Contents

NOTE: All code is in my pcl_simd repo on kforge.

## Representing point data

In PCL currently, points are stored with their fields interleaved. For the simplest `PointXYZ` type, this looks like:

XYZ_XYZ_XYZ_XYZ_ ...

where `_` denotes an extra padding `float` so that each point is 16-byte aligned. Operating on `XYZ_` data efficiently often requires the use of horizontal SSE instructions, which perform computations using multiple elements of the same SSE register.

Instead a vertical representation, aka Structure-of-Arrays (SoA), can be used:

XXXXXXXX ... YYYYYYYY ... ZZZZZZZZ ...

This layout fits traditional vertical SSE processing better. Most arithmetic SSE instructions are binary operations on corresponding elements of two SSE registers.

## Why does PCL use AoS?

PCL's use of AoS, normally non-optimal, does have its logic. In PCL, frequently we wish to process only some (indexed / valid) subset of a point cloud. Besides dense processing of all points, we then have two other cases.

### Indexed subsets

PCL operators routinely provide a `setIndices()` method, ordering them to use only certain points identified by index. With the AoS representation, each individual point can be used in an SSE register via a simple aligned load. Indexed access therefore does not much complicate an SSE-optimized implementation.

Vertical SSE (in the dense case) processes four adjacent points simultaneously, and indexed access breaks the adjacency requirement. Instead of an aligned load, the implementation must "gather" the data for the next four indexed points (spread out in memory).

### Organized point clouds

PCL permits point clouds with missing data. For imager-based 3D sensors, this allows point clouds to retain their 2D structure, making it trivial to identify nearby points. Invalid points have each field set to `NaN`, so that it is clear when invalid data is accidentally used in a computation.

Handling invalid points in PCL (with AoS) is again rather simple. For each point, check if `X` is NaN; if so, ignore it.

The SoA situation is much more complicated. Since we operate on four points at a time, we have to check if any of the four points are invalid. If so, it becomes very tricky to use SSE at all without destroying our result. Masking tricks are possible, but imply some overhead over the simple dense code.

## Horizontal or vertical?

Both representations have pros and cons:

Horizontal

- Pros
- More intuitive, easier to write code for
- Handling indexed subsets is simple - can still use aligned loads
Handling NaNs also simple

- Cons
- Clearly slower at dense processing
- Waste space and computation on padding elements

Vertical

- Pros
- Clearly faster at dense processing
- No wasted space - only 3/4 as many loads required
- No wasted computation
- May have less loop overhead, since you process 4 points per iteration instead of 1

- Cons
- Less intuitive
- Indexed subsets require gathering data for non-adjacent points
Handling NaNs is complicated

## Data structures

For benchmarking, we define two very simple point cloud representations:

## Computations considered

We benchmark two basic operations:

- Compute the dot product of every point in a cloud with a given point
- Compute the centroid of a point cloud

For both operations, we implemented several versions covering the space of:

- Horizontal (AoS) or vertical (SoA)
- Dense or indexed
- SSE instruction set

Representative examples are listed below.

### Dot product

Vertical (SoA), SSE2-optimized:

```
1 void dotSSE2 (const SOA& vectors, const AOS& vector,
2 float* result, unsigned long size)
3 {
4 float x = vector.x, y = vector.y, z = vector.z;
5
6 // Broadcast X, Y, Z of constant vector into 3 SSE registers
7 __m128 vX = _mm_set_ps1(x);
8 __m128 vY = _mm_set_ps1(y);
9 __m128 vZ = _mm_set_ps1(z);
10 __m128 X, Y, Z;
11
12 unsigned i = 0;
13 for ( ; i < size - 3; i += 4)
14 {
15 // Load data for next 4 points
16 X = _mm_load_ps (vectors.x + i);
17 Y = _mm_load_ps (vectors.y + i);
18 Z = _mm_load_ps (vectors.z + i);
19
20 // Compute X*X'+Y*Y'+Z*Z' for each point
21 X = _mm_mul_ps (X, vX);
22 Y = _mm_mul_ps (Y, vY);
23 Z = _mm_mul_ps (Z, vZ);
24 X = _mm_add_ps (X, Y);
25 X = _mm_add_ps (X, Z);
26
27 // Store results
28 _mm_store_ps(result + i, X);
29 }
30
31 // Handle any leftovers at the end
32 for ( ; i < size; ++i)
33 {
34 result[i] = vectors.x[i]*x + vectors.y[i]*y + vectors.z[i]*z;
35 }
36 }
```

Horizontal (AoS), SSE4.1-optimized (with horizontal DPPS instruction):

```
1 void dotSSE4_1 (const AOS* vectors, const AOS& vector,
2 float* result, unsigned long size)
3 {
4 // Load constant vector into an SSE register
5 __m128 vec = _mm_load_ps ((const float*) &vector);
6 __m128 XYZH;
7
8 // Set mask to ignore the padding elements
9 const int mask = 123;
10 for (unsigned i = 0; i < size; ++i)
11 {
12 // Load next point
13 XYZH = _mm_load_ps ((const float*)(vectors + i));
14
15 // Dot product from SSE4.1
16 XYZH = _mm_dp_ps (XYZH, vec, mask);
17
18 // Store single result (the bottom register element)
19 _mm_store_ss (&(result [i]), XYZH);
20 }
21 }
```

### Centroid

Vertical (SoA), SSE2-optimized:

```
1 void centroidSSE2 (const SOA& vectors, AOS& result, size_t size)
2 {
3 __m128 X_sum = _mm_setzero_ps();
4 __m128 Y_sum = _mm_setzero_ps();
5 __m128 Z_sum = _mm_setzero_ps();
6 __m128 X, Y, Z;
7
8 size_t i = 0;
9 for ( ; i < size - 3; i += 4)
10 {
11 // Load next 4 points
12 X = _mm_load_ps (vectors.x + i);
13 Y = _mm_load_ps (vectors.y + i);
14 Z = _mm_load_ps (vectors.z + i);
15
16 // Accumulate 4 sums in each dimension
17 X_sum = _mm_add_ps(X_sum, X);
18 Y_sum = _mm_add_ps(Y_sum, Y);
19 Z_sum = _mm_add_ps(Z_sum, Z);
20 }
21
22 // Horizontal adds (HADD from SSE3 could help slightly)
23 float* pX = reinterpret_cast<float*>(&X_sum);
24 float* pY = reinterpret_cast<float*>(&Y_sum);
25 float* pZ = reinterpret_cast<float*>(&Z_sum);
26 result.x = pX[0] + pX[1] + pX[2] + pX[3];
27 result.y = pY[0] + pY[1] + pY[2] + pY[3];
28 result.z = pZ[0] + pZ[1] + pZ[2] + pZ[3];
29
30 // Leftover points
31 for ( ; i < size; ++i)
32 {
33 result.x += vectors.x[i];
34 result.y += vectors.y[i];
35 result.z += vectors.z[i];
36 }
37
38 // Average
39 float inv_size = 1.0f / size;
40 result.x *= inv_size;
41 result.y *= inv_size;
42 result.z *= inv_size;
43 }
```

Horizontal (AoS), SSE2-optimized:

```
1 void centroidSSE2 (const AOS* vectors, AOS& result, size_t size)
2 {
3 __m128 sum = _mm_setzero_ps();
4
5 for (unsigned i = 0; i < size; ++i)
6 {
7 __m128 XYZH = _mm_load_ps ((const float*)(vectors + i));
8 sum = _mm_add_ps(sum, XYZH);
9 }
10 _mm_store_ps((float*)&result, sum);
11
12 float inv_size = 1.0f / size;
13 result.x *= inv_size;
14 result.y *= inv_size;
15 result.z *= inv_size;
16 }
```

### Indexed

When using point indices, the vertical implementation can no longer use aligned loads. Instead it's best to use the `_mm_set_ps` intrinsic to gather the next four points.

Vertical (SoA) dot product, SSE2-optimized:

```
1 void dotIndexedSSE2 (const SOA& vectors, const AOS& vector, const int* indices, float* result, unsigned long size)
2 {
3 float x = vector.x, y = vector.y, z = vector.z;
4
5 __m128 vX = _mm_set_ps1(x);
6 __m128 vY = _mm_set_ps1(y);
7 __m128 vZ = _mm_set_ps1(z);
8 __m128 X, Y, Z;
9
10 unsigned i = 0;
11 for ( ; i < size - 3; i += 4)
12 {
13 int i0 = indices[i + 0];
14 int i1 = indices[i + 1];
15 int i2 = indices[i + 2];
16 int i3 = indices[i + 3];
17
18 // Gather next four indexed points
19 X = _mm_set_ps(vectors.x[i3], vectors.x[i2], vectors.x[i1], vectors.x[i0]);
20 Y = _mm_set_ps(vectors.y[i3], vectors.y[i2], vectors.y[i1], vectors.y[i0]);
21 Z = _mm_set_ps(vectors.z[i3], vectors.z[i2], vectors.z[i1], vectors.z[i0]);
22
23 // Computation
24 X = _mm_mul_ps (X, vX);
25 Y = _mm_mul_ps (Y, vY);
26 Z = _mm_mul_ps (Z, vZ);
27 X = _mm_add_ps (X, Y);
28 X = _mm_add_ps (X, Z);
29
30 // Store result
31 _mm_store_ps(result + i, X);
32 }
33
34 for ( ; i < size; ++i)
35 {
36 int idx = indices[i];
37 result[i] = vectors.x[idx]*x + vectors.y[idx]*x + vectors.z[idx]*z;
38 }
39 }
```

## Benchmarks (random data)

The test point cloud is randomly generated, 640x480, dense. Each operation is repeated 1000 times.

For indexed tests, the indices list every 4th point. More random index patterns would change execution time by affecting caching and prefetching, but I'd expect such effects to be similar for horizontal and vertical code.

"Scalar" code uses no vector instructions, otherwise the instruction set is listed. A trailing u# means the code was unrolled by factor #.

### Dot product

#### Dense

Horizontal (AOS) Scalar: 0.621674 seconds SSE2: 0.756300 seconds SSE4.1: 0.532441 seconds SSE4.1u4: 0.476841 seconds Vertical (SOA) Scalar: 0.519625 seconds SSE2: 0.215499 seconds

The vertical SSE2 code is the clear winner, more than twice as fast as horizontal code even with the special horizontal dot product from SSE4.1.

On the i7 at the ski house, horizontal SSE4.1 was actually the *slowest* implementation. Unrolling it x4 helped significantly, although it was still much worse than vertical SSE2. I attributed this to the very high latency of the DPPS instruction; it takes 11 cycles before the result can be stored. Unrolling helps hide the latency by providing more computation to do during that time. I don't know why the results from my WG i7 (shown above) are so different.

#### Indexed

Horizontal (AOS) Scalar: 0.271768 seconds SSE2: 0.276114 seconds SSE4.1: 0.259613 seconds Vertical (SOA) Scalar: 0.193394 seconds SSE2: 0.177262 seconds

SSE optimization actually gives meager benefits in both the horizontal and vertical cases. However vertical SSE2 is still the winner.

### Centroid

The story for centroid is similar; vertical SSE2 is fastest, significantly so for dense data.

#### Dense

Horizontal (AOS) Scalar: 0.628597 seconds SSE2: 0.326645 seconds SSE2u2: 0.247539 seconds SSE2u4: 0.236474 seconds Vertical (SOA) Scalar: 0.711040 seconds SSE2: 0.149806 seconds

#### Indexed

Horizontal (AOS) Scalar: 0.256237 seconds SSE2: 0.195724 seconds Vertical (SOA) Scalar: 0.194030 seconds SSE2: 0.166639 seconds

## Vertical SSE for organized point clouds

We still need a way to effectively use vertical SSE for organized point clouds (containing NaNs). A promising approach is to compute a run-length encoding (RLE) of the valid points as a preprocessing step. The data structure is very simple:

The RLE counts the length of alternating runs of valid and invalid points. Once computed, it allows us to process only valid points without explicitly checking each one for NaNs. In fact, operations become O(#valid points) instead of O(#total points), which can itself be a win if many points are invalid.

In real scenes, valid points are clustered together (into objects), so valid (and invalid) runs should be lengthy on average. A long run of valid points can be split into <4 beginning points, <4 final points, and a run of aligned, valid point data which can be safely processed with vertical SSE.

## Abstracting point iteration

We are still left with three distinct cases for processing point clouds, requiring different methods of iterating over point data:

Dense (no NaNs)

- Indexed
Organized (contains NaNs)

Writing and maintaining three copies of each PCL algorithm is a huge burden. The RLE for organized data in particular imposes a relatively complicated iteration method. Ideally we should be able to write the computational core of an algorithm only once, and have it work efficiently in each of the three cases.

Currently PCL does not meet this goal. In fact, core algorithms tend to have four near-identical implementations:

- Dense
- Dense indexed
- Organized
- Organized indexed

I think it's unnecessary to distinguish between "dense indexed" and "organized indexed", if we require that indices point to valid data.

### Writing algorithms as computational kernels

As an experiment, I rewrote the vertical centroid as a "kernel" class. This implements only the computation, without worrying about the memory layout of the whole cloud:

```
1 struct CentroidKernel
2 {
3 // State
4 float x_sum, y_sum, z_sum;
5 __m128 X_sum, Y_sum, Z_sum;
6 size_t count;
7 AOS result;
8
9 void init()
10 {
11 // Initialization
12 x_sum = y_sum = z_sum = 0.0f;
13 X_sum = _mm_setzero_ps();
14 Y_sum = _mm_setzero_ps();
15 Z_sum = _mm_setzero_ps();
16 count = 0;
17 }
18
19 // Scalar operator
20 inline void operator() (float x, float y, float z)
21 {
22 x_sum += x;
23 y_sum += y;
24 z_sum += z;
25 ++count;
26 }
27
28 // SIMD operator
29 inline void operator() (__m128 X, __m128 Y, __m128 Z)
30 {
31 X_sum = _mm_add_ps(X_sum, X);
32 Y_sum = _mm_add_ps(Y_sum, Y);
33 Z_sum = _mm_add_ps(Z_sum, Z);
34 count += 4;
35 }
36
37 void reduce()
38 {
39 float* pX = reinterpret_cast<float*>(&X_sum);
40 float* pY = reinterpret_cast<float*>(&Y_sum);
41 float* pZ = reinterpret_cast<float*>(&Z_sum);
42 result.x = pX[0] + pX[1] + pX[2] + pX[3] + x_sum;
43 result.y = pY[0] + pY[1] + pY[2] + pY[3] + y_sum;
44 result.z = pZ[0] + pZ[1] + pZ[2] + pZ[3] + z_sum;
45
46 float inv_count = 1.0f / count;
47 result.x *= inv_count;
48 result.y *= inv_count;
49 result.z *= inv_count;
50 }
51 };
```

### Kernel applicators

We can then define "applicator" functions that apply a kernel to a particular case of point cloud. The dense version simply uses aligned loads:

```
1 template <typename Kernel>
2 void applyDense (Kernel& kernel, const SOA& pts)
3 {
4 kernel.init();
5
6 size_t i = 0;
7 for ( ; i < pts.size - 3; i += 4)
8 {
9 __m128 X = _mm_load_ps (pts.x + i);
10 __m128 Y = _mm_load_ps (pts.y + i);
11 __m128 Z = _mm_load_ps (pts.z + i);
12
13 kernel(X, Y, Z);
14 }
15 for ( ; i < pts.size; ++i)
16 {
17 kernel(pts.x[i], pts.y[i], pts.z[i]);
18 }
19
20 kernel.reduce();
21 }
```

The indexed version performs the necessary data gathering:

```
1 template <typename Kernel>
2 void applySparse (Kernel& kernel, const SOA& pts, const std::vector<int>& indices)
3 {
4 kernel.init();
5
6 size_t i = 0;
7 for ( ; i < indices.size() - 3; i += 4)
8 {
9 int i0 = indices[i + 0];
10 int i1 = indices[i + 1];
11 int i2 = indices[i + 2];
12 int i3 = indices[i + 3];
13
14 // Gather next four indexed points
15 __m128 X = _mm_set_ps(pts.x[i3], pts.x[i2], pts.x[i1], pts.x[i0]);
16 __m128 Y = _mm_set_ps(pts.y[i3], pts.y[i2], pts.y[i1], pts.y[i0]);
17 __m128 Z = _mm_set_ps(pts.z[i3], pts.z[i2], pts.z[i1], pts.z[i0]);
18
19 kernel(X, Y, Z);
20 }
21 for ( ; i < indices.size(); ++i)
22 {
23 int idx = indices[i];
24 kernel(pts.x[idx], pts.y[idx], pts.z[idx]);
25 }
26
27 kernel.reduce();
28 }
```

The organized version is most complicated, and uses the RLE to vectorize as much of the computation as possible:

```
1 template <typename Kernel>
2 void applyOrganized (Kernel& kernel, const SOA& pts, const RLE& rle)
3 {
4 kernel.init();
5
6 size_t i = 0;
7 for (RLE::const_iterator rle_it = rle.begin(); rle_it != rle.end(); ++rle_it)
8 {
9 // Process current stretch of good pixels
10 size_t good = rle_it->good;
11 size_t skip = rle_it->skip;
12 size_t good_end = i + good;
13
14 // Any unaligned points at start
15 size_t unaligned_end = std::min( (i + 3) & ~3, good_end );
16 for ( ; i < unaligned_end; ++i)
17 kernel(pts.x[i], pts.y[i], pts.z[i]);
18 // Aligned SIMD point data
19 for ( ; i + 4 <= good_end; i += 4)
20 {
21 __m128 X = _mm_load_ps (pts.x + i);
22 __m128 Y = _mm_load_ps (pts.y + i);
23 __m128 Z = _mm_load_ps (pts.z + i);
24
25 kernel(X, Y, Z);
26 }
27 // <4 remaining points
28 for ( ; i < good_end; ++i)
29 kernel(pts.x[i], pts.y[i], pts.z[i]);
30
31 // Skip the following stretch of NaNs
32 i += skip;
33 }
34
35 kernel.reduce();
36 }
```

The kernel + applicator combinations for the dense and indexed cases were added to the centroid benchmark for random point data, and show identical performance to the hand-written vertical SSE2 code.

The above code is written with simplicity in mind. The biggest improvement would be to combine the scalar and SSE `operator() (...)` functions; this could possibly be achieved by using `Eigen::Array` as an SSE backend (similar to how Eigen::Matrix maps are currently used), something like:

## Benchmarks (real point clouds)

Finally, we compare `CentroidKernel` + applicator to `pcl::compute3DCentroid()` for several real organized (and one dense) point clouds.

The point clouds used are:

capture0001.pcd (organized, 640x480, 57553 NaNs)

PCL: 0.926901 seconds RLE: 0.348173 seconds Kernel: 0.174194 seconds

capture0002.pcd (organized, 640x480, 57269 NaNs)

PCL: 0.931111 seconds RLE: 0.345437 seconds Kernel: 0.171373 seconds

Even if you include the RLE computation time (which could be amortized over several operations, and perhaps optimized) in the total, the vertical kernel beats the current PCL implementation. Discounting RLE, it's more than 5x faster.

table_scene_mug_stereo_textured.pcd (organized, 640x480, 97920 NaNs)

PCL: 3.36001 seconds RLE: 0.379737 seconds Kernel: 0.183159 seconds

The very poor performance of PCL on the mug scene is a mystery to me. Perhaps the larger number of NaNs has an effect?

table_scene_lms400.pcd (dense, 460400 pts)

PCL: 0.678805 seconds RLE: N/A Kernel: 0.242546 seconds

## Conclusions

For the simple operations considered here, vertical SSE is a huge win. In the best case, this suggests that much of PCL could get at least a 3x speedup by switching to the more SSE-friendly memory layout.

Vertical SSE presents some complications in usage and implementation for PCL, but good solutions (RLE, kernel abstraction) are possible.

Looking at instruction sets, vertical SSE is especially advantageous both on older and very new processors. On older processors, because it makes excellent use of SSE2 instructions, whereas horizontal SSE may require horizontal instructions (introduced in SSE3 and later) for good performance. On new processors, because the latest AVX extensions expand SSE register to 256 bits, allowing 8 floating point operations at a time instead of 4. The vertical SSE techniques shown here trivially extend to AVX, and future instruction sets will likely expand SSE registers even further. The upcoming AVX2 extensions add dedicated "gather" instructions, which should improve performance with indices.

## Remaining questions

Are there PCL algorithms that aren't easily implementable in the proposed kernel style?

How to handle nearest neighbor searches? These may be hard to vectorize.