# Advanced dense layouts¶

Tensors (Tensors of scalars) can be *placed* in a specific shape and *layout*.
Defining a proper layout can be critical to performance, especially for memory-bound applications. A carefully designed data layout can significantly improve cache/TLB-hit rates and cacheline utilization. Although when performance is not the first priority, you probably don’t have to worry about it.

In Taichi, the layout is defined in a recursive manner. See Structural nodes (SNodes) for more details about how this works. We suggest starting with the default layout specification (simply by specifying `shape`

when creating tensors using `ti.var/Vector/Matrix`

),
and then migrate to more advanced layouts using the `ti.root.X`

syntax if necessary.

Taichi decouples algorithms from data layouts, and the Taichi compiler automatically optimizes data accesses on a specific data layout. These Taichi features allow programmers to quickly experiment with different data layouts and figure out the most efficient one on a specific task and computer architecture.

## From `shape`

to `ti.root.X`

¶

For example, this declares a 0-D tensor:

```
x = ti.var(ti.f32)
ti.root.place(x)
# is equivalent to:
x = ti.var(ti.f32, shape=())
```

This declares a 1D tensor of size `3`

:

```
x = ti.var(ti.f32)
ti.root.dense(ti.i, 3).place(x)
# is equivalent to:
x = ti.var(ti.f32, shape=3)
```

This declares a 2D tensor of shape `(3, 4)`

:

```
x = ti.var(ti.f32)
ti.root.dense(ti.ij, (3, 4)).place(x)
# is equivalent to:
x = ti.var(ti.f32, shape=(3, 4))
```

You may wonder, why not simply specify the `shape`

of the tensor? Why bother using the more complex version?
Good question, let go forward and figure out why.

## Row-major versus column-major¶

Let’s start with the simplest layout.

Since address spaces are linear in modern computers, for 1D Taichi tensors, the address of the `i`

-th element is simply `i`

.

To store a multi-dimensional tensor, however, it has to be flattened, in order to fit into the 1D address space.
For example, to store a 2D tensor of size `(3, 2)`

, there are two ways to do this:

- The address of
`(i, j)`

-th is`base + i * 2 + j`

(row-major).- The address of
`(i, j)`

-th is`base + j * 3 + i`

(column-major).

To specify which layout to use in Taichi:

```
ti.root.dense(ti.i, 3).dense(ti.j, 2).place(x) # row-major (default)
ti.root.dense(ti.j, 2).dense(ti.i, 3).place(y) # column-major
```

Both `x`

and `y`

have the same shape of `(3, 2)`

, and they can be accessed in the same manner, where `0 <= i < 3 && 0 <= j < 2`

. They can be accessed in the same manner: `x[i, j]`

and `y[i, j]`

.
However, they have a very different memory layouts:

```
# address low ........................... address high
# x: x[0,0] x[0,1] x[0,2] | x[1,0] x[1,1] x[1,2]
# y: y[0,0] y[1,0] | y[0,1] y[1,1] | y[0,2] y[1,2]
```

See? `x`

first increases the first index (i.e. row-major), while `y`

first increases the second index (i.e. column-major).

Note

For those people from C/C++, here’s what they looks like:

```
int x[3][2]; // row-major
int y[2][3]; // column-major
for (int i = 0; i < 3; i++) {
for (int j = 0; j < 2; j++) {
do_something ( x[i][j] );
do_something ( y[j][i] );
}
}
```

## Array of Structures (AoS), Structure of Arrays (SoA)¶

Tensors of same size can be placed together.

For example, this places two 1D tensors of size `3`

(array of structure, AoS):

```
ti.root.dense(ti.i, 3).place(x, y)
```

Their memory layout:

```
# address low ............. address high
# x[0] y[0] | x[1] y[1] | x[2] y[2]
```

In contrast, this places two tensor placed separately (structure of array, SoA):

```
ti.root.dense(ti.i, 3).place(x)
ti.root.dense(ti.i, 3).place(y)
```

Now, their memory layout:

```
# address low ............. address high
# x[0] x[1] x[2] | y[0] y[1] y[2]
```

Normally, you don’t have to worry about the performance nuances between different layouts, and should just define the simplest layout as a start. However, locality sometimes have a significant impact on the performance, especially when the tensor is huge.

**To improve spatial locality of memory accesses (i.e. cache hit rate / cacheline utilization), it’s sometimes helpful to place the data elements within relatively close storage locations if they are often accessed together.**
Take a simple 1D wave equation solver for example:

```
N = 200000
pos = ti.var(ti.f32)
vel = ti.var(ti.f32)
ti.root.dense(ti.i, N).place(pos)
ti.root.dense(ti.i, N).place(vel)
@ti.kernel
def step():
pos[i] += vel[i] * dt
vel[i] += -k * pos[i] * dt
```

Here, we placed `pos`

and `vel`

seperately. So the distance in address space between `pos[i]`

and `vel[i]`

is `200000`

. This will result in a poor spatial locality and lots of cache-misses, which damages the performance.
A better placement is to place them together:

```
ti.root.dense(ti.i, N).place(pos, vel)
```

Then `vel[i]`

is placed right next to `pos[i]`

, this can increase the cache-hit rate and therefore increase the performance.

## Flat layouts versus hierarchical layouts¶

By default, when allocating a `ti.var`

, it follows the simplest data layout.

```
val = ti.var(ti.f32, shape=(32, 64, 128))
# C++ equivalent: float val[32][64][128]
```

However, at times this data layout can be suboptimal for certain types of computer graphics tasks.
For example, `val[i, j, k]`

and `val[i + 1, j, k]`

are very far away (`32 KB`

) from each other, and leads to poor access locality under certain computation tasks. Specifically, in tasks such as texture trilinear interpolation, the two elements are not even within the same `4KB`

pages, creating a huge cache/TLB pressure.

A better layout might be

```
val = ti.var(ti.f32)
ti.root.dense(ti.ijk, (8, 16, 32)).dense(ti.ijk, (4, 4, 4)).place(val)
```

This organizes `val`

in `4x4x4`

blocks, so that with high probability `val[i, j, k]`

and its neighbours are close to each other (i.e., in the same cacheline or memory page).

## Struct-fors on advanced dense data layouts¶

Struct-fors on nested dense data structures will automatically follow their data order in memory. For example, if 2D scalar tensor `A`

is stored in row-major order,

```
for i, j in A:
A[i, j] += 1
```

will iterate over elements of `A`

following row-major order. If `A`

is column-major, then the iteration follows the column-major order.

If `A`

is hierarchical, it will be iterated level by level. This maximizes the memory bandwidth utilization in most cases.

Struct-for loops on sparse tensors follow the same philosophy, and will be discussed further in Sparse computation (WIP).

## Examples¶

2D matrix, row-major

```
A = ti.var(ti.f32)
ti.root.dense(ti.ij, (256, 256)).place(A)
```

2D matrix, column-major

```
A = ti.var(ti.f32)
ti.root.dense(ti.ji, (256, 256)).place(A) # Note ti.ji instead of ti.ij
```

8x8 blocked 2D array of size 1024x1024

```
density = ti.var(ti.f32)
ti.root.dense(ti.ij, (128, 128)).dense(ti.ij, (8, 8)).place(density)
```

3D Particle positions and velocities, AoS

```
pos = ti.Vector(3, dt=ti.f32)
vel = ti.Vector(3, dt=ti.f32)
ti.root.dense(ti.i, 1024).place(pos, vel)
# equivalent to
ti.root.dense(ti.i, 1024).place(pos(0), pos(1), pos(2), vel(0), vel(1), vel(2))
```

3D Particle positions and velocities, SoA

```
pos = ti.Vector(3, dt=ti.f32)
vel = ti.Vector(3, dt=ti.f32)
for i in range(3):
ti.root.dense(ti.i, 1024).place(pos(i))
for i in range(3):
ti.root.dense(ti.i, 1024).place(vel(i))
```