

zipped, tiled

coalesced layout: remove modes (axes) of size (dim) 1, remove modes that traverse contiguously with the preceding mode: (s0:d0, s1:s0\*d0) --> (s0\*s1, d0)

sorted layout: strides are non-decreasing

B by itself has its image disjoint from A except for position 0 -- because of disjoint coordinates when embedded in C=(A,B)

complement of layout A wrt. size M: layout B such that (A, B) (concatenated) is contiguous (dense) of size M; i.e. B fills the gaps of A due to strides

CuTe layout algebra

(functional) composition of layouts:

$$R(c) := (A \circ B)(c) := A(B(c))$$

composition is left-distributive with concatenation, so focus on B with 1 mode

R has coordinates compatible with B

 $(s0:d0) \circ (s1:d1) = s1:(d0*d1)$ every d0th element, s1 total

coalesce A and let B=N:r, then A o B has the modes (dimensions and strides) from the middle of A with first and last mode adjusted, such that size(A o B)=N, and the first stride = the corresponding stride of A \*/ (r / the cumulative size of modes to the left)

complement(A,M) has one more mode than A: leading mode (d0:1) where d0 is leading stride of A; except d0=1 same num of modes

2025.03.07

Tilers add expressivity via hierarchical layouts:

layouts: A o  $(B1, B2) = (A \circ B1, A \circ B2)$ 

cortinued

tilers: (A1, A2) o <B1, B2> = (A1 o B1, A2 o B2)

A o B selects parts of A and reshapes them

zipped divide: puts the iterator modes coming from different pieces of the tiler into one toplevel sublayout

Logical divide:

A / B := A o (B, complement(B, size(A)))
split A into elems pointed to by B, and the rest

CuTe layout algebra

logical dividing by a tiler keeps
the hierarchical structure: (A1, A2)
/ <B1, B2> = (A1 / B1, A2 / B2)

in the hierarchical layout of A/B, the first toplevel mode is the tile, and the second is the iterator over tiles

tiled divide: puts the intra-tile modes of the tiler pieces into one toplevel sublayout

dividing by a multimode tiler: multidim tiles that look differently for different modes (axes) of A

layout<0>(zipped\_divide(A, B)) = \( \) layout<0>(tiled\_divide(A, B)) = A o B

Logical product:

A x B := (A, complement(A, size(A)\*cosize(B)) o B)

logical\_product is tricky because B needs to be designed for A; blocked\_product simplifies this

A is the tile, B is the number and order of repetitions, A\* is the available repetition layout

raked\_product disperses the tiles

tensor = layout + engine + [non]owning engine = RAM iterator + mem tag mem tag = global | shared | registers |

Thread-Value partitioning: a layout mapping threads + per-thread values to coordinates

CuTe: tensors and algorithms

tensor slice: new non-owning tensor with offset iterator, and new (restricted) layout

partitioning: zipped\_divide then slice; inner\_partition, local\_tile: give each e.g. threadgroup a tile; outer\_partiotion, local\_partition: iterate at fixed position / slice of the tiles

since CuTe can represent datatype, mem tag, shape and stride at compile time, algorithms can specialize to hardware-specific instructions

copy: the compiler-selected impl. may need a specific explicit synchronization

the type-inferred default implican be overriden

axpby: y := a\*x+b\*y (generalizes FMA)

copy\_if: copy with a mask tensor

gemm: accumulating, one of: element-wise product, outer product, batched outer product, matrix product; selected by a type param

fill, clear memory

transform: map in-place

Genencs:

fold an operation (C++17 style)

2025 03.10

hardware instruction levels:

quadpair: kinda 8 thread tensor)

- single thread (e.g. FMA)

core, i.e. 4 QPs per warp;

- quadpair (Volta: V100)

QP 0 is threads 0-3 & 16-19 etc.)

- mnemonic:
- D := A\*B + C
- single warp (Ampere: A100, RTX 30x0)
- warpgroup (Hopper: H100)

not introduced, but much optimized)

MMA and Copy atoms: expose specific PTX instructions with a unified, templated interface

for a specific PTX operation, provide A,B,C,D types and layouts

CuTe Matrix Multiply

//often represents inputs as (M,K),(N,K) rather than (M,K),(K,N) M/N/K-major: has stride 1 in the M/N/K mode f

Tensor  $gA = local_tile(mA, cta_tiler, cta_coord, Step<_1,X,_1>{});$ 

auto cta\_coord = make\_coord(blockIdx.x, blockIdx.y, \_);

28 1D threads

simplest example: tiles over threadblocks) aka CTAs

(example: independently) spcifying shared mem

layouts, partitioning patterns,

PTX instruction via TiledCopy and

/TiledMMA

static templated args eg. **\bM=128** x bN=128 x bK=8/

selects sublayouts except for pos X for both cta\_tiler and cta\_coord

```
auto tA = make_layout(make_shape(Int<32>{},Int<8>{}));
Tensor tAgA/tAsA = local_patition(gA/sA, tA, threadIdx.x);
     copy(tAgA(_,_,k_tile), tAsA); cp_async_fence();
     cp_async_wait<0>(); gemm(tCsA, tCsB, tCrC); __synctreads(); /
```

builders take Copy\_Atom<...> / UniversalFMA<TC,TA,TB> and threads (and values for TiledCopy) layout(s), then TiledX.get\_slice(threadIdx.x) to extract tAgA, tAsA, tCsA...



