Dominik Honnef

Blend stack memory

Last modified:
by

Dynamically managing the memory for the blend stack is a subproblem of the Dynamic GPU memory problem.

Clipping and blending (which are handled by the same code) in Vello are implemented as part of the compute shaders. Instead of rendering and compositing separate textures, the fine shader pushes the current pixel values onto a blend stack when entering a layer, and pops from the stack and blends with the new pixel values when leaving a layer. This has (theoretical) performance advantages because the shader can (partially) rely on vector registers instead of having to access global memory in a separate compositing step. It also needs less memory because we don’t have to allocate a full-size texture per layer. With readback from the coarse shader we could allocate smaller textures, limited by bounding boxes, but that can still be a lot more memory than doing per-tile blending.

While parts of the blend stack can live in actual registers, limiting ourselves to that would put a hard and small limit on the supported maximal depth.

Coarse history

2020/10/16 - https://github.com/linebender/vello/pull/34 basic clips, no stack

2020/11/23 - https://github.com/linebender/vello/pull/44 clip stack, limit of 4

2020/11/25 - https://github.com/linebender/vello/pull/45 clip stack of 4 in registers, spilling to memory with a linked list of stack frames

2020/12/07 - https://github.com/linebender/vello/commit/4de67d9081db59883838d5060c5d966462838775 Unifies GPU memory management, using a single buffer for all memory

2021/04/08 - https://github.com/linebender/vello/commit/22507dea0e15b003bcac8dad215e3e80921ff78a Preallocates stack space, removing allocations from the fine shader. Removes local blend stack, always using the buffer.

2021/04/10 - https://github.com/linebender/vello/issues/83 Discovers poor performance on Adreno 640, a mobile GPU

2021/04/29 - https://github.com/linebender/vello/issues/83 Discovers that poor performance is due to unifying the buffers, and accessing read-only data through a read/write binding. Decides to merge PR 77 to fix this.

2021/07/13 - https://github.com/linebender/vello/pull/77 Removes the buffer-based blend stack, instead allocates an array with 128 entries, assuming drivers will move it to scratch space.

2021/12/08 - https://github.com/linebender/vello/issues/83 Discovers that the approach from PR 77 works poorly, causing artifacts and crashes on an AMD 5700 XT, and learning that there is poor driver support for large arrays and relying on scratch space.

2022/02/26 - https://github.com/linebender/vello/issues/155 Points out that the large array has a negative performance impact.

2022/03/02 - https://github.com/linebender/vello/issues/156 Lays out “final” plan for blend stack.

2022/04/04 - https://github.com/linebender/vello/issues/163 Another issue pointing out the poor performance of a large array.

2022/05/20 - https://github.com/linebender/vello/pull/173 Implements the blend stack as described in issue 156, except it still uses a single, large buffer for all memory, and there is no OOM detection. This basically reverts PR 77.

2022/07/14 - https://github.com/linebender/vello/pull/181 Implements “robust dynamic memory”. Fine-grained OOM information per stage. Separate buffer for blend stack (finally fixing issue 83). Retrying stages when they need more memory. Blend buffer is reallocated according to the result of the coarse stage, before running fine stage.

2022/10/24–2022/11/27 Rewrite using WebGPU. In the process we’ve completely lost blend stack spilling and robust dynamic memory.

2023/01/19 - https://github.com/linebender/vello/pull/257 Adds some GPU-side work for robust memory, but it’s not hooked up to anything. Blend stack spilling is still missing.

2024/04/01 - https://github.com/linebender/vello/pull/537 Adds some GPU-side work for robust memory (again?) Still not hooked up to anything, still no blend stack spilling.

2024/08/06 - https://github.com/linebender/vello/pull/657 Adds blend stack spilling. The buffer for this has a fixed size, and frankly overflows quickly.