What modified after we break up the swimming pools
We ran a two-week proof of idea. I break up the cluster into two swimming pools: Eight GPUs devoted to immediate processing and the remaining GPUs dealing with token era. No new {hardware}, no new cluster — only a configuration change within the serving layer and a routing coverage that despatched every request to the precise pool primarily based on its inference part. The prompt-processing pool hit 90–95% compute utilization persistently as a result of that’s all it did. No token era competing for scheduling slots. No decode requests sitting idle whereas a prefill burst hogged the cores.
The token-generation pool was the larger shock. By batching a whole lot of concurrent decode requests collectively the reminiscence reads bought amortized throughout extra work. Bandwidth utilization climbed above 70% — much better than the 30% we’d been seeing when decode requests had been interleaved with prefill on the identical GPU. General compute effectivity roughly doubled.
The fee math adopted. The client was spending about $2M yearly on inference GPU-hours. After disaggregation they had been on monitor to chop that by $600–800K whereas serving the identical request quantity on the similar latency targets. No new {hardware} bought. Identical GPUs, similar cluster, similar mannequin weights — completely different structure.
