Why Your ML Inference Pipeline Is Slower Than It Needs to Be
Most inference slowdowns aren't where you think. A practical guide to finding and fixing the real bottlenecks.
Most teams I work with have the same complaint: inference is too slow. And most of them have already tried the obvious fixes — batching, quantization, maybe a GPU upgrade. But the pipeline is still slower than it should be.
The problem is rarely the model itself. It’s everything around it.
The usual suspects aren’t the real culprits
When inference latency is high, the instinct is to optimize the model. Smaller architecture, fewer parameters, distillation. These help, but they’re expensive in engineering time and often in quality.
In my experience, 80% of inference latency comes from three places:
- Data preprocessing on the critical path. Tokenization, normalization, feature extraction — all running synchronously before the model even sees the input.
- Serialization overhead. Moving data between services, formats, or devices. Protobuf, JSON, even tensor conversions add up.
- Memory allocation patterns. Allocating and freeing memory on every request instead of using pools or pre-allocated buffers.
Measure before you optimize
The first thing I do on any engagement is profile. Not benchmark — profile. There’s a difference.
A benchmark tells you the system is slow. A profile tells you where it’s slow. Flame graphs, trace spans, hardware counters — these are the tools that turn “it’s slow” into “this function allocates 47MB per request.”
# Don't guess. Measure.
import cProfile
import pstats
with cProfile.Profile() as pr:
result = inference_pipeline(input_data)
stats = pstats.Stats(pr)
stats.sort_stats('cumulative')
stats.print_stats(20)
Quick wins that compound
Once you know where the time goes, the fixes are often straightforward:
- Move preprocessing off the critical path. Pre-compute what you can. Cache tokenized inputs. Use async preprocessing.
- Batch at the right layer. Dynamic batching at the model level, not the API level.
- Reuse allocations. Memory pools for tensors. Pre-allocated output buffers. This alone can cut latency by 20-30%.
None of these require changing your model. They’re systems engineering — the kind of work that’s unglamorous but effective.
The deeper lesson
Performance engineering isn’t about knowing a bag of tricks. It’s about having the discipline to measure, form hypotheses, and validate. Every system is different. Every bottleneck has its own shape.
The teams that build fast inference pipelines aren’t the ones with the most clever optimizations. They’re the ones that measure first and optimize second.
If your inference pipeline is slower than it should be, let’s talk. We help teams find and fix the bottlenecks that matter.