← Back to Blog

LLM Inference Is Not Just Bigger Inference

llminferencehealthcare-itarchitecturesystems

Introduction

In healthcare analytics, we’ve lived with “inference” for years.

HEDIS gap closure models. Risk adjustment predictors. Fraud detection. CNNs on imaging. Gradient boosting on claims.

Those systems assume something very stable:

It’s clean. Deterministic. Boring in a good way.

LLMs break almost all of that.


Main Content

1. Variable-Length Computation Changes Everything

Traditional inference assumes uniform work per request.

LLMs don’t.

Two requests hit your system:

You cannot just batch and wait.

LLM systems use continuous batching, dynamically inserting and removing requests as tokens complete.

flowchart LR
A[Incoming Requests] —> B[Continuous Batching Engine]
B —> C[GPU Batch]
C —> D[Token Completion]
D —> B

The batch is fluid, not fixed.

In healthcare IT terms: this is the difference between a nightly HEDIS job and a real-time clinical documentation assistant.


2. Prefill vs Decode: Two Different Workloads

LLM inference splits into:

They behave differently.

Running both on the same GPUs causes interference and latency jitter.

High-performance systems separate them.

flowchart LR
A[Client Request] —> B[Prefill Pool]
B —> C[KV State Transfer]
C —> D[Decode Pool]
D —> E[Streaming Tokens to Client]

Prefill is like loading the full patient history into memory.
Decode is writing the note sentence by sentence.

Treating them as identical workloads is architectural laziness.


3. KV Cache Is a Systems Problem

LLMs rely on a KV cache to store attention state.

In multi-turn healthcare conversations:

This introduces:

Modern engines implement paged KV caches similar to virtual memory systems.

Traditional models finish, release memory, and disappear.
LLMs persist.

That changes infrastructure design.


4. Routing Stops Being Stateless

Traditional scaling:

LLMs benefit from prefix-aware routing.

If replica A holds the KV cache for a conversation, routing the next request to replica B destroys cache locality.

flowchart LR
A[Router]
A —>|Prefix Hash| B[Replica 1]
A —>|Prefix Hash| C[Replica 2]
B —> D[KV Cache]
C —> E[KV Cache]

In applied healthcare IT, this mirrors longitudinal care systems. Context matters. Routing must respect state.


5. Mixture-of-Experts Is Not Replication

LLMs increasingly use Mixture-of-Experts (MoE):

flowchart TD
A[Token] —> B[Shared Attention]
B —> C1[Expert 1]
B —> C2[Expert 2]
B —> C3[Expert N]

This is one distributed model with dynamic internal traffic.

Not just more replicas.


Conclusion

LLM inference is not interesting because the models are large.

It is interesting because the systems problem is different:

If we deploy LLMs into healthcare workflows using infrastructure assumptions from 2018, we will blame the model for what is actually a systems failure.

LLM inference is a different class of architecture.

And healthcare IT is becoming a systems discipline.