Why Your Inference Lives Closer Than You Think
Edge AI is not faster because the chips are bigger. It is faster because the network round-trip disappears — and for the workloads that actually matter, that round-trip is the whole product.
Key Takeaways
- A cloud round-trip on 4G/5G lands between 50–200 ms. Edge inference on the same workload lands between 5–20 ms. That 10–40× gap is the difference between "reacts in time" and "is too late."
- A modern AV generates roughly 1.5 GB of raw sensor data per second. Shipping that to the cloud is not a networking problem — it is a physics problem. Inference has to live where the sensors live.
- The dominant edge-AI deployment pattern is train in the cloud, infer at the edge, with model packages pushed via orchestrators like AWS IoT Greengrass or KubeEdge. The cloud stays the brain; the edge becomes the reflex.
- Edge is not free. A 28 GB Llama2-7B does not fit on most edge devices. Sharding, quantization, model staleness, and orchestration cost are real engineering tax.
- The right answer in 2026 is almost never "edge or cloud." It is "which part of the inference runs where" — and the answer is decided by latency budget, bandwidth budget, model size, and privacy regime.
---
Picture an autonomous vehicle closing on a stalled truck at highway speed. The physics deadline for emergency braking is roughly 100 ms — and most of that is mechanical, not computational. If the perception stack has to round-trip to a data center 50 milliseconds away, the vehicle has already eaten half of its decision budget on the network. Now layer in eight cameras and a LiDAR pushing 1.5 GB of raw data per second, and the picture stops being an architecture debate. It becomes a question of whether the car can do its job at all.
I started writing this expecting edge inference to be a niche optimization. By the third source, I had to update. Edge inference is no longer a "nice to have." It is the default architecture for the categories of AI that are dangerous to do wrong — safety, control, monitoring, anything that touches the physical world.
Why cloud-only inference breaks at the speed of physics
A cloud inference call pays for serialization at the device, transmission over the access network, queueing and processing in the data center, and the return trip. On a healthy 5G link, that floor is around 50 ms. On 4G, 80–120 ms. On a moving vehicle handing off towers, it is volatile. The Baidu developer-center analysis puts the realistic cloud range at 50–200 ms for inference of non-trivial models.
Edge inference on a well-tuned local runtime lands between 5–20 ms for the same workload — production telemetry from Jetson-class devices, not marketing. A concrete comparison from an AV reference design:
| Task | Edge latency | Cloud latency | Delta | |------|-------------:|--------------:|------:| | Object detection (camera + LiDAR fusion) | 8 ms | 120 ms | 15× | | Local path planning | 15 ms | 200 ms | 13× |
For an emergency brake that must trigger inside a 100 ms budget, those numbers are not abstract. The cloud column is the car that did not stop in time. The edge column is the car that did.
Two other reasons cloud-only breaks, and they are often the bigger ones. Bandwidth: an AV with 8 cameras, 1–2 LiDARs, and radar generates ~1.5 GB/s; a factory with 1,000 vibration sensors on a stamping line produces ~80 GB/day, of which 99% is "system behaving normally." Uploading all of it and then deciding what to keep is not just expensive — it is wasteful in a way that compounds with fleet size. Regulation: healthcare, biometric, and personal data under GDPR or HIPAA often cannot be exported for inference. Running the model where the data is born is sometimes the only legal option.
flowchart LR
A[Sensor<br/>on device] -->|1.5 GB/s raw| B{Edge node<br/>local model}
B -->|8–15 ms inference| C[Action<br/>brake / sort / alert]
B -->|summary only| D[Cloud]
A -.->|all data| E[Cloud DC]
E -->|120–200 ms RTT| F[Action]
style B fill:#dff5e1
style E fill:#fde2e2
The green path is what edge gives you; the red path is what cloud-only forces on you. The cloud path is not "wrong" — it is the wrong tool for the speed-of-physics problem.
How edge nodes actually cache and serve a model
The phrase "edge inference" hides a fair amount of plumbing. The dominant 2026 pattern is a six-step loop:
1. Train in the cloud on a fleet of GPUs against a curated dataset. 2. Package the model for the device — exported, optimized (TensorRT, ONNX, TFLite, Core ML), and bundled with its pre/post-processing. AWS packages this as a *Greengrass component* and a *SageMaker Edge Manager* artifact. KubeEdge wraps it in a container and ships it as a Kubernetes Deployment that happens to land on an edge node. 3. Push to the edge over MQTT, HTTPS, or a vendor control plane. Updates can be staged (canary 5% of fleet, observe, roll forward) or instant. 4. Serve locally. The edge node runs a runtime — ONNX Runtime, TensorRT-LLM, vLLM, NVIDIA Triton, Jetson Inference — and answers requests entirely on-device. AWS Greengrass is explicit: inference on a Greengrass device "incurs no data transfer costs and adds no latency," because the network is not in the path. 5. Summarize back to the cloud — logs, drift metrics, rare events, retraining triggers. The bandwidth asymmetry is the point. 6. Stay alive when the network doesn't. KubeEdge EdgeCore and AWS Greengrass Core both support offline autonomy. If the WAN drops, the inference path keeps working. State reconciles when the link returns.
sequenceDiagram
participant Cloud
participant Edge as Edge node
participant Sensor
Note over Cloud,Edge: One-time: model deployment
Cloud->>Edge: Push model package (Greengrass component / K8s Deployment)
Edge->>Edge: Cache model + runtime locally
Note over Edge,Sensor: Hot path: per-request inference
Sensor->>Edge: Raw data
Edge->>Edge: Local inference (5–20 ms)
Edge-->>Sensor: Action
Edge->>Cloud: Async summary / metrics
Note over Edge: If WAN drops: keep serving locally
That is the answer to "how does an edge node serve a model locally": the model is downloaded, cached, and executed by a runtime on the device. The cloud is in the lifecycle, not the request path.
One detail that is easy to miss: the edge node does not have to be the sensor. An "edge" can be a Jetson in a car, a Greengrass Core on a factory gateway, a 5G base station's ARC box, or a small data center in a regional POP. What makes it edge is geographic proximity to the data source and removal of the WAN from the inner loop, not the form factor. A 72 TOPS Tesla FSD chip is a different purchase decision than a Coral accelerator on a smart camera.
IoT: where the economics actually pencil out
The IoT case for edge inference is less about heroism and more about arithmetic. Imagine you're the engineer who has just wired 1,000 vibration sensors onto stamping presses in a metal-fab plant. Each sensor produces roughly 1 KB/sec; the plant produces ~80 GB/day, 99% of it "normal." Ship it all to the cloud and you pay for transport, ingest, storage, and a model invocation on a stream that is, by construction, mostly boring. Don't ship it, and the model lives on a Jetson AGX at the cell.
The Baidu smart-factory case is unusually clean: a vibration-analysis workload went from a 2-second cloud response to 80 ms at the edge, a 62% reduction in bandwidth cost, and 99.2% defect-recognition accuracy on Jetson. The latency improvement is what gets quoted; the bandwidth saving is what makes the project survive its second budget review.
Privacy stops being a marketing line in this category. A home security camera doing person detection on-device sends a labeled event to the cloud, not a face. A hospital bedside monitor doing arrhythmia classification on the gateway keeps PHI inside the facility. AWS Greengrass, Azure IoT Edge, and NVIDIA Metropolis all make this pitch, and it is correct — under HIPAA, GDPR, and the patchwork of US state privacy laws, the edge is often the only place the data is allowed to live.
The honest IoT case is also where I have to flag a tradeoff up front: edge IoT inference is *not* the same as "free." It is a capex line (device, gateway, on-prem server) replacing an opex line (cloud bills). For ten sensors it is not worth it. For ten thousand, the math flips hard. Break-even depends on traffic shape, not latency.
Autonomous vehicles: the case where the architecture is the safety claim
The AV case is the cleanest argument for edge inference, and the most extreme. Tesla's published description of its Full Self-Driving software is a study in what edge inference has to look like at the limit. A full build of the perception and planning networks is 48 distinct networks, producing 1,000 tensors (predictions) at each timestep, at 36 fps across 12 cameras. Training takes 70,000 GPU hours. Inference runs on Tesla's custom dual-NPU FSD SoC, tightly coupled to the camera pipeline.
Two design choices stand out:
- Custom silicon. Tesla does not run its safety stack on a general-purpose GPU. The FSD chip is purpose-built for CNN inference at fixed latency — the same bet NVIDIA's IGX Orin and DRIVE platforms make for industrial and automotive customers.
- Latency as the primary metric. Tesla's own code-foundation priorities list throughput, latency, correctness, and determinism — in that order. Latency is not a property of the network; it is a property of the silicon and the software. If the team cared primarily about cloud round-trips, the FSD chip would not exist.
When your workload has a hard physics deadline, you do not negotiate with the network. You remove it from the path. The 8 ms / 120 ms gap is not a benchmark; it is the difference between a vehicle that stops and one that does not.
Autonomous vehicles are also the cleanest example of the *hybrid* architecture that the tradeoffs section insists on. Tesla's fleet *does* use the cloud — for fleet learning, for collecting rare scenarios, for over-the-air model updates. The cloud is the brain; the car is the reflex. A modern FSD build is a hybrid system, not pure edge, and pretending otherwise misrepresents how it works.
The tradeoffs nobody puts on the first slide
I find it easier to believe an architecture pitch when the seller admits the cost. Here is the honest ledger.
Edge is not free. A full-precision Llama2-7B is ~28 GB. That does not fit on a Jetson, an IGX, or most industrial edge servers. To run an LLM on edge-class hardware you have to do one of three things: quantize (costs accuracy), shard across devices (costs orchestration and inter-device bandwidth — EdgeShard reports a 50% latency cut and 2× throughput, but the engineering bill is real), or accept a smaller model (costs capability). The IEEE IoT Journal paper on EdgeShard is explicit that the sharded path exists to get around the 28 GB wall.
Edge shifts the operational cost line. Babu and Stewart's ACM/IEEE Symposium on Edge Computing 2019 study on energy-latency-staleness tradeoffs is uncomfortable reading: naive scheduling policies burn 7× more energy than the best random-walk policy they tested. Edge is not "no infrastructure" — it is a different infrastructure. KubeEdge, Greengrass, and Azure IoT Edge exist because someone has to manage model versions, autoscale inference pods, and reconcile state when the WAN drops. Whoever owns that surface owns the cost.
Edge is not always more private. Local inference keeps raw data on-device, which is a real win. But the model itself is now physically accessible to anyone with device access. Model extraction, model inversion, and on-device adversarial attacks are an active research area. A naive "edge is more secure" claim is half-true at best; the security model has to be designed.
Edge can be wasteful at low scale. A 5-TOPS Coral TPU serving 10 requests/minute is a poor use of silicon. Cloud inference is a near-perfect variable cost — pay per request, scale to zero, no idle hardware. The break-even is real and depends on request shape, not latency.
Edge does not replace the cloud; it complements it. In production, edge and cloud are not competing. The cloud is for training, aggregation, drift monitoring, and rare-event retraining. The edge is for the inner loop. The decision is *where in the inference pipeline* each piece runs, not "edge or cloud" as a binary.
flowchart TB
subgraph Cloud
T[Training]
M[Model registry]
D[Drift monitoring]
end
subgraph Edge
R[Local runtime]
C[Model cache]
end
subgraph Device
S[Sensor]
end
M -->|deploy / update| C
C --> R
S --> R
R -->|summary| D
T -->|new model| M
The cloud stays. The edge is added. The decision is about which milliseconds get deleted.
| Dimension | Cloud inference | Edge inference | Notes | |-----------|-----------------|----------------|-------| | Latency (typical) | 50–200 ms | 5–20 ms | Up to 40× delta on 4G/5G | | Bandwidth need | 100% of raw data | 30–70% of raw data (after local filtering) | Hidden line item | | Cost model | Pay-per-use (opex) | One-time hardware (capex) | Break-even depends on request volume | | Privacy posture | Data leaves device | Data stays on device | Both have their own threat model | | Max model size | Effectively unlimited (cluster) | Constrained by device (28 GB Llama2-7B does not fit) | Quantize / shard / smaller model required | | Operational complexity | Low (managed) | High (orchestration, model lifecycle, offline autonomy) | KubeEdge / Greengrass are the answer to this | | Real-time suitability | Poor for <50 ms deadlines | Excellent | The reason the architecture exists |
The decision I now make by default
When the question is "edge or cloud," I have stopped answering it as a binary. The question I ask instead is: what is the latency budget, and is the network in it?
If the budget is above 200 ms — chatbot responses, document summarization, batch analytics, training itself — the cloud is the right answer. Bandwidth and elasticity dominate; latency is irrelevant.
If the budget is below 20 ms — vehicle control, robotic actuation, anomaly-triggered shutdown, real-time vision on a moving platform — the edge is the only answer. The network cannot be in the path.
In between, you hybrid. Inference at the edge, training and aggregation in the cloud, model updates flowing downward, summaries flowing up. Greengrass and KubeEdge are both, fundamentally, products that exist to make this hybrid clean.
The numbers behind that decision are not new. What is new is the willingness of serious engineering teams to act on it. The market sizing tells the same story: edge computing at roughly USD 19–25 billion in 2024–2025 and growing at a 21–48% CAGR, depending on which slice you measure, is no longer the budget line that gets cut. It is the budget line that gets defended.
The 8 ms / 120 ms gap is the headline. The harder part is what you do once you've decided the gap is real. That is what the rest of the edge stack — model packaging, KubeEdge, Greengrass, Tesla's FSD silicon, EdgeShard's sharded LLMs — is for. It is the plumbing that turns "we should probably do inference at the edge" into a system that actually does it, at 5–20 ms, on a million devices.
---
References:
- IBM — What is Edge AI? — definition, market sizing, and use-case catalog.
- Red Hat — What is IoT Edge computing? — IoT edge architecture and the autonomous-vehicle case.
- NVIDIA — Edge Computing — vendor stack for enterprise, industrial, embedded, and network edge.
- AWS — IoT Greengrass ML Inference — "no data transfer costs, no added latency" vendor claim and packaging model.
- Tesla — AI & Robotics — FSD inference architecture: 48 networks, 1,000 tensors per timestep, 70,000 GPU hours per build.
- Baidu Developer Center — 边缘计算驱动无人驾驶 — 8 ms / 120 ms AV latency comparison and architecture.
- M. Zhang et al., "EdgeShard: Efficient LLM Inference via Collaborative Edge Computing," IEEE IoT Journal 12(10), 2025 — peer-reviewed evidence on sharded edge LLMs (50% latency cut, 2× throughput, 28 GB model wall).
- Babu & Stewart, "Energy, Latency and Staleness Tradeoffs in AI-driven IoT," ACM/IEEE Symposium on Edge Computing, 2019 — factorial design showing 7× energy penalty from naive scheduling at the edge.
- Market Research Future — Edge Computing Market Report — market sizing ($19.38B in 2024 → $1.51T in 2035, 48.62% CAGR).
- KubeEdge — Cloud-Native Edge Computing — CNCF-graduated orchestration for edge inference workloads.
Closing thought: The 8 ms versus 120 ms gap is the headline; the capex bill that edge forces on you is the second chapter. The architectures that survive the next five years will be the ones that answer that bill honestly, not the ones that pretend edge is free.