A 100 Hz Day in the Life of selfdrived
The reason openpilot ships on 332 cars is not the model. It is the decision to treat every subsystem as an independently faulting process, communicating through Cap'n Proto messages on a fixed-frequency bus.
Key Takeaways
selfdrivedis a 100 Hz pub/sub consumer, not a controller — it makes the engagement decision; the controller is downstream.- The process boundary is the fault boundary: a crashing
modeldcannot stall the lane-keeping actuator because the two are separate processes on the same device. - Frequency is the discipline. CAN traffic runs at 100 Hz, model inference at 20 Hz, the IMU at 104 Hz, the camera frames at 20 Hz. Each subscriber decides what to do when a message is late.
- This is why a Python codebase can ship in a safety-critical system: the safety code is not in Python. It is in the C firmware of the
pandadevice, which is the only thing the car ever listens to.
Here is the single most important 14 lines in the entire openpilot repository. It lives in openpilot/selfdrive/selfdrived/selfdrived.py, between lines 86 and 93. I will show it to you, then I will show you why it is the entire architecture in miniature.
self.sm = messaging.SubMaster(['deviceState', 'pandaStates', 'peripheralState', 'modelV2', 'liveCalibration',
'carOutput', 'driverMonitoringState', 'longitudinalPlan', 'livePose', 'liveDelay',
'managerState', 'liveParameters', 'radarState', 'liveTorqueParameters',
'controlsState', 'carControl', 'driverAssistance', 'alertDebug', 'userBookmark', 'audioFeedback',
'lateralManeuverPlan'] + \
self.camera_packets + self.sensor_packets + self.gps_packets,
ignore_alive=ignore, ignore_avg_freq=ignore,
ignore_valid=ignore, frequency=int(1/DT_CTRL))
Read it again. That is a single SubMaster call. It registers a subscription to 20 named message services, three camera state services, three sensor services, and a GPS service. The call is parameterised by frequency=int(1/DT_CTRL), and DT_CTRL is the canonical control-loop period — defined in openpilot/common/realtime.py as 0.01 seconds. One hundred hertz. That single line means: "every 10 milliseconds, deliver the latest value of each of those 26 services to me, in one batch, or skip this tick if any of them is overdue." That is the entire process architecture of openpilot, in one line. Now let me show you the rest of it.
The bus and the kernel
Imagine you are reading telemetry from your own car at 100 Hz. The road-camera frame arrives at 20 Hz. The CAN bus at 100 Hz. The driving model at 20 Hz. The IMU at 104 Hz. They never block each other. If the model is slow on a tick, selfdrived skips the model on that tick and uses the last good plan. If a camera frame drops, the lane-keeping controller uses the last visible lane. If the GPS packets are absent for three seconds, selfdrived keeps driving on the localizer. This is the entire point of the architecture: the slowest producer in the system cannot stall the fastest consumer, because they are separate processes and the bus drops messages rather than blocks.
The bus is implemented in openpilot/cereal/. It is Cap'n Proto, the same serialization library that Cloudflare and Sandstorm use, chosen because it is small, zero-copy, and has no schema compilation step on the Python side. Each service is declared in services.py with a frequency, a decimation factor, and a queue size. can runs at 100 Hz with a decimation of 2053 (about three messages per logged segment, the rest are dropped to keep disk usage bounded). controlsState runs at 100 Hz with a decimation of 10 and a 2 MB queue. modelV2 runs at 20 Hz. accelerometer and gyroscope run at 104 Hz — the raw sensor rate. The schema is a single Cap'n Proto file (log.capnp) and the service list is a single Python file. The whole bus fits in your head.
The kernel is panda — a separate C codebase, not in the openpilot repository but symlinked at panda/ as a submodule. panda is the only process that talks to the car's CAN bus directly. It validates every command the openpilot Python stack sends, against a per-car fingerprint, against the MISRA-C safety rules comma inherited from the automotive supply chain, and against the hardware watchdog that physically cuts the actuator line if the firmware stops responding. openpilot's Python code can issue the command "steer left 3 degrees at 5 Nm". panda can refuse to issue the command. The car never sees the command if panda refuses. That is the safety kernel, and it is not in Python. It is in C, in firmware, on a separate microcontroller. The Python that surrounds it is, structurally, just a slightly fancy keyboard.
graph LR
subgraph Sensors
IMU[IMU 104Hz]
CAN[CAN 100Hz]
Cam[Road/Wide/Driver Cameras 20Hz]
GPS[GPS]
end
subgraph Openpilot[openpilot Python stack]
Modeld[modeld<br/>driving model]
DM[monitoring<br/>driver monitor]
Radard[radard<br/>radar tracker]
Selfdrived[selfdrived<br/>engagement state]
Controlsd[controlsd<br/>PID/MPC]
Plannerd[plannerd<br/>longitudinal plan]
end
Panda[panda<br/>C firmware<br/>safety kernel]
Car[Car CAN bus<br/>steering / gas / brake]
IMU -->|livePose| Selfdrived
CAN --> Panda
Cam --> Modeld
GPS --> Selfdrived
Modeld -->|modelV2 20Hz| Selfdrived
Modeld -->|modelV2 20Hz| Controlsd
DM -->|driverMonitoringState| Selfdrived
Radard -->|radarState 20Hz| Controlsd
Selfdrived -->|selfdriveState 100Hz| Controlsd
Plannerd -->|longitudinalPlan 20Hz| Controlsd
Controlsd -->|carControl 100Hz| Panda
Panda -->|validated| Car
Car -->|feedback| CAN
I drew that diagram with Mermaid for legibility, but the actual one in your head does not need the rounded boxes. The point is the shape: sensors on the left, kernel on the right, the openpilot Python stack in the middle, every arrow a typed message, every rectangle a process. There is no monolith. There is no central orchestrator. There is a bus and a kernel and a set of consumers that subscribe to whatever they need.
The case for many small processes
Here is the part of the architecture that took me the longest to internalise, and that I think most newcomers get wrong. The natural first reaction to "30+ processes" is "this is overengineered, why not one big Python program?". The answer is that the process boundary is the fault boundary. Consider three failure modes.
A driving model that returns a NaN on a particular input frame. If the model and the controller were in the same process, a NaN could propagate through the controller and into the actuator command. With them separated: the model process crashes. The next modelV2 message is missing. selfdrived and controlsd notice the missing message and use the last valid plan. The model process is restarted by the manager. The car never sees a NaN.
A driver-monitoring model that hangs on a single frame. If the DM model and the engagement state machine were the same process, the hang would freeze the entire engagement logic. With them separated: the DM process is wedged, but selfdrived keeps running. The driver-monitoring message becomes stale, and — this is the safety chapter's material — the state machine begins escalating through AlertLevel warnings, eventually disengaging openpilot.
A library upgrade that introduces a RecursionError deep in some logging helper. Same story. A separate process dies, the bus keeps running, the system degrades gracefully.
The 100 Hz frequency is the discipline that makes this work. If a process is allowed to be late — if the bus is allowed to wait — then a slow process stalls a fast process. By setting the bus to a fixed 100 Hz tick and dropping late messages, comma enforces the rule that no slow process ever blocks a fast one. The price is that some messages are dropped; the benefit is that the system as a whole never freezes.
I want to be honest about the cost. The first time you read selfdrived.py and see the explicit subscription list, the per-service frequency and decimation constants, the three layers of SubMaster config (ignore_alive, ignore_avg_freq, ignore_valid), the per-timer Ratekeeper, the per-process REPLAY/SIMULATION/TESTING_CLOSET environment switches — the first time you read it, the codebase will feel heavy. It is heavy. The reason it is heavy is that every one of those knobs corresponds to a real bug that comma has debugged in production. The codebase is the bug log. That is the only honest way I can describe the architecture.
What this enables, and what it costs
The upside of the bus-and-kernel architecture is the thing we are here to discuss: 332 cars. Every supported car has a different CAN bus dialect, a different actuator command set, a different set of supported ADAS features, and a different driver-monitoring camera position. The architecture lets comma treat each of those differences as a "driver" — a separate Python module in openpilot/selfdrive/car/ that translates between the bus messages and the per-car fingerprint. To add a new car, you write a new driver. You do not touch selfdrived. You do not touch controlsd. You do not touch modeld. You write a driver, you add it to the list, you ship.
The cost is operational. Thirty processes means thirty things to monitor, thirty logs to correlate, thirty places a regression can hide. comma's answer to that is the testing closet — the famous line in the README: "We run the latest openpilot in a testing closet containing 10 comma devices continuously replaying routes." We will get to that in E04. The point for now is that the architecture is not free. It is the price of 332 cars.
There is one more thing the architecture enables that is easy to miss. Because every process is independent, you can take any one of them and run it standalone. The driving model can be evaluated on a recorded log without the controller. The DM model can be tested against a static image. The state machine can be unit-tested with synthetic event sequences. None of this requires the car. That is why openpilot is a research project as well as a product. The bus makes the components individually testable, and the kernel makes the system as a whole safe even when the components are imperfect.
I came into this analysis thinking "30 processes is bloat". I came out of it thinking "the process boundary is the fault boundary, and the fault boundary is the safety boundary". The bloat is the budget for safety. It is paid in lines of Python and clock cycles. It is, in my view, the single most important design decision in the codebase.
Now we know the processes exist. We know the bus drops messages, and the kernel validates every command. We know the 100 Hz tick is the discipline that keeps the system from freezing. The next question is the obvious one: what stops any one of those 30 processes from doing the wrong thing at 100 Hz? That is the safety chapter, and it is the question comma has been answering in C, in Python, and in social contracts, for the better part of a decade.
---
References: