Building ACRAS: Real-Time Crash Detection from Highway Cameras

Highway crashes kill 40,000 people a year in the US. The average time between a crash occurring and emergency services being dispatched is 4–8 minutes — often because detection depends on someone calling 911. Cameras are already everywhere on US highways. They just aren't being watched intelligently.

That's what ACRAS does. It watches 50+ live DOT feeds simultaneously, detects crashes in real time, and generates an incident report before most bystanders have unlocked their phones.

This is how I built it.

The Problem with Naive Approaches

The first instinct is to throw a classifier at each frame: "does this frame contain a crash?" But this fails immediately for several reasons.

Crashes are temporally sparse. A 24-hour camera feed might contain a single crash. Training a frame-level classifier on this data gives you something that learns to say "no crash" 99.99% of the time and scores excellent accuracy while being completely useless.

Crashes are also visually ambiguous in a single frame. A car stopped on the shoulder looks identical to a car that just crashed. A truck changing lanes fast looks suspicious. You need motion context — what happened in the frames before this one.

And DOT cameras are low quality. 480p, compressed, with artifacts. Pre-trained ImageNet weights don't transfer cleanly.

The solution I landed on chains three models together, each responsible for a different part of the problem.

The Three-Stage Pipeline

Stage 1: Vehicle Detection with YOLOv8

Every frame first goes through YOLOv8n (the nano variant — speed matters at 30fps). The goal here isn't crash detection. It's just to find vehicles: their bounding boxes, class (car, truck, motorcycle), and confidence scores.

I fine-tuned on a dataset of DOT camera frames specifically — not COCO, not dashcam footage. The domain gap between a clean dashcam and a rain-blurred overhead DOT feed is significant. Detection accuracy jumped from 71% to 89% after domain-specific fine-tuning on ~8,000 manually labeled frames.

The output of stage 1 is a set of tracked bounding boxes per frame, linked across time using a simple IoU-based tracker. Each vehicle gets an ID that persists across frames.

Stage 2: Optical Flow Analysis

With tracked bounding boxes, I compute dense optical flow (Farneback method) within each vehicle's bounding region across a 15-frame sliding window.

The key insight: a normal driving vehicle has smooth, consistent flow vectors. A crash creates sudden, violent, multi-directional flow changes — a signature that's hard to fake and easy to detect.

I extract three features per vehicle per window:

Flow magnitude variance — how chaotic is the motion?
Flow direction entropy — are vectors pointing in many directions simultaneously?
Velocity delta — how fast did the speed change between frames?

A sudden spike in all three simultaneously is the strongest signal. This catches the moment of impact even before the vehicles stop moving.

def extract_flow_features(frames, bbox):
    x1, y1, x2, y2 = bbox
    features = []
 
    for i in range(len(frames) - 1):
        prev = cv2.cvtColor(frames[i][y1:y2, x1:x2], cv2.COLOR_BGR2GRAY)
        curr = cv2.cvtColor(frames[i+1][y1:y2, x1:x2], cv2.COLOR_BGR2GRAY)
 
        flow = cv2.calcOpticalFlowFarneback(prev, curr, None,
                                             0.5, 3, 15, 3, 5, 1.2, 0)
 
        magnitude, angle = cv2.cartToPolar(flow[..., 0], flow[..., 1])
 
        features.append({
            "mag_variance": np.var(magnitude),
            "angle_entropy": entropy(np.histogram(angle, bins=36)[0] + 1e-8),
            "mean_magnitude": np.mean(magnitude)
        })
 
    return features

Optical flow alone generates too many false positives — trucks merging, emergency stops, wind on trailers. That's where stage 3 comes in.

Stage 3: CNN Crash Classifier

The final stage takes a 15-frame clip centered on the suspicious event and runs it through a custom 3D CNN. The architecture is a lightweight C3D variant (3D convolutions for spatiotemporal feature extraction) trained specifically on crash vs. non-crash clips.

Training data was the hard part. I collected ~2,400 verified crash clips from public DOT archives and court records (US traffic crashes are public record in most states), plus ~6,000 hard negative examples: emergency stops, merges, trucks with flapping tarps, rain artifacts.

The model outputs a probability and a confidence interval. If both the optical flow flags and the CNN score above threshold, an incident is confirmed.

Final accuracy: 94.2% precision, 91.7% recall on the held-out test set. False positive rate: ~0.3 per camera per hour — acceptable for a system that's still 10x faster than human detection.

The Infrastructure Problem

Running three models on 50+ simultaneous streams is not a small compute problem.

The architecture that worked:

Frame ingestion: Each camera stream runs as an independent async worker, pulling frames via RTSP. Workers are stateless and horizontally scalable.
GPU batching: Frames from all streams are batched before hitting YOLOv8 — single-model inference on a batch of 64 frames is far cheaper than 64 sequential inferences.
Tiered processing: Stage 1 runs on every frame. Stage 2 only runs when stage 1 detects vehicles in motion. Stage 3 only runs when stage 2 flags an anomaly. ~95% of frames never leave stage 1.
PostGIS for geolocation: Each camera has a known lat/lng. When a crash is detected, the incident is inserted into PostGIS with a geographic point. This enables proximity queries — "find all active incidents within 5 miles of this address" — which is how emergency dispatchers use it.

The FastAPI backend handles both the real-time pipeline and the REST API for the Next.js dashboard. The dashboard shows a live map, camera feeds, active incidents, and generated reports.

Generating the Report

The final step — turning a detection into a human-readable incident report — was simpler than expected. I pull structured fields from the pipeline output (time, camera ID, location, vehicles involved, estimated severity from flow magnitude) and pass them to a templated report generator.

The report is available to emergency coordinators in under 30 seconds from the moment of impact. Compare that to the national average of 4–8 minutes via 911.

What I'd Do Differently

Transformer-based tracking. My IoU tracker loses vehicles in heavy occlusion. ByteTrack or DeepSORT would hold tracks through multi-vehicle pileups better.

More training data for nighttime. The pipeline degrades in low-light conditions. DOT cameras switch to infrared at night but the model wasn't trained on enough IR footage.

Edge deployment. Running all inference centrally is a bandwidth problem — streaming raw video from 50 cameras is expensive. The right architecture pushes stage 1 (YOLO) to edge devices at the cameras and only sends flagged clips to the central server.

Why This Matters

The US has ~4 million miles of highway. There are tens of thousands of DOT cameras already deployed. The compute to run this pipeline on all of them costs less than a handful of human traffic operators. The math is not complicated.

Every minute of reduced emergency response time saves lives. ACRAS shows that the models to do this exist, the inference speed is achievable in real time, and the pipeline is deployable today — not in five years.