Lazizbek

Building FleetSight: Making 4.4 Million Carrier Records Actually Useful

·6 min read

The Short Version:

I took 4.4 million rows of raw federal trucking data and built a search tool that tells you in under 200ms whether a carrier is safe to hire — including whether they've rebranded to hide a bad safety record.

The US freight industry moves $900 billion in cargo per year. Every truck on the road is registered with FMCSA — the Federal Motor Carrier Safety Administration. Every carrier's safety record, inspection history, and crash data is public. And yet, freight brokers and shippers routinely hire carriers they know nothing about, because accessing that data is genuinely painful.

The FMCSA website is slow, returns limited data per query, and doesn't let you do any real analysis. Third-party tools exist but cost thousands per month. The data itself — all 4.4 million records — is available as a free bulk download. So I built FleetSight.

The Data Problem

The FMCSA bulk dataset is a CSV dump that lands at around 4GB uncompressed. It's structured but dirty — inconsistent address formats, missing values, deprecated carrier status codes that aren't documented anywhere in current FMCSA docs, and duplicate entries for carriers that changed their DOT numbers.

The first challenge was loading this into PostgreSQL in a way that makes search fast.

The naive approach — full-text search on carrier name — is too slow and misses too much. Carrier names are abbreviated inconsistently: "J.B. Hunt", "JB Hunt Transport", "JB HUNT TRANSPORT INC" are all the same company. Fuzzy matching on 4.4 million rows is not a real-time operation.

My solution was a two-layer index:

  1. Trigram index (pg_trgm) on the carrier name column. Trigrams break strings into overlapping 3-character chunks and index them — "JB HUNT" becomes the set ["JB ", "B H", "HU", "HUN", "UNT", "NT "]. Similarity search using the % operator runs in ~150ms even on 4.4M rows.

  2. Exact-match indexes on DOT number, MC number, and EIN. These are the most common lookup keys for compliance teams and are O(log n).

The result is sub-200ms search for both fuzzy name queries and exact ID lookups.

-- Trigram index for fuzzy name search
CREATE INDEX idx_carriers_name_trgm
ON carriers USING gin(legal_name gin_trgm_ops);
 
-- Query pattern
SELECT * FROM carriers
WHERE legal_name % 'JB Hunt'
ORDER BY similarity(legal_name, 'JB Hunt') DESC
LIMIT 20;

The 7-Tab Detail View

Once you find a carrier, the detail page has seven tabs. This wasn't arbitrary — it reflects how compliance teams actually evaluate carriers.

1. Overview — DOT number, MC number, operating status, authority type, contact info, years in operation. The basics.

2. Safety Ratings — The FMCSA's formal safety rating (Satisfactory/Conditional/Unsatisfactory) plus the date it was assigned. Many carriers have never been rated — this is itself a signal.

3. BASIC Scores — This is the most technically interesting tab. BASIC (Behavior Analysis and Safety Improvement Categories) is FMCSA's scoring system across 7 categories: Unsafe Driving, Hours-of-Service Compliance, Driver Fitness, Controlled Substances, Vehicle Maintenance, Hazardous Materials, and Crash Indicator. Each score is a percentile — 90th percentile means worse than 90% of comparable carriers.

I visualize each BASIC score as a bar with a threshold indicator. Scores above the intervention threshold (70th percentile for most categories) are highlighted in red. This lets you see at a glance whether a carrier is borderline or clearly problematic.

4. Inspections — Historical inspection records: date, state, level (1-6), violations cited, out-of-service orders. I show trend lines — is this carrier improving or getting worse?

5. Crashes — Crash records going back 24 months as required by FMCSA reporting. Total crashes, fatal crashes, injury crashes, tow-away crashes. Crash rate per million miles is the normalized metric that matters.

6. Fleet — Number of power units, number of drivers, cargo type. Useful for understanding scale and exposure.

7. Authority History — This tab is the most important one and the one no other tool shows well.

Chameleon Carrier Detection

This is the feature I'm most proud of technically.

A "chameleon carrier" is one that shuts down after accumulating a bad safety record and re-registers under a new name (and sometimes new ownership) to escape FMCSA scrutiny. This is not rare — it's a documented fraud pattern that has caused fatal crashes.

The signal: a new carrier (registered < 2 years ago) with the same phone number, address, EIN, or principals as an older carrier with a poor safety record.

I built a graph-based detection algorithm. Every carrier is a node. Edges connect carriers that share:

  • Phone number (after normalization)
  • Physical address (fuzzy-matched and geocoded)
  • EIN
  • Named principals (owner/officer names)
async function detectChameleonCarriers(carrierId: string) {
  const carrier = await db.carrier.findUnique({ where: { id: carrierId } });
 
  // Find all carriers sharing identifying attributes
  const related = await db.$queryRaw`
    SELECT c2.dot_number, c2.legal_name, c2.created_date,
           c2.safety_rating, c2.unsafe_driving_score,
           similarity(c1.phone, c2.phone) as phone_sim
    FROM carriers c1
    JOIN carriers c2 ON (
      c1.phone = c2.phone OR
      c1.ein = c2.ein OR
      c1.physical_address_hash = c2.physical_address_hash
    )
    WHERE c1.dot_number = :carrierId
      AND c2.dot_number != :carrierId
      AND c2.created_date > c1.created_date
  `;
 
  return related.filter(r =>
    r.unsafe_driving_score > 70 || r.safety_rating === 'Unsatisfactory'
  );
}

When a potential chameleon match is found, the carrier detail page shows a warning banner with the linked carrier's safety record. This has caught real cases — carriers that had been cited for serious violations, shut down, and re-opened under a new trade name at the same address within months.

Performance at Scale

The Prisma + PostgreSQL stack handles the query load well, but a few things required tuning:

Connection pooling. At peak, the BASIC scores tab triggers 7 parallel queries (one per category). Without pgBouncer-style connection pooling, this exhausts the connection limit fast. I added Prisma Accelerate for connection pooling in production.

Caching carrier detail pages. Carrier records don't change daily — FMCSA updates the bulk dataset weekly. I cache full carrier detail responses in Redis with a 24-hour TTL, cutting database load by ~80% for popular carriers.

Materialized views for analytics. Aggregate queries ("what percentage of carriers in Texas have unsatisfactory safety ratings?") are precomputed as materialized views and refreshed nightly after the FMCSA data sync.

The Impact

The freight industry's carrier vetting process is broken in a specific, fixable way: the data exists, it's public, but it's not accessible. FleetSight makes it accessible.

The chameleon carrier detection alone addresses a safety gap that contributes to real fatalities. When a broker unknowingly hires a rebranded unsafe carrier because they couldn't see the connection, trucks that shouldn't be on the road end up carrying loads across the country.

The right version of this becomes mandatory infrastructure for the freight industry — the way credit scores became standard for lending. The data exists. The models work. The hard part is distribution and adoption.