Finding statistical signal in public court data.
A data pipeline that turns a messy public dataset into a defensible position.
+32 percentage-point spread on the strongest signal we isolated.
Sector
Legal research, quantitative signal
Tech
Python, BeautifulSoup, Selenium, pdfplumber, spaCy, Claude, Groq, OpenAI, pandas
The challenge
A research fund wanted to know whether the procedural history of a case could predict its outcome. They needed statistical confidence, not intuition. The source data was public but a mess. Documents were PDFs with inconsistent formatting. Labels in the source dataset were often wrong. Sample sizes were small enough that one mis-extracted record could shift a headline number.
Our approach
- —Scraper pulling the full public docket, linking every associated PDF, and storing structured events in a database.
- —PDF extraction pipeline using pdfplumber, with LLM-assisted parsing as fallback on noisy documents.
- —NLP pipeline for detecting specific legal patterns in the filings (spaCy plus Claude for edge cases).
- —Statistical engine computing conditional outcome rates across each procedural feature, with explicit sample-size flags.
- —Per-case reports: signal profile, peer-group comparison, projected outcome window, written conviction narrative.
- —Data-quality audits on every run, surfacing missing data and extraction confidence.
The outcome
A written report, per case, that a partner can read in ten minutes and feel confident presenting. Executive summary, signal table, peer-group analysis, and an explicit section on where the numbers could be wrong. The largest signal we isolated showed a 32-percentage-point spread between the top-signal cohort and the bottom-signal cohort.
Where the real work was
Treating data quality as a first-class deliverable, not a footnote. A meaningful share of the merits documents in the source data were mislabelled at source. Getting to a defensible answer meant auditing every record before it entered the statistical layer.
Tags