Legal ResearchDelivered, in recurring use

Finding statistical signal in public court data.

A data pipeline that turns a messy public dataset into a defensible position.

+32 percentage-point spread on the strongest signal we isolated.

Sector

Legal research, quantitative signal

Tech

Python, BeautifulSoup, Selenium, pdfplumber, spaCy, Claude, Groq, OpenAI, pandas

The challenge

A research fund wanted to know whether the procedural history of a case could predict its outcome. They needed statistical confidence, not intuition. The source data was public but a mess. Documents were PDFs with inconsistent formatting. Labels in the source dataset were often wrong. Sample sizes were small enough that one mis-extracted record could shift a headline number.

Our approach

—Scraper pulling the full public docket, linking every associated PDF, and storing structured events in a database.
—PDF extraction pipeline using pdfplumber, with LLM-assisted parsing as fallback on noisy documents.
—NLP pipeline for detecting specific legal patterns in the filings (spaCy plus Claude for edge cases).
—Statistical engine computing conditional outcome rates across each procedural feature, with explicit sample-size flags.
—Per-case reports: signal profile, peer-group comparison, projected outcome window, written conviction narrative.
—Data-quality audits on every run, surfacing missing data and extraction confidence.

The outcome

A written report, per case, that a partner can read in ten minutes and feel confident presenting. Executive summary, signal table, peer-group analysis, and an explicit section on where the numbers could be wrong. The largest signal we isolated showed a 32-percentage-point spread between the top-signal cohort and the bottom-signal cohort.

Where the real work was

Treating data quality as a first-class deliverable, not a footnote. A meaningful share of the merits documents in the source data were mislabelled at source. Getting to a defensible answer meant auditing every record before it entered the statistical layer.

Have a problem like this?

ben@atab.ai or use the contact form.

Talk to us