Legal ResearchDelivered, in recurring use

Finding statistical signal in public court data.

A data pipeline that turns a messy public dataset into a defensible position.

+32 percentage-point spread on the strongest signal we isolated.

Sector

Legal research, quantitative signal

Tech

Python, BeautifulSoup, Selenium, pdfplumber, spaCy, Claude, Groq, OpenAI, pandas

The challenge

A research fund wanted to know whether the procedural history of a case could predict its outcome. They needed statistical confidence, not intuition. The source data was public but a mess. Documents were PDFs with inconsistent formatting. Labels in the source dataset were often wrong. Sample sizes were small enough that one mis-extracted record could shift a headline number.

Our approach

  • Scraper pulling the full public docket, linking every associated PDF, and storing structured events in a database.
  • PDF extraction pipeline using pdfplumber, with LLM-assisted parsing as fallback on noisy documents.
  • NLP pipeline for detecting specific legal patterns in the filings (spaCy plus Claude for edge cases).
  • Statistical engine computing conditional outcome rates across each procedural feature, with explicit sample-size flags.
  • Per-case reports: signal profile, peer-group comparison, projected outcome window, written conviction narrative.
  • Data-quality audits on every run, surfacing missing data and extraction confidence.

The outcome

A written report, per case, that a partner can read in ten minutes and feel confident presenting. Executive summary, signal table, peer-group analysis, and an explicit section on where the numbers could be wrong. The largest signal we isolated showed a 32-percentage-point spread between the top-signal cohort and the bottom-signal cohort.

Where the real work was

Treating data quality as a first-class deliverable, not a footnote. A meaningful share of the merits documents in the source data were mislabelled at source. Getting to a defensible answer meant auditing every record before it entered the statistical layer.

Tags

Statistical analysisNLPData scrapingPDF extractionQuantitative research

Have a problem like this?

ben@atab.ai or use the contact form.

Talk to us