Predictive Coding (DataPhilly)

Predictive Coding Bill Dimm DataPhilly December 16, 2014

PredictiveCodingBook.com

What is E-Discovery? • Organization is required to produce any
documents it has relevant to some matter – Litigation – Mergers

The Problem • 1 TB = 1.8 million documents –
1.1 million emails – 460,000 scanned documents – 230,000 loose files • $50 for hard drive to store • $10 million to review • Cheap storage → junk & duplicates

Document Deletion • Document Retention Policy – Policy of routinely
deleting old documents is OK if done in good faith • Litigation Hold – Documents must be preserved if litigation is anticipated

Technological Solutions • Keyword Search • Unsupervised Machine Learning –
Organizing / analyzing without knowing what you are looking for – Clustering • Supervised Machine Learning – Learning by example – “Predictive Coding”

Keyword Search • Study by Blair and Maron in 1985
– 2 lawyers with paralegals and search engine – Aim for at least 75% recall – Got 75% recall once out of 40 tasks – Averaged 20% recall • Queries often fail to recognize all of the ways to say the same thing

Keyword Search Sorting • Strategy: Very broad query + hope
relevant docs tend to be at top of sort • Term frequency-inverse document frequency – Sorting is based on a heuristic rule – Not specific to the problem at hand

Keyword Search Sorting

Relevance Score Sorting

Precision & Recall • Recall – % of relevant docs
found – High recall required for defensibility • Precision – % of docs predicted to be relevant that actually are – High precision desired to reduce cost

Training – Word Exposure

Training – Feature Prevalence

Training – The Big Debate • Random Sampling • Judgmental
Sampling / Active Learning

PredictiveCodingBook.com

Predictive Coding (DataPhilly)

Predictive Coding (DataPhilly)

Clustify

More Decks by Clustify

Other Decks in Technology

Featured

Transcript

Predictive Coding Bill Dimm DataPhilly December 16, 2014

PredictiveCodingBook.com

What is E-Discovery? • Organization is required to produce any

The Problem • 1 TB = 1.8 million documents –

Document Deletion • Document Retention Policy – Policy of routinely

Technological Solutions • Keyword Search • Unsupervised Machine Learning –

Keyword Search • Study by Blair and Maron in 1985

Keyword Search Sorting • Strategy: Very broad query + hope

Keyword Search Sorting

Keyword Search Sorting

Relevance Score Sorting

Precision & Recall • Recall – % of relevant docs

Training – Word Exposure

Training – Feature Prevalence

Training – Feature Prevalence

Training – The Big Debate • Random Sampling • Judgmental

PredictiveCodingBook.com