Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Predictive Coding (DataPhilly)

Clustify
December 16, 2014

Predictive Coding (DataPhilly)

E-discovery and predictive coding. Presented at DataPhilly on December 16, 2014.

Clustify

December 16, 2014
Tweet

More Decks by Clustify

Other Decks in Technology

Transcript

  1. What is E-Discovery? • Organization is required to produce any

    documents it has relevant to some matter – Litigation – Mergers
  2. The Problem • 1 TB = 1.8 million documents –

    1.1 million emails – 460,000 scanned documents – 230,000 loose files • $50 for hard drive to store • $10 million to review • Cheap storage → junk & duplicates
  3. Document Deletion • Document Retention Policy – Policy of routinely

    deleting old documents is OK if done in good faith • Litigation Hold – Documents must be preserved if litigation is anticipated
  4. Technological Solutions • Keyword Search • Unsupervised Machine Learning –

    Organizing / analyzing without knowing what you are looking for – Clustering • Supervised Machine Learning – Learning by example – “Predictive Coding”
  5. Keyword Search • Study by Blair and Maron in 1985

    – 2 lawyers with paralegals and search engine – Aim for at least 75% recall – Got 75% recall once out of 40 tasks – Averaged 20% recall • Queries often fail to recognize all of the ways to say the same thing
  6. Keyword Search Sorting • Strategy: Very broad query + hope

    relevant docs tend to be at top of sort • Term frequency-inverse document frequency – Sorting is based on a heuristic rule – Not specific to the problem at hand
  7. Precision & Recall • Recall – % of relevant docs

    found – High recall required for defensibility • Precision – % of docs predicted to be relevant that actually are – High precision desired to reduce cost