Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Predictive Coding (DataPhilly)

Avatar for Clustify Clustify
December 16, 2014

Predictive Coding (DataPhilly)

E-discovery and predictive coding. Presented at DataPhilly on December 16, 2014.

Avatar for Clustify

Clustify

December 16, 2014
Tweet

More Decks by Clustify

Other Decks in Technology

Transcript

  1. What is E-Discovery? • Organization is required to produce any

    documents it has relevant to some matter – Litigation – Mergers
  2. The Problem • 1 TB = 1.8 million documents –

    1.1 million emails – 460,000 scanned documents – 230,000 loose files • $50 for hard drive to store • $10 million to review • Cheap storage → junk & duplicates
  3. Document Deletion • Document Retention Policy – Policy of routinely

    deleting old documents is OK if done in good faith • Litigation Hold – Documents must be preserved if litigation is anticipated
  4. Technological Solutions • Keyword Search • Unsupervised Machine Learning –

    Organizing / analyzing without knowing what you are looking for – Clustering • Supervised Machine Learning – Learning by example – “Predictive Coding”
  5. Keyword Search • Study by Blair and Maron in 1985

    – 2 lawyers with paralegals and search engine – Aim for at least 75% recall – Got 75% recall once out of 40 tasks – Averaged 20% recall • Queries often fail to recognize all of the ways to say the same thing
  6. Keyword Search Sorting • Strategy: Very broad query + hope

    relevant docs tend to be at top of sort • Term frequency-inverse document frequency – Sorting is based on a heuristic rule – Not specific to the problem at hand
  7. Precision & Recall • Recall – % of relevant docs

    found – High recall required for defensibility • Precision – % of docs predicted to be relevant that actually are – High precision desired to reduce cost