EY Aleph: Deep Learning applied to jurimetrics practice

EY ALEPH deep learning applied to jurimetrics papis | june
2018

relevance 283,4 76,9 39,1 Total lawsuits in the balance sheets
(R$ Billions)1 Tax Civil Labor 32% tax litigation in relation to the market value1 ICMS 90,4 Social Contribution on Net Income 65,2 Income Tax 22,8 PIS/Cofins 15,57 CIDE 11,72 Top Litigation Taxes (R$ Billions)1 1Source: “O contencioso tributário sob a perspectiva corporativa”, Ana Teresa L R Lopes All related data is about 2014 regarding 30 top companies in Brazil

real jurimetrics for labor law legal provisions deal & defense
models deal pricing operation optimization law firms performance

regional labor tribunal 204 courts in region 2 (São Paulo)
universal data physical lawsuits over than 7 MM of documents (pdf) available electronic lawsuits all lawsuits from 2015 more structured (html)

our hacking skills automatic scraping: not human mimic captcha solving

sometimes is not easy, even for pros but we can
do little tricks…

80% data prep cv 20% + cnns

putting all together public data collections normalization attribute extractions audit
visualization scraping rpa deep learning computer vision document conversions unrtf poppler regex machine learning human check support web app web app viztools

challenge large-scale processing of files constant evolution accessibility speed

rate OUR setup stack defined by ia programing language &
azure public cloud monthly bill < 1K USD

job management back-end app rest api front-end app storage queue
no-sql sql git ci table

cosmosdb no friction with data structure alterations, but not good
with dataviz tools for BI teams nor application stack

mysql perfect to dataviz tools and support the app summary
from no-sql database, read-only scenario

job management

ops redis as queue management

blob queue table stores objects with possibility of local redundancy
(3 copies) or global (6 copies) has local redundancy (3 copies of the message) messages expire in 7 days storage of key-value type "No-sql like" does not allow map- reduce operations filters only by key (recommended) raw documents cleaned documents ml models job management orchestration configurations

continuous deployment

slack everywhere

ai solution in a box cosmos no-sql app insights sql
aleph admin ruby functions queue blob tables jenkins tfs celery users api mgnt redis cache cloud for b2b customers ey aleph ruby + ember mechanical turk staff

check ai black boxes but carefully

our open boxes

In 20/02/2017 was declared … 20/02/2017 In 20 of December
of 2017 … 20/12/2017 In the second day of January of two thousand and seventeen … 02/01/2017 In eleventh day of March of 2016 … 01/03/2016 our challenge: real unstructured data…

…. Foundation Extra Hours Worker claims that the hours after
work were not …. D E C I S I O N Of the additional of unhealthiness. The author worked for the claimed ones … II – FOUNDATION - Rescission sums The author postulates the payment of the amounts resulting from the unmotivated waiver … J u d g e m e n t … moral damages. The requester claimed that during his work at … …and it gets worse REQUESTS CONCLUSION

Human Lawyers Annotated Database Classifier Model Predictions

ey mechanical turk

TYPE OF PHRASE • Requests • Sentences • Other DECISION
• Granted • Overruled 1 2 classifiers

Text Representation Tokenizers Tested Algorithms WORD COUNT TF-IDF N-GRAMS +
STOPWORDS UNIGRAM + STEMMING + STOPWORDS UNIGRAM + STOPWORDS REGRESSÃO LOGÍSTICA RANDOM FOREST GRADIENT BOOSTING CLASSIFIER MULTINOMIAL NAÏVE BAYES STOCHASTIC GRADIENT DESCENT SUPPORT VECTOR CLASSIFIER traditional nlp approaches

traditional nlp approaches text representation tokenizer classifiers algorithms WORD COUNT
TF-IDF N-GRAMS + STOPWORDS UNIGRAM + STEMMING + STOPWORDS UNIGRAM + STOPWORDS LOGISTIC REGRESSION RANDOM FOREST GRADIENT BOOSTING CLASSIFIER MULTINOMIAL NAÏVE BAYES STOCHASTIC GRADIENT DESCENT SUPPORT VECTOR CLASSIFIER

Treinamento Teste first results logistic regression + tf-idf + no
stopwords + stemming training testing f1-score: 0,90 f1-score: 0,81

TYPE OF PHRASE • Demands • Sentences • Nothing DECISION
• Granted • Overruled 1 2 DEEP LEARNING TRADITIONAL NLP classifiers

1DConvNet LSTM Bidirectional-LSTM Task Specific Pre-trained (Word2Vec) EMBEDDINGS (Words) LAYERS
(Phrases) LOSS FUNCTION SGD RMSProp Nadam

embeddings pre trained f-score = 0,897

embeddings task specific f-score = 0,896

“I do not find the defendant– in light of all
available evidence and according to the law and the decision of the jury, and so it goes and yadda yadda – to be guilty.” recurrent neural networks

RNN Recurrent Neural Network LSTM Long-Short Term Memory

Bidirectional-LSTM Pre-trained (Word2Vec) EMBEDDINGS LAYERS LOSS FUNCTION Nadam

conv1D/dense alternative approaches lstm word2vec documents / phrases / words

stop guessing

Michel Fernandes [email protected] Rafael Kenski [email protected] THANK YOU

EY Aleph: Deep Learning applied to jurimetrics ...

EY Aleph: Deep Learning applied to jurimetrics practice

More Decks by Michel Fernandes

Other Decks in Technology

Featured

Transcript