or maybe not...
PDF Reader
Grim
Docsplit
pdftk
Apache PDF Box
Processing pipeline
Slide 31
Slide 31 text
How do we cut price
on iText, then?
Slide 32
Slide 32 text
Fix PDFs before
splitting
iText + Apache PDF Box
Reduce file-based license cost by 80%
Use free software to split PDFs
Processing pipeline
Slide 33
Slide 33 text
Table extraction
Processing pipeline
Fix content stream Split into single pages
Download report
Slide 34
Slide 34 text
How do we know which
pages are relevant?
Slide 35
Slide 35 text
We care only
about balance
sheets
Close to the end of a report
⅓ 40% of pages
Processing pipeline
Slide 36
Slide 36 text
Seems quite a lot, how can
we reduce that number?
Slide 37
Slide 37 text
Keyword analysis
of unstructured
text
Google Vision API
$0.0015 per page
Every single page can be processed
40% 8% pages get qualified
Processing pipeline
Slide 38
Slide 38 text
Why don’t we use Google
Vision API for more?
Slide 39
Slide 39 text
Google Vision API
OCR != layout detection
Highly specialized
OCR
Labels
Processing pipeline
Slide 40
Slide 40 text
Keyword analysis
Table extraction
Processing pipeline
Fix content stream Split into single pages
Convert to image
Extract raw text
Detect relevant pages
Download report
Slide 41
Slide 41 text
OK, but how to
extract tables?
Slide 42
Slide 42 text
Tables with text
from images
Abbyy OCR SDK
OCR is never 100% correct
RTF format contains tables detected
$0.03 per page - a bit pricy...
Processing pipeline
Slide 43
Slide 43 text
It’s cheaper to use
Google Vision API to
get keywords and pass
only the most relevant
pages to Abbyy
Slide 44
Slide 44 text
How good is Abbyy?
Slide 45
Slide 45 text
~30 reports
assessed
manually
Missing headers,
Missing tables,
text distortion
Headers not included
Tables not detected
Text distorted by OCR
Processing pipeline
Slide 46
Slide 46 text
What if tables are missing?
Slide 47
Slide 47 text
Tables are there… most of the time
Slide 48
Slide 48 text
How to read
missing tables?
Slide 49
Slide 49 text
Convert RTF
to HTML
online-convert.com
Most of the tools can handle single
type of tables
Abbyy uses more than one
Processing pipeline
Slide 50
Slide 50 text
How to fix other problems
with tables?
Slide 51
Slide 51 text
Pre-analysis
cleanup
Try to correct common OCR errors
Insert missing headers with years
Split tables by header detection
Processing pipeline
Slide 52
Slide 52 text
Where is the value though?
Slide 53
Slide 53 text
Extraction based
on a heat map
Total assets, UK, 2017
Processing pipeline
Balance sheet 2017 2017 2016 2016
£'000 UK US UK US
Total assets 3,456 6,543 2,345 5,432
Total liabilities (2,109) (3,210) (1,098) (2,109)
Slide 54
Slide 54 text
Keyword analysis
Table extraction
Processing pipeline
Fix content stream Split into single pages
Convert to image
Extract raw text
Detect relevant pages
Extract tables Extract data points
Download report
Slide 55
Slide 55 text
How reliable is the
extracted value?
Slide 56
Slide 56 text
Probability
assessment
3 models tested
Weighted keywords
Value formatting and range
Other similarly good results
Processing pipeline
Slide 57
Slide 57 text
Wait, so it might be wrong...
Slide 58
Slide 58 text
Probability is calculated
to bring attention to
potential errors as
extraction might be
wrong
Slide 59
Slide 59 text
What’s the error rate?
Slide 60
Slide 60 text
60%
Error rate
Slide 61
Slide 61 text
Meaningful data
extraction from
images is hard for
computers
Slide 62
Slide 62 text
But it’s easy
for humans...
Slide 63
Slide 63 text
No content
Slide 64
Slide 64 text
How to automate this?
Slide 65
Slide 65 text
Mechanical Turk
Scalable workforce from Amazon
Comparable cost per page to Abbyy
Pay only for approved answers
Automated cross-check policy
Slide 66
Slide 66 text
How does it work?
Slide 67
Slide 67 text
No content
Slide 68
Slide 68 text
What’s the error rate?
Slide 69
Slide 69 text
60%
Error rate for
Automated extraction
18%
Error rate for
Mechanical Turk
Slide 70
Slide 70 text
Where do we go from
here?
Slide 71
Slide 71 text
Next steps
RW Labs
More reliable probability model
Limit relevant pages
Add connector for Mechanical Turk
…
AI?
Slide 72
Slide 72 text
What have we learned?
Slide 73
Slide 73 text
Lessons learned
Build measure learn
Automation is a part of the picture
Human supervision is necessary
Mechanical Turk - valid alternative
Start simple and be pragmatic
Cost estimation is important for
development
Slide 74
Slide 74 text
BONUS: How to configure
Heroku for all of that?
Slide 75
Slide 75 text
Buildpacks
Buildpack API
Ruby
Node.js
Java
Google Cloud SDK
Heroku deployment
Slide 76
Slide 76 text
Why do you need Java?
Slide 77
Slide 77 text
java -jar ...
iText, Apache PDF Box
Requires pom.xml
4.0.0
GROUP_HERE
ARTIFACT_HERE
VERSION_HERE
And some memory...
Heroku deployment
Slide 78
Slide 78 text
What about
Google Cloud SDK?
Slide 79
Slide 79 text
Google Cloud
SDK
heroku-google-cloud-buildpack
Download and install SDK
Build JSON with credentials from
ENV variables
Activate service account using
profile.d script
Heroku deployment