How to Extract Tabular Data from Scanned PDF's?

Slide 1

Slide 1 text

How to extract tabular data from scanned PDFs? Proof-of-Concept made by Railsware

Slide 2

Slide 2 text

Artur Hebda Software Engineer at Railsware

Slide 3

Slide 3 text

Agenda Project overview PDF challenges Processing pipeline Bonus: Heroku deployment

Slide 4

Slide 4 text

Seems simple, right? 2017 2016 Assets £'000 £'000 Total assets 536,812 495,788

Slide 5

Slide 5 text

Well… it’s not that simple...

Slide 6

Slide 6 text

Source: time.com

Slide 7

Slide 7 text

Who was the client?

Slide 8

Slide 8 text

Fin-tech company UK-based More ﬁn than tech Looks for automation instead of hiring people Cares about trends on the market Project overview

Slide 9

Slide 9 text

What did the client want?

Slide 10

Slide 10 text

Ideal ﬂow Excel Company number Financial data points extracted for last two years Excel Project overview

Slide 11

Slide 11 text

What was the goal?

Slide 12

Slide 12 text

Semi-automated extraction Proof of concept 3 weeks 27 3 ﬁnancial data points 500 100 reports Project overview

Slide 13

Slide 13 text

What did we do?

Slide 14

Slide 14 text

Research & MVP Rapid prototyping with Rails Sidekiq-based processing pipeline Cloud storage 3rd-party services Project overview

Slide 15

Slide 15 text

Let’s talk about PDF

Slide 16

Slide 16 text

What is a PDF? move cursor, draw, text box PDF Reference, iText RUPS ((in%)) Tj /CS0 cs 0.894 0.11 0.224 scn /GS0 gs /T1_2 1 Tf 6.3 0 0 6.3 135.13 690.3 Tm [France, -3925.4, Eurozone, -2460.4, Kingdom] TJ 15.932 1.159 Td (United) Tj 7.173 -1.159 Td (States) Tj -0.206 1.159 Td (United) Tj PDF challenges

Slide 17

Slide 17 text

Where are the tables then?

Slide 18

Slide 18 text

Tables are not there ...per se, but text elements are PDF challenges Source: Nurminen

Slide 19

Slide 19 text

Table in PDF is not a duck - if it looks like a table, it deﬁnitely isn’t one

Slide 20

Slide 20 text

How do we know where a table is?

Slide 21

Slide 21 text

Image processing edge detection, position analysis Tabula, Nurminen algorithm PDF challenges

Slide 22

Slide 22 text

How diﬃcult can it be?

Slide 23

Slide 23 text

Processing pipeline Download report

Slide 24

Slide 24 text

Table extraction Processing pipeline Fix content stream Download report

Slide 25

Slide 25 text

Why would you ever have to ﬁx a PDF?

Slide 26

Slide 26 text

Content stream ended up unexpectedly Output may be incorrect Processing pipeline

Slide 27

Slide 27 text

How to ﬁx the broken ﬁles?

Slide 28

Slide 28 text

Split pages iText Fast Reliable Quite expensive volume-based license Processing pipeline

Slide 29

Slide 29 text

Maybe something else would work, though

Slide 30

Slide 30 text

or maybe not... PDF Reader Grim Docsplit pdftk Apache PDF Box Processing pipeline

Slide 31

Slide 31 text

How do we cut price on iText, then?

Slide 32

Slide 32 text

Fix PDFs before splitting iText + Apache PDF Box Reduce ﬁle-based license cost by 80% Use free software to split PDFs Processing pipeline

Slide 33

Slide 33 text

Table extraction Processing pipeline Fix content stream Split into single pages Download report

Slide 34

Slide 34 text

How do we know which pages are relevant?

Slide 35

Slide 35 text

We care only about balance sheets Close to the end of a report ⅓ 40% of pages Processing pipeline

Slide 36

Slide 36 text

Seems quite a lot, how can we reduce that number?

Slide 37

Slide 37 text

Keyword analysis of unstructured text Google Vision API $0.0015 per page Every single page can be processed 40% 8% pages get qualiﬁed Processing pipeline

Slide 38

Slide 38 text

Why don’t we use Google Vision API for more?

Slide 39

Slide 39 text

Google Vision API OCR != layout detection Highly specialized OCR Labels Processing pipeline

Slide 40

Slide 40 text

Keyword analysis Table extraction Processing pipeline Fix content stream Split into single pages Convert to image Extract raw text Detect relevant pages Download report

Slide 41

Slide 41 text

OK, but how to extract tables?

Slide 42

Slide 42 text

Tables with text from images Abbyy OCR SDK OCR is never 100% correct RTF format contains tables detected $0.03 per page - a bit pricy... Processing pipeline

Slide 43

Slide 43 text

It’s cheaper to use Google Vision API to get keywords and pass only the most relevant pages to Abbyy

Slide 44

Slide 44 text

How good is Abbyy?

Slide 45

Slide 45 text

~30 reports assessed manually Missing headers, Missing tables, text distortion Headers not included Tables not detected Text distorted by OCR Processing pipeline

Slide 46

Slide 46 text

What if tables are missing?

Slide 47

Slide 47 text

Tables are there… most of the time

Slide 48

Slide 48 text

How to read missing tables?

Slide 49

Slide 49 text

Convert RTF to HTML online-convert.com Most of the tools can handle single type of tables Abbyy uses more than one Processing pipeline

Slide 50

Slide 50 text

How to ﬁx other problems with tables?

Slide 51

Slide 51 text

Pre-analysis cleanup Try to correct common OCR errors Insert missing headers with years Split tables by header detection Processing pipeline

Slide 52

Slide 52 text

Where is the value though?

Slide 53

Slide 53 text

Extraction based on a heat map Total assets, UK, 2017 Processing pipeline Balance sheet 2017 2017 2016 2016 £'000 UK US UK US Total assets 3,456 6,543 2,345 5,432 Total liabilities (2,109) (3,210) (1,098) (2,109)

Slide 54

Slide 54 text

Keyword analysis Table extraction Processing pipeline Fix content stream Split into single pages Convert to image Extract raw text Detect relevant pages Extract tables Extract data points Download report

Slide 55

Slide 55 text

How reliable is the extracted value?

Slide 56

Slide 56 text

Probability assessment 3 models tested Weighted keywords Value formatting and range Other similarly good results Processing pipeline

Slide 57

Slide 57 text

Wait, so it might be wrong...

Slide 58

Slide 58 text

Probability is calculated to bring attention to potential errors as extraction might be wrong

Slide 59

Slide 59 text

What’s the error rate?

Slide 60

Slide 60 text

60% Error rate

Slide 61

Slide 61 text

Meaningful data extraction from images is hard for computers

Slide 62

Slide 62 text

But it’s easy for humans...

Slide 63

Slide 63 text

No content

Slide 64

Slide 64 text

How to automate this?

Slide 65

Slide 65 text

Mechanical Turk Scalable workforce from Amazon Comparable cost per page to Abbyy Pay only for approved answers Automated cross-check policy

Slide 66

Slide 66 text

How does it work?

Slide 67

Slide 67 text

No content

Slide 68

Slide 68 text

What’s the error rate?

Slide 69

Slide 69 text

60% Error rate for Automated extraction 18% Error rate for Mechanical Turk

Slide 70

Slide 70 text

Where do we go from here?

Slide 71

Slide 71 text

Next steps RW Labs More reliable probability model Limit relevant pages Add connector for Mechanical Turk … AI?

Slide 72

Slide 72 text

What have we learned?

Slide 73

Slide 73 text

Lessons learned Build measure learn Automation is a part of the picture Human supervision is necessary Mechanical Turk - valid alternative Start simple and be pragmatic Cost estimation is important for development

Slide 74

Slide 74 text

BONUS: How to conﬁgure Heroku for all of that?

Slide 75

Slide 75 text

Buildpacks Buildpack API Ruby Node.js Java Google Cloud SDK Heroku deployment

Slide 76

Slide 76 text

Why do you need Java?

Slide 77

Slide 77 text

java -jar ... iText, Apache PDF Box Requires pom.xml 4.0.0 GROUP_HERE ARTIFACT_HERE VERSION_HERE And some memory... Heroku deployment

Slide 78

Slide 78 text

What about Google Cloud SDK?

Slide 79

Slide 79 text

Google Cloud SDK heroku-google-cloud-buildpack Download and install SDK Build JSON with credentials from ENV variables Activate service account using proﬁle.d script Heroku deployment