PDF Full Text Search in Go (Peter Williams 2019)

PDF Full Text Search in Go Why and how I
wrote it November 2019

Source Code https://github.com/PaperCutSoftware/pdfsearch

Talk Format ► Background (10 minutes) ► Code (15 minutes)
► Summary and Questions (5 minutes)

About me ► Work at PaperCut ► Write system software,
document processing and image processing ► Background in C and Python ► Developing in Go for last few years

About PaperCut ► PaperCut makes print monitoring software and print
servers Our server software: ► Runs on customer servers, in cloud and on edge nodes. ► Runs on Windows, Mac, Several Linuxes. ► Is tightly coupled to the OS it runs on. ► Is usually soft real time.

Why PaperCut uses Go ► Go Binaries run on Windows,
Mac and Linuxes ► Reasonable memory footprint ► Fast enough for soft real time ► Simple language. Good for writing complex software. ► Fast development cycles ► Libraries for common needs. String processing, networking, crypto, graphics, etc ► Libraries follow simplicity of base language.

PaperCut’s IPP Print Servers ► PaperCut makes some IPP print
servers ► 10M End-users

IPP Print Server ► IPP is an HTTP based printing
protocol ► Widely used ► Basis of CUPS, the printing subsystem on Mac and Unix ► Basis of AirPrint, the native printing on iOS devices ► Needs a directory service, typically DNS or mDNS ► Uses PDF as internal print document format

PaperCut IPP Print Server ► We wrote a pure Go
IPP print server in 2016 ► Much simpler than CUPS, the best known IPP server ► Built on top of net/http ► Standard Go libraries and some elbow grease.

Can we add value? IPP servers stores print documents as
PDFs. Full text search of the documents can help customers in several ways: ► Print alerts: Search for a ﬁxed term in each document before it is printed. ► Print archive search: Search over all user’s documents. Present her with a list and allow her to print selected documents or pages ► Qrdoc Program: Adds a QR code to documents. Readers of the paper document access the digital copy by scanning the QR code.

Requirements ► Needs to run on all OSes ► Have
a modest memory footprint We have to write this ourselves! Rules out Elasticsearch / Solr / Lucene We don’t complicate our product developers’ lives

Extra Requirement ► Generate previews of matches for client software

The Pocket mobile app displays document previews before printing

Anatomy of a PDF Full Text Search 1. Convert PDF
to text. 2. Do full text search on PDF. Get top matches 3. Find page number and location of matches in PDFs. 4. Mark up PDF pages. 5. Optionally rasterize PDF pages to PNG at speciﬁed PPI.

General Solution ► PDF text extraction library ► Full text
search library ► PDF markup library ► Code to connect all the above

Choosing libraries Pure Go worked well for us

Choosing libraries - PDF text extractor ► Pure Go had
worked well for us. ► No Go PDF library had a high quality text extractor So I added a text extractor to UniDoc, the PDF library we used. Insert more and more info Easier to write the code for what we needed More work than an existing PDF text extractor UniDoc is simple, with consistent patterns - Go library ethos

Choosing libraries - Full text search ► Pure Go had
worked well for us. ► bleve looked simple and clean. ► Simpler and less history than Lucene. ► Had all the features we needed. We didn’t need many. ► We still think Lucene is great.

Domain Knowledge PDF is a graphics language not a text
markup language. Text is drawn at speciﬁed positions, not necessarily in reading order. E.g. 1 0 0 1 100 700 Tm (World!)Tj 1 0 0 1 20 700 Tm (Hello)Tj writes “Hello Word!” near the top left of an A4 page.

Key Insight PDF text extractors need to track the position
of text on a page. We can use that positional information to ﬁnd the location of text search matches on a PDF page.

Writing the program - basic idea Extract text and its
position on the page from PDF. type TextMark struct { Text string BBox model.PdfRectangle } Typically one TextMark per few characters.

Writing the program - basic idea ► Sort slice of
TextMarks by their offsets in the extracted text. ► Search extracted text and return offsets of start and end of match. ► Binary search slice of TextMarks to obtain ► .. sub-slice of TextMarks corresponding to search match. ► Find region of PDF page covered by sub-slice of TextMarks. ► Mark up PDF page.

Writing the program - extract text from PDF Extract text
and it positions from PDF extractor.TextMarkArray is the text and positions of a PDF page // ExtractPageTextMarks returns the extracted text and corresponding TextMarks on page page func ExtractPageTextMarks(page *model.PdfPage) (string, *extractor.TextMarkArray, error)

Writing the program - bounding box of matched text //
RangeOffset returns the TextMarks in `ma` that have `start` <= TextMark.Offset < `end`. func (ma *TextMarkArray) RangeOffset(start, end int) (*TextMarkArray, error) // BBox returns the smallest axis-aligned rectangle that encloses all the TextMarks in `ma`. func (ma *TextMarkArray) BBox() (model.PdfRectangle, bool)

Writing the program - full text search Index page text
like this. Index is a bleve index id := fmt.Sprintf("%04X.%d", dp.DocIdx, dp.PageIdx) idText := IDText{ID: id, Text: dp.Text} err = index.Index(id, idText)

Writing the program - full text search Search the bleve
index like this query := bleve.NewMatchQuery(term) search := bleve.NewSearchRequest(query) .. searchResults, err := index.Search(search)

Writing the program - Bleve match to PDF location bleve
matches are in a struct DocumentMatch Inside DocumentMatch is a type Location struct { // Start and End are the byte offsets of the term in the ﬁeld Start uint64 `json:"start"` End uint64 `json:"end"` }

Writing the program - Completion ► Map the Start and
End offsets in the extracted text to set of character positions ► Find the bounding box of those character position ► Mark up the PDF

To allow matches to be displayed like this

© PaperCut 2019 T +61 (3) 8376 8610 F +61
(3) 8621 8983 www.papercut.com Level 1, 3 Prospect Hill Road Camberwell, Victoria, 3124 Australia Thank you!

PDF Full Text Search in Go (Peter Williams 2019)

PDF Full Text Search in Go (Peter Williams 2019)

GopherConAU

More Decks by GopherConAU

Other Decks in Programming

Featured

Transcript

PDF Full Text Search in Go Why and how I

Source Code https://github.com/PaperCutSoftware/pdfsearch

Talk Format ► Background (10 minutes) ► Code (15 minutes)

About me ► Work at PaperCut ► Write system software,

About PaperCut ► PaperCut makes print monitoring software and print

Why PaperCut uses Go ► Go Binaries run on Windows,

PaperCut’s IPP Print Servers ► PaperCut makes some IPP print

IPP Print Server ► IPP is an HTTP based printing

PaperCut IPP Print Server ► We wrote a pure Go

Can we add value? IPP servers stores print documents as

Requirements ► Needs to run on all OSes ► Have

Extra Requirement ► Generate previews of matches for client software

The Pocket mobile app displays document previews before printing

Anatomy of a PDF Full Text Search 1. Convert PDF

General Solution ► PDF text extraction library ► Full text

Choosing libraries Pure Go worked well for us

Choosing libraries - PDF text extractor ► Pure Go had

Choosing libraries - Full text search ► Pure Go had

Domain Knowledge PDF is a graphics language not a text

Key Insight PDF text extractors need to track the position

Writing the program - basic idea Extract text and its

Writing the program - basic idea ► Sort slice of

Writing the program - extract text from PDF Extract text

Writing the program - bounding box of matched text //

Writing the program - full text search Index page text

Writing the program - full text search Search the bleve

Writing the program - Bleve match to PDF location bleve

Writing the program - Completion ► Map the Start and

To allow matches to be displayed like this

© PaperCut 2019 T +61 (3) 8376 8610 F +61