Upgrade to Pro — share decks privately, control downloads, hide ads and more …

PDF Full Text Search in Go (Peter Williams 2019)

PDF Full Text Search in Go (Peter Williams 2019)

Avatar for GopherConAU

GopherConAU

October 31, 2019
Tweet

More Decks by GopherConAU

Other Decks in Programming

Transcript

  1. About me ► Work at PaperCut ► Write system software,

    document processing and image processing ► Background in C and Python ► Developing in Go for last few years
  2. About PaperCut ► PaperCut makes print monitoring software and print

    servers Our server software: ► Runs on customer servers, in cloud and on edge nodes. ► Runs on Windows, Mac, Several Linuxes. ► Is tightly coupled to the OS it runs on. ► Is usually soft real time.
  3. Why PaperCut uses Go ► Go Binaries run on Windows,

    Mac and Linuxes ► Reasonable memory footprint ► Fast enough for soft real time ► Simple language. Good for writing complex software. ► Fast development cycles ► Libraries for common needs. String processing, networking, crypto, graphics, etc ► Libraries follow simplicity of base language.
  4. IPP Print Server ► IPP is an HTTP based printing

    protocol ► Widely used ► Basis of CUPS, the printing subsystem on Mac and Unix ► Basis of AirPrint, the native printing on iOS devices ► Needs a directory service, typically DNS or mDNS ► Uses PDF as internal print document format
  5. PaperCut IPP Print Server ► We wrote a pure Go

    IPP print server in 2016 ► Much simpler than CUPS, the best known IPP server ► Built on top of net/http ► Standard Go libraries and some elbow grease.
  6. Can we add value? IPP servers stores print documents as

    PDFs. Full text search of the documents can help customers in several ways: ► Print alerts: Search for a fixed term in each document before it is printed. ► Print archive search: Search over all user’s documents. Present her with a list and allow her to print selected documents or pages ► Qrdoc Program: Adds a QR code to documents. Readers of the paper document access the digital copy by scanning the QR code.
  7. Requirements ► Needs to run on all OSes ► Have

    a modest memory footprint We have to write this ourselves! Rules out Elasticsearch / Solr / Lucene We don’t complicate our product developers’ lives
  8. Anatomy of a PDF Full Text Search 1. Convert PDF

    to text. 2. Do full text search on PDF. Get top matches 3. Find page number and location of matches in PDFs. 4. Mark up PDF pages. 5. Optionally rasterize PDF pages to PNG at specified PPI.
  9. General Solution ► PDF text extraction library ► Full text

    search library ► PDF markup library ► Code to connect all the above
  10. Choosing libraries - PDF text extractor ► Pure Go had

    worked well for us. ► No Go PDF library had a high quality text extractor So I added a text extractor to UniDoc, the PDF library we used. Insert more and more info Easier to write the code for what we needed More work than an existing PDF text extractor UniDoc is simple, with consistent patterns - Go library ethos
  11. Choosing libraries - Full text search ► Pure Go had

    worked well for us. ► bleve looked simple and clean. ► Simpler and less history than Lucene. ► Had all the features we needed. We didn’t need many. ► We still think Lucene is great.
  12. Domain Knowledge PDF is a graphics language not a text

    markup language. Text is drawn at specified positions, not necessarily in reading order. E.g. 1 0 0 1 100 700 Tm (World!)Tj 1 0 0 1 20 700 Tm (Hello)Tj writes “Hello Word!” near the top left of an A4 page.
  13. Key Insight PDF text extractors need to track the position

    of text on a page. We can use that positional information to find the location of text search matches on a PDF page.
  14. Writing the program - basic idea Extract text and its

    position on the page from PDF. type TextMark struct { Text string BBox model.PdfRectangle } Typically one TextMark per few characters.
  15. Writing the program - basic idea ► Sort slice of

    TextMarks by their offsets in the extracted text. ► Search extracted text and return offsets of start and end of match. ► Binary search slice of TextMarks to obtain ► .. sub-slice of TextMarks corresponding to search match. ► Find region of PDF page covered by sub-slice of TextMarks. ► Mark up PDF page.
  16. Writing the program - extract text from PDF Extract text

    and it positions from PDF extractor.TextMarkArray is the text and positions of a PDF page // ExtractPageTextMarks returns the extracted text and corresponding TextMarks on page page func ExtractPageTextMarks(page *model.PdfPage) (string, *extractor.TextMarkArray, error)
  17. Writing the program - bounding box of matched text //

    RangeOffset returns the TextMarks in `ma` that have `start` <= TextMark.Offset < `end`. func (ma *TextMarkArray) RangeOffset(start, end int) (*TextMarkArray, error) // BBox returns the smallest axis-aligned rectangle that encloses all the TextMarks in `ma`. func (ma *TextMarkArray) BBox() (model.PdfRectangle, bool)
  18. Writing the program - full text search Index page text

    like this. Index is a bleve index id := fmt.Sprintf("%04X.%d", dp.DocIdx, dp.PageIdx) idText := IDText{ID: id, Text: dp.Text} err = index.Index(id, idText)
  19. Writing the program - full text search Search the bleve

    index like this query := bleve.NewMatchQuery(term) search := bleve.NewSearchRequest(query) .. searchResults, err := index.Search(search)
  20. Writing the program - Bleve match to PDF location bleve

    matches are in a struct DocumentMatch Inside DocumentMatch is a type Location struct { // Start and End are the byte offsets of the term in the field Start uint64 `json:"start"` End uint64 `json:"end"` }
  21. Writing the program - Completion ► Map the Start and

    End offsets in the extracted text to set of character positions ► Find the bounding box of those character position ► Mark up the PDF
  22. © PaperCut 2019 T +61 (3) 8376 8610 F +61

    (3) 8621 8983 www.papercut.com Level 1, 3 Prospect Hill Road Camberwell, Victoria, 3124 Australia Thank you!