Extracting searchable text from Arabic Pdfs

Extracting Searchable Text from Arabic PDFs Brian Carrier, Ph.D. Director
of Digital Forensics Basis Technology Corp.

Motivation • Need to get the text out of files
before they can be indexed and searched. • Arabic PDF files can be challenging. 2 Database ﻦﺴﺣ ﻴﺋر س ﺔﻣﺪﺨﺘﺴﻤﻟا ﺪﻤﺣا ﻦﺴﺣ ﺮﻤﻋ ﺮﻴﺸﻤﻟا دﺪﺟ ﻩﺪﻴآﺄﺗ ﺔﻳرﻮﻬﻤﺠﻟا ﺲﻴﺋر ﺮﻴﺸﺒﻟا ةﺮﻣ بﺮﺤﻟا ﻲﻟا ةدﻮﻌﻟا مﺪﻌﺑ ﺮﻴهﺎﻤﺟ ً ﺎﺒﻃﺎﺨﻣ لﺎﻗو ىﺮﺧا ﺔﻳﻻﻮﻟﺎﺑ ﺪﻠﺒﻟا ﺔﻣﺮآ ﺔﻘﻄﻨﻣ ﺔآﺮﺤﻟاو ﺎﻨﺌﻴﺠﻣ نا ﺔﻴﻟﺎﻤﺸﻟا ﻰﻠﻋ ﺪﻴآﺄﺗ ﺔﻴﻟﺎﻤﺸﻠﻟ ﺔﻴﺒﻌﺸﻟا ةﺪﺣﻮﻟا رﺎﻴﺧ

PDF Basics • Raw file contents are organized into objects.
• Each object stores a specific type of info: • Document (Root) object • Page objects • Font objects • Basic structure of file is viewable text: 3 […] 7 0 obj <</Metadata 4 0 R/Pages 3 0 R/Type/Catalog/PageLabels 1 0 R>> endobj […]

PDF Text • Text is stored in chunks of one
or more characters. • Each chunk is located at a given X,Y coordinate • Chunks can be stored in any order in the file 4

Typical Encoding and Rendering • Files store text in an
encoding: • ISO-8859-6 maps a 1 byte value to a Latin or Arabic character • Unicode maps values to characters in many languages • The OS uses the encoding value and a specific font to find the correct glyph to display: 5

PDF Fonts and Encodings • PDF fonts typically store only
the glyphs that are used. • Text chunk stores an index into a PDF font object. • Font object may map glyph to a Unicode value. 6

Rendering Difference • Displaying a PDF requires the PDF Engine
to map fonts • Note that standard encoding values are not required. 7

Basic Extraction Approach 1. Parse PDF file to identify page
content objects 2. Parse page content stream into text chunks 3. Sort text chunks based on coordinates 4. Process chunks in order: 1. Get index for each character 2. Use font information to map index to Unicode (if defined) 3. Add Unicode value to end of string 8

English Extraction Example 9

Arabic Glyphs • Arabic characters have different shapes depending on
their location in a word. • Each shape is a different glyph in a font. 10

Arabic Extraction Example 11

Logical and Presentation Orders • Text in computers is typically
stored in logical order • First character stored is first character read or written • Presentation order is based on screen layout • Orders are same for Left to Right (LTR) Languages: • Opposite for Right to Left (RTL) Languages: 12

Possible Order Solution • PDF stores data in presentation (display)
order. • Text editors need the text in logical order though. • Need to convert from presentation to logical order. • Obvious solution: • After decoding each line, reverse the order of the Arabic text: 13

Bi-directional Text • How should the following be logically stored?
14

Bi-directional Text • How should the following be logically stored?
15

Bi-directional Text • Text can have both RTL and LTR
characters and each should go in the correct direction • Unicode Bi-directional Text (BiDi) algorithm defines how to order characters in a paragraph based on: • Dominant direction of text in paragraph • Direction of each character in text • Punctuation and neighboring characters • Implicit direction markers • BiDi lets you convert from logical to presentation order. 16

Reverse Bi-directional Algorithm 17 • We need Reverse BiDi to
convert from presentation to logical order.

Updated Extraction Approach 1. Parse PDF file to identify page
content objects 2. Parse page content stream into text chunks 3. Sort text chunks based on coordinates 4. Determine dominant text direction 5. Process chunks in order and by line: 1. Get index for each character 2. Use font information to map index to Unicode 3. Add Unicode value to end of “presentation order” string 4. Apply reverse BiDi algorithm to “presentation order” string 18

Presentation Forms / Ligatures • Encodings typically define only the
general form of Arabic characters. • Unicode is an exception. • The OS determines which glyph form to use (initial, medial, etc.) based on the context of the character. • PDF stores the specific form of each Arabic character. • Unicode presentation forms should not be used in a string and many tools cannot process them. • Need to normalize text from presentation to general forms 19

Arabic Extraction Example 2 20

Font-specific Ligature Implementations • U+FDF2 is the Unicode Arabic ligature
for Allah ( ﷲ). • The single ligature represents four characters: • “Alef, Lam, Lam, Heh”. • Some fonts implement the ligature differently: • “Lam, Lam, Heh” • They add a separate “Alef” before the ligature. • Alef (U+0627) Allah(U+FDF2) • When decomposing using Unicode specs: • “Alef Alef Lam Lam Heh” 21

Diacritic Placement • Vocalizations and diacritics can be separate glyphs
• With Unicode: • Diacritics are stored after the base character in logical order • Diacritics are placed over the base character when rendered on screen • With PDF: • Diacritics are stored in a separate text chunks • Coordinates cause them to overlap • Diacritic chunk can be before or after the chunk it modifies 22

Diacritic Insertion 23

Spacing Estimation • Spaces and newlines are not explicitly stored.
• Spacing is achieved by direct placement of text. • Extraction requires guessing where spaces and newlines should exist. • Is this text chunk’s X-value further away then we expected? • Is this text chunk’s Y-value further away then we expected? • Spacing estimation can be done by keeping track of average character width thus far. • Newline estimation can be done by keeping track of character heights. 24

PDFBox • PDFBox is an open source Apache Incubator project
• It worked well for many documents in LTR languages • We enhanced it to: • Correct direction of RTL text • Normalize ligatures and presentation forms • Merge diacritics into text • Better estimate where to add spaces • Fix parsing issues • Deal with corrupt / non-compliant files • Can be freely downloaded (in next release): http://incubator.apache.org/pdfbox/ 25

Thank You!

Extracting searchable text from Arabic Pdfs

Extracting searchable text from Arabic Pdfs

roohullah

Other Decks in Programming

Featured

Transcript