Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Extracting searchable text from Arabic Pdfs

Avatar for roohullah roohullah
January 03, 2012

Extracting searchable text from Arabic Pdfs

Avatar for roohullah

roohullah

January 03, 2012
Tweet

Other Decks in Programming

Transcript

  1. Motivation • Need to get the text out of files

    before they can be indexed and searched. • Arabic PDF files can be challenging. 2 Database ﻦﺴﺣ ﻴﺋر س ﺔﻣﺪﺨﺘﺴﻤﻟا ﺪﻤﺣا ﻦﺴﺣ ﺮﻤﻋ ﺮﻴﺸﻤﻟا دﺪﺟ ﻩﺪﻴآﺄﺗ ﺔﻳرﻮﻬﻤﺠﻟا ﺲﻴﺋر ﺮﻴﺸﺒﻟا ةﺮﻣ بﺮﺤﻟا ﻲﻟا ةدﻮﻌﻟا مﺪﻌﺑ ﺮﻴهﺎﻤﺟ ً ﺎﺒﻃﺎﺨﻣ لﺎﻗو ىﺮﺧا ﺔﻳﻻﻮﻟﺎﺑ ﺪﻠﺒﻟا ﺔﻣﺮآ ﺔﻘﻄﻨﻣ ﺔآﺮﺤﻟاو ﺎﻨﺌﻴﺠﻣ نا ﺔﻴﻟﺎﻤﺸﻟا ﻰﻠﻋ ﺪﻴآﺄﺗ ﺔﻴﻟﺎﻤﺸﻠﻟ ﺔﻴﺒﻌﺸﻟا ةﺪﺣﻮﻟا رﺎﻴﺧ
  2. PDF Basics • Raw file contents are organized into objects.

    • Each object stores a specific type of info: • Document (Root) object • Page objects • Font objects • Basic structure of file is viewable text: 3 […] 7 0 obj <</Metadata 4 0 R/Pages 3 0 R/Type/Catalog/PageLabels 1 0 R>> endobj […]
  3. PDF Text • Text is stored in chunks of one

    or more characters. • Each chunk is located at a given X,Y coordinate • Chunks can be stored in any order in the file 4
  4. Typical Encoding and Rendering • Files store text in an

    encoding: • ISO-8859-6 maps a 1 byte value to a Latin or Arabic character • Unicode maps values to characters in many languages • The OS uses the encoding value and a specific font to find the correct glyph to display: 5
  5. PDF Fonts and Encodings • PDF fonts typically store only

    the glyphs that are used. • Text chunk stores an index into a PDF font object. • Font object may map glyph to a Unicode value. 6
  6. Rendering Difference • Displaying a PDF requires the PDF Engine

    to map fonts • Note that standard encoding values are not required. 7
  7. Basic Extraction Approach 1. Parse PDF file to identify page

    content objects 2. Parse page content stream into text chunks 3. Sort text chunks based on coordinates 4. Process chunks in order: 1. Get index for each character 2. Use font information to map index to Unicode (if defined) 3. Add Unicode value to end of string 8
  8. Arabic Glyphs • Arabic characters have different shapes depending on

    their location in a word. • Each shape is a different glyph in a font. 10
  9. Logical and Presentation Orders • Text in computers is typically

    stored in logical order • First character stored is first character read or written • Presentation order is based on screen layout • Orders are same for Left to Right (LTR) Languages: • Opposite for Right to Left (RTL) Languages: 12
  10. Possible Order Solution • PDF stores data in presentation (display)

    order. • Text editors need the text in logical order though. • Need to convert from presentation to logical order. • Obvious solution: • After decoding each line, reverse the order of the Arabic text: 13
  11. Bi-directional Text • Text can have both RTL and LTR

    characters and each should go in the correct direction • Unicode Bi-directional Text (BiDi) algorithm defines how to order characters in a paragraph based on: • Dominant direction of text in paragraph • Direction of each character in text • Punctuation and neighboring characters • Implicit direction markers • BiDi lets you convert from logical to presentation order. 16
  12. Reverse Bi-directional Algorithm 17 • We need Reverse BiDi to

    convert from presentation to logical order.
  13. Updated Extraction Approach 1. Parse PDF file to identify page

    content objects 2. Parse page content stream into text chunks 3. Sort text chunks based on coordinates 4. Determine dominant text direction 5. Process chunks in order and by line: 1. Get index for each character 2. Use font information to map index to Unicode 3. Add Unicode value to end of “presentation order” string 4. Apply reverse BiDi algorithm to “presentation order” string 18
  14. Presentation Forms / Ligatures • Encodings typically define only the

    general form of Arabic characters. • Unicode is an exception. • The OS determines which glyph form to use (initial, medial, etc.) based on the context of the character. • PDF stores the specific form of each Arabic character. • Unicode presentation forms should not be used in a string and many tools cannot process them. • Need to normalize text from presentation to general forms 19
  15. Font-specific Ligature Implementations • U+FDF2 is the Unicode Arabic ligature

    for Allah ( ﷲ). • The single ligature represents four characters: • “Alef, Lam, Lam, Heh”. • Some fonts implement the ligature differently: • “Lam, Lam, Heh” • They add a separate “Alef” before the ligature. • Alef (U+0627) Allah(U+FDF2) • When decomposing using Unicode specs: • “Alef Alef Lam Lam Heh” 21
  16. Diacritic Placement • Vocalizations and diacritics can be separate glyphs

    • With Unicode: • Diacritics are stored after the base character in logical order • Diacritics are placed over the base character when rendered on screen • With PDF: • Diacritics are stored in a separate text chunks • Coordinates cause them to overlap • Diacritic chunk can be before or after the chunk it modifies 22
  17. Spacing Estimation • Spaces and newlines are not explicitly stored.

    • Spacing is achieved by direct placement of text. • Extraction requires guessing where spaces and newlines should exist. • Is this text chunk’s X-value further away then we expected? • Is this text chunk’s Y-value further away then we expected? • Spacing estimation can be done by keeping track of average character width thus far. • Newline estimation can be done by keeping track of character heights. 24
  18. PDFBox • PDFBox is an open source Apache Incubator project

    • It worked well for many documents in LTR languages • We enhanced it to: • Correct direction of RTL text • Normalize ligatures and presentation forms • Merge diacritics into text • Better estimate where to add spaces • Fix parsing issues • Deal with corrupt / non-compliant files • Can be freely downloaded (in next release): http://incubator.apache.org/pdfbox/ 25