Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Scraping and Parsing PDFs in Python

Scraping and Parsing PDFs in Python

Ian Hopkinson, Sr. Data Scientist @Scraperwiki, talk at Data Science London @ds_ldn

Data Science London

July 13, 2013
Tweet

More Decks by Data Science London

Other Decks in Technology

Transcript

  1. # The steps of a scrape """ * Get root

    URL * Iterate over the resources to scrape * Download source files (HTML, CSV, PDF, xls...) * Parse files and write out """ Scraping and Parsing PDFs in Python With Ian Hopkinson from ScraperWiki
  2. # Handy tools """ * Browser Developer Tools XPath helper

    JSONView * requests library * lxml library (includes XPath) * scraperwiki library (databases, pdftoxml) * Scrapy framework * Selenium remote control browser """ Scraping and Parsing PDFs in Python With Ian Hopkinson from ScraperWiki
  3. # HTML wrangling >>> import lxml.html >>> doc = lxml.html.fromstring(r.content)

    >>> s = doc.xpath("//td[@class='thing']") >>> links = doc.xpath("//a") >>> P = doc.xpath("//*[contains(@href,'cgi')]") ... >>> element_text = s.text >>> elements_text = s.text_content() >>> import html2text >>> elements_text = html2text.html2text( lxml.html.tostring(s)) Scraping and Parsing PDFs in Python With Ian Hopkinson from ScraperWiki
  4. # PDF wrangling # Royal Society Membership List Scraping and

    Parsing PDFs in Python With Ian Hopkinson from ScraperWiki
  5. # PDF wrangling # L'Académie des sciences Scraping and Parsing

    PDFs in Python With Ian Hopkinson from ScraperWiki
  6. # PDF wrangling # Hints for PDF parsers * Use

    every layout feature you can find! * Regular expressions * Anticipate irregularity * Don't be afraid of manual edits * pdfminer Scraping and Parsing PDFs in Python With Ian Hopkinson from ScraperWiki
  7. # UN Democracy # with Julian Todd """ ** Provide

    the verbatim records of the UN ** in easily navigable, accessible form * Available content: * Security Council, General Assembly http://www.undemocracy.com/ * 21,000 documents - varying formats * Currently not updating """ Scraping and Parsing PDFs in Python With Ian Hopkinson from ScraperWiki
  8. # UN Democracy # PDFs with consistent formatting Scraping and

    Parsing PDFs in Python With Ian Hopkinson from ScraperWiki
  9. # UN Democracy # PDF with regular formatting * President

    * Speaker ID * Stage directions * Votes * Links to other files Scraping and Parsing PDFs in Python With Ian Hopkinson from ScraperWiki
  10. # UN Democracy # XML from PDF Scraping and Parsing

    PDFs in Python With Ian Hopkinson from ScraperWiki
  11. # UN Democracy # regular expressions! msodecided = re.match("(?:There being

    no objection, )?[Ii]t (?:was|is) so decided(?: \(decision [\d/]*\s*(?:A|B|C|A and B)?\))?\.?$", ptext) mwasadopted = re.match(".*?(?:resolution|decision|agenda|amendment|recommendation).*?(?:was|were) adopted(?i)", ptext) mcalledorder = re.match("The meeting (?:was called to order|rose|was suspended|was adjourned|resumed|was resumed) (?:at|on)", ptext) mtookchair = re.match("\s*(?:In the absence of the President, )?(.*?)(?:, \(?Vice[\-\s]President\)?,)? (?:took|in) the [Cc]hair\.?$", ptext) mretchair = re.match("(?:The President|.*?, Vice-President,|Mrs. Albright.*?|Baroness Amos) (?:returned to|in) the Chair.$", ptext) mescort = re.search("(?:was escorted|escorted the.*?) (?:(?:from|to) the (?:rostrum|podium|platform)|(?:from|into|to its place in) the (?:General Assembly Hall|Conference Room|Security Council Chamber))(?: by the President and the Secretary-General)?\.?$", ptext) msecball = re.search("A vote was taken by secret ballot\.(?: The meeting was suspended at|$)", ptext) mminsil = re.search("The (?:members of the (?:General )?Assembly|Council) observed (?:a|one) minute of (?:silent prayer (?:or|and) meditation|silence)\.$", ptext) mtellers = re.search("At the invitations? of the (?:Acting )?Presidents?.*?acted as tellers\.$|Having been drawn by lot", ptext) melected = re.search("[Hh]aving obtained (?:the required (?:two-thirds )?|an absolute )majority.*?(?:(?:were|was|been|is) s?elected|will be included [io]n the list)", ptext) mmisc = re.search("The Acting President drew the following.*?from the box|sang.*?for the General Assembly|The Assembly heard a musical performance|The Secretary- General presented the award to|From the .*? Group:|Having been drawn by lot by the (?:President|Secretary-General),|were elected members of the Organizational Committee|President \w+ and then Vice-President|Vice-President \S+ \S+ presided over|The following .*? States have.*?been elected members of the Security Council", ptext) mmiscnote = re.search("\[In the 79th plenary .*? III.\]$", ptext) mmstar = re.match("\*", ptext) # insert * in the text mmspokein = re.match("\(spoke in \w+(?:; interpretation.*?|; .*? the delegation)?\)$", ptext) matinvite = re.match("(?:At the invitation of the President, )?.*? (?:(?:took (?:a |the )?|were escorted to their )seats? at the Council table|(?:took|was invited to take) (?:(?:the |a |their )?(?:seat|place)s? reserved for \w+|a seat|a place|places|seats|their seats|his seat) at the (?:side of the )?Council (?:[Cc]hamber|table))(?:;.*?Chamber)?\.$", ptext) mscsilence = re.match("The members of the (?:Security )?Council observed a minute of silence.$", ptext) mscescort = re.search("(?:were|was) escorted to (?:seats|a seat|his place|a place) at the (?:Security )?Council table.$", ptext) mvtape = re.match("A video ?(?:tape)? was (?:shown|played|displayed) in the (?:Council Chamber|General Assembly Hall).$|An audio tape, in Arabic,|The members of the General Assembly heard a musical performance.$", ptext) mvprojscreen = re.match("(?:An image was|Two images were|A video was) projected on screen\.$", ptext) mvresuadjourned = re.match("The meeting was resumed and adjourned on.*? a\.m\.$", ptext) Scraping and Parsing PDFs in Python With Ian Hopkinson from ScraperWiki
  12. # Tools for probing legacy code >>> import pycallgraph Scraping

    and Parsing PDFs in Python With Ian Hopkinson from ScraperWiki
  13. # Tools for probing legacy code # Clonedigger Scraping and

    Parsing PDFs in Python With Ian Hopkinson from ScraperWiki
  14. • EU FP7 natural language processing project • Process streams

    of daily news in 4 languages: o what, where, when and who o extract events: temporal and causal relations o store -> a history recorder. o Organise and visualise as stories, scripts, plots to provide more efficient access • Netherlands (VU, LexisNexis, Synerscope), Spain (Basque University), UK (ScraperWiki) and Italy (Federation Bruno Kessler, Trento) http://www.newsreader-project.eu/
  15. #Natural language processing toolkit >>>import nltk >>> from nltk.book import

    * >>> text6 <Text: Monty Python and the Holy Grail> >>> text6.concordance("Thank",width=41) you ! Thank you ! Thank you ! ... ZOOT : orde ! CONCORDE : Thank you , sir ! Most h . ARTHUR : Oh . Thank you . ROBIN : Ahh . Fine . ARTHUR : Thank you . ROBIN : Spl Scraping and Parsing PDFs in Python With Ian Hopkinson from ScraperWiki