Scraping and Parsing PDFs in Python

With Ian Hopkinson from ScraperWiki http://beta.scraperwiki.com @SmallCasserole [email protected] Scraping and
Parsing PDFs in Python:

# The steps of a scrape """ * Get root
URL * Iterate over the resources to scrape * Download source files (HTML, CSV, PDF, xls...) * Parse files and write out """ Scraping and Parsing PDFs in Python With Ian Hopkinson from ScraperWiki

# Handy tools """ * Browser Developer Tools XPath helper
JSONView * requests library * lxml library (includes XPath) * scraperwiki library (databases, pdftoxml) * Scrapy framework * Selenium remote control browser """ Scraping and Parsing PDFs in Python With Ian Hopkinson from ScraperWiki

# HTML wrangling >>> import lxml.html >>> doc = lxml.html.fromstring(r.content)
>>> s = doc.xpath("//td[@class='thing']") >>> links = doc.xpath("//a") >>> P = doc.xpath("//*[contains(@href,'cgi')]") ... >>> element_text = s.text >>> elements_text = s.text_content() >>> import html2text >>> elements_text = html2text.html2text( lxml.html.tostring(s)) Scraping and Parsing PDFs in Python With Ian Hopkinson from ScraperWiki

# PDF wrangling # Royal Society Membership List Scraping and
Parsing PDFs in Python With Ian Hopkinson from ScraperWiki

# PDF wrangling >>> import scraperwiki.pdftoxml Scraping and Parsing PDFs
in Python With Ian Hopkinson from ScraperWiki

# PDF wrangling # L'Académie des sciences Scraping and Parsing
PDFs in Python With Ian Hopkinson from ScraperWiki

# PDF wrangling # Hints for PDF parsers * Use
every layout feature you can find! * Regular expressions * Anticipate irregularity * Don't be afraid of manual edits * pdfminer Scraping and Parsing PDFs in Python With Ian Hopkinson from ScraperWiki

# UN Democracy # with Julian Todd """ ** Provide
the verbatim records of the UN ** in easily navigable, accessible form * Available content: * Security Council, General Assembly http://www.undemocracy.com/ * 21,000 documents - varying formats * Currently not updating """ Scraping and Parsing PDFs in Python With Ian Hopkinson from ScraperWiki

# UN Democracy # PDFs with consistent formatting Scraping and

# UN Democracy # PDF with regular formatting * President
* Speaker ID * Stage directions * Votes * Links to other files Scraping and Parsing PDFs in Python With Ian Hopkinson from ScraperWiki

# UN Democracy # XML from PDF Scraping and Parsing
PDFs in Python With Ian Hopkinson from ScraperWiki

# UN Democracy # regular expressions! msodecided = re.match("(?:There being
no objection, )?[Ii]t (?:was|is) so decided(?: $decision [\d/]*\s*(?:A|B|C|A and B)?$)?\.?$", ptext) mwasadopted = re.match(".*?(?:resolution|decision|agenda|amendment|recommendation).*?(?:was|were) adopted(?i)", ptext) mcalledorder = re.match("The meeting (?:was called to order|rose|was suspended|was adjourned|resumed|was resumed) (?:at|on)", ptext) mtookchair = re.match("\s*(?:In the absence of the President, )?(.*?)(?:, $?Vice[\-\s]President$?,)? (?:took|in) the [Cc]hair\.?$", ptext) mretchair = re.match("(?:The President|.*?, Vice-President,|Mrs. Albright.*?|Baroness Amos) (?:returned to|in) the Chair.$", ptext) mescort = re.search("(?:was escorted|escorted the.*?) (?:(?:from|to) the (?:rostrum|podium|platform)|(?:from|into|to its place in) the (?:General Assembly Hall|Conference Room|Security Council Chamber))(?: by the President and the Secretary-General)?\.?$", ptext) msecball = re.search("A vote was taken by secret ballot\.(?: The meeting was suspended at|$)", ptext) mminsil = re.search("The (?:members of the (?:General )?Assembly|Council) observed (?:a|one) minute of (?:silent prayer (?:or|and) meditation|silence)\.$", ptext) mtellers = re.search("At the invitations? of the (?:Acting )?Presidents?.*?acted as tellers\.$|Having been drawn by lot", ptext) melected = re.search("[Hh]aving obtained (?:the required (?:two-thirds )?|an absolute )majority.*?(?:(?:were|was|been|is) s?elected|will be included [io]n the list)", ptext) mmisc = re.search("The Acting President drew the following.*?from the box|sang.*?for the General Assembly|The Assembly heard a musical performance|The Secretary- General presented the award to|From the .*? Group:|Having been drawn by lot by the (?:President|Secretary-General),|were elected members of the Organizational Committee|President \w+ and then Vice-President|Vice-President \S+ \S+ presided over|The following .*? States have.*?been elected members of the Security Council", ptext) mmiscnote = re.search("\[In the 79th plenary .*? III.\]$", ptext) mmstar = re.match("\*", ptext) # insert * in the text mmspokein = re.match("$spoke in \w+(?:; interpretation.*?|; .*? the delegation)?$$", ptext) matinvite = re.match("(?:At the invitation of the President, )?.*? (?:(?:took (?:a |the )?|were escorted to their )seats? at the Council table|(?:took|was invited to take) (?:(?:the |a |their )?(?:seat|place)s? reserved for \w+|a seat|a place|places|seats|their seats|his seat) at the (?:side of the )?Council (?:[Cc]hamber|table))(?:;.*?Chamber)?\.$", ptext) mscsilence = re.match("The members of the (?:Security )?Council observed a minute of silence.$", ptext) mscescort = re.search("(?:were|was) escorted to (?:seats|a seat|his place|a place) at the (?:Security )?Council table.$", ptext) mvtape = re.match("A video ?(?:tape)? was (?:shown|played|displayed) in the (?:Council Chamber|General Assembly Hall).$|An audio tape, in Arabic,|The members of the General Assembly heard a musical performance.$", ptext) mvprojscreen = re.match("(?:An image was|Two images were|A video was) projected on screen\.$", ptext) mvresuadjourned = re.match("The meeting was resumed and adjourned on.*? a\.m\.$", ptext) Scraping and Parsing PDFs in Python With Ian Hopkinson from ScraperWiki

# UN Democracy # PDFs with consistent formatting Deep_in_the_PDF_mines With
Ian Hopkinson from ScraperWiki

# Tools for probing legacy code >>> import pycallgraph Scraping
and Parsing PDFs in Python With Ian Hopkinson from ScraperWiki

# Tools for probing legacy code # Clonedigger Scraping and

• EU FP7 natural language processing project • Process streams
of daily news in 4 languages: o what, where, when and who o extract events: temporal and causal relations o store -> a history recorder. o Organise and visualise as stories, scripts, plots to provide more efficient access • Netherlands (VU, LexisNexis, Synerscope), Spain (Basque University), UK (ScraperWiki) and Italy (Federation Bruno Kessler, Trento) http://www.newsreader-project.eu/

#Natural language processing toolkit >>>import nltk >>> from nltk.book import
* >>> text6 <Text: Monty Python and the Holy Grail> >>> text6.concordance("Thank",width=41) you ! Thank you ! Thank you ! ... ZOOT : orde ! CONCORDE : Thank you , sir ! Most h . ARTHUR : Oh . Thank you . ROBIN : Ahh . Fine . ARTHUR : Thank you . ROBIN : Spl Scraping and Parsing PDFs in Python With Ian Hopkinson from ScraperWiki

Scraping and Parsing PDFs in Python

Scraping and Parsing PDFs in Python

Data Science London

More Decks by Data Science London

Other Decks in Technology

Featured

Transcript

With Ian Hopkinson from ScraperWiki http://beta.scraperwiki.com @SmallCasserole [email protected] Scraping and

# The steps of a scrape """ * Get root

# Handy tools """ * Browser Developer Tools XPath helper

# HTML wrangling >>> import lxml.html >>> doc = lxml.html.fromstring(r.content)

# PDF wrangling # Royal Society Membership List Scraping and

# PDF wrangling >>> import scraperwiki.pdftoxml Scraping and Parsing PDFs

# PDF wrangling # L'Académie des sciences Scraping and Parsing

# PDF wrangling # Hints for PDF parsers * Use

# UN Democracy # with Julian Todd """ ** Provide

# UN Democracy # PDFs with consistent formatting Scraping and

# UN Democracy # PDF with regular formatting * President

# UN Democracy # XML from PDF Scraping and Parsing

# UN Democracy # regular expressions! msodecided = re.match("(?:There being

# UN Democracy # PDFs with consistent formatting Deep_in_the_PDF_mines With

# Tools for probing legacy code >>> import pycallgraph Scraping

# Tools for probing legacy code # Clonedigger Scraping and

• EU FP7 natural language processing project • Process streams

#Natural language processing toolkit >>>import nltk >>> from nltk.book import