How soon is now: using machine to extract publication dates of news articles online

How soon is now Using machine learning to extract publication
dates of news articles Julie Lavoie [email protected] PyData SF 2016

Thanks to Sawhorse Media for letting me speak about work
I did for them

Getting date of one page is easy: scraping

Sites use templates if you can extract one page from
a site, you can extract them all

Extracting date of a second page: also easy

Extracting date of “any” page is hard Can do the
same for 10, 15 sites What about 10 000 sites? Will you write 10 000 cases?

What now?

Could get people to do it Mechanical Turk, has API,
send tasks to real humans, ~0.01$ per page Programming humans! Problem: not scalable! 1 million+ pages to process per week, 10 000$/week Quality not the best. (Foreshadowing: we’re going to need this anyways)

Get a programmer to do it look at the data,
see a pattern, write a program/algorithm Eg. “if you see text that says: “Published:”, extract date from that, if you don’t ﬁnd it, look for a div with class=”pubdate”, etc Solutions like this can be good ﬁrst approximations Problem: patterns can be more complex than a human can understand (Foreshadowing: we’re also going to need this anyways) 9

The big guns: get machine learning to do it -Can
“see” a pattern that is more complex than humans can see -note: ML includes both previous solutions: We’re going to do supervised learning: you will need humans to get examples of pages and correct dates to train the model And you will need heuristics to ﬁgure out the features to tell the model to pay attention to

But: machine learning is a PITA Enormous amount of work
Only use it if the volume of your data, or problem complexity is too much for previous solutions Eg. friend who wanted to parse dates, hiring intern vs. ML, asked her how much data she had, 2000 results -- fastest and most accurate is to actually just do it by hand at that point, even faster than signing up for mechanical turk

Where to start? Start by thinking: how I would do
this as a human?

How do humans recognize publication date? • Close to the
title, close to the author, near top of the page, looks like date • Not in the footer, not in comments

3 different types of cues Linguistics, language: “Updated:” Rendering/presentation: eg,
near the top, close to title, author Markup cues: not read by human, <div class=”pubdate”> time tags

Can you guess the publication date?

What about here?

The hardest part of machine learning… ..Is setting up the
problem so machine learning can answer “You set them up and I knock them down”

Enter this research paper http://www.jofcis.com/publishedpapers/2010_6_1_279_285.pdf “Web Page Publication Date Extraction
and Application” by Chen, Ma, Rui, Sun, Shao, Ren Journal of Computational Information Systems

2 phases: training and using the trained model in production

Training 1) Get data (html + publication date) 2) Split
page into chunks, dom elements that contain a date (+ label) 3) Calculate feature on each of these chunks (+ label) 4) Train model

Trained model in production 1) Split page into chunks, DOM
elements that contain a date 2) Calculate feature on each of these chunks 3) Pass each feature vector to trained classiﬁer, get label 4) Decide which chunk to take, if more than one gets true answer 5) Extract date string from chunk 6) Fine-tuning

Step 1) Getting data

On two occasions I have been asked, — "Pray, Mr.
Babbage, if you put into the machine wrong ﬁgures, will the right answers come out?" … I am not able rightly to apprehend the kind of confusion of ideas that could provoke such a question. -Charles Babbage, 1864

Step 1) Data Big csv with urls, people on odesk
look at page and enter the date in csv URL, PUBLICATION DATE Download all pages from url to use as training data People are often wrong 3 people ﬁll out same csv, only take result if all 3 agree

mo data mo problems • After correct publication date is
entered into csv, web page goes 404 • Articles have pub dates when viewed in the browser but when you download with requests you get no data because site uses react.js • CSV AND UTF-8 ERRORS: four horsemen of the parsing apocalypse • We drop results that are “N/A”, people enter NA instead so that still gets parsed • People enter incorrect results because they make a mistake or don’t care

Step 2) split into chunks Easy for dates because limited
formats: June 23rd, 2016, 23-06-2016, … Regex + lxml <li class="date">Aug 3, 2016</li> <p class="update-time">Updated 5:58 PM ET, Thu August 4, 2016 <span class="video__source top_source" id="js-pagetop_video_source"></ span></p> Need the chunk, not just date, because of the markup info

Step 3) Feature selection good idea to start with heuristics
bc many features come from there Contains word “posted”, contains “modiﬁed”, contains “published” Header level Is a link Bold or not Position from top of page, position from title

Step 4) Training the model Classiﬁcation problem at chunk level:
does this contain publication date or not? Nltk + Naive Bayes, could use scikit-learn Probably better solutions but it was the easiest one to understand nltk tutorial understandable, scikit learn wtf

tutorial

vs. any real production problem

model is now trained, can use in production

Step 5) THERE CAN BE ONLY ONE How to decide
which chunk to take? Tried multiple things, my case, just pick ﬁrst one, correlates with being on top of page

Step 6) extracting date from chunk as a string straightforward:
regex

Step 7: fine-tuning Pre-filter: ditch DOM elements that lead to
false positives: comments, footer, javascript (often library date) Post-filter: remove results that are obviously wrong: dates in the future, dates in the past but: historical articles Hacks: if there is a date in the url, almost always publication date http://www.degeneratestate.org/posts/2016/Apr/20/heavy-metal-and- natural-language-processing-part-1/

Results! 88% precision on articles sampled from ~550 000 pubs
Paper had ~96% precision (on research data set, not same as ours) several reasons why:

Differences between prod and research Major one: performance issues in
production Did not implement headless browser features, too slow for production parsing 300 000 pages per day vs. chugging along in the lab avoiding false positives > avoiding no answers

Lessons learned Only one I knew working with data ---
bounced ideas off CTO but also no experience with this, had problems my web dev friends didn’t have. Watched a lot of PyData talks to try to understand best practices, other people were doing, how to organize my code Wanted to list some as well

Lessons learned On data: go through whole process with small
set of data to make sure data is in the correct format, if supervised learning Spreadsheet: give examples, often English not ﬁrst language, clearer instructions better, don’t use slang, idioms

Write tools to debug data pipeline at each stage Parse
a page, get wrong answer, what is the problem? Do we get data at all when we download the page? Do we get chunks from that page? — chunk algo maybe wrong Does the correct chunk get classified as true? -- classifier maybe wrong Does it have multiple true chunks but picks the wrong one? -- chunk picking algo wrong Does it have chunks all the way through but empty result? post- filtering wrong

Machine learning is about thinking statistically… ...vs. algorithmically 100 pages,
99 will have date in the top, 1 at the bottom If you write something that is correct for the 99 pages, probably get 1 page wrong But this is better than 1 correct, 99 wrong Search for what is right MOST of the time vs. algorithmically correct (1 wrong is a bug)

Thank you! For listening to this talk Thanks to Lee
Semel, Peat Bakke and PDX Python Users Group for helping improve this talk

Talk to me now/talk to me later No QA but
come talk to me in person after Talk to me later: I work on cool data problems, remotely. Julie Lavoie / [email protected] (ps. I send clients candy from Japan)

How soon is now: using machine to extract publi...

How soon is now: using machine to extract publication dates of news articles online

More Decks by almosteverywhere

Other Decks in Technology

Featured

Transcript