Ferret: an open-source library to extract data from web news pages

1 Federal University of Bahia Computer Science Department Victor Martinez
Ferret: an open-source library to extract data from web news pages Advisor: Ivan Machado

2 2015 - 2016

Crawler Team 3 3

4 Web News

6 crawlers ~500 Goal

7 ~500 crawlers Goal Machine Learning + Programming

9 That's enough! 9

10 Extracted documents month 200K crawlers ~200

13 URL language / HTML Ferret Json { 'title' :
'This is the title', 'publish_date' : '2017-04-06T14:00:00', 'content': '<p>Dissertation … </p>', 'lang': 'en', 'html: '<!DOCTYPE><head>' }

14 Scientific Investigation Fundamental Observations &

16 Baeza-Yates and Ribeiro Neto, 2013 There are many pages
on the Web for which the HTML does not adhere to the HTML specification correctly.

17 Ofuonye et al., 2010 Approximately 95% of HTML documents
on the web do not adhere to W3C HTML standards.

18 Architecture

19 extensibility easy to contribute portability usability testability

23 http://edition.cnn.com/2017/04/03/opinions/russia-terror-attack-opinion-bergen-sterman/index.html

24 http://edition.cnn.com/2017/04/03/opinions/russia-terror-attack-opinion-bergen-sterman/index.html

25 Title Extraction OpenGraphTitleExtractor TwitterTitleExtractor TitleTagExtractor

26 Publish Date Extraction OpenGraphPublishedDateExtractor MetaTagsPublishedDateExtractor

27 Content Extraction

28 Working with Ferret

30 Analysis 30

31 Regression Tests 228 websites from different domains Brazilian-Portuguese 203
English 25

32 Regression Directory Test Cases

33 $ py.test tests/regression

34 86% regarding title extraction

35 87% regarding publish date extraction

36 X% regarding content extraction Lack of existing approaches Complexity
to measure

37 Concluding Remarks and Future Work

38 1. A study on Data Mining, Web Mining and
Web Article Extraction 2. A study aimed to extract data from web news pages 3. Ferret: an open-source library to extract data from web news pages Research Contributions

41 1. Stimulate contributions 2. Quality Attributes 3. Extract other
elements 4. Work with other languages 5. Benchmark with existing projects 6. Test and analysis of content extraction Future Work

Victor Martinez [email protected] Information Systems @ UFBA Software Engineer 47

Ferret: an open-source library to extract data ...

Ferret: an open-source library to extract data from web news pages

More Decks by Victor Martinez

Other Decks in Programming

Featured

Transcript