Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Ferret: an open-source library to extract data from web news pages

Ferret: an open-source library to extract data from web news pages

6dbafc7a4ba86959b02c97995bf7be70?s=128

Victor Martinez

November 25, 2017
Tweet

Transcript

  1. 1 Federal University of Bahia Computer Science Department Victor Martinez

    Ferret: an open-source library to extract data from web news pages Advisor: Ivan Machado
  2. 2 2015 - 2016

  3. Crawler Team 3 3

  4. 4 Web News

  5. 5

  6. 6 crawlers ~500 Goal

  7. 7 ~500 crawlers Goal Machine Learning + Programming

  8. 8 8

  9. 9 That's enough! 9

  10. 10 Extracted documents month 200K crawlers ~200

  11. 11

  12. 12

  13. 13 URL language / HTML Ferret Json { 'title' :

    'This is the title', 'publish_date' : '2017-04-06T14:00:00', 'content': '<p>Dissertation … </p>', 'lang': 'en', 'html: '<!DOCTYPE><head>' }
  14. 14 Scientific Investigation Fundamental Observations &

  15. 15

  16. 16 Baeza-Yates and Ribeiro Neto, 2013 There are many pages

    on the Web for which the HTML does not adhere to the HTML specification correctly.
  17. 17 Ofuonye et al., 2010 Approximately 95% of HTML documents

    on the web do not adhere to W3C HTML standards.
  18. 18 Architecture

  19. 19 extensibility easy to contribute portability usability testability

  20. 20

  21. 21 21

  22. 22 22

  23. 23 http://edition.cnn.com/2017/04/03/opinions/russia-terror-attack-opinion-bergen-sterman/index.html

  24. 24 http://edition.cnn.com/2017/04/03/opinions/russia-terror-attack-opinion-bergen-sterman/index.html

  25. 25 Title Extraction OpenGraphTitleExtractor TwitterTitleExtractor TitleTagExtractor

  26. 26 Publish Date Extraction OpenGraphPublishedDateExtractor MetaTagsPublishedDateExtractor

  27. 27 Content Extraction

  28. 28 Working with Ferret

  29. 29

  30. 30 Analysis 30

  31. 31 Regression Tests 228 websites from different domains Brazilian-Portuguese 203

    English 25
  32. 32 Regression Directory Test Cases

  33. 33 $ py.test tests/regression

  34. 34 86% regarding title extraction

  35. 35 87% regarding publish date extraction

  36. 36 X% regarding content extraction Lack of existing approaches Complexity

    to measure
  37. 37 Concluding Remarks and Future Work

  38. 38 1. A study on Data Mining, Web Mining and

    Web Article Extraction 2. A study aimed to extract data from web news pages 3. Ferret: an open-source library to extract data from web news pages Research Contributions
  39. 39 1. A study on Data Mining, Web Mining and

    Web Article Extraction 2. A study aimed to extract data from web news pages 3. Ferret: an open-source library to extract data from web news pages Research Contributions
  40. 40 1. A study on Data Mining, Web Mining and

    Web Article Extraction 2. A study aimed to extract data from web news pages 3. Ferret: an open-source library to extract data from web news pages Research Contributions
  41. 41 1. Stimulate contributions 2. Quality Attributes 3. Extract other

    elements 4. Work with other languages 5. Benchmark with existing projects 6. Test and analysis of content extraction Future Work
  42. 42 1. Stimulate contributions 2. Quality Attributes 3. Extract other

    elements 4. Work with other languages 5. Benchmark with existing projects 6. Test and analysis of content extraction Future Work
  43. 43 1. Stimulate contributions 2. Quality Attributes 3. Extract other

    elements 4. Work with other languages 5. Benchmark with existing projects 6. Test and analysis of content extraction Future Work
  44. 44 1. Stimulate contributions 2. Quality Attributes 3. Extract other

    elements 4. Work with other languages 5. Benchmark with existing projects 6. Test and analysis of content extraction Future Work
  45. 45 1. Stimulate contributions 2. Quality Attributes 3. Extract other

    elements 4. Work with other languages 5. Benchmark with existing projects 6. Test and analysis of content extraction Future Work
  46. 46 1. Stimulate contributions 2. Quality Attributes 3. Extract other

    elements 4. Work with other languages 5. Benchmark with existing projects 6. Test and analysis of content extraction Future Work
  47. Victor Martinez vcrmartinez@gmail.com Information Systems @ UFBA Software Engineer 47