Ferret: an open-source library to extract data from web news pages

Slide 1

Slide 1 text

1 Federal University of Bahia Computer Science Department Victor Martinez Ferret: an open-source library to extract data from web news pages Advisor: Ivan Machado

Slide 2

Slide 2 text

2 2015 - 2016

Slide 3

Slide 3 text

Crawler Team 3 3

Slide 4

Slide 4 text

4 Web News

Slide 5

Slide 5 text

Slide 6

Slide 6 text

6 crawlers ~500 Goal

Slide 7

Slide 7 text

7 ~500 crawlers Goal Machine Learning + Programming

Slide 8

Slide 8 text

8 8

Slide 9

Slide 9 text

9 That's enough! 9

Slide 10

Slide 10 text

10 Extracted documents month 200K crawlers ~200

Slide 11

Slide 11 text

Slide 12

Slide 12 text

Slide 13

Slide 13 text

13 URL language / HTML Ferret Json { 'title' : 'This is the title', 'publish_date' : '2017-04-06T14:00:00', 'content': '

Dissertation …

', 'lang': 'en', 'html: '' }

Slide 14

Slide 14 text

14 Scientific Investigation Fundamental Observations &

Slide 15

Slide 15 text

Slide 16

Slide 16 text

16 Baeza-Yates and Ribeiro Neto, 2013 There are many pages on the Web for which the HTML does not adhere to the HTML specification correctly.

Slide 17

Slide 17 text

17 Ofuonye et al., 2010 Approximately 95% of HTML documents on the web do not adhere to W3C HTML standards.

Slide 18

Slide 18 text

18 Architecture

Slide 19

Slide 19 text

19 extensibility easy to contribute portability usability testability

Slide 20

Slide 20 text

Slide 21

Slide 21 text

21 21

Slide 22

Slide 22 text

22 22

Slide 23

Slide 23 text

23 http://edition.cnn.com/2017/04/03/opinions/russia-terror-attack-opinion-bergen-sterman/index.html

Slide 24

Slide 24 text

24 http://edition.cnn.com/2017/04/03/opinions/russia-terror-attack-opinion-bergen-sterman/index.html

Slide 25

Slide 25 text

25 Title Extraction OpenGraphTitleExtractor TwitterTitleExtractor TitleTagExtractor

Slide 26

Slide 26 text

26 Publish Date Extraction OpenGraphPublishedDateExtractor MetaTagsPublishedDateExtractor

Slide 27

Slide 27 text

27 Content Extraction

Slide 28

Slide 28 text

28 Working with Ferret

Slide 29

Slide 29 text

Slide 30

Slide 30 text

30 Analysis 30

Slide 31

Slide 31 text

31 Regression Tests 228 websites from different domains Brazilian-Portuguese 203 English 25

Slide 32

Slide 32 text

32 Regression Directory Test Cases

Slide 33

Slide 33 text

33 $ py.test tests/regression

Slide 34

Slide 34 text

34 86% regarding title extraction

Slide 35

Slide 35 text

35 87% regarding publish date extraction

Slide 36

Slide 36 text

36 X% regarding content extraction Lack of existing approaches Complexity to measure

Slide 37

Slide 37 text

37 Concluding Remarks and Future Work

Slide 38

Slide 38 text

38 1. A study on Data Mining, Web Mining and Web Article Extraction 2. A study aimed to extract data from web news pages 3. Ferret: an open-source library to extract data from web news pages Research Contributions

Slide 39

Slide 39 text

39 1. A study on Data Mining, Web Mining and Web Article Extraction 2. A study aimed to extract data from web news pages 3. Ferret: an open-source library to extract data from web news pages Research Contributions

Slide 40

Slide 40 text

40 1. A study on Data Mining, Web Mining and Web Article Extraction 2. A study aimed to extract data from web news pages 3. Ferret: an open-source library to extract data from web news pages Research Contributions

Slide 41

Slide 41 text

41 1. Stimulate contributions 2. Quality Attributes 3. Extract other elements 4. Work with other languages 5. Benchmark with existing projects 6. Test and analysis of content extraction Future Work

Slide 42

Slide 42 text

42 1. Stimulate contributions 2. Quality Attributes 3. Extract other elements 4. Work with other languages 5. Benchmark with existing projects 6. Test and analysis of content extraction Future Work

Slide 43

Slide 43 text

43 1. Stimulate contributions 2. Quality Attributes 3. Extract other elements 4. Work with other languages 5. Benchmark with existing projects 6. Test and analysis of content extraction Future Work

Slide 44

Slide 44 text

44 1. Stimulate contributions 2. Quality Attributes 3. Extract other elements 4. Work with other languages 5. Benchmark with existing projects 6. Test and analysis of content extraction Future Work

Slide 45

Slide 45 text

45 1. Stimulate contributions 2. Quality Attributes 3. Extract other elements 4. Work with other languages 5. Benchmark with existing projects 6. Test and analysis of content extraction Future Work

Slide 46

Slide 46 text

46 1. Stimulate contributions 2. Quality Attributes 3. Extract other elements 4. Work with other languages 5. Benchmark with existing projects 6. Test and analysis of content extraction Future Work

Slide 47

Slide 47 text

Victor Martinez [email protected] Information Systems @ UFBA Software Engineer 47