1
Federal University of Bahia
Computer Science Department
Victor Martinez
Ferret: an open-source library to extract
data from web news pages
Advisor: Ivan Machado
25
Title Extraction
OpenGraphTitleExtractor
TwitterTitleExtractor
TitleTagExtractor
Slide 26
Slide 26 text
26
Publish Date Extraction
OpenGraphPublishedDateExtractor
MetaTagsPublishedDateExtractor
Slide 27
Slide 27 text
27
Content Extraction
Slide 28
Slide 28 text
28
Working with
Ferret
Slide 29
Slide 29 text
29
Slide 30
Slide 30 text
30
Analysis
30
Slide 31
Slide 31 text
31
Regression Tests
228 websites from different domains
Brazilian-Portuguese
203
English
25
Slide 32
Slide 32 text
32
Regression Directory
Test Cases
Slide 33
Slide 33 text
33
$ py.test tests/regression
Slide 34
Slide 34 text
34
86%
regarding title extraction
Slide 35
Slide 35 text
35
87%
regarding publish date
extraction
Slide 36
Slide 36 text
36
X%
regarding content
extraction
Lack of existing approaches
Complexity to measure
Slide 37
Slide 37 text
37
Concluding Remarks
and Future Work
Slide 38
Slide 38 text
38
1. A study on Data Mining, Web Mining
and Web Article Extraction
2. A study aimed to extract data from
web news pages
3. Ferret: an open-source library to
extract data from web news pages
Research
Contributions
Slide 39
Slide 39 text
39
1. A study on Data Mining, Web Mining
and Web Article Extraction
2. A study aimed to extract data from
web news pages
3. Ferret: an open-source library to
extract data from web news pages
Research
Contributions
Slide 40
Slide 40 text
40
1. A study on Data Mining, Web Mining
and Web Article Extraction
2. A study aimed to extract data from
web news pages
3. Ferret: an open-source library to
extract data from web news pages
Research
Contributions
Slide 41
Slide 41 text
41
1. Stimulate contributions
2. Quality Attributes
3. Extract other elements
4. Work with other languages
5. Benchmark with existing projects
6. Test and analysis of content
extraction
Future Work
Slide 42
Slide 42 text
42
1. Stimulate contributions
2. Quality Attributes
3. Extract other elements
4. Work with other languages
5. Benchmark with existing projects
6. Test and analysis of content
extraction
Future Work
Slide 43
Slide 43 text
43
1. Stimulate contributions
2. Quality Attributes
3. Extract other elements
4. Work with other languages
5. Benchmark with existing projects
6. Test and analysis of content
extraction
Future Work
Slide 44
Slide 44 text
44
1. Stimulate contributions
2. Quality Attributes
3. Extract other elements
4. Work with other languages
5. Benchmark with existing projects
6. Test and analysis of content
extraction
Future Work
Slide 45
Slide 45 text
45
1. Stimulate contributions
2. Quality Attributes
3. Extract other elements
4. Work with other languages
5. Benchmark with existing projects
6. Test and analysis of content
extraction
Future Work
Slide 46
Slide 46 text
46
1. Stimulate contributions
2. Quality Attributes
3. Extract other elements
4. Work with other languages
5. Benchmark with existing projects
6. Test and analysis of content
extraction
Future Work
Slide 47
Slide 47 text
Victor Martinez
[email protected]
Information Systems @ UFBA
Software Engineer
47