μRaptor: A DOM based system with appetite for hCard elements

175389e8c3ad885108fc33f8f05ba9bd?s=47 Emir Muñoz
October 20, 2014

μRaptor: A DOM based system with appetite for hCard elements

LD4IE Challenge 2014 Submission

175389e8c3ad885108fc33f8f05ba9bd?s=128

Emir Muñoz

October 20, 2014
Tweet

Transcript

 1. 3.
 2. 8.

  Training Phase Clean the HTML DOM sub-trees CSS class co-occurrence

  Value Constraints CSS Selectors vcard:email mailto : ALPHA PUNCTUATION ALL_LOWERCASE . ALL_LOWERCASE vcard:email mailto : ALPHA PUNCTUATION ALL_LOWERCASE . com vcard:email mailto : ALPHA @ ALPHANUMERIC . ALL_LOWERCASE vcard:email mailto : ALPHA @ ALPHANUMERIC . com vcard:email mailto : ALL_UPPERCASE ****@ ALL_LOWERCASE . ALL_LOWERCASE vcard:bday NUMBER - SMALL_NUMBER - SMALL_NUMBER vcard:bday MEDIUM_NUMBER - SMALL_NUMBER - SMALL_NUMBER We could determine patterns for emails for example: … or even for birthdays 
 3. 9.

  Extraction Phase Clean the HTML DOM sub-trees CSS class co-occurrence

  Value Constraints Pattern Detection CSS Selectors
 4. 10.

  Extraction Phase Clean the HTML DOM sub-trees CSS class co-occurrence

  Value Constraints Pattern Detection Elements Qualification CSS Selectors
 5. 11.

  Clean the HTML DOM sub-trees CSS class co-occurrence Value Constraints

  Pattern Detection Elements Qualification Models Validation CSS Selectors Extraction Phase RDF Model From μRaptor RDF Model Test set ? = 0.94 = 0.7 = 0.8
 6. 13.

  We made the discovery of the new μRaptor species and

  I am very pleased some researchers helped us understanding its feeding habits Godzilla is a doll compared to μRaptor! I am currently working on a script for an upcoming movie As a kid I always wanted to see an actual dinosaur. Today my dream comes true Damn, he is better than me!