Upgrade to Pro — share decks privately, control downloads, hide ads and more …

μRaptor: A DOM based system with appetite for hCard elements

Emir Muñoz
October 20, 2014

μRaptor: A DOM based system with appetite for hCard elements

LD4IE Challenge 2014 Submission

Emir Muñoz

October 20, 2014
Tweet

More Decks by Emir Muñoz

Other Decks in Research

Transcript

  1. Training Phase Clean the HTML DOM sub-trees CSS class co-occurrence

    Value Constraints CSS Selectors vcard:email mailto : ALPHA PUNCTUATION ALL_LOWERCASE . ALL_LOWERCASE vcard:email mailto : ALPHA PUNCTUATION ALL_LOWERCASE . com vcard:email mailto : ALPHA @ ALPHANUMERIC . ALL_LOWERCASE vcard:email mailto : ALPHA @ ALPHANUMERIC . com vcard:email mailto : ALL_UPPERCASE ****@ ALL_LOWERCASE . ALL_LOWERCASE vcard:bday NUMBER - SMALL_NUMBER - SMALL_NUMBER vcard:bday MEDIUM_NUMBER - SMALL_NUMBER - SMALL_NUMBER We could determine patterns for emails for example: … or even for birthdays 
  2. Extraction Phase Clean the HTML DOM sub-trees CSS class co-occurrence

    Value Constraints Pattern Detection CSS Selectors
  3. Extraction Phase Clean the HTML DOM sub-trees CSS class co-occurrence

    Value Constraints Pattern Detection Elements Qualification CSS Selectors
  4. Clean the HTML DOM sub-trees CSS class co-occurrence Value Constraints

    Pattern Detection Elements Qualification Models Validation CSS Selectors Extraction Phase RDF Model From μRaptor RDF Model Test set ? = 0.94 = 0.7 = 0.8
  5. We made the discovery of the new μRaptor species and

    I am very pleased some researchers helped us understanding its feeding habits Godzilla is a doll compared to μRaptor! I am currently working on a script for an upcoming movie As a kid I always wanted to see an actual dinosaur. Today my dream comes true Damn, he is better than me!