μRaptor: A DOM based system with appetite for hCard elements

175389e8c3ad885108fc33f8f05ba9bd?s=47 Emir Muñoz
October 20, 2014

μRaptor: A DOM based system with appetite for hCard elements

LD4IE Challenge 2014 Submission

175389e8c3ad885108fc33f8f05ba9bd?s=128

Emir Muñoz

October 20, 2014
Tweet

Transcript

  1. 3.
  2. 8.

    Training Phase Clean the HTML DOM sub-trees CSS class co-occurrence

    Value Constraints CSS Selectors vcard:email mailto : ALPHA PUNCTUATION ALL_LOWERCASE . ALL_LOWERCASE vcard:email mailto : ALPHA PUNCTUATION ALL_LOWERCASE . com vcard:email mailto : ALPHA @ ALPHANUMERIC . ALL_LOWERCASE vcard:email mailto : ALPHA @ ALPHANUMERIC . com vcard:email mailto : ALL_UPPERCASE ****@ ALL_LOWERCASE . ALL_LOWERCASE vcard:bday NUMBER - SMALL_NUMBER - SMALL_NUMBER vcard:bday MEDIUM_NUMBER - SMALL_NUMBER - SMALL_NUMBER We could determine patterns for emails for example: … or even for birthdays 
  3. 9.

    Extraction Phase Clean the HTML DOM sub-trees CSS class co-occurrence

    Value Constraints Pattern Detection CSS Selectors
  4. 10.

    Extraction Phase Clean the HTML DOM sub-trees CSS class co-occurrence

    Value Constraints Pattern Detection Elements Qualification CSS Selectors
  5. 11.

    Clean the HTML DOM sub-trees CSS class co-occurrence Value Constraints

    Pattern Detection Elements Qualification Models Validation CSS Selectors Extraction Phase RDF Model From μRaptor RDF Model Test set ? = 0.94 = 0.7 = 0.8
  6. 13.

    We made the discovery of the new μRaptor species and

    I am very pleased some researchers helped us understanding its feeding habits Godzilla is a doll compared to μRaptor! I am currently working on a script for an upcoming movie As a kid I always wanted to see an actual dinosaur. Today my dream comes true Damn, he is better than me!