Upgrade to Pro — share decks privately, control downloads, hide ads and more …

μRaptor: A DOM based system with appetite for hCard elements

Emir Muñoz
October 20, 2014

μRaptor: A DOM based system with appetite for hCard elements

LD4IE Challenge 2014 Submission

Emir Muñoz

October 20, 2014
Tweet

More Decks by Emir Muñoz

Other Decks in Research

Transcript

  1. μRaptor
    A DOM based system with appetite for hCard elements

    View Slide

  2. μRaptor
    is hungry

    View Slide

  3. View Slide

  4. Training Phase
    Clean the HTML

    View Slide

  5. Training Phase
    Clean the HTML
    DOM sub-trees

    View Slide

  6. Training Phase
    Clean the HTML
    DOM sub-trees
    CSS class co-occurrence
    author

    View Slide

  7. Training Phase
    Clean the HTML
    DOM sub-trees
    CSS class co-occurrence
    CSS Selectors

    View Slide

  8. Training Phase
    Clean the HTML
    DOM sub-trees
    CSS class co-occurrence
    Value Constraints
    CSS Selectors
    vcard:email mailto : ALPHA PUNCTUATION ALL_LOWERCASE . ALL_LOWERCASE
    vcard:email mailto : ALPHA PUNCTUATION ALL_LOWERCASE . com
    vcard:email mailto : ALPHA @ ALPHANUMERIC . ALL_LOWERCASE
    vcard:email mailto : ALPHA @ ALPHANUMERIC . com
    vcard:email mailto : ALL_UPPERCASE ****@ ALL_LOWERCASE . ALL_LOWERCASE
    vcard:bday NUMBER - SMALL_NUMBER - SMALL_NUMBER
    vcard:bday MEDIUM_NUMBER - SMALL_NUMBER - SMALL_NUMBER
    We could determine patterns for emails for example:
    … or even for birthdays 

    View Slide

  9. Extraction Phase
    Clean the HTML
    DOM sub-trees
    CSS class co-occurrence
    Value Constraints
    Pattern Detection
    CSS Selectors

    View Slide

  10. Extraction Phase
    Clean the HTML
    DOM sub-trees
    CSS class co-occurrence
    Value Constraints
    Pattern Detection
    Elements Qualification
    CSS Selectors

    View Slide

  11. Clean the HTML
    DOM sub-trees
    CSS class co-occurrence
    Value Constraints
    Pattern Detection
    Elements Qualification
    Models Validation
    CSS Selectors
    Extraction Phase
    RDF Model
    From μRaptor
    RDF Model
    Test set
    ?
    = 0.94
    = 0.7
    = 0.8

    View Slide

  12. μRaptor
    https://github.com/emir-munoz/uraptor

    View Slide

  13. We made the discovery of the new μRaptor
    species and I am very pleased some researchers
    helped us understanding its feeding habits
    Godzilla is a doll compared to μRaptor! I am
    currently working on a script for an upcoming
    movie
    As a kid I always wanted to see an actual
    dinosaur. Today my dream comes true
    Damn, he is better than me!

    View Slide