$30 off During Our Annual Pro Sale. View Details »

Accelerating Knowledge Base Construction

Jaeho Shin
August 11, 2016

Accelerating Knowledge Base Construction

Jaeho Shin

August 11, 2016
Tweet

More Decks by Jaeho Shin

Other Decks in Research

Transcript

  1. Accelerating 

    Knowledge Base Construction
    Jaeho Shin
    Advisor: Christopher Ré
    Readers: Hector Garcia-Molina
    Kunle Olukotun
    Oral Examiner: Peter Bailis
    Chair: Parag Mallick

    View Slide

  2. Accelerating Knowledge Base Construction
    1. Background: Knowledge Base Construction
    2. KBC with DeepDive
    [SIGMOD 2016 Industrial Track] [SIGMOD Record 2016]
    github.com/HazyResearch/deepdive
    3. Machine Efficiency
    [VLDB 2015] [VLDB Journal 2016]
    (Best of VLDB) (SIGMOD Research Highlight Award 2015)
    github.com/netj/mkmimo
    4. Human Productivity
    [IEEE DEBul 2014] [VLDB 2015 Demo] [HILDA 2016]
    github.com/HazyResearch/mindbender
    2

    View Slide

  3. Macroscopic Questions
    Where do
    human-
    trafficking crimes
    happen?
    Which gene
    mutations cause
    certain diseases?
    What is the
    impact of climate
    change to
    biodiversity?
    3

    View Slide

  4. Macroscopic Questions
    Where do
    human-
    trafficking crimes
    happen?
    Which gene
    mutations cause
    certain diseases?
    What is the
    impact of climate
    change to
    biodiversity?
    Trafficking Distribution
    Time
    # Species
    Biodiversity
    Curve
    Biodiversity Curve Gene-Phenotype Map
    3

    View Slide

  5. Knowledge Bases Can Answer
    Taxon Time Ecosystem
    O. mykiss 1991 Gulf of Alaska
    T. thynnus 1983 Caribbean Sea
    … … …
    Natural Science KBs
    e.g., Fishbase, PaleoDB, ...
    Gene Phenotype
    HLA-B27 Ankylosing spondylitis
    PAX6 Optic nerve hypoplasia
    … …
    Biomedical KBs
    e.g., OMIM, MeSH, HPO, GO, ...
    MSA Price Phone #
    SF $200/hr 415-555-2242
    NY $150/hr 646-555-9792
    … …
    Law Enforcement KBs
    e.g., MEMEX, PolarisProject, ...
    Trafficking Distribution
    Time
    # Species
    Biodiversity
    Curve
    Biodiversity Curve Gene-Phenotype Map
    4

    View Slide

  6. Knowledge Bases Can Answer
    Taxon Time Ecosystem
    O. mykiss 1991 Gulf of Alaska
    T. thynnus 1983 Caribbean Sea
    … … …
    Natural Science KBs
    e.g., Fishbase, PaleoDB, ...
    Gene Phenotype
    HLA-B27 Ankylosing spondylitis
    PAX6 Optic nerve hypoplasia
    … …
    Biomedical KBs
    e.g., OMIM, MeSH, HPO, GO, ...
    MSA Price Phone #
    SF $200/hr 415-555-2242
    NY $150/hr 646-555-9792
    … …
    Law Enforcement KBs
    e.g., MEMEX, PolarisProject, ...
    Trafficking Distribution
    Time
    # Species
    Biodiversity
    Curve
    Biodiversity Curve Gene-Phenotype Map
    4
    Structured Data needed!

    View Slide

  7. Knowledge in Unstructured Sources
    MSA Price Phone #
    SF $200/hr 415-555-2242
    NY $150/hr 646-555-9792
    … …
    Gene Phenotype
    HLA-B27 Ankylosing spondylitis
    PAX6 Optic nerve hypoplasia
    … …
    Taxon Time Ecosystem
    O. mykiss 1991 Gulf of Alaska
    T. thynnus 1983 Caribbean Sea
    … … …
    Pathway Diagrams
    Doctor's notes
    Medical images
    News Articles
    Web Postings 

    (Sex Ads, Reviews)
    Text, Tables, Figures in
    Scientific Literature
    Natural Science KBs Biomedical KBs Law Enforcement KBs
    5

    View Slide

  8. Knowledge Base Construction by Human
    MSA Price Phone #
    SF $200/hr 415-555-2242
    NY $150/hr 646-555-9792
    … …
    Gene Phenotype
    HLA-B27 Ankylosing spondylitis
    PAX6 Optic nerve hypoplasia
    … …
    Taxon Time Ecosystem
    O. mykiss 1991 Gulf of Alaska
    T. thynnus 1983 Caribbean Sea
    … … …
    6

    View Slide

  9. Knowledge Base Construction by Human
    MSA Price Phone #
    SF $200/hr 415-555-2242
    NY $150/hr 646-555-9792
    … …
    Gene Phenotype
    HLA-B27 Ankylosing spondylitis
    PAX6 Optic nerve hypoplasia
    … …
    Taxon Time Ecosystem
    O. mykiss 1991 Gulf of Alaska
    T. thynnus 1983 Caribbean Sea
    … … …
    6
    Error-prone
    Slow
    Expensive

    View Slide

  10. MSA Price Phone #
    SF $200/hr 415-555-2242
    NY $150/hr 646-555-9792
    … …
    Gene Phenotype
    HLA-B27 Ankylosing spondylitis
    PAX6 Optic nerve hypoplasia
    … …
    Taxon Time Ecosystem
    O. mykiss 1991 Gulf of Alaska
    T. thynnus 1983 Caribbean Sea
    … … …
    Knowledge Base Construction by Machine
    7

    View Slide

  11. MSA Price Phone #
    SF $200/hr 415-555-2242
    NY $150/hr 646-555-9792
    … …
    Gene Phenotype
    HLA-B27 Ankylosing spondylitis
    PAX6 Optic nerve hypoplasia
    … …
    Taxon Time Ecosystem
    O. mykiss 1991 Gulf of Alaska
    T. thynnus 1983 Caribbean Sea
    … … …
    Knowledge Base Construction by Machine
    7
    Faster
    Cheaper
    Repeatable
    Scalable

    View Slide

  12. KBC Machine Successes
    8
    Unstructured Information
    Knowledge Base
    Genomics
    Drug Repurposing
    Paleobiology
    Anti-Human Trafficking
    TAC-KBP 2014 Winner
    Material Science
    Collaborators including:
    Successful KBC applications from our group:

    View Slide

  13. Genom
    Drug
    Paleobi
    Anti-
    TAC-
    Materi
    Run
    Analyze
    Improve
    Human

    in the

    loop
    Unstructured Information
    Knowledge Base
    Iterative KBC with DeepDive
    9
    • Development Loop
    improves iteratively
    • More Rapid Iteration
    → Successful KBC
    • Goal: High Quality
    • precision
    • recall

    View Slide

  14. • Humans are easily distracted
    • Machine-optimized data is

    not human-friendly
    • Collecting metadata is tedious
    and error-prone
    • Exploring data is not
    interactive and too laborious
    Iterative KBC Challenges
    10
    Slow Unreliable Humans
    Slow Unwieldy Machines
    • Normal runs are too slow for
    small incremental changes
    • Time and resource are wasted
    by inefficient data processing
    • Machines waste executing
    what human did not intend

    View Slide

  15. Accelerating KBC: Focus of My Work
    11
    • Error analysis guidelines
    • Easier data labeling
    • Automatic search interface
    • Metrics monitoring dashboard
    • Easier training set creation
    • Faster incremental runs
    • More efficient data processing
    • Execution planning and micro-
    step operations
    • Better serialization for data in
    motion
    2. Increasing 

    Human Productivity
    1. Increasing 

    Machine Efficiency

    View Slide

  16. Accelerating KBC: Focus of My Work
    11
    2. Increasing 

    Human Productivity
    1. Increasing 

    Machine Efficiency
    • Faster incremental runs
    • More efficient data processing
    • Error analysis guidelines
    • Easier data labeling
    • Automatic search interface

    View Slide

  17. Accelerating KBC
    12
    Extracting Databases from Dark Data with DeepDive. [SIGMOD 2016 Industrial]
    DeepDive: Declarative Knowledge Base Construction. [SIGMOD Record 2016]
    github.com/HazyResearch/deepdive
    2. Increasing 

    Human Productivity
    1. Increasing 

    Machine Efficiency
    • Faster incremental runs
    • More efficient data processing
    • Error analysis guidelines
    • Easier data labeling
    • Automatic search interface
    0. KBC with DeepDive

    View Slide

  18. DeepDive Among Others
    13
    Machine Learning Data Management
    Information Extraction
    Caffe
    Torch
    SystemT
    CoreNLP
    FACTORIE
    MapReduce
    Alchemy
    BUGS
    Xlog
    MCDB
    ProbKB
    Lixto
    GATE Various 

    Rule-based 

    Systems
    XPath
    regexp

    View Slide

  19. KBC with DeepDive
    14
    • Relational Databases
    • Declarative Languages
    • SQL & DDlog
    • Standard Tools Integration
    • as UDF (User-defined Functions)
    • written in Java, Python, Perl
    • e.g., CoreNLP, regular expressions
    • Semi-Supervised Machine Learning
    • Probabilistic Graphical Models (Factor Graphs)
    • Approximate Inference (Gibbs Sampling)
    • Learning with Asynchronous SGD (HogWild!)

    View Slide

  20. KBC with DeepDive
    14
    • Relational Databases
    • Declarative Languages
    • SQL & DDlog
    • Standard Tools Integration
    • as UDF (User-defined Functions)
    • written in Java, Python, Perl
    • e.g., CoreNLP, regular expressions
    • Semi-Supervised Machine Learning
    • Probabilistic Graphical Models (Factor Graphs)
    • Approximate Inference (Gibbs Sampling)
    • Learning with Asynchronous SGD (HogWild!)
    Candidate
    Generation
    Feature
    Extraction
    Supervision
    Learning /
    Inference
    Aspirational Schema

    View Slide

  21. DeepDive Programs in DDlog
    15
    President Barack Obama and
    his wife Michelle Obama step
    out of Air Force One on Sunday.

    Candidate
    Generation
    Feature
    Extraction
    Supervision
    Learning /
    Inference
    Aspirational Schema

    View Slide

  22. DeepDive Programs in DDlog
    15
    has_spouse?(p1_id text,
    p2_id text).
    person_mention(p_id text,
    doc_id text,
    sent_id text).
    sentences(doc_id text , sent_id int ,
    tokens text[], lemmas text[],
    pos_tags text[], ner_tags text[], …).
    articles(id text, content text).
    President Barack Obama and
    his wife Michelle Obama step
    out of Air Force One on Sunday.

    Candidate
    Generation
    Feature
    Extraction
    Supervision
    Learning /
    Inference
    Aspirational Schema

    View Slide

  23. DeepDive Programs in DDlog
    16
    Candidate
    Generation
    Feature
    Extraction
    Supervision
    Learning /
    Inference
    Aspirational Schema
    person_mention +=
    udf_map_candidates(sent, ner_tags, …) :-
    sentences(sent, ner_tags, …).
    has_spouse(p1, p2) :-

    person_mention(p1, doc, sent),

    person_mention(p2, doc, sent).
    spouse_features +=
    udf_extract_features(p1, p2, sent) :-
    has_spouse(p1, p2),

    person_mention(p1, doc, sent),
    sentences(doc, sent).
    DDlog inherits Datalog, allowing UDFs
    for integrated data processing
    UDF in
    Python
    UDF in
    Python
    Extracting text spans
    by NER tags
    Extracting text features

    View Slide

  24. DeepDive Programs in DDlog
    17
    Candidate
    Generation
    Feature
    Extraction
    Supervision
    Learning /
    Inference
    Aspirational Schema
    # Features
    @weight(f)
    has_spouse(p1, p2) :-
    spouse_feature(p1, p2, f).
    # Inference rule: Symmetry
    @weight(3.0)
    has_spouse(p1, p2) => has_spouse(p2, p1).
    # Inference rule: Only one marriage
    @weight(-1.0)
    has_spouse(p1, p2) => has_spouse(p1, p3) :-
    p2 != p3.
    DDlog inherits Markov Logic Networks (Richardson &
    Domingos) and Tuffy (Niu et al.)
    has_spouse(p1, p2) = true :-

    freebase_person(p1, e1), freebase_person(p2, e2),

    freebase_marriage(e1, e2).
    Distant Supervision
    with known facts
    Features and
    domain knowledge

    View Slide

  25. 18
    DeepDive’s Semantics
    User Relations
    Inference Rules
    Factor Graph
    Variables V
    R S Q
    F1 F2
    Factors F
    Grounding
    x y
    a 0
    a 1
    a 2
    r1
    r2
    r3
    s1
    s2
    y
    0
    10
    q1
    x
    a r1
    r2
    r3
    s1
    s2
    q1
    User Relations
    Inference Rules
    Factor Graph
    Variables V
    F1
    R S Q
    q(x) :- R(x,y)
    F2 q(x) :- R(x,y), S(y)
    F1 F2
    Factors F
    Factor function corresponds to
    Equation 1 in Section 2.4.
    Grounding
    x y
    a 0
    a 1
    a 2
    r1
    r2
    r3
    s1
    s2
    y
    0
    10
    q1
    x
    a r1
    r2
    r3
    s1
    s2
    q1
    User Relations
    Inference Rules
    F1
    R S Q
    q(x) :- R(x,y)
    F2 q(x) :- R(x,y), S(y)
    Fac
    Equ
    Grounding
    x y
    a 0
    a 1
    a 2
    r1
    r2
    r3
    s1
    s2
    y
    0
    10
    q1
    x
    a r1
    User Relations
    Inference Rules
    F1
    R S Q
    q(x) :- R(x,y)
    F2 q(x) :- R(x,y), S(y)
    Fac
    Equ
    Grounding
    x y
    a 0
    a 1
    a 2
    r1
    r2
    r3
    s1
    s2
    y
    0
    10
    q1
    x
    a r1
    Grounding
    Factor Graph
    DDlog Relations
    Candidate
    Generation
    Feature
    Extraction
    Supervision
    Learning /
    Inference
    Aspirational Schema
    DDlog Inference Rules
    R(x,y) => Q(x).
    R(x,y), S(y) => Q(x).

    View Slide

  26. 19
    DeepDive’s Semantics
    Candidate
    Generation
    Feature
    Extraction
    Supervision
    Learning /
    Inference
    Aspirational Schema
    User Relations
    Inference Rules
    Factor Graph
    Variables V
    F1
    R S Q
    q(x) :- R(x,y)
    F2 q(x) :- R(x,y), S(y)
    F1 F2
    Factors F
    Factor function corresponds to
    Equation 1 in Section 2.4.
    Grounding
    x y
    a 0
    a 1
    a 2
    r1
    r2
    r3
    s1
    s2
    y
    0
    10
    q1
    x
    a r1
    r2
    r3
    s1
    s2
    q1
    Factor Graph
    factor graph (
    V, F,
    ˆ
    w
    )
    random variables
    V
    hyperedges of variables
    F ✓ {f | f ✓ V }
    weight function ˆ
    w
    :
    F ⇥ {
    0
    ,
    1
    }V ! R
    all possible worlds
    I ✓ {I
    :
    V ! {
    0
    ,
    1
    }}
    joint probability
    marginal probability
    Pr[
    I
    ] =
    Z 1
    exp
    n
    ˆ
    W
    (
    F, I
    )
    o
    where
    Z
    =
    X
    I2I
    exp
    n
    ˆ
    W
    (
    F, I
    )
    o
    ˆ
    W
    (
    F, I
    ) =
    X
    f2F
    ˆ
    w
    (
    f, I
    )
    Pr[v] =
    X
    I2I+
    Pr[I] where I+ = {I 2 I | I(v) = 1}

    View Slide

  27. 20
    DeepDive’s Semantics
    • Approximate Inference by Gibbs Sampling
    • Learning by Asynchronous SGD 

    (Hogwild! [Niu et al.])
    • High-performance implementation on
    modern hardware (NUMA & many cores)

    (DimmWitted [Zhang et al.])
    Marginal Probabilities Pr[ᐧ]
    Candidate
    Generation
    Feature
    Extraction
    Supervision
    Learning /
    Inference
    Aspirational Schema
    User Relations
    Inference Rules
    Factor Graph
    Variables V
    F1
    R S Q
    q(x) :- R(x,y)
    F1 F2
    Factors F
    Grounding
    x y
    a 0
    a 1
    a 2
    r1
    r2
    r3
    s1
    s2
    y
    0
    10
    q1
    x
    a r1
    r2
    r3
    s1
    s2
    q1
    Factor Graph

    View Slide

  28. 1. Increasing 

    Machine Efficiency
    • Faster incremental runs
    • More efficient data processing
    Accelerating KBC
    21
    2. Increasing 

    Human Productivity
    • Error analysis guidelines
    • Easier data labeling
    • Automatic search interface
    0. KBC with DeepDive

    View Slide

  29. 1. Increasing 

    Machine Efficiency
    • Faster incremental runs
    • More efficient data processing
    Accelerating KBC
    21
    Incremental knowledge base construction using DeepDive. (Best of VLDB) [VLDB 2015]

    [VLDB Journal 2016]

    (SIGMOD Research Highlight Award 2015)
    2. Increasing 

    Human Productivity
    • Faster incremental runs
    • More efficient data processing
    • Error analysis guidelines
    • Easier data labeling
    • Automatic search interface
    0. KBC with DeepDive
    1. Increasing 

    Machine Efficiency

    View Slide

  30. Problem: Slow Incremental Runs
    22
    2.4M
    facts
    1.8M
    documents
    6 hours
    … 

    Barack Obama
    and his wife
    Michelle Obama 


    “News Reading” System

    View Slide

  31. Problem: Slow Incremental Runs
    22
    2.4M
    facts
    1.8M
    documents
    Incremental Updates
    6 hours
    … 

    Barack Obama
    and his wife
    Michelle Obama 


    “News Reading” System

    View Slide

  32. Problem: Slow Incremental Runs
    22
    2.4M
    facts
    1.8M
    documents
    Incremental Updates
    6 hours
    … 

    Barack Obama
    and his wife
    Michelle Obama 


    +∆1
    +∆1
    +∆2
    +∆1
    +∆2
    +∆3
    6 hours
    7 hours
    8 hours
    “News Reading” System

    View Slide

  33. Fast Incremental Runs
    23
    Incremental Updates
    < 30 mins
    … 

    Barack Obama
    and his wife
    Michelle Obama 


    2.4M
    facts
    1.8M
    documents
    6 hours
    “News Reading” System
    +∆1
    +∆2
    +∆3

    View Slide

  34. Updates to Factor Graph
    Types of Incremental Updates
    24
    New DDlog Rules + Data
    Candidate
    Generation
    Feature
    Extraction
    Supervision
    Learning /
    Inference
    Aspirational Schema

    View Slide

  35. Updates to Factor Graph
    Types of Incremental Updates
    24
    V V
    … +
    + addition
    New DDlog Rules + Data
    Candidate
    Generation
    Feature
    Extraction
    Supervision
    Learning /
    Inference
    Aspirational Schema
    Candidate
    Generation

    View Slide

  36. Updates to Factor Graph
    Types of Incremental Updates
    24
    V V
    … +
    V F
    V
    V

    +
    + addition
    New DDlog Rules + Data
    Candidate
    Generation
    Feature
    Extraction
    Supervision
    Learning /
    Inference
    Aspirational Schema
    Learning /
    Inference

    View Slide

  37. Updates to Factor Graph
    Types of Incremental Updates
    24
    V V
    … +
    V F
    V
    V

    F F’
    +

    ⟳ mutation
    + addition
    New DDlog Rules + Data
    Candidate
    Generation
    Feature
    Extraction
    Supervision
    Learning /
    Inference
    Aspirational Schema
    Feature
    Extraction

    View Slide

  38. Updates to Factor Graph
    Types of Incremental Updates
    24
    V V
    … +
    V F
    V
    V

    F F’
    V -/+
    +


    ⟳ mutation
    + addition
    evidence
    New DDlog Rules + Data
    Candidate
    Generation
    Feature
    Extraction
    Supervision
    Learning /
    Inference
    Aspirational Schema
    Supervision

    View Slide

  39. ∆FG
    Updates to Factor Graph
    Types of Incremental Updates
    24
    V V
    … +
    V F
    V
    V

    F F’
    V -/+
    +


    Incremental Grounding
    FG
    ⟳ mutation
    + addition
    evidence
    New DDlog Rules + Data
    Candidate
    Generation
    Feature
    Extraction
    Supervision
    Learning /
    Inference
    Aspirational Schema

    View Slide

  40. 25
    Original Factor Graph
    FG
    Incremental Updates
    Speeding up Incremental Runs

    View Slide

  41. (Repeated many times)
    25
    Reusable 

    Data
    1. Materialization
    Original Factor Graph
    FG
    Incremental Updates
    Speeding up Incremental Runs
    (Once)

    View Slide

  42. (Repeated many times)
    25
    ∆FG
    Reusable 

    Data
    2. Incremental
    Grounding
    1. Materialization
    Original Factor Graph
    FG
    Incremental Updates
    Speeding up Incremental Runs
    (Once)

    View Slide

  43. (Repeated many times)
    25
    ∆FG
    Reusable 

    Data
    2. Incremental
    Grounding
    3. Incremental 

    Inference
    1. Materialization
    Pr(∆FG)[ᐧ]
    Original Factor Graph
    FG
    Incremental Updates
    +
    Speeding up Incremental Runs
    (Once)

    View Slide

  44. Incremental Maintenance: Two Approaches
    26
    + ∆FG
    Samples of 

    Possible Worlds
    1. Sampling-based
    Pr(∆FG)[ᐧ]
    Original Factor Graph
    FG
    Approximate 

    Factor Graph
    2. Variational-based
    + ∆FG

    View Slide

  45. 1. Sampling Approach
    27
    Reuse with Acceptance Tests
    Materialization
    Samples of 

    Possible Worlds
    Generate many
    00111
    11101
    01001
    01001
    Original 

    Factor Graph
    FG
    Updated 

    Probabilities
    Pr(∆FG)[ᐧ]
    Independent
    Metropolis-Hastings
    sampling w.r.t. ∆FG
    Incremental Inference
    Updates to the Factor Graph
    ∆FG

    View Slide

  46. 2. Variational Approach
    28
    Run Gibbs Sampling after update
    Approximate
    Log-determinant Relaxation
    Updated Simpler Factor Graph
    FG’ + ∆FG
    Simpler Factor Graph
    FG’
    Materialization Incremental Inference
    Original 

    Factor Graph
    FG
    Updates to the Factor Graph
    ∆FG
    Updated 

    Probabilities
    Pr(∆FG)[ᐧ]
    (with only binary potentials)

    View Slide

  47. Quality Over Time
    29
    Consistent ~10x speedup across 5 KBC systems
    22x Overall 

    Speedup
    0
    0.05
    0.1
    0.15
    0.2
    0.25
    0.3
    0.35
    0.4
    1 100 10000
    Quality (F1 score)
    Cumulative Execution Time (s)
    Rerun
    Incremental
    (“News Reading” system)
    Simulated incremental development with 6 different rules
    12 hour
    Materialization
    Still >2x Faster
    99% Overlap
    Pr[v] > 0.9

    View Slide

  48. Tradeoff of Two Approaches
    30
    Neither dominates the other 

    (depends on the workload)
    → Rule-based optimizer
    (synthetic dataset)
    (slower)
    0.001
    0.01
    0.1
    1
    10
    0.001
    0.01
    0.1
    1
    0.001
    0.01
    0.1
    1
    0.1
    1
    Incremental Inference Time (s)
    (a) Acceptance Rate (b) Sparsity of Correlations
    Sampling
    Variational
    Sampling
    Variational
    (more 

    correlations)
    0.001
    0.01
    0.1
    1
    10
    0.001
    0.01
    0.1
    1
    0.001
    0.01
    0.1
    1
    0.1
    1
    Incremental Inference Time (s)
    (a) Acceptance Rate (b) Sparsity of Correlations
    Sampling
    Variational
    Sampling
    Variational
    (slower)
    (larger

    updates)

    View Slide

  49. 1. Increasing 

    Machine Efficiency
    • Faster incremental runs
    • More efficient data processing
    Accelerating KBC
    31
    github.com/netj/mkmimo
    2. Increasing 

    Human Productivity
    • Error analysis guidelines
    • Easier data labeling
    • Automatic search interface
    0. KBC with DeepDive
    1. Increasing 

    Machine Efficiency

    View Slide

  50. Problem: Inefficient Data Processing
    • KBC requires lots of data processing
    • DeepDive’s UDFs (user-defined functions)
    • simple, easy to debug
    • language agnostic for integrating arbitrary tools/library
    • efficient?
    32
    UDF
    Unload Load

    View Slide

  51. 33
    Naive Parallelization
    UDF
    UDF
    Split
    File
    File
    File UDF
    Executing UDF processes in Parallel
    File
    File
    File
    Unload Load
    (Batch) (Batch)

    View Slide

  52. 33
    Naive Parallelization
    UDF
    UDF
    Split
    File
    File
    File UDF
    Executing UDF processes in Parallel
    File
    File
    File
    Data
    Duplication
    Unnecessary 

    I/O
    Unload Load
    (Batch) (Batch)

    View Slide

  53. (Streaming)
    34
    Better Parallelization
    UDF
    UDF
    Split
    Pipe
    Pipe
    Pipe UDF
    Executing UDF processes in Parallel, Streaming Data
    File
    File
    File
    Reduced 

    Duplication
    Unload Load (Batch)

    View Slide

  54. (Streaming)
    34
    Better Parallelization
    UDF
    UDF
    Split
    Pipe
    Pipe
    Pipe UDF
    Executing UDF processes in Parallel, Streaming Data
    File
    File
    File
    Reduced 

    Duplication
    Throughput 

    bounded by 

    Stragglers!
    Unload Load (Batch)

    View Slide

  55. 35
    DeepDive’s Efficient Data Processing
    Unload Load
    UDF
    UDF
    mkmimo
    Pipe
    Pipe
    Pipe UDF
    Executing UDF processes in Parallel, Streaming Data, Balancing Load
    mkmimo
    Pipe
    Pipe
    Pipe
    (Streaming) (Streaming)

    View Slide

  56. 35
    DeepDive’s Efficient Data Processing
    Unload Load
    UDF
    UDF
    mkmimo
    Pipe
    Pipe
    Pipe UDF
    Executing UDF processes in Parallel, Streaming Data, Balancing Load
    mkmimo
    Pipe
    Pipe
    Pipe
    Zero Footprint!
    Speed up ~3x
    (Streaming) (Streaming)

    View Slide

  57. 35
    DeepDive’s Efficient Data Processing
    UDF
    UDF
    mkmimo
    Pipe
    Pipe
    Pipe UDF
    Executing UDF processes in Parallel, Streaming Data, Balancing Load
    mkmimo
    Pipe
    Pipe
    Pipe
    Zero Footprint!
    Speed up ~3x
    Parallel
    Unload
    Parallel 

    Load
    Speed up ~20x
    (Streaming) (Streaming)

    View Slide

  58. • Faster incremental runs
    • More efficient data processing
    Accelerating KBC
    36
    Feature engineering for knowledge base construction. [IEEE DEBul 2014]
    • Error analysis guidelines
    • Easier data labeling
    • Automatic search interface
    0. KBC with DeepDive
    1. Increasing 

    Machine Efficiency
    2. Increasing 

    Human Productivity

    View Slide

  59. Problem: Easily Distracted Humans
    • Working in ad-hoc fashion
    • adding useless features
    • perfecting features with little impact
    • solving obvious errors not common ones
    • fiddling with statistical procedure
    37

    View Slide

  60. DeepDive's Macro Error Analysis Support
    38
    Calibration Plots

    View Slide

  61. DeepDive's Micro Error Analysis Guideline
    39
    Start Error
    Analysis
    Any new example?
    Rerun DeepDive
    Pipelines
    Made Enough
    Changes?
    No
    Yes
    End Error
    Analysis
    Find All Error Examples
    No
    Yes
    Is it an error?
    What type of error?
    Fix ground truth
    Not an error False Negative
    Take a look at the example,
    along with its origins
    False Positive
    Debug Recall Error Debug Precision Error
    Legend
    Correction
    Inspection
    Debug Precision Error
    Find relevant features to
    the example
    All features look
    correct?
    Fix bug in feature extractors
    No
    Find features with high
    weights
    Yes
    Why did a feature
    get high weight?
    Label more negative
    examples
    Found a false positive error example
    Skew in train set
    Take a look at training set
    covered by features
    Add distant supervision
    rules/constraints
    Skew in train set
    Continue to next example
    No more feature
    Legend
    Correction
    Inspection
    Legend
    Debug Recall Error
    Add/Fix feature extractors to extract
    more features for the example
    Find relevant features to
    the example
    Enough features?
    Extracted as
    candidate?
    Fix extractors to extract the example
    as candidate
    No
    Yes
    All features look
    correct? Fix bug in feature extractors
    No
    No
    Find features with low
    weights
    Yes
    Why did a feature
    get low weight?
    Label more positive
    examples
    Found a false negative error example
    Fix feature extractor to
    cover more cases
    Skew in train set
    Sparse feature
    Take a look at training set
    covered by features
    Add distant supervision
    rules/constraints
    Skew in train set
    Continue to next example
    No more feature
    Correction
    Inspection
    [IEEE DE Bul 2014, VLDB 2015]

    View Slide

  62. DeepDive's Micro Error Analysis Guideline
    39
    Start Error
    Analysis
    Any new example?
    Rerun DeepDive
    Pipelines
    Made Enough
    Changes?
    No
    Yes
    End Error
    Analysis
    Find All Error Examples
    No
    Yes
    Is it an error?
    What type of error?
    Fix ground truth
    Not an error False Negative
    Take a look at the example,
    along with its origins
    False Positive
    Debug Recall Error Debug Precision Error
    Legend
    Correction
    Inspection
    Debug Precision Error
    Find relevant features to
    the example
    All features look
    correct?
    Fix bug in feature extractors
    No
    Find features with high
    weights
    Yes
    Why did a feature
    get high weight?
    Label more negative
    examples
    Found a false positive error example
    Skew in train set
    Take a look at training set
    covered by features
    Add distant supervision
    rules/constraints
    Skew in train set
    Continue to next example
    No more feature
    Legend
    Correction
    Inspection
    Legend
    Debug Recall Error
    Add/Fix feature extractors to extract
    more features for the example
    Find relevant features to
    the example
    Enough features?
    Extracted as
    candidate?
    Fix extractors to extract the example
    as candidate
    No
    Yes
    All features look
    correct? Fix bug in feature extractors
    No
    No
    Find features with low
    weights
    Yes
    Why did a feature
    get low weight?
    Label more positive
    examples
    Found a false negative error example
    Fix feature extractor to
    cover more cases
    Skew in train set
    Sparse feature
    Take a look at training set
    covered by features
    Add distant supervision
    rules/constraints
    Skew in train set
    Continue to next example
    No more feature
    Correction
    Inspection
    [IEEE DE Bul 2014, VLDB 2015]
    Slow data inspection and
    metadata collection

    View Slide

  63. • Faster incremental runs
    • More efficient data processing
    Accelerating KBC
    40
    Mindtagger: a demonstration of data labeling in knowledge base construction.

    [VLDB 2015 Demo]
    github.com/HazyResearch/mindbender
    • Error analysis guidelines
    • Easier data labeling
    • Automatic search interface
    0. KBC with DeepDive
    1. Increasing 

    Machine Efficiency
    2. Increasing 

    Human Productivity

    View Slide

  64. Problem: Data Model Mismatch
    Have data
    • Machine-optimized
    • in Relational schema
    • Normalized
    41
    Need data
    • Human-friendly
    • in Document model
    • Denormalized
    Large Gap

    View Slide

  65. Problem: Painful Data Inspection
    • Data not in human-friendly format
    • Awful to work with data for machine-
    consumption
    • Too slow and tedious to write SQL queries
    to understand or explore data
    • Unreliable to collect metadata manually
    • Difficult to predict schema of metadata
    needed for ad-hoc analysis
    42

    View Slide

  66. Mindtagger: Tool for Data Labeling
    • Interactive user interface
    • Human-friendly data presentation
    • Quick, reliable metadata collection
    • Customizable task template
    43
    Mindtagger
    Task Template
    Interactive UI
    Data Items
    Metadata
    Little
    Gap

    View Slide

  67. Precision mode
    (spouse example)
    Mindtagger in Action
    44
    Ground-truth collection
    (Tracking down slavery)
    Clustering errors 

    by ad-hoc tags

    View Slide

  68. Precision mode
    (spouse example)
    Mindtagger in Action
    44
    Ground-truth collection
    (Tracking down slavery)
    Clustering errors 

    by ad-hoc tags

    View Slide

  69. Precision mode
    (spouse example)
    Mindtagger in Action
    44
    Ground-truth collection
    (Tracking down slavery)
    Clustering errors 

    by ad-hoc tags
    Making #machinelearning fun!

    View Slide

  70. Cumulative 

    Ratio
    Time 

    consumed
    Problem: Slow Data Exploration
    45
    Breakdown of time consumption in an error analysis iteration
    (semiconductor 

    material KBC)
    Error analysis step

    View Slide

  71. Cumulative 

    Ratio
    Time 

    consumed
    Problem: Slow Data Exploration
    45
    Breakdown of time consumption in an error analysis iteration
    Productive steps 

    with Mindtagger
    Tedious 

    Data 

    Exploration
    (semiconductor 

    material KBC)
    Error analysis step

    View Slide

  72. Automatic Search Interface Generation
    46
    Data denormalization
    DDlog annotations
    @extraction
    has_spouse?(@ref(…) p1_id text,
    @ref(…) p2_id text).
    person_mention(@key p_id text,
    @ref(…) doc_id text,
    @ref(…) sent_id text).
    sentences(@key doc_id text,
    @key sent_id int ,
    tokens text[], lemmas text[],
    pos_tags text[], ner_tags text[], …).
    @source
    articles(id text, content text).
    Interactive keyword search

    View Slide

  73. Increased Productivity, Lowered Bar
    47
    1 computer scientist
    2-3 paleontologists
    1.5 years
    3-5 programmers, physicists
    6 months
    2 computer scientists
    3 months
    2 biomedical scientists
    2 computer scientists
    3-4 months
    undergrad students
    4-8 weeks
    5-6 programmers
    3-4 months
    1 biomedical scientist
    1 computer scientist
    1 year
    2014
    2013 2015 2016
    Time to reach 

    sufficient quality
    years
    months
    weeks
    DeepDive coevolving over calendar years

    View Slide

  74. Future Work in Accelerating KBC
    48
    2. Increasing 

    Human Productivity
    1. Increasing 

    Machine Efficiency
    • Scale-out learning/inference
    • Scale-out data processing
    • Execution optimization
    across heterogenous
    compute resource
    • Training data generation
    (data programming)

    → Snorkel [HILDA 2016]
    • Interactive rule composing
    • Rule auto suggestions

    View Slide

  75. Accelerating Knowledge Base Construction
    49
    2. Increasing 

    Human Productivity
    1. Increasing 

    Machine Efficiency
    • Faster incremental runs
    • More efficient data processing
    • Error analysis guidelines
    • Easier data labeling
    • Automatic search interface
    0. KBC with DeepDive
    (Best of VLDB) [VLDB 2015] [VLDB Journal 2016]
    (SIGMOD Research Highlight Award) [SIGMOD Record 2016]

    [SIGMOD 2016 Industrial Track]

    [IEEE DEBul 2014] [VLDB 2015 Demo] [HILDA 2016]
    github.com/HazyResearch/deepdive 

    github.com/HazyResearch/mindbender 

    github.com/netj/mkmimo

    View Slide

  76. Acknowledgment
    50
    Chris Ré
    Intern Hosts at Google & Facebook
    Alon Halevy, Chris Olston, Alkis, Avery Ching
    Hazy Research Group
    Ce, Sen, Feiran, Alex, Theo, Ivan, Michael FitzPatrick, Henry, Jason,
    Steve, Stephen, Matteo, Chris Aberger & De Sa
    Nobu, Masayuki, Yuichi TOSHIBA
    Gill Bejerano, Johannes, and all DeepDive users
    Feng, Zifei, Raphael, Xiao LATTICE
    Reading & Oral Committee
    Parag Mallick
    Peter Bailis
    Kunle Olukotun
    InfoLab
    Jure, Jeff, Gio, Rok, Semih, Vikesh,
    Steven, Hyunjung, Akash, Manas,
    Saint, Vasilis, Asif, Andrej
    Mike Cafarella
    Hector Jennifer Andreas

    View Slide

  77. Acknowledgment
    51
    Friends & Community

    View Slide

  78. Acknowledgment
    52
    Family

    View Slide

  79. Acknowledgment
    53
    Hailey
    Suyeun

    View Slide

  80. Next Stop
    54

    View Slide

  81. Accelerating Knowledge Base Construction
    55
    2. Increasing 

    Human Productivity
    1. Increasing 

    Machine Efficiency
    • Faster incremental runs
    • More efficient data processing
    • Error analysis guidelines
    • Easier data labeling
    • Automatic search interface
    0. KBC with DeepDive
    (Best of VLDB) [VLDB 2015] [VLDB Journal 2016]
    (SIGMOD Research Highlight Award) [SIGMOD Record 2016]

    [SIGMOD 2016 Industrial Track]

    [IEEE DEBul 2014] [VLDB 2015 Demo] [HILDA 2016]
    github.com/HazyResearch/deepdive 

    github.com/HazyResearch/mindbender 

    github.com/netj/mkmimo

    View Slide