Slide 1

Slide 1 text

Accelerating 
 Knowledge Base Construction Jaeho Shin Advisor: Christopher Ré Readers: Hector Garcia-Molina Kunle Olukotun Oral Examiner: Peter Bailis Chair: Parag Mallick

Slide 2

Slide 2 text

Accelerating Knowledge Base Construction 1. Background: Knowledge Base Construction 2. KBC with DeepDive [SIGMOD 2016 Industrial Track] [SIGMOD Record 2016] github.com/HazyResearch/deepdive 3. Machine Efficiency [VLDB 2015] [VLDB Journal 2016] (Best of VLDB) (SIGMOD Research Highlight Award 2015) github.com/netj/mkmimo 4. Human Productivity [IEEE DEBul 2014] [VLDB 2015 Demo] [HILDA 2016] github.com/HazyResearch/mindbender 2

Slide 3

Slide 3 text

Macroscopic Questions Where do human- trafficking crimes happen? Which gene mutations cause certain diseases? What is the impact of climate change to biodiversity? 3

Slide 4

Slide 4 text

Macroscopic Questions Where do human- trafficking crimes happen? Which gene mutations cause certain diseases? What is the impact of climate change to biodiversity? Trafficking Distribution Time # Species Biodiversity Curve Biodiversity Curve Gene-Phenotype Map 3

Slide 5

Slide 5 text

Knowledge Bases Can Answer Taxon Time Ecosystem O. mykiss 1991 Gulf of Alaska T. thynnus 1983 Caribbean Sea … … … Natural Science KBs e.g., Fishbase, PaleoDB, ... Gene Phenotype HLA-B27 Ankylosing spondylitis PAX6 Optic nerve hypoplasia … … Biomedical KBs e.g., OMIM, MeSH, HPO, GO, ... MSA Price Phone # SF $200/hr 415-555-2242 NY $150/hr 646-555-9792 … … Law Enforcement KBs e.g., MEMEX, PolarisProject, ... Trafficking Distribution Time # Species Biodiversity Curve Biodiversity Curve Gene-Phenotype Map 4

Slide 6

Slide 6 text

Knowledge Bases Can Answer Taxon Time Ecosystem O. mykiss 1991 Gulf of Alaska T. thynnus 1983 Caribbean Sea … … … Natural Science KBs e.g., Fishbase, PaleoDB, ... Gene Phenotype HLA-B27 Ankylosing spondylitis PAX6 Optic nerve hypoplasia … … Biomedical KBs e.g., OMIM, MeSH, HPO, GO, ... MSA Price Phone # SF $200/hr 415-555-2242 NY $150/hr 646-555-9792 … … Law Enforcement KBs e.g., MEMEX, PolarisProject, ... Trafficking Distribution Time # Species Biodiversity Curve Biodiversity Curve Gene-Phenotype Map 4 Structured Data needed!

Slide 7

Slide 7 text

Knowledge in Unstructured Sources MSA Price Phone # SF $200/hr 415-555-2242 NY $150/hr 646-555-9792 … … Gene Phenotype HLA-B27 Ankylosing spondylitis PAX6 Optic nerve hypoplasia … … Taxon Time Ecosystem O. mykiss 1991 Gulf of Alaska T. thynnus 1983 Caribbean Sea … … … Pathway Diagrams Doctor's notes Medical images News Articles Web Postings 
 (Sex Ads, Reviews) Text, Tables, Figures in Scientific Literature Natural Science KBs Biomedical KBs Law Enforcement KBs 5

Slide 8

Slide 8 text

Knowledge Base Construction by Human MSA Price Phone # SF $200/hr 415-555-2242 NY $150/hr 646-555-9792 … … Gene Phenotype HLA-B27 Ankylosing spondylitis PAX6 Optic nerve hypoplasia … … Taxon Time Ecosystem O. mykiss 1991 Gulf of Alaska T. thynnus 1983 Caribbean Sea … … … 6

Slide 9

Slide 9 text

Knowledge Base Construction by Human MSA Price Phone # SF $200/hr 415-555-2242 NY $150/hr 646-555-9792 … … Gene Phenotype HLA-B27 Ankylosing spondylitis PAX6 Optic nerve hypoplasia … … Taxon Time Ecosystem O. mykiss 1991 Gulf of Alaska T. thynnus 1983 Caribbean Sea … … … 6 Error-prone Slow Expensive

Slide 10

Slide 10 text

MSA Price Phone # SF $200/hr 415-555-2242 NY $150/hr 646-555-9792 … … Gene Phenotype HLA-B27 Ankylosing spondylitis PAX6 Optic nerve hypoplasia … … Taxon Time Ecosystem O. mykiss 1991 Gulf of Alaska T. thynnus 1983 Caribbean Sea … … … Knowledge Base Construction by Machine 7

Slide 11

Slide 11 text

MSA Price Phone # SF $200/hr 415-555-2242 NY $150/hr 646-555-9792 … … Gene Phenotype HLA-B27 Ankylosing spondylitis PAX6 Optic nerve hypoplasia … … Taxon Time Ecosystem O. mykiss 1991 Gulf of Alaska T. thynnus 1983 Caribbean Sea … … … Knowledge Base Construction by Machine 7 Faster Cheaper Repeatable Scalable

Slide 12

Slide 12 text

KBC Machine Successes 8 Unstructured Information Knowledge Base Genomics Drug Repurposing Paleobiology Anti-Human Trafficking TAC-KBP 2014 Winner Material Science Collaborators including: Successful KBC applications from our group:

Slide 13

Slide 13 text

Genom Drug Paleobi Anti- TAC- Materi Run Analyze Improve Human
 in the
 loop Unstructured Information Knowledge Base Iterative KBC with DeepDive 9 • Development Loop improves iteratively • More Rapid Iteration → Successful KBC • Goal: High Quality • precision • recall

Slide 14

Slide 14 text

• Humans are easily distracted • Machine-optimized data is
 not human-friendly • Collecting metadata is tedious and error-prone • Exploring data is not interactive and too laborious Iterative KBC Challenges 10 Slow Unreliable Humans Slow Unwieldy Machines • Normal runs are too slow for small incremental changes • Time and resource are wasted by inefficient data processing • Machines waste executing what human did not intend

Slide 15

Slide 15 text

Accelerating KBC: Focus of My Work 11 • Error analysis guidelines • Easier data labeling • Automatic search interface • Metrics monitoring dashboard • Easier training set creation • Faster incremental runs • More efficient data processing • Execution planning and micro- step operations • Better serialization for data in motion 2. Increasing 
 Human Productivity 1. Increasing 
 Machine Efficiency

Slide 16

Slide 16 text

Accelerating KBC: Focus of My Work 11 2. Increasing 
 Human Productivity 1. Increasing 
 Machine Efficiency • Faster incremental runs • More efficient data processing • Error analysis guidelines • Easier data labeling • Automatic search interface

Slide 17

Slide 17 text

Accelerating KBC 12 Extracting Databases from Dark Data with DeepDive. [SIGMOD 2016 Industrial] DeepDive: Declarative Knowledge Base Construction. [SIGMOD Record 2016] github.com/HazyResearch/deepdive 2. Increasing 
 Human Productivity 1. Increasing 
 Machine Efficiency • Faster incremental runs • More efficient data processing • Error analysis guidelines • Easier data labeling • Automatic search interface 0. KBC with DeepDive

Slide 18

Slide 18 text

DeepDive Among Others 13 Machine Learning Data Management Information Extraction Caffe Torch SystemT CoreNLP FACTORIE MapReduce Alchemy BUGS Xlog MCDB ProbKB Lixto GATE Various 
 Rule-based 
 Systems XPath regexp

Slide 19

Slide 19 text

KBC with DeepDive 14 • Relational Databases • Declarative Languages • SQL & DDlog • Standard Tools Integration • as UDF (User-defined Functions) • written in Java, Python, Perl • e.g., CoreNLP, regular expressions • Semi-Supervised Machine Learning • Probabilistic Graphical Models (Factor Graphs) • Approximate Inference (Gibbs Sampling) • Learning with Asynchronous SGD (HogWild!)

Slide 20

Slide 20 text

KBC with DeepDive 14 • Relational Databases • Declarative Languages • SQL & DDlog • Standard Tools Integration • as UDF (User-defined Functions) • written in Java, Python, Perl • e.g., CoreNLP, regular expressions • Semi-Supervised Machine Learning • Probabilistic Graphical Models (Factor Graphs) • Approximate Inference (Gibbs Sampling) • Learning with Asynchronous SGD (HogWild!) Candidate Generation Feature Extraction Supervision Learning / Inference Aspirational Schema

Slide 21

Slide 21 text

DeepDive Programs in DDlog 15 President Barack Obama and his wife Michelle Obama step out of Air Force One on Sunday. … Candidate Generation Feature Extraction Supervision Learning / Inference Aspirational Schema

Slide 22

Slide 22 text

DeepDive Programs in DDlog 15 has_spouse?(p1_id text, p2_id text). person_mention(p_id text, doc_id text, sent_id text). sentences(doc_id text , sent_id int , tokens text[], lemmas text[], pos_tags text[], ner_tags text[], …). articles(id text, content text). President Barack Obama and his wife Michelle Obama step out of Air Force One on Sunday. … Candidate Generation Feature Extraction Supervision Learning / Inference Aspirational Schema

Slide 23

Slide 23 text

DeepDive Programs in DDlog 16 Candidate Generation Feature Extraction Supervision Learning / Inference Aspirational Schema person_mention += udf_map_candidates(sent, ner_tags, …) :- sentences(sent, ner_tags, …). has_spouse(p1, p2) :-
 person_mention(p1, doc, sent),
 person_mention(p2, doc, sent). spouse_features += udf_extract_features(p1, p2, sent) :- has_spouse(p1, p2),
 person_mention(p1, doc, sent), sentences(doc, sent). DDlog inherits Datalog, allowing UDFs for integrated data processing UDF in Python UDF in Python Extracting text spans by NER tags Extracting text features

Slide 24

Slide 24 text

DeepDive Programs in DDlog 17 Candidate Generation Feature Extraction Supervision Learning / Inference Aspirational Schema # Features @weight(f) has_spouse(p1, p2) :- spouse_feature(p1, p2, f). # Inference rule: Symmetry @weight(3.0) has_spouse(p1, p2) => has_spouse(p2, p1). # Inference rule: Only one marriage @weight(-1.0) has_spouse(p1, p2) => has_spouse(p1, p3) :- p2 != p3. DDlog inherits Markov Logic Networks (Richardson & Domingos) and Tuffy (Niu et al.) has_spouse(p1, p2) = true :-
 freebase_person(p1, e1), freebase_person(p2, e2),
 freebase_marriage(e1, e2). Distant Supervision with known facts Features and domain knowledge

Slide 25

Slide 25 text

18 DeepDive’s Semantics User Relations Inference Rules Factor Graph Variables V R S Q F1 F2 Factors F Grounding x y a 0 a 1 a 2 r1 r2 r3 s1 s2 y 0 10 q1 x a r1 r2 r3 s1 s2 q1 User Relations Inference Rules Factor Graph Variables V F1 R S Q q(x) :- R(x,y) F2 q(x) :- R(x,y), S(y) F1 F2 Factors F Factor function corresponds to Equation 1 in Section 2.4. Grounding x y a 0 a 1 a 2 r1 r2 r3 s1 s2 y 0 10 q1 x a r1 r2 r3 s1 s2 q1 User Relations Inference Rules F1 R S Q q(x) :- R(x,y) F2 q(x) :- R(x,y), S(y) Fac Equ Grounding x y a 0 a 1 a 2 r1 r2 r3 s1 s2 y 0 10 q1 x a r1 User Relations Inference Rules F1 R S Q q(x) :- R(x,y) F2 q(x) :- R(x,y), S(y) Fac Equ Grounding x y a 0 a 1 a 2 r1 r2 r3 s1 s2 y 0 10 q1 x a r1 Grounding Factor Graph DDlog Relations Candidate Generation Feature Extraction Supervision Learning / Inference Aspirational Schema DDlog Inference Rules R(x,y) => Q(x). R(x,y), S(y) => Q(x).

Slide 26

Slide 26 text

19 DeepDive’s Semantics Candidate Generation Feature Extraction Supervision Learning / Inference Aspirational Schema User Relations Inference Rules Factor Graph Variables V F1 R S Q q(x) :- R(x,y) F2 q(x) :- R(x,y), S(y) F1 F2 Factors F Factor function corresponds to Equation 1 in Section 2.4. Grounding x y a 0 a 1 a 2 r1 r2 r3 s1 s2 y 0 10 q1 x a r1 r2 r3 s1 s2 q1 Factor Graph factor graph ( V, F, ˆ w ) random variables V hyperedges of variables F ✓ {f | f ✓ V } weight function ˆ w : F ⇥ { 0 , 1 }V ! R all possible worlds I ✓ {I : V ! { 0 , 1 }} joint probability marginal probability Pr[ I ] = Z 1 exp n ˆ W ( F, I ) o where Z = X I2I exp n ˆ W ( F, I ) o ˆ W ( F, I ) = X f2F ˆ w ( f, I ) Pr[v] = X I2I+ Pr[I] where I+ = {I 2 I | I(v) = 1}

Slide 27

Slide 27 text

20 DeepDive’s Semantics • Approximate Inference by Gibbs Sampling • Learning by Asynchronous SGD 
 (Hogwild! [Niu et al.]) • High-performance implementation on modern hardware (NUMA & many cores)
 (DimmWitted [Zhang et al.]) Marginal Probabilities Pr[ᐧ] Candidate Generation Feature Extraction Supervision Learning / Inference Aspirational Schema User Relations Inference Rules Factor Graph Variables V F1 R S Q q(x) :- R(x,y) F1 F2 Factors F Grounding x y a 0 a 1 a 2 r1 r2 r3 s1 s2 y 0 10 q1 x a r1 r2 r3 s1 s2 q1 Factor Graph

Slide 28

Slide 28 text

1. Increasing 
 Machine Efficiency • Faster incremental runs • More efficient data processing Accelerating KBC 21 2. Increasing 
 Human Productivity • Error analysis guidelines • Easier data labeling • Automatic search interface 0. KBC with DeepDive

Slide 29

Slide 29 text

1. Increasing 
 Machine Efficiency • Faster incremental runs • More efficient data processing Accelerating KBC 21 Incremental knowledge base construction using DeepDive. (Best of VLDB) [VLDB 2015]
 [VLDB Journal 2016]
 (SIGMOD Research Highlight Award 2015) 2. Increasing 
 Human Productivity • Faster incremental runs • More efficient data processing • Error analysis guidelines • Easier data labeling • Automatic search interface 0. KBC with DeepDive 1. Increasing 
 Machine Efficiency

Slide 30

Slide 30 text

Problem: Slow Incremental Runs 22 2.4M facts 1.8M documents 6 hours … 
 Barack Obama and his wife Michelle Obama 
 … “News Reading” System

Slide 31

Slide 31 text

Problem: Slow Incremental Runs 22 2.4M facts 1.8M documents Incremental Updates 6 hours … 
 Barack Obama and his wife Michelle Obama 
 … “News Reading” System

Slide 32

Slide 32 text

Problem: Slow Incremental Runs 22 2.4M facts 1.8M documents Incremental Updates 6 hours … 
 Barack Obama and his wife Michelle Obama 
 … +∆1 +∆1 +∆2 +∆1 +∆2 +∆3 6 hours 7 hours 8 hours “News Reading” System

Slide 33

Slide 33 text

Fast Incremental Runs 23 Incremental Updates < 30 mins … 
 Barack Obama and his wife Michelle Obama 
 … 2.4M facts 1.8M documents 6 hours “News Reading” System +∆1 +∆2 +∆3

Slide 34

Slide 34 text

Updates to Factor Graph Types of Incremental Updates 24 New DDlog Rules + Data Candidate Generation Feature Extraction Supervision Learning / Inference Aspirational Schema

Slide 35

Slide 35 text

Updates to Factor Graph Types of Incremental Updates 24 V V … + + addition New DDlog Rules + Data Candidate Generation Feature Extraction Supervision Learning / Inference Aspirational Schema Candidate Generation

Slide 36

Slide 36 text

Updates to Factor Graph Types of Incremental Updates 24 V V … + V F V V … + + addition New DDlog Rules + Data Candidate Generation Feature Extraction Supervision Learning / Inference Aspirational Schema Learning / Inference

Slide 37

Slide 37 text

Updates to Factor Graph Types of Incremental Updates 24 V V … + V F V V … F F’ + ⟳ ⟳ mutation + addition New DDlog Rules + Data Candidate Generation Feature Extraction Supervision Learning / Inference Aspirational Schema Feature Extraction

Slide 38

Slide 38 text

Updates to Factor Graph Types of Incremental Updates 24 V V … + V F V V … F F’ V -/+ + ⟳ ⟳ ⟳ mutation + addition evidence New DDlog Rules + Data Candidate Generation Feature Extraction Supervision Learning / Inference Aspirational Schema Supervision

Slide 39

Slide 39 text

∆FG Updates to Factor Graph Types of Incremental Updates 24 V V … + V F V V … F F’ V -/+ + ⟳ ⟳ Incremental Grounding FG ⟳ mutation + addition evidence New DDlog Rules + Data Candidate Generation Feature Extraction Supervision Learning / Inference Aspirational Schema

Slide 40

Slide 40 text

25 Original Factor Graph FG Incremental Updates Speeding up Incremental Runs

Slide 41

Slide 41 text

(Repeated many times) 25 Reusable 
 Data 1. Materialization Original Factor Graph FG Incremental Updates Speeding up Incremental Runs (Once)

Slide 42

Slide 42 text

(Repeated many times) 25 ∆FG Reusable 
 Data 2. Incremental Grounding 1. Materialization Original Factor Graph FG Incremental Updates Speeding up Incremental Runs (Once)

Slide 43

Slide 43 text

(Repeated many times) 25 ∆FG Reusable 
 Data 2. Incremental Grounding 3. Incremental 
 Inference 1. Materialization Pr(∆FG)[ᐧ] Original Factor Graph FG Incremental Updates + Speeding up Incremental Runs (Once)

Slide 44

Slide 44 text

Incremental Maintenance: Two Approaches 26 + ∆FG Samples of 
 Possible Worlds 1. Sampling-based Pr(∆FG)[ᐧ] Original Factor Graph FG Approximate 
 Factor Graph 2. Variational-based + ∆FG

Slide 45

Slide 45 text

1. Sampling Approach 27 Reuse with Acceptance Tests Materialization Samples of 
 Possible Worlds Generate many 00111 11101 01001 01001 Original 
 Factor Graph FG Updated 
 Probabilities Pr(∆FG)[ᐧ] Independent Metropolis-Hastings sampling w.r.t. ∆FG Incremental Inference Updates to the Factor Graph ∆FG

Slide 46

Slide 46 text

2. Variational Approach 28 Run Gibbs Sampling after update Approximate Log-determinant Relaxation Updated Simpler Factor Graph FG’ + ∆FG Simpler Factor Graph FG’ Materialization Incremental Inference Original 
 Factor Graph FG Updates to the Factor Graph ∆FG Updated 
 Probabilities Pr(∆FG)[ᐧ] (with only binary potentials)

Slide 47

Slide 47 text

Quality Over Time 29 Consistent ~10x speedup across 5 KBC systems 22x Overall 
 Speedup 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 1 100 10000 Quality (F1 score) Cumulative Execution Time (s) Rerun Incremental (“News Reading” system) Simulated incremental development with 6 different rules 12 hour Materialization Still >2x Faster 99% Overlap Pr[v] > 0.9

Slide 48

Slide 48 text

Tradeoff of Two Approaches 30 Neither dominates the other 
 (depends on the workload) → Rule-based optimizer (synthetic dataset) (slower) 0.001 0.01 0.1 1 10 0.001 0.01 0.1 1 0.001 0.01 0.1 1 0.1 1 Incremental Inference Time (s) (a) Acceptance Rate (b) Sparsity of Correlations Sampling Variational Sampling Variational (more 
 correlations) 0.001 0.01 0.1 1 10 0.001 0.01 0.1 1 0.001 0.01 0.1 1 0.1 1 Incremental Inference Time (s) (a) Acceptance Rate (b) Sparsity of Correlations Sampling Variational Sampling Variational (slower) (larger
 updates)

Slide 49

Slide 49 text

1. Increasing 
 Machine Efficiency • Faster incremental runs • More efficient data processing Accelerating KBC 31 github.com/netj/mkmimo 2. Increasing 
 Human Productivity • Error analysis guidelines • Easier data labeling • Automatic search interface 0. KBC with DeepDive 1. Increasing 
 Machine Efficiency

Slide 50

Slide 50 text

Problem: Inefficient Data Processing • KBC requires lots of data processing • DeepDive’s UDFs (user-defined functions) • simple, easy to debug • language agnostic for integrating arbitrary tools/library • efficient? 32 UDF Unload Load

Slide 51

Slide 51 text

33 Naive Parallelization UDF UDF Split File File File UDF Executing UDF processes in Parallel File File File Unload Load (Batch) (Batch)

Slide 52

Slide 52 text

33 Naive Parallelization UDF UDF Split File File File UDF Executing UDF processes in Parallel File File File Data Duplication Unnecessary 
 I/O Unload Load (Batch) (Batch)

Slide 53

Slide 53 text

(Streaming) 34 Better Parallelization UDF UDF Split Pipe Pipe Pipe UDF Executing UDF processes in Parallel, Streaming Data File File File Reduced 
 Duplication Unload Load (Batch)

Slide 54

Slide 54 text

(Streaming) 34 Better Parallelization UDF UDF Split Pipe Pipe Pipe UDF Executing UDF processes in Parallel, Streaming Data File File File Reduced 
 Duplication Throughput 
 bounded by 
 Stragglers! Unload Load (Batch)

Slide 55

Slide 55 text

35 DeepDive’s Efficient Data Processing Unload Load UDF UDF mkmimo Pipe Pipe Pipe UDF Executing UDF processes in Parallel, Streaming Data, Balancing Load mkmimo Pipe Pipe Pipe (Streaming) (Streaming)

Slide 56

Slide 56 text

35 DeepDive’s Efficient Data Processing Unload Load UDF UDF mkmimo Pipe Pipe Pipe UDF Executing UDF processes in Parallel, Streaming Data, Balancing Load mkmimo Pipe Pipe Pipe Zero Footprint! Speed up ~3x (Streaming) (Streaming)

Slide 57

Slide 57 text

35 DeepDive’s Efficient Data Processing UDF UDF mkmimo Pipe Pipe Pipe UDF Executing UDF processes in Parallel, Streaming Data, Balancing Load mkmimo Pipe Pipe Pipe Zero Footprint! Speed up ~3x Parallel Unload Parallel 
 Load Speed up ~20x (Streaming) (Streaming)

Slide 58

Slide 58 text

• Faster incremental runs • More efficient data processing Accelerating KBC 36 Feature engineering for knowledge base construction. [IEEE DEBul 2014] • Error analysis guidelines • Easier data labeling • Automatic search interface 0. KBC with DeepDive 1. Increasing 
 Machine Efficiency 2. Increasing 
 Human Productivity

Slide 59

Slide 59 text

Problem: Easily Distracted Humans • Working in ad-hoc fashion • adding useless features • perfecting features with little impact • solving obvious errors not common ones • fiddling with statistical procedure 37

Slide 60

Slide 60 text

DeepDive's Macro Error Analysis Support 38 Calibration Plots

Slide 61

Slide 61 text

DeepDive's Micro Error Analysis Guideline 39 Start Error Analysis Any new example? Rerun DeepDive Pipelines Made Enough Changes? No Yes End Error Analysis Find All Error Examples No Yes Is it an error? What type of error? Fix ground truth Not an error False Negative Take a look at the example, along with its origins False Positive Debug Recall Error Debug Precision Error Legend Correction Inspection Debug Precision Error Find relevant features to the example All features look correct? Fix bug in feature extractors No Find features with high weights Yes Why did a feature get high weight? Label more negative examples Found a false positive error example Skew in train set Take a look at training set covered by features Add distant supervision rules/constraints Skew in train set Continue to next example No more feature Legend Correction Inspection Legend Debug Recall Error Add/Fix feature extractors to extract more features for the example Find relevant features to the example Enough features? Extracted as candidate? Fix extractors to extract the example as candidate No Yes All features look correct? Fix bug in feature extractors No No Find features with low weights Yes Why did a feature get low weight? Label more positive examples Found a false negative error example Fix feature extractor to cover more cases Skew in train set Sparse feature Take a look at training set covered by features Add distant supervision rules/constraints Skew in train set Continue to next example No more feature Correction Inspection [IEEE DE Bul 2014, VLDB 2015]

Slide 62

Slide 62 text

DeepDive's Micro Error Analysis Guideline 39 Start Error Analysis Any new example? Rerun DeepDive Pipelines Made Enough Changes? No Yes End Error Analysis Find All Error Examples No Yes Is it an error? What type of error? Fix ground truth Not an error False Negative Take a look at the example, along with its origins False Positive Debug Recall Error Debug Precision Error Legend Correction Inspection Debug Precision Error Find relevant features to the example All features look correct? Fix bug in feature extractors No Find features with high weights Yes Why did a feature get high weight? Label more negative examples Found a false positive error example Skew in train set Take a look at training set covered by features Add distant supervision rules/constraints Skew in train set Continue to next example No more feature Legend Correction Inspection Legend Debug Recall Error Add/Fix feature extractors to extract more features for the example Find relevant features to the example Enough features? Extracted as candidate? Fix extractors to extract the example as candidate No Yes All features look correct? Fix bug in feature extractors No No Find features with low weights Yes Why did a feature get low weight? Label more positive examples Found a false negative error example Fix feature extractor to cover more cases Skew in train set Sparse feature Take a look at training set covered by features Add distant supervision rules/constraints Skew in train set Continue to next example No more feature Correction Inspection [IEEE DE Bul 2014, VLDB 2015] Slow data inspection and metadata collection

Slide 63

Slide 63 text

• Faster incremental runs • More efficient data processing Accelerating KBC 40 Mindtagger: a demonstration of data labeling in knowledge base construction.
 [VLDB 2015 Demo] github.com/HazyResearch/mindbender • Error analysis guidelines • Easier data labeling • Automatic search interface 0. KBC with DeepDive 1. Increasing 
 Machine Efficiency 2. Increasing 
 Human Productivity

Slide 64

Slide 64 text

Problem: Data Model Mismatch Have data • Machine-optimized • in Relational schema • Normalized 41 Need data • Human-friendly • in Document model • Denormalized Large Gap

Slide 65

Slide 65 text

Problem: Painful Data Inspection • Data not in human-friendly format • Awful to work with data for machine- consumption • Too slow and tedious to write SQL queries to understand or explore data • Unreliable to collect metadata manually • Difficult to predict schema of metadata needed for ad-hoc analysis 42

Slide 66

Slide 66 text

Mindtagger: Tool for Data Labeling • Interactive user interface • Human-friendly data presentation • Quick, reliable metadata collection • Customizable task template 43 Mindtagger Task Template Interactive UI Data Items Metadata Little Gap

Slide 67

Slide 67 text

Precision mode (spouse example) Mindtagger in Action 44 Ground-truth collection (Tracking down slavery) Clustering errors 
 by ad-hoc tags

Slide 68

Slide 68 text

Precision mode (spouse example) Mindtagger in Action 44 Ground-truth collection (Tracking down slavery) Clustering errors 
 by ad-hoc tags

Slide 69

Slide 69 text

Precision mode (spouse example) Mindtagger in Action 44 Ground-truth collection (Tracking down slavery) Clustering errors 
 by ad-hoc tags Making #machinelearning fun!

Slide 70

Slide 70 text

Cumulative 
 Ratio Time 
 consumed Problem: Slow Data Exploration 45 Breakdown of time consumption in an error analysis iteration (semiconductor 
 material KBC) Error analysis step

Slide 71

Slide 71 text

Cumulative 
 Ratio Time 
 consumed Problem: Slow Data Exploration 45 Breakdown of time consumption in an error analysis iteration Productive steps 
 with Mindtagger Tedious 
 Data 
 Exploration (semiconductor 
 material KBC) Error analysis step

Slide 72

Slide 72 text

Automatic Search Interface Generation 46 Data denormalization DDlog annotations @extraction has_spouse?(@ref(…) p1_id text, @ref(…) p2_id text). person_mention(@key p_id text, @ref(…) doc_id text, @ref(…) sent_id text). sentences(@key doc_id text, @key sent_id int , tokens text[], lemmas text[], pos_tags text[], ner_tags text[], …). @source articles(id text, content text). Interactive keyword search

Slide 73

Slide 73 text

Increased Productivity, Lowered Bar 47 1 computer scientist 2-3 paleontologists 1.5 years 3-5 programmers, physicists 6 months 2 computer scientists 3 months 2 biomedical scientists 2 computer scientists 3-4 months undergrad students 4-8 weeks 5-6 programmers 3-4 months 1 biomedical scientist 1 computer scientist 1 year 2014 2013 2015 2016 Time to reach 
 sufficient quality years months weeks DeepDive coevolving over calendar years

Slide 74

Slide 74 text

Future Work in Accelerating KBC 48 2. Increasing 
 Human Productivity 1. Increasing 
 Machine Efficiency • Scale-out learning/inference • Scale-out data processing • Execution optimization across heterogenous compute resource • Training data generation (data programming)
 → Snorkel [HILDA 2016] • Interactive rule composing • Rule auto suggestions

Slide 75

Slide 75 text

Accelerating Knowledge Base Construction 49 2. Increasing 
 Human Productivity 1. Increasing 
 Machine Efficiency • Faster incremental runs • More efficient data processing • Error analysis guidelines • Easier data labeling • Automatic search interface 0. KBC with DeepDive (Best of VLDB) [VLDB 2015] [VLDB Journal 2016] (SIGMOD Research Highlight Award) [SIGMOD Record 2016]
 [SIGMOD 2016 Industrial Track]
 [IEEE DEBul 2014] [VLDB 2015 Demo] [HILDA 2016] github.com/HazyResearch/deepdive 
 github.com/HazyResearch/mindbender 
 github.com/netj/mkmimo

Slide 76

Slide 76 text

Acknowledgment 50 Chris Ré Intern Hosts at Google & Facebook Alon Halevy, Chris Olston, Alkis, Avery Ching Hazy Research Group Ce, Sen, Feiran, Alex, Theo, Ivan, Michael FitzPatrick, Henry, Jason, Steve, Stephen, Matteo, Chris Aberger & De Sa Nobu, Masayuki, Yuichi TOSHIBA Gill Bejerano, Johannes, and all DeepDive users Feng, Zifei, Raphael, Xiao LATTICE Reading & Oral Committee Parag Mallick Peter Bailis Kunle Olukotun InfoLab Jure, Jeff, Gio, Rok, Semih, Vikesh, Steven, Hyunjung, Akash, Manas, Saint, Vasilis, Asif, Andrej Mike Cafarella Hector Jennifer Andreas

Slide 77

Slide 77 text

Acknowledgment 51 Friends & Community

Slide 78

Slide 78 text

Acknowledgment 52 Family

Slide 79

Slide 79 text

Acknowledgment 53 Hailey Suyeun

Slide 80

Slide 80 text

Next Stop 54

Slide 81

Slide 81 text

Accelerating Knowledge Base Construction 55 2. Increasing 
 Human Productivity 1. Increasing 
 Machine Efficiency • Faster incremental runs • More efficient data processing • Error analysis guidelines • Easier data labeling • Automatic search interface 0. KBC with DeepDive (Best of VLDB) [VLDB 2015] [VLDB Journal 2016] (SIGMOD Research Highlight Award) [SIGMOD Record 2016]
 [SIGMOD 2016 Industrial Track]
 [IEEE DEBul 2014] [VLDB 2015 Demo] [HILDA 2016] github.com/HazyResearch/deepdive 
 github.com/HazyResearch/mindbender 
 github.com/netj/mkmimo