Accelerating Knowledge Base Construction

Accelerating   Knowledge Base Construction Jaeho Shin Advisor: Christopher Ré
Readers: Hector Garcia-Molina Kunle Olukotun Oral Examiner: Peter Bailis Chair: Parag Mallick

Accelerating Knowledge Base Construction 1. Background: Knowledge Base Construction 2.
KBC with DeepDive [SIGMOD 2016 Industrial Track] [SIGMOD Record 2016] github.com/HazyResearch/deepdive 3. Machine Eﬃciency [VLDB 2015] [VLDB Journal 2016] (Best of VLDB) (SIGMOD Research Highlight Award 2015) github.com/netj/mkmimo 4. Human Productivity [IEEE DEBul 2014] [VLDB 2015 Demo] [HILDA 2016] github.com/HazyResearch/mindbender 2

Macroscopic Questions Where do human- traﬃcking crimes happen? Which gene
mutations cause certain diseases? What is the impact of climate change to biodiversity? 3

Macroscopic Questions Where do human- traﬃcking crimes happen? Which gene
mutations cause certain diseases? What is the impact of climate change to biodiversity? Traﬃcking Distribution Time # Species Biodiversity Curve Biodiversity Curve Gene-Phenotype Map 3

Knowledge Bases Can Answer Taxon Time Ecosystem O. mykiss 1991
Gulf of Alaska T. thynnus 1983 Caribbean Sea … … … Natural Science KBs e.g., Fishbase, PaleoDB, ... Gene Phenotype HLA-B27 Ankylosing spondylitis PAX6 Optic nerve hypoplasia … … Biomedical KBs e.g., OMIM, MeSH, HPO, GO, ... MSA Price Phone # SF $200/hr 415-555-2242 NY $150/hr 646-555-9792 … … Law Enforcement KBs e.g., MEMEX, PolarisProject, ... Traﬃcking Distribution Time # Species Biodiversity Curve Biodiversity Curve Gene-Phenotype Map 4

Knowledge Bases Can Answer Taxon Time Ecosystem O. mykiss 1991
Gulf of Alaska T. thynnus 1983 Caribbean Sea … … … Natural Science KBs e.g., Fishbase, PaleoDB, ... Gene Phenotype HLA-B27 Ankylosing spondylitis PAX6 Optic nerve hypoplasia … … Biomedical KBs e.g., OMIM, MeSH, HPO, GO, ... MSA Price Phone # SF $200/hr 415-555-2242 NY $150/hr 646-555-9792 … … Law Enforcement KBs e.g., MEMEX, PolarisProject, ... Traﬃcking Distribution Time # Species Biodiversity Curve Biodiversity Curve Gene-Phenotype Map 4 Structured Data needed!

Knowledge in Unstructured Sources MSA Price Phone # SF $200/hr
415-555-2242 NY $150/hr 646-555-9792 … … Gene Phenotype HLA-B27 Ankylosing spondylitis PAX6 Optic nerve hypoplasia … … Taxon Time Ecosystem O. mykiss 1991 Gulf of Alaska T. thynnus 1983 Caribbean Sea … … … Pathway Diagrams Doctor's notes Medical images News Articles Web Postings   (Sex Ads, Reviews) Text, Tables, Figures in Scientiﬁc Literature Natural Science KBs Biomedical KBs Law Enforcement KBs 5

Knowledge Base Construction by Human MSA Price Phone # SF
$200/hr 415-555-2242 NY $150/hr 646-555-9792 … … Gene Phenotype HLA-B27 Ankylosing spondylitis PAX6 Optic nerve hypoplasia … … Taxon Time Ecosystem O. mykiss 1991 Gulf of Alaska T. thynnus 1983 Caribbean Sea … … … 6

Knowledge Base Construction by Human MSA Price Phone # SF
$200/hr 415-555-2242 NY $150/hr 646-555-9792 … … Gene Phenotype HLA-B27 Ankylosing spondylitis PAX6 Optic nerve hypoplasia … … Taxon Time Ecosystem O. mykiss 1991 Gulf of Alaska T. thynnus 1983 Caribbean Sea … … … 6 Error-prone Slow Expensive

MSA Price Phone # SF $200/hr 415-555-2242 NY $150/hr 646-555-9792
… … Gene Phenotype HLA-B27 Ankylosing spondylitis PAX6 Optic nerve hypoplasia … … Taxon Time Ecosystem O. mykiss 1991 Gulf of Alaska T. thynnus 1983 Caribbean Sea … … … Knowledge Base Construction by Machine 7

MSA Price Phone # SF $200/hr 415-555-2242 NY $150/hr 646-555-9792
… … Gene Phenotype HLA-B27 Ankylosing spondylitis PAX6 Optic nerve hypoplasia … … Taxon Time Ecosystem O. mykiss 1991 Gulf of Alaska T. thynnus 1983 Caribbean Sea … … … Knowledge Base Construction by Machine 7 Faster Cheaper Repeatable Scalable

KBC Machine Successes 8 Unstructured Information Knowledge Base Genomics Drug
Repurposing Paleobiology Anti-Human Traﬃcking TAC-KBP 2014 Winner Material Science Collaborators including: Successful KBC applications from our group:

Genom Drug Paleobi Anti- TAC- Materi Run Analyze Improve Human 
in the  loop Unstructured Information Knowledge Base Iterative KBC with DeepDive 9 • Development Loop improves iteratively • More Rapid Iteration → Successful KBC • Goal: High Quality • precision • recall

• Humans are easily distracted • Machine-optimized data is  not
human-friendly • Collecting metadata is tedious and error-prone • Exploring data is not interactive and too laborious Iterative KBC Challenges 10 Slow Unreliable Humans Slow Unwieldy Machines • Normal runs are too slow for small incremental changes • Time and resource are wasted by ineﬃcient data processing • Machines waste executing what human did not intend

Accelerating KBC: Focus of My Work 11 • Error analysis
guidelines • Easier data labeling • Automatic search interface • Metrics monitoring dashboard • Easier training set creation • Faster incremental runs • More eﬃcient data processing • Execution planning and micro- step operations • Better serialization for data in motion 2. Increasing   Human Productivity 1. Increasing   Machine Eﬃciency

Accelerating KBC: Focus of My Work 11 2. Increasing  
Human Productivity 1. Increasing   Machine Eﬃciency • Faster incremental runs • More eﬃcient data processing • Error analysis guidelines • Easier data labeling • Automatic search interface

Accelerating KBC 12 Extracting Databases from Dark Data with DeepDive.
[SIGMOD 2016 Industrial] DeepDive: Declarative Knowledge Base Construction. [SIGMOD Record 2016] github.com/HazyResearch/deepdive 2. Increasing   Human Productivity 1. Increasing   Machine Eﬃciency • Faster incremental runs • More eﬃcient data processing • Error analysis guidelines • Easier data labeling • Automatic search interface 0. KBC with DeepDive

DeepDive Among Others 13 Machine Learning Data Management Information Extraction
Caﬀe Torch SystemT CoreNLP FACTORIE MapReduce Alchemy BUGS Xlog MCDB ProbKB Lixto GATE Various   Rule-based   Systems XPath regexp

KBC with DeepDive 14 • Relational Databases • Declarative Languages
• SQL & DDlog • Standard Tools Integration • as UDF (User-deﬁned Functions) • written in Java, Python, Perl • e.g., CoreNLP, regular expressions • Semi-Supervised Machine Learning • Probabilistic Graphical Models (Factor Graphs) • Approximate Inference (Gibbs Sampling) • Learning with Asynchronous SGD (HogWild!)

KBC with DeepDive 14 • Relational Databases • Declarative Languages
• SQL & DDlog • Standard Tools Integration • as UDF (User-deﬁned Functions) • written in Java, Python, Perl • e.g., CoreNLP, regular expressions • Semi-Supervised Machine Learning • Probabilistic Graphical Models (Factor Graphs) • Approximate Inference (Gibbs Sampling) • Learning with Asynchronous SGD (HogWild!) Candidate Generation Feature Extraction Supervision Learning / Inference Aspirational Schema

DeepDive Programs in DDlog 15 President Barack Obama and his
wife Michelle Obama step out of Air Force One on Sunday. … Candidate Generation Feature Extraction Supervision Learning / Inference Aspirational Schema

DeepDive Programs in DDlog 15 has_spouse?(p1_id text, p2_id text). person_mention(p_id
text, doc_id text, sent_id text). sentences(doc_id text , sent_id int , tokens text[], lemmas text[], pos_tags text[], ner_tags text[], …). articles(id text, content text). President Barack Obama and his wife Michelle Obama step out of Air Force One on Sunday. … Candidate Generation Feature Extraction Supervision Learning / Inference Aspirational Schema

DeepDive Programs in DDlog 16 Candidate Generation Feature Extraction Supervision
Learning / Inference Aspirational Schema person_mention += udf_map_candidates(sent, ner_tags, …) :- sentences(sent, ner_tags, …). has_spouse(p1, p2) :-  person_mention(p1, doc, sent),  person_mention(p2, doc, sent). spouse_features += udf_extract_features(p1, p2, sent) :- has_spouse(p1, p2),  person_mention(p1, doc, sent), sentences(doc, sent). DDlog inherits Datalog, allowing UDFs for integrated data processing UDF in Python UDF in Python Extracting text spans by NER tags Extracting text features

DeepDive Programs in DDlog 17 Candidate Generation Feature Extraction Supervision
Learning / Inference Aspirational Schema # Features @weight(f) has_spouse(p1, p2) :- spouse_feature(p1, p2, f). # Inference rule: Symmetry @weight(3.0) has_spouse(p1, p2) => has_spouse(p2, p1). # Inference rule: Only one marriage @weight(-1.0) has_spouse(p1, p2) => has_spouse(p1, p3) :- p2 != p3. DDlog inherits Markov Logic Networks (Richardson & Domingos) and Tuﬀy (Niu et al.) has_spouse(p1, p2) = true :-  freebase_person(p1, e1), freebase_person(p2, e2),  freebase_marriage(e1, e2). Distant Supervision with known facts Features and domain knowledge

18 DeepDive’s Semantics User Relations Inference Rules Factor Graph Variables
V R S Q F1 F2 Factors F Grounding x y a 0 a 1 a 2 r1 r2 r3 s1 s2 y 0 10 q1 x a r1 r2 r3 s1 s2 q1 User Relations Inference Rules Factor Graph Variables V F1 R S Q q(x) :- R(x,y) F2 q(x) :- R(x,y), S(y) F1 F2 Factors F Factor function corresponds to Equation 1 in Section 2.4. Grounding x y a 0 a 1 a 2 r1 r2 r3 s1 s2 y 0 10 q1 x a r1 r2 r3 s1 s2 q1 User Relations Inference Rules F1 R S Q q(x) :- R(x,y) F2 q(x) :- R(x,y), S(y) Fac Equ Grounding x y a 0 a 1 a 2 r1 r2 r3 s1 s2 y 0 10 q1 x a r1 User Relations Inference Rules F1 R S Q q(x) :- R(x,y) F2 q(x) :- R(x,y), S(y) Fac Equ Grounding x y a 0 a 1 a 2 r1 r2 r3 s1 s2 y 0 10 q1 x a r1 Grounding Factor Graph DDlog Relations Candidate Generation Feature Extraction Supervision Learning / Inference Aspirational Schema DDlog Inference Rules R(x,y) => Q(x). R(x,y), S(y) => Q(x).

19 DeepDive’s Semantics Candidate Generation Feature Extraction Supervision Learning /
Inference Aspirational Schema User Relations Inference Rules Factor Graph Variables V F1 R S Q q(x) :- R(x,y) F2 q(x) :- R(x,y), S(y) F1 F2 Factors F Factor function corresponds to Equation 1 in Section 2.4. Grounding x y a 0 a 1 a 2 r1 r2 r3 s1 s2 y 0 10 q1 x a r1 r2 r3 s1 s2 q1 Factor Graph factor graph ( V, F, ˆ w ) random variables V hyperedges of variables F ✓ {f | f ✓ V } weight function ˆ w : F ⇥ { 0 , 1 }V ! R all possible worlds I ✓ {I : V ! { 0 , 1 }} joint probability marginal probability Pr[ I ] = Z 1 exp n ˆ W ( F, I ) o where Z = X I2I exp n ˆ W ( F, I ) o ˆ W ( F, I ) = X f2F ˆ w ( f, I ) Pr[v] = X I2I+ Pr[I] where I+ = {I 2 I | I(v) = 1}

20 DeepDive’s Semantics • Approximate Inference by Gibbs Sampling •
Learning by Asynchronous SGD   (Hogwild! [Niu et al.]) • High-performance implementation on modern hardware (NUMA & many cores)  (DimmWitted [Zhang et al.]) Marginal Probabilities Pr[ᐧ] Candidate Generation Feature Extraction Supervision Learning / Inference Aspirational Schema User Relations Inference Rules Factor Graph Variables V F1 R S Q q(x) :- R(x,y) F1 F2 Factors F Grounding x y a 0 a 1 a 2 r1 r2 r3 s1 s2 y 0 10 q1 x a r1 r2 r3 s1 s2 q1 Factor Graph

1. Increasing   Machine Eﬃciency • Faster incremental runs •
More eﬃcient data processing Accelerating KBC 21 2. Increasing   Human Productivity • Error analysis guidelines • Easier data labeling • Automatic search interface 0. KBC with DeepDive

More efficient data processing Accelerating KBC 21 Incremental knowledge base construction using DeepDive. (Best of VLDB) [VLDB 2015]  [VLDB Journal 2016]  (SIGMOD Research Highlight Award 2015) 2. Increasing   Human Productivity • Faster incremental runs • More efficient data processing • Error analysis guidelines • Easier data labeling • Automatic search interface 0. KBC with DeepDive 1. Increasing   Machine Efficiency

Problem: Slow Incremental Runs 22 2.4M facts 1.8M documents 6
hours …   Barack Obama and his wife Michelle Obama   … “News Reading” System

Problem: Slow Incremental Runs 22 2.4M facts 1.8M documents Incremental
Updates 6 hours …   Barack Obama and his wife Michelle Obama   … “News Reading” System

Problem: Slow Incremental Runs 22 2.4M facts 1.8M documents Incremental
Updates 6 hours …   Barack Obama and his wife Michelle Obama   … +∆1 +∆1 +∆2 +∆1 +∆2 +∆3 6 hours 7 hours 8 hours “News Reading” System

Fast Incremental Runs 23 Incremental Updates < 30 mins …
  Barack Obama and his wife Michelle Obama   … 2.4M facts 1.8M documents 6 hours “News Reading” System +∆1 +∆2 +∆3

Updates to Factor Graph Types of Incremental Updates 24 New
DDlog Rules + Data Candidate Generation Feature Extraction Supervision Learning / Inference Aspirational Schema

Updates to Factor Graph Types of Incremental Updates 24 V
V … + + addition New DDlog Rules + Data Candidate Generation Feature Extraction Supervision Learning / Inference Aspirational Schema Candidate Generation

V … + V F V V … + + addition New DDlog Rules + Data Candidate Generation Feature Extraction Supervision Learning / Inference Aspirational Schema Learning / Inference

V … + V F V V … F F’ + ⟳ ⟳ mutation + addition New DDlog Rules + Data Candidate Generation Feature Extraction Supervision Learning / Inference Aspirational Schema Feature Extraction

V … + V F V V … F F’ V -/+ + ⟳ ⟳ ⟳ mutation + addition evidence New DDlog Rules + Data Candidate Generation Feature Extraction Supervision Learning / Inference Aspirational Schema Supervision

∆FG Updates to Factor Graph Types of Incremental Updates 24
V V … + V F V V … F F’ V -/+ + ⟳ ⟳ Incremental Grounding FG ⟳ mutation + addition evidence New DDlog Rules + Data Candidate Generation Feature Extraction Supervision Learning / Inference Aspirational Schema

25 Original Factor Graph FG Incremental Updates Speeding up Incremental
Runs

(Repeated many times) 25 Reusable   Data 1. Materialization Original
Factor Graph FG Incremental Updates Speeding up Incremental Runs (Once)

(Repeated many times) 25 ∆FG Reusable   Data 2. Incremental
Grounding 1. Materialization Original Factor Graph FG Incremental Updates Speeding up Incremental Runs (Once)

(Repeated many times) 25 ∆FG Reusable   Data 2. Incremental
Grounding 3. Incremental   Inference 1. Materialization Pr(∆FG)[ᐧ] Original Factor Graph FG Incremental Updates + Speeding up Incremental Runs (Once)

Incremental Maintenance: Two Approaches 26 + ∆FG Samples of  
Possible Worlds 1. Sampling-based Pr(∆FG)[ᐧ] Original Factor Graph FG Approximate   Factor Graph 2. Variational-based + ∆FG

1. Sampling Approach 27 Reuse with Acceptance Tests Materialization Samples
of   Possible Worlds Generate many 00111 11101 01001 01001 Original   Factor Graph FG Updated   Probabilities Pr(∆FG)[ᐧ] Independent Metropolis-Hastings sampling w.r.t. ∆FG Incremental Inference Updates to the Factor Graph ∆FG

2. Variational Approach 28 Run Gibbs Sampling after update Approximate
Log-determinant Relaxation Updated Simpler Factor Graph FG’ + ∆FG Simpler Factor Graph FG’ Materialization Incremental Inference Original   Factor Graph FG Updates to the Factor Graph ∆FG Updated   Probabilities Pr(∆FG)[ᐧ] (with only binary potentials)

Quality Over Time 29 Consistent ~10x speedup across 5 KBC
systems 22x Overall   Speedup 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 1 100 10000 Quality (F1 score) Cumulative Execution Time (s) Rerun Incremental (“News Reading” system) Simulated incremental development with 6 diﬀerent rules 12 hour Materialization Still >2x Faster 99% Overlap Pr[v] > 0.9

Tradeoﬀ of Two Approaches 30 Neither dominates the other  
(depends on the workload) → Rule-based optimizer (synthetic dataset) (slower) 0.001 0.01 0.1 1 10 0.001 0.01 0.1 1 0.001 0.01 0.1 1 0.1 1 Incremental Inference Time (s) (a) Acceptance Rate (b) Sparsity of Correlations Sampling Variational Sampling Variational (more   correlations) 0.001 0.01 0.1 1 10 0.001 0.01 0.1 1 0.001 0.01 0.1 1 0.1 1 Incremental Inference Time (s) (a) Acceptance Rate (b) Sparsity of Correlations Sampling Variational Sampling Variational (slower) (larger  updates)

More eﬃcient data processing Accelerating KBC 31 github.com/netj/mkmimo 2. Increasing   Human Productivity • Error analysis guidelines • Easier data labeling • Automatic search interface 0. KBC with DeepDive 1. Increasing   Machine Eﬃciency

Problem: Inefficient Data Processing • KBC requires lots of data
processing • DeepDive’s UDFs (user-defined functions) • simple, easy to debug • language agnostic for integrating arbitrary tools/library • efficient? 32 UDF Unload Load

33 Naive Parallelization UDF UDF Split File File File UDF
Executing UDF processes in Parallel File File File Unload Load (Batch) (Batch)

33 Naive Parallelization UDF UDF Split File File File UDF
Executing UDF processes in Parallel File File File Data Duplication Unnecessary   I/O Unload Load (Batch) (Batch)

(Streaming) 34 Better Parallelization UDF UDF Split Pipe Pipe Pipe
UDF Executing UDF processes in Parallel, Streaming Data File File File Reduced   Duplication Unload Load (Batch)

(Streaming) 34 Better Parallelization UDF UDF Split Pipe Pipe Pipe
UDF Executing UDF processes in Parallel, Streaming Data File File File Reduced   Duplication Throughput   bounded by   Stragglers! Unload Load (Batch)

35 DeepDive’s Eﬃcient Data Processing Unload Load UDF UDF mkmimo
Pipe Pipe Pipe UDF Executing UDF processes in Parallel, Streaming Data, Balancing Load mkmimo Pipe Pipe Pipe (Streaming) (Streaming)

35 DeepDive’s Eﬃcient Data Processing Unload Load UDF UDF mkmimo
Pipe Pipe Pipe UDF Executing UDF processes in Parallel, Streaming Data, Balancing Load mkmimo Pipe Pipe Pipe Zero Footprint! Speed up ~3x (Streaming) (Streaming)

35 DeepDive’s Eﬃcient Data Processing UDF UDF mkmimo Pipe Pipe
Pipe UDF Executing UDF processes in Parallel, Streaming Data, Balancing Load mkmimo Pipe Pipe Pipe Zero Footprint! Speed up ~3x Parallel Unload Parallel   Load Speed up ~20x (Streaming) (Streaming)

• Faster incremental runs • More eﬃcient data processing Accelerating
KBC 36 Feature engineering for knowledge base construction. [IEEE DEBul 2014] • Error analysis guidelines • Easier data labeling • Automatic search interface 0. KBC with DeepDive 1. Increasing   Machine Eﬃciency 2. Increasing   Human Productivity

Problem: Easily Distracted Humans • Working in ad-hoc fashion •
adding useless features • perfecting features with little impact • solving obvious errors not common ones • ﬁddling with statistical procedure 37

DeepDive's Macro Error Analysis Support 38 Calibration Plots

DeepDive's Micro Error Analysis Guideline 39 Start Error Analysis Any
new example? Rerun DeepDive Pipelines Made Enough Changes? No Yes End Error Analysis Find All Error Examples No Yes Is it an error? What type of error? Fix ground truth Not an error False Negative Take a look at the example, along with its origins False Positive Debug Recall Error Debug Precision Error Legend Correction Inspection Debug Precision Error Find relevant features to the example All features look correct? Fix bug in feature extractors No Find features with high weights Yes Why did a feature get high weight? Label more negative examples Found a false positive error example Skew in train set Take a look at training set covered by features Add distant supervision rules/constraints Skew in train set Continue to next example No more feature Legend Correction Inspection Legend Debug Recall Error Add/Fix feature extractors to extract more features for the example Find relevant features to the example Enough features? Extracted as candidate? Fix extractors to extract the example as candidate No Yes All features look correct? Fix bug in feature extractors No No Find features with low weights Yes Why did a feature get low weight? Label more positive examples Found a false negative error example Fix feature extractor to cover more cases Skew in train set Sparse feature Take a look at training set covered by features Add distant supervision rules/constraints Skew in train set Continue to next example No more feature Correction Inspection [IEEE DE Bul 2014, VLDB 2015]

DeepDive's Micro Error Analysis Guideline 39 Start Error Analysis Any
new example? Rerun DeepDive Pipelines Made Enough Changes? No Yes End Error Analysis Find All Error Examples No Yes Is it an error? What type of error? Fix ground truth Not an error False Negative Take a look at the example, along with its origins False Positive Debug Recall Error Debug Precision Error Legend Correction Inspection Debug Precision Error Find relevant features to the example All features look correct? Fix bug in feature extractors No Find features with high weights Yes Why did a feature get high weight? Label more negative examples Found a false positive error example Skew in train set Take a look at training set covered by features Add distant supervision rules/constraints Skew in train set Continue to next example No more feature Legend Correction Inspection Legend Debug Recall Error Add/Fix feature extractors to extract more features for the example Find relevant features to the example Enough features? Extracted as candidate? Fix extractors to extract the example as candidate No Yes All features look correct? Fix bug in feature extractors No No Find features with low weights Yes Why did a feature get low weight? Label more positive examples Found a false negative error example Fix feature extractor to cover more cases Skew in train set Sparse feature Take a look at training set covered by features Add distant supervision rules/constraints Skew in train set Continue to next example No more feature Correction Inspection [IEEE DE Bul 2014, VLDB 2015] Slow data inspection and metadata collection

• Faster incremental runs • More eﬃcient data processing Accelerating
KBC 40 Mindtagger: a demonstration of data labeling in knowledge base construction.  [VLDB 2015 Demo] github.com/HazyResearch/mindbender • Error analysis guidelines • Easier data labeling • Automatic search interface 0. KBC with DeepDive 1. Increasing   Machine Eﬃciency 2. Increasing   Human Productivity

Problem: Data Model Mismatch Have data • Machine-optimized • in
Relational schema • Normalized 41 Need data • Human-friendly • in Document model • Denormalized Large Gap

Problem: Painful Data Inspection • Data not in human-friendly format
• Awful to work with data for machine- consumption • Too slow and tedious to write SQL queries to understand or explore data • Unreliable to collect metadata manually • Diﬃcult to predict schema of metadata needed for ad-hoc analysis 42

Mindtagger: Tool for Data Labeling • Interactive user interface •
Human-friendly data presentation • Quick, reliable metadata collection • Customizable task template 43 Mindtagger Task Template Interactive UI Data Items Metadata Little Gap

Precision mode (spouse example) Mindtagger in Action 44 Ground-truth collection
(Tracking down slavery) Clustering errors   by ad-hoc tags

Precision mode (spouse example) Mindtagger in Action 44 Ground-truth collection
(Tracking down slavery) Clustering errors   by ad-hoc tags Making #machinelearning fun!

Cumulative   Ratio Time   consumed Problem: Slow Data Exploration
45 Breakdown of time consumption in an error analysis iteration (semiconductor   material KBC) Error analysis step

Cumulative   Ratio Time   consumed Problem: Slow Data Exploration
45 Breakdown of time consumption in an error analysis iteration Productive steps   with Mindtagger Tedious   Data   Exploration (semiconductor   material KBC) Error analysis step

Automatic Search Interface Generation 46 Data denormalization DDlog annotations @extraction
has_spouse?(@ref(…) p1_id text, @ref(…) p2_id text). person_mention(@key p_id text, @ref(…) doc_id text, @ref(…) sent_id text). sentences(@key doc_id text, @key sent_id int , tokens text[], lemmas text[], pos_tags text[], ner_tags text[], …). @source articles(id text, content text). Interactive keyword search

Increased Productivity, Lowered Bar 47 1 computer scientist 2-3 paleontologists
1.5 years 3-5 programmers, physicists 6 months 2 computer scientists 3 months 2 biomedical scientists 2 computer scientists 3-4 months undergrad students 4-8 weeks 5-6 programmers 3-4 months 1 biomedical scientist 1 computer scientist 1 year 2014 2013 2015 2016 Time to reach   suﬃcient quality years months weeks DeepDive coevolving over calendar years

Future Work in Accelerating KBC 48 2. Increasing   Human
Productivity 1. Increasing   Machine Eﬃciency • Scale-out learning/inference • Scale-out data processing • Execution optimization across heterogenous compute resource • Training data generation (data programming)  → Snorkel [HILDA 2016] • Interactive rule composing • Rule auto suggestions

Accelerating Knowledge Base Construction 49 2. Increasing   Human Productivity
1. Increasing   Machine Eﬃciency • Faster incremental runs • More eﬃcient data processing • Error analysis guidelines • Easier data labeling • Automatic search interface 0. KBC with DeepDive (Best of VLDB) [VLDB 2015] [VLDB Journal 2016] (SIGMOD Research Highlight Award) [SIGMOD Record 2016]  [SIGMOD 2016 Industrial Track]  [IEEE DEBul 2014] [VLDB 2015 Demo] [HILDA 2016] github.com/HazyResearch/deepdive   github.com/HazyResearch/mindbender   github.com/netj/mkmimo

Acknowledgment 50 Chris Ré Intern Hosts at Google & Facebook
Alon Halevy, Chris Olston, Alkis, Avery Ching Hazy Research Group Ce, Sen, Feiran, Alex, Theo, Ivan, Michael FitzPatrick, Henry, Jason, Steve, Stephen, Matteo, Chris Aberger & De Sa Nobu, Masayuki, Yuichi TOSHIBA Gill Bejerano, Johannes, and all DeepDive users Feng, Zifei, Raphael, Xiao LATTICE Reading & Oral Committee Parag Mallick Peter Bailis Kunle Olukotun InfoLab Jure, Jeﬀ, Gio, Rok, Semih, Vikesh, Steven, Hyunjung, Akash, Manas, Saint, Vasilis, Asif, Andrej Mike Cafarella Hector Jennifer Andreas

Acknowledgment 51 Friends & Community

Acknowledgment 52 Family

Acknowledgment 53 Hailey Suyeun

Next Stop 54

Accelerating Knowledge Base Construction 55 2. Increasing   Human Productivity
1. Increasing   Machine Eﬃciency • Faster incremental runs • More eﬃcient data processing • Error analysis guidelines • Easier data labeling • Automatic search interface 0. KBC with DeepDive (Best of VLDB) [VLDB 2015] [VLDB Journal 2016] (SIGMOD Research Highlight Award) [SIGMOD Record 2016]  [SIGMOD 2016 Industrial Track]  [IEEE DEBul 2014] [VLDB 2015 Demo] [HILDA 2016] github.com/HazyResearch/deepdive   github.com/HazyResearch/mindbender   github.com/netj/mkmimo

Accelerating Knowledge Base Construction

Accelerating Knowledge Base Construction

More Decks by Jaeho Shin

Other Decks in Research

Featured

Transcript