Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Accelerating Knowledge Base Construction

Jaeho Shin
August 11, 2016

Accelerating Knowledge Base Construction

Jaeho Shin

August 11, 2016
Tweet

More Decks by Jaeho Shin

Other Decks in Research

Transcript

  1. Accelerating 
 Knowledge Base Construction Jaeho Shin Advisor: Christopher Ré

    Readers: Hector Garcia-Molina Kunle Olukotun Oral Examiner: Peter Bailis Chair: Parag Mallick
  2. Accelerating Knowledge Base Construction 1. Background: Knowledge Base Construction 2.

    KBC with DeepDive [SIGMOD 2016 Industrial Track] [SIGMOD Record 2016] github.com/HazyResearch/deepdive 3. Machine Efficiency [VLDB 2015] [VLDB Journal 2016] (Best of VLDB) (SIGMOD Research Highlight Award 2015) github.com/netj/mkmimo 4. Human Productivity [IEEE DEBul 2014] [VLDB 2015 Demo] [HILDA 2016] github.com/HazyResearch/mindbender 2
  3. Macroscopic Questions Where do human- trafficking crimes happen? Which gene

    mutations cause certain diseases? What is the impact of climate change to biodiversity? 3
  4. Macroscopic Questions Where do human- trafficking crimes happen? Which gene

    mutations cause certain diseases? What is the impact of climate change to biodiversity? Trafficking Distribution Time # Species Biodiversity Curve Biodiversity Curve Gene-Phenotype Map 3
  5. Knowledge Bases Can Answer Taxon Time Ecosystem O. mykiss 1991

    Gulf of Alaska T. thynnus 1983 Caribbean Sea … … … Natural Science KBs e.g., Fishbase, PaleoDB, ... Gene Phenotype HLA-B27 Ankylosing spondylitis PAX6 Optic nerve hypoplasia … … Biomedical KBs e.g., OMIM, MeSH, HPO, GO, ... MSA Price Phone # SF $200/hr 415-555-2242 NY $150/hr 646-555-9792 … … Law Enforcement KBs e.g., MEMEX, PolarisProject, ... Trafficking Distribution Time # Species Biodiversity Curve Biodiversity Curve Gene-Phenotype Map 4
  6. Knowledge Bases Can Answer Taxon Time Ecosystem O. mykiss 1991

    Gulf of Alaska T. thynnus 1983 Caribbean Sea … … … Natural Science KBs e.g., Fishbase, PaleoDB, ... Gene Phenotype HLA-B27 Ankylosing spondylitis PAX6 Optic nerve hypoplasia … … Biomedical KBs e.g., OMIM, MeSH, HPO, GO, ... MSA Price Phone # SF $200/hr 415-555-2242 NY $150/hr 646-555-9792 … … Law Enforcement KBs e.g., MEMEX, PolarisProject, ... Trafficking Distribution Time # Species Biodiversity Curve Biodiversity Curve Gene-Phenotype Map 4 Structured Data needed!
  7. Knowledge in Unstructured Sources MSA Price Phone # SF $200/hr

    415-555-2242 NY $150/hr 646-555-9792 … … Gene Phenotype HLA-B27 Ankylosing spondylitis PAX6 Optic nerve hypoplasia … … Taxon Time Ecosystem O. mykiss 1991 Gulf of Alaska T. thynnus 1983 Caribbean Sea … … … Pathway Diagrams Doctor's notes Medical images News Articles Web Postings 
 (Sex Ads, Reviews) Text, Tables, Figures in Scientific Literature Natural Science KBs Biomedical KBs Law Enforcement KBs 5
  8. Knowledge Base Construction by Human MSA Price Phone # SF

    $200/hr 415-555-2242 NY $150/hr 646-555-9792 … … Gene Phenotype HLA-B27 Ankylosing spondylitis PAX6 Optic nerve hypoplasia … … Taxon Time Ecosystem O. mykiss 1991 Gulf of Alaska T. thynnus 1983 Caribbean Sea … … … 6
  9. Knowledge Base Construction by Human MSA Price Phone # SF

    $200/hr 415-555-2242 NY $150/hr 646-555-9792 … … Gene Phenotype HLA-B27 Ankylosing spondylitis PAX6 Optic nerve hypoplasia … … Taxon Time Ecosystem O. mykiss 1991 Gulf of Alaska T. thynnus 1983 Caribbean Sea … … … 6 Error-prone Slow Expensive
  10. MSA Price Phone # SF $200/hr 415-555-2242 NY $150/hr 646-555-9792

    … … Gene Phenotype HLA-B27 Ankylosing spondylitis PAX6 Optic nerve hypoplasia … … Taxon Time Ecosystem O. mykiss 1991 Gulf of Alaska T. thynnus 1983 Caribbean Sea … … … Knowledge Base Construction by Machine 7
  11. MSA Price Phone # SF $200/hr 415-555-2242 NY $150/hr 646-555-9792

    … … Gene Phenotype HLA-B27 Ankylosing spondylitis PAX6 Optic nerve hypoplasia … … Taxon Time Ecosystem O. mykiss 1991 Gulf of Alaska T. thynnus 1983 Caribbean Sea … … … Knowledge Base Construction by Machine 7 Faster Cheaper Repeatable Scalable
  12. KBC Machine Successes 8 Unstructured Information Knowledge Base Genomics Drug

    Repurposing Paleobiology Anti-Human Trafficking TAC-KBP 2014 Winner Material Science Collaborators including: Successful KBC applications from our group:
  13. Genom Drug Paleobi Anti- TAC- Materi Run Analyze Improve Human


    in the
 loop Unstructured Information Knowledge Base Iterative KBC with DeepDive 9 • Development Loop improves iteratively • More Rapid Iteration → Successful KBC • Goal: High Quality • precision • recall
  14. • Humans are easily distracted • Machine-optimized data is
 not

    human-friendly • Collecting metadata is tedious and error-prone • Exploring data is not interactive and too laborious Iterative KBC Challenges 10 Slow Unreliable Humans Slow Unwieldy Machines • Normal runs are too slow for small incremental changes • Time and resource are wasted by inefficient data processing • Machines waste executing what human did not intend
  15. Accelerating KBC: Focus of My Work 11 • Error analysis

    guidelines • Easier data labeling • Automatic search interface • Metrics monitoring dashboard • Easier training set creation • Faster incremental runs • More efficient data processing • Execution planning and micro- step operations • Better serialization for data in motion 2. Increasing 
 Human Productivity 1. Increasing 
 Machine Efficiency
  16. Accelerating KBC: Focus of My Work 11 2. Increasing 


    Human Productivity 1. Increasing 
 Machine Efficiency • Faster incremental runs • More efficient data processing • Error analysis guidelines • Easier data labeling • Automatic search interface
  17. Accelerating KBC 12 Extracting Databases from Dark Data with DeepDive.

    [SIGMOD 2016 Industrial] DeepDive: Declarative Knowledge Base Construction. [SIGMOD Record 2016] github.com/HazyResearch/deepdive 2. Increasing 
 Human Productivity 1. Increasing 
 Machine Efficiency • Faster incremental runs • More efficient data processing • Error analysis guidelines • Easier data labeling • Automatic search interface 0. KBC with DeepDive
  18. DeepDive Among Others 13 Machine Learning Data Management Information Extraction

    Caffe Torch SystemT CoreNLP FACTORIE MapReduce Alchemy BUGS Xlog MCDB ProbKB Lixto GATE Various 
 Rule-based 
 Systems XPath regexp
  19. KBC with DeepDive 14 • Relational Databases • Declarative Languages

    • SQL & DDlog • Standard Tools Integration • as UDF (User-defined Functions) • written in Java, Python, Perl • e.g., CoreNLP, regular expressions • Semi-Supervised Machine Learning • Probabilistic Graphical Models (Factor Graphs) • Approximate Inference (Gibbs Sampling) • Learning with Asynchronous SGD (HogWild!)
  20. KBC with DeepDive 14 • Relational Databases • Declarative Languages

    • SQL & DDlog • Standard Tools Integration • as UDF (User-defined Functions) • written in Java, Python, Perl • e.g., CoreNLP, regular expressions • Semi-Supervised Machine Learning • Probabilistic Graphical Models (Factor Graphs) • Approximate Inference (Gibbs Sampling) • Learning with Asynchronous SGD (HogWild!) Candidate Generation Feature Extraction Supervision Learning / Inference Aspirational Schema
  21. DeepDive Programs in DDlog 15 President Barack Obama and his

    wife Michelle Obama step out of Air Force One on Sunday. … Candidate Generation Feature Extraction Supervision Learning / Inference Aspirational Schema
  22. DeepDive Programs in DDlog 15 has_spouse?(p1_id text, p2_id text). person_mention(p_id

    text, doc_id text, sent_id text). sentences(doc_id text , sent_id int , tokens text[], lemmas text[], pos_tags text[], ner_tags text[], …). articles(id text, content text). President Barack Obama and his wife Michelle Obama step out of Air Force One on Sunday. … Candidate Generation Feature Extraction Supervision Learning / Inference Aspirational Schema
  23. DeepDive Programs in DDlog 16 Candidate Generation Feature Extraction Supervision

    Learning / Inference Aspirational Schema person_mention += udf_map_candidates(sent, ner_tags, …) :- sentences(sent, ner_tags, …). has_spouse(p1, p2) :-
 person_mention(p1, doc, sent),
 person_mention(p2, doc, sent). spouse_features += udf_extract_features(p1, p2, sent) :- has_spouse(p1, p2),
 person_mention(p1, doc, sent), sentences(doc, sent). DDlog inherits Datalog, allowing UDFs for integrated data processing UDF in Python UDF in Python Extracting text spans by NER tags Extracting text features
  24. DeepDive Programs in DDlog 17 Candidate Generation Feature Extraction Supervision

    Learning / Inference Aspirational Schema # Features @weight(f) has_spouse(p1, p2) :- spouse_feature(p1, p2, f). # Inference rule: Symmetry @weight(3.0) has_spouse(p1, p2) => has_spouse(p2, p1). # Inference rule: Only one marriage @weight(-1.0) has_spouse(p1, p2) => has_spouse(p1, p3) :- p2 != p3. DDlog inherits Markov Logic Networks (Richardson & Domingos) and Tuffy (Niu et al.) has_spouse(p1, p2) = true :-
 freebase_person(p1, e1), freebase_person(p2, e2),
 freebase_marriage(e1, e2). Distant Supervision with known facts Features and domain knowledge
  25. 18 DeepDive’s Semantics User Relations Inference Rules Factor Graph Variables

    V R S Q F1 F2 Factors F Grounding x y a 0 a 1 a 2 r1 r2 r3 s1 s2 y 0 10 q1 x a r1 r2 r3 s1 s2 q1 User Relations Inference Rules Factor Graph Variables V F1 R S Q q(x) :- R(x,y) F2 q(x) :- R(x,y), S(y) F1 F2 Factors F Factor function corresponds to Equation 1 in Section 2.4. Grounding x y a 0 a 1 a 2 r1 r2 r3 s1 s2 y 0 10 q1 x a r1 r2 r3 s1 s2 q1 User Relations Inference Rules F1 R S Q q(x) :- R(x,y) F2 q(x) :- R(x,y), S(y) Fac Equ Grounding x y a 0 a 1 a 2 r1 r2 r3 s1 s2 y 0 10 q1 x a r1 User Relations Inference Rules F1 R S Q q(x) :- R(x,y) F2 q(x) :- R(x,y), S(y) Fac Equ Grounding x y a 0 a 1 a 2 r1 r2 r3 s1 s2 y 0 10 q1 x a r1 Grounding Factor Graph DDlog Relations Candidate Generation Feature Extraction Supervision Learning / Inference Aspirational Schema DDlog Inference Rules R(x,y) => Q(x). R(x,y), S(y) => Q(x).
  26. 19 DeepDive’s Semantics Candidate Generation Feature Extraction Supervision Learning /

    Inference Aspirational Schema User Relations Inference Rules Factor Graph Variables V F1 R S Q q(x) :- R(x,y) F2 q(x) :- R(x,y), S(y) F1 F2 Factors F Factor function corresponds to Equation 1 in Section 2.4. Grounding x y a 0 a 1 a 2 r1 r2 r3 s1 s2 y 0 10 q1 x a r1 r2 r3 s1 s2 q1 Factor Graph factor graph ( V, F, ˆ w ) random variables V hyperedges of variables F ✓ {f | f ✓ V } weight function ˆ w : F ⇥ { 0 , 1 }V ! R all possible worlds I ✓ {I : V ! { 0 , 1 }} joint probability marginal probability Pr[ I ] = Z 1 exp n ˆ W ( F, I ) o where Z = X I2I exp n ˆ W ( F, I ) o ˆ W ( F, I ) = X f2F ˆ w ( f, I ) Pr[v] = X I2I+ Pr[I] where I+ = {I 2 I | I(v) = 1}
  27. 20 DeepDive’s Semantics • Approximate Inference by Gibbs Sampling •

    Learning by Asynchronous SGD 
 (Hogwild! [Niu et al.]) • High-performance implementation on modern hardware (NUMA & many cores)
 (DimmWitted [Zhang et al.]) Marginal Probabilities Pr[ᐧ] Candidate Generation Feature Extraction Supervision Learning / Inference Aspirational Schema User Relations Inference Rules Factor Graph Variables V F1 R S Q q(x) :- R(x,y) F1 F2 Factors F Grounding x y a 0 a 1 a 2 r1 r2 r3 s1 s2 y 0 10 q1 x a r1 r2 r3 s1 s2 q1 Factor Graph
  28. 1. Increasing 
 Machine Efficiency • Faster incremental runs •

    More efficient data processing Accelerating KBC 21 2. Increasing 
 Human Productivity • Error analysis guidelines • Easier data labeling • Automatic search interface 0. KBC with DeepDive
  29. 1. Increasing 
 Machine Efficiency • Faster incremental runs •

    More efficient data processing Accelerating KBC 21 Incremental knowledge base construction using DeepDive. (Best of VLDB) [VLDB 2015]
 [VLDB Journal 2016]
 (SIGMOD Research Highlight Award 2015) 2. Increasing 
 Human Productivity • Faster incremental runs • More efficient data processing • Error analysis guidelines • Easier data labeling • Automatic search interface 0. KBC with DeepDive 1. Increasing 
 Machine Efficiency
  30. Problem: Slow Incremental Runs 22 2.4M facts 1.8M documents 6

    hours … 
 Barack Obama and his wife Michelle Obama 
 … “News Reading” System
  31. Problem: Slow Incremental Runs 22 2.4M facts 1.8M documents Incremental

    Updates 6 hours … 
 Barack Obama and his wife Michelle Obama 
 … “News Reading” System
  32. Problem: Slow Incremental Runs 22 2.4M facts 1.8M documents Incremental

    Updates 6 hours … 
 Barack Obama and his wife Michelle Obama 
 … +∆1 +∆1 +∆2 +∆1 +∆2 +∆3 6 hours 7 hours 8 hours “News Reading” System
  33. Fast Incremental Runs 23 Incremental Updates < 30 mins …

    
 Barack Obama and his wife Michelle Obama 
 … 2.4M facts 1.8M documents 6 hours “News Reading” System +∆1 +∆2 +∆3
  34. Updates to Factor Graph Types of Incremental Updates 24 New

    DDlog Rules + Data Candidate Generation Feature Extraction Supervision Learning / Inference Aspirational Schema
  35. Updates to Factor Graph Types of Incremental Updates 24 V

    V … + + addition New DDlog Rules + Data Candidate Generation Feature Extraction Supervision Learning / Inference Aspirational Schema Candidate Generation
  36. Updates to Factor Graph Types of Incremental Updates 24 V

    V … + V F V V … + + addition New DDlog Rules + Data Candidate Generation Feature Extraction Supervision Learning / Inference Aspirational Schema Learning / Inference
  37. Updates to Factor Graph Types of Incremental Updates 24 V

    V … + V F V V … F F’ + ⟳ ⟳ mutation + addition New DDlog Rules + Data Candidate Generation Feature Extraction Supervision Learning / Inference Aspirational Schema Feature Extraction
  38. Updates to Factor Graph Types of Incremental Updates 24 V

    V … + V F V V … F F’ V -/+ + ⟳ ⟳ ⟳ mutation + addition evidence New DDlog Rules + Data Candidate Generation Feature Extraction Supervision Learning / Inference Aspirational Schema Supervision
  39. ∆FG Updates to Factor Graph Types of Incremental Updates 24

    V V … + V F V V … F F’ V -/+ + ⟳ ⟳ Incremental Grounding FG ⟳ mutation + addition evidence New DDlog Rules + Data Candidate Generation Feature Extraction Supervision Learning / Inference Aspirational Schema
  40. (Repeated many times) 25 Reusable 
 Data 1. Materialization Original

    Factor Graph FG Incremental Updates Speeding up Incremental Runs (Once)
  41. (Repeated many times) 25 ∆FG Reusable 
 Data 2. Incremental

    Grounding 1. Materialization Original Factor Graph FG Incremental Updates Speeding up Incremental Runs (Once)
  42. (Repeated many times) 25 ∆FG Reusable 
 Data 2. Incremental

    Grounding 3. Incremental 
 Inference 1. Materialization Pr(∆FG)[ᐧ] Original Factor Graph FG Incremental Updates + Speeding up Incremental Runs (Once)
  43. Incremental Maintenance: Two Approaches 26 + ∆FG Samples of 


    Possible Worlds 1. Sampling-based Pr(∆FG)[ᐧ] Original Factor Graph FG Approximate 
 Factor Graph 2. Variational-based + ∆FG
  44. 1. Sampling Approach 27 Reuse with Acceptance Tests Materialization Samples

    of 
 Possible Worlds Generate many 00111 11101 01001 01001 Original 
 Factor Graph FG Updated 
 Probabilities Pr(∆FG)[ᐧ] Independent Metropolis-Hastings sampling w.r.t. ∆FG Incremental Inference Updates to the Factor Graph ∆FG
  45. 2. Variational Approach 28 Run Gibbs Sampling after update Approximate

    Log-determinant Relaxation Updated Simpler Factor Graph FG’ + ∆FG Simpler Factor Graph FG’ Materialization Incremental Inference Original 
 Factor Graph FG Updates to the Factor Graph ∆FG Updated 
 Probabilities Pr(∆FG)[ᐧ] (with only binary potentials)
  46. Quality Over Time 29 Consistent ~10x speedup across 5 KBC

    systems 22x Overall 
 Speedup 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 1 100 10000 Quality (F1 score) Cumulative Execution Time (s) Rerun Incremental (“News Reading” system) Simulated incremental development with 6 different rules 12 hour Materialization Still >2x Faster 99% Overlap Pr[v] > 0.9
  47. Tradeoff of Two Approaches 30 Neither dominates the other 


    (depends on the workload) → Rule-based optimizer (synthetic dataset) (slower) 0.001 0.01 0.1 1 10 0.001 0.01 0.1 1 0.001 0.01 0.1 1 0.1 1 Incremental Inference Time (s) (a) Acceptance Rate (b) Sparsity of Correlations Sampling Variational Sampling Variational (more 
 correlations) 0.001 0.01 0.1 1 10 0.001 0.01 0.1 1 0.001 0.01 0.1 1 0.1 1 Incremental Inference Time (s) (a) Acceptance Rate (b) Sparsity of Correlations Sampling Variational Sampling Variational (slower) (larger
 updates)
  48. 1. Increasing 
 Machine Efficiency • Faster incremental runs •

    More efficient data processing Accelerating KBC 31 github.com/netj/mkmimo 2. Increasing 
 Human Productivity • Error analysis guidelines • Easier data labeling • Automatic search interface 0. KBC with DeepDive 1. Increasing 
 Machine Efficiency
  49. Problem: Inefficient Data Processing • KBC requires lots of data

    processing • DeepDive’s UDFs (user-defined functions) • simple, easy to debug • language agnostic for integrating arbitrary tools/library • efficient? 32 UDF Unload Load
  50. 33 Naive Parallelization UDF UDF Split File File File UDF

    Executing UDF processes in Parallel File File File Unload Load (Batch) (Batch)
  51. 33 Naive Parallelization UDF UDF Split File File File UDF

    Executing UDF processes in Parallel File File File Data Duplication Unnecessary 
 I/O Unload Load (Batch) (Batch)
  52. (Streaming) 34 Better Parallelization UDF UDF Split Pipe Pipe Pipe

    UDF Executing UDF processes in Parallel, Streaming Data File File File Reduced 
 Duplication Unload Load (Batch)
  53. (Streaming) 34 Better Parallelization UDF UDF Split Pipe Pipe Pipe

    UDF Executing UDF processes in Parallel, Streaming Data File File File Reduced 
 Duplication Throughput 
 bounded by 
 Stragglers! Unload Load (Batch)
  54. 35 DeepDive’s Efficient Data Processing Unload Load UDF UDF mkmimo

    Pipe Pipe Pipe UDF Executing UDF processes in Parallel, Streaming Data, Balancing Load mkmimo Pipe Pipe Pipe (Streaming) (Streaming)
  55. 35 DeepDive’s Efficient Data Processing Unload Load UDF UDF mkmimo

    Pipe Pipe Pipe UDF Executing UDF processes in Parallel, Streaming Data, Balancing Load mkmimo Pipe Pipe Pipe Zero Footprint! Speed up ~3x (Streaming) (Streaming)
  56. 35 DeepDive’s Efficient Data Processing UDF UDF mkmimo Pipe Pipe

    Pipe UDF Executing UDF processes in Parallel, Streaming Data, Balancing Load mkmimo Pipe Pipe Pipe Zero Footprint! Speed up ~3x Parallel Unload Parallel 
 Load Speed up ~20x (Streaming) (Streaming)
  57. • Faster incremental runs • More efficient data processing Accelerating

    KBC 36 Feature engineering for knowledge base construction. [IEEE DEBul 2014] • Error analysis guidelines • Easier data labeling • Automatic search interface 0. KBC with DeepDive 1. Increasing 
 Machine Efficiency 2. Increasing 
 Human Productivity
  58. Problem: Easily Distracted Humans • Working in ad-hoc fashion •

    adding useless features • perfecting features with little impact • solving obvious errors not common ones • fiddling with statistical procedure 37
  59. DeepDive's Micro Error Analysis Guideline 39 Start Error Analysis Any

    new example? Rerun DeepDive Pipelines Made Enough Changes? No Yes End Error Analysis Find All Error Examples No Yes Is it an error? What type of error? Fix ground truth Not an error False Negative Take a look at the example, along with its origins False Positive Debug Recall Error Debug Precision Error Legend Correction Inspection Debug Precision Error Find relevant features to the example All features look correct? Fix bug in feature extractors No Find features with high weights Yes Why did a feature get high weight? Label more negative examples Found a false positive error example Skew in train set Take a look at training set covered by features Add distant supervision rules/constraints Skew in train set Continue to next example No more feature Legend Correction Inspection Legend Debug Recall Error Add/Fix feature extractors to extract more features for the example Find relevant features to the example Enough features? Extracted as candidate? Fix extractors to extract the example as candidate No Yes All features look correct? Fix bug in feature extractors No No Find features with low weights Yes Why did a feature get low weight? Label more positive examples Found a false negative error example Fix feature extractor to cover more cases Skew in train set Sparse feature Take a look at training set covered by features Add distant supervision rules/constraints Skew in train set Continue to next example No more feature Correction Inspection [IEEE DE Bul 2014, VLDB 2015]
  60. DeepDive's Micro Error Analysis Guideline 39 Start Error Analysis Any

    new example? Rerun DeepDive Pipelines Made Enough Changes? No Yes End Error Analysis Find All Error Examples No Yes Is it an error? What type of error? Fix ground truth Not an error False Negative Take a look at the example, along with its origins False Positive Debug Recall Error Debug Precision Error Legend Correction Inspection Debug Precision Error Find relevant features to the example All features look correct? Fix bug in feature extractors No Find features with high weights Yes Why did a feature get high weight? Label more negative examples Found a false positive error example Skew in train set Take a look at training set covered by features Add distant supervision rules/constraints Skew in train set Continue to next example No more feature Legend Correction Inspection Legend Debug Recall Error Add/Fix feature extractors to extract more features for the example Find relevant features to the example Enough features? Extracted as candidate? Fix extractors to extract the example as candidate No Yes All features look correct? Fix bug in feature extractors No No Find features with low weights Yes Why did a feature get low weight? Label more positive examples Found a false negative error example Fix feature extractor to cover more cases Skew in train set Sparse feature Take a look at training set covered by features Add distant supervision rules/constraints Skew in train set Continue to next example No more feature Correction Inspection [IEEE DE Bul 2014, VLDB 2015] Slow data inspection and metadata collection
  61. • Faster incremental runs • More efficient data processing Accelerating

    KBC 40 Mindtagger: a demonstration of data labeling in knowledge base construction.
 [VLDB 2015 Demo] github.com/HazyResearch/mindbender • Error analysis guidelines • Easier data labeling • Automatic search interface 0. KBC with DeepDive 1. Increasing 
 Machine Efficiency 2. Increasing 
 Human Productivity
  62. Problem: Data Model Mismatch Have data • Machine-optimized • in

    Relational schema • Normalized 41 Need data • Human-friendly • in Document model • Denormalized Large Gap
  63. Problem: Painful Data Inspection • Data not in human-friendly format

    • Awful to work with data for machine- consumption • Too slow and tedious to write SQL queries to understand or explore data • Unreliable to collect metadata manually • Difficult to predict schema of metadata needed for ad-hoc analysis 42
  64. Mindtagger: Tool for Data Labeling • Interactive user interface •

    Human-friendly data presentation • Quick, reliable metadata collection • Customizable task template 43 Mindtagger Task Template Interactive UI Data Items Metadata Little Gap
  65. Precision mode (spouse example) Mindtagger in Action 44 Ground-truth collection

    (Tracking down slavery) Clustering errors 
 by ad-hoc tags
  66. Precision mode (spouse example) Mindtagger in Action 44 Ground-truth collection

    (Tracking down slavery) Clustering errors 
 by ad-hoc tags
  67. Precision mode (spouse example) Mindtagger in Action 44 Ground-truth collection

    (Tracking down slavery) Clustering errors 
 by ad-hoc tags Making #machinelearning fun!
  68. Cumulative 
 Ratio Time 
 consumed Problem: Slow Data Exploration

    45 Breakdown of time consumption in an error analysis iteration (semiconductor 
 material KBC) Error analysis step
  69. Cumulative 
 Ratio Time 
 consumed Problem: Slow Data Exploration

    45 Breakdown of time consumption in an error analysis iteration Productive steps 
 with Mindtagger Tedious 
 Data 
 Exploration (semiconductor 
 material KBC) Error analysis step
  70. Automatic Search Interface Generation 46 Data denormalization DDlog annotations @extraction

    has_spouse?(@ref(…) p1_id text, @ref(…) p2_id text). person_mention(@key p_id text, @ref(…) doc_id text, @ref(…) sent_id text). sentences(@key doc_id text, @key sent_id int , tokens text[], lemmas text[], pos_tags text[], ner_tags text[], …). @source articles(id text, content text). Interactive keyword search
  71. Increased Productivity, Lowered Bar 47 1 computer scientist 2-3 paleontologists

    1.5 years 3-5 programmers, physicists 6 months 2 computer scientists 3 months 2 biomedical scientists 2 computer scientists 3-4 months undergrad students 4-8 weeks 5-6 programmers 3-4 months 1 biomedical scientist 1 computer scientist 1 year 2014 2013 2015 2016 Time to reach 
 sufficient quality years months weeks DeepDive coevolving over calendar years
  72. Future Work in Accelerating KBC 48 2. Increasing 
 Human

    Productivity 1. Increasing 
 Machine Efficiency • Scale-out learning/inference • Scale-out data processing • Execution optimization across heterogenous compute resource • Training data generation (data programming)
 → Snorkel [HILDA 2016] • Interactive rule composing • Rule auto suggestions
  73. Accelerating Knowledge Base Construction 49 2. Increasing 
 Human Productivity

    1. Increasing 
 Machine Efficiency • Faster incremental runs • More efficient data processing • Error analysis guidelines • Easier data labeling • Automatic search interface 0. KBC with DeepDive (Best of VLDB) [VLDB 2015] [VLDB Journal 2016] (SIGMOD Research Highlight Award) [SIGMOD Record 2016]
 [SIGMOD 2016 Industrial Track]
 [IEEE DEBul 2014] [VLDB 2015 Demo] [HILDA 2016] github.com/HazyResearch/deepdive 
 github.com/HazyResearch/mindbender 
 github.com/netj/mkmimo
  74. Acknowledgment 50 Chris Ré Intern Hosts at Google & Facebook

    Alon Halevy, Chris Olston, Alkis, Avery Ching Hazy Research Group Ce, Sen, Feiran, Alex, Theo, Ivan, Michael FitzPatrick, Henry, Jason, Steve, Stephen, Matteo, Chris Aberger & De Sa Nobu, Masayuki, Yuichi TOSHIBA Gill Bejerano, Johannes, and all DeepDive users Feng, Zifei, Raphael, Xiao LATTICE Reading & Oral Committee Parag Mallick Peter Bailis Kunle Olukotun InfoLab Jure, Jeff, Gio, Rok, Semih, Vikesh, Steven, Hyunjung, Akash, Manas, Saint, Vasilis, Asif, Andrej Mike Cafarella Hector Jennifer Andreas
  75. Accelerating Knowledge Base Construction 55 2. Increasing 
 Human Productivity

    1. Increasing 
 Machine Efficiency • Faster incremental runs • More efficient data processing • Error analysis guidelines • Easier data labeling • Automatic search interface 0. KBC with DeepDive (Best of VLDB) [VLDB 2015] [VLDB Journal 2016] (SIGMOD Research Highlight Award) [SIGMOD Record 2016]
 [SIGMOD 2016 Industrial Track]
 [IEEE DEBul 2014] [VLDB 2015 Demo] [HILDA 2016] github.com/HazyResearch/deepdive 
 github.com/HazyResearch/mindbender 
 github.com/netj/mkmimo