Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Apache Drill - a interactive, ad-hoc query system for large-scale datasets

Apache Drill - a interactive, ad-hoc query system for large-scale datasets

Talk about the Apache Drill incubator project at the Big Data User Group in Stuttgart, May 2013.

Michael Hausenblas

May 16, 2013
Tweet

More Decks by Michael Hausenblas

Other Decks in Technology

Transcript

  1. Apache  Drill   a  interac.ve,  ad-­‐hoc  query  system  for  large-­‐scale

     datasets   Michael  Hausenblas,  Chief  Data  Engineer  EMEA,  MapR   Big  Data  User  Group  Stu>gart,  2013-­‐05-­‐16  
  2. Which   workloads  do   you   encounter  in  

    your   environment?   h>p://www.flickr.com/photos/kevinomara/2866648330/  licensed  under  CC  BY-­‐NC-­‐ND  2.0  
  3. Batch  processing   …  for  recurring  tasks  such  as  large-­‐scale

     data  mining,  ETL   offloading/data-­‐warehousing  à  for  the  batch  layer  in  Lambda   architecture  
  4. OLTP   …  user-­‐facing  eCommerce  transac[ons,  real-­‐[me  messaging  at  

    scale  (FB),  [me-­‐series  processing,  etc.  à  for  the  serving  layer  in   Lambda  architecture  
  5. Stream  processing   …  in  order  to  handle  stream  sources

     such  as  social  media  feeds   or  sensor  data  (mobile  phones,  RFID,  weather  sta[ons,  etc.)  à   for  the  speed  layer  in  Lambda  architecture    
  6. Search/Informa[on  Retrieval   …  retrieval  of  items  from  unstructured  documents

     (plain   text,  etc.),  semi-­‐structured  data  formats  (JSON,  etc.),  as   well  as  data  stores  (MongoDB,  CouchDB,  etc.)  
  7. Use  Case:  Marke[ng  Campaign   •  Jane,  a  marke[ng  analyst

      •  Determine  target  segments   •  Data  from  different  sources    
  8. Use  Case:  Logis[cs   •  Supplier  tracking  and  performance  

    •  Queries   – Shipments  from  supplier  ‘ACM’  in  last  24h   – Shipments  in  region  ‘US’  not  from  ‘ACM’   SUPPLIER_ID   NAME   REGION   ACM   ACME  Corp   US   GAL   GotALot  Inc   US   BAP   Bits  and  Pieces  Ltd   Europe   ZUP   Zu  Pli   Asia   { "shipment": 100123, "supplier": "ACM", “timestamp": "2013-02-01", "description": ”first delivery today” }, { "shipment": 100124, "supplier": "BAP", "timestamp": "2013-02-02", "description": "hope you enjoy it” } …
  9. Use  Case:  Crime  Detec[on   •  Online  purchases   • 

    Fraud,  bilking,  etc.   •  Batch-­‐generated  overview   •  Modes   – Explora[ve   – Alerts  
  10. Requirements   •  Support  for  different  data  sources   • 

    Support  for  different  query  interfaces   •  Low-­‐latency/real-­‐[me   •  Ad-­‐hoc  queries   •  Scalable,  reliable  
  11. Google’s  Dremel   h>p://research.google.com/pubs/pub36632.html       Sergey  Melnik,  Andrey

     Gubarev,  Jing  Jing  Long,  Geoffrey  Romer,  Shiva  Shivakumar,  Ma@  Tolton,   Theo  Vassilakis,  Proc.  of  the  36th  Int'l  Conf  on  Very  Large  Data  Bases  (2010),  pp.  330-­‐339   Dremel is a scalable, interactive ad-hoc query system for analysis of read-only nested data. By combining multi-level execution trees and columnar data layout, it is capable of running aggregation queries over trillion-row tables in seconds. The system scales to thousands of CPUs and petabytes of data, and has thousands of users at Google. … “ “ Dremel is a scalable, interactive ad-hoc query system for analysis of read-only nested data. By combining multi-level execution trees and columnar data layout, it is capable of running aggregation queries over trillion-row tables in seconds. The system scales to thousands of CPUs and petabytes of data, and has thousands of users at Google. …
  12. Apache  Drill–key  facts   •  Inspired  by  Google’s  Dremel  

    •  Standard  SQL  2003  support   •  Plug-­‐able  data  sources   •  Nested  data  is  a  first-­‐class  ci[zen   •  Schema  is  op.onal   •  Community  driven,  open,  100’s  involved  
  13. Principled  Query  Execu[on   Source   Query   Parser  

    Logical   Plan   Op[mizer   Physical   Plan   Execu[on   SQL  2003     DrQL   MongoQL   DSL   scanner  API   Topology   CF   etc.   query: [ { @id: "log", op: "sequence", do: [ { op: "scan", source: “logs” }, { op: "filter", condition: "x > 3” }, parser  API  
  14. Wire-­‐level  Architecture   •  Each  node:  Drillbit  -­‐  maximize  data

     locality   •  Co-­‐ordina[on,  query  planning,  execu[on,  etc,  are  distributed   •  Any  node  can  act  as  endpoint  for  a  query—foreman   Storage   Process   Drillbit   node   Storage   Process   Drillbit   node   Storage   Process   Drillbit   node   Storage   Process   Drillbit   node  
  15. Wire-­‐level  Architecture   •  Curator/Zookeeper  for  ephemeral  cluster  membership  info

      •  Distributed  cache  (Hazelcast)  for  metadata,  locality   informa[on,  etc.   Curator/Zk   Distributed  Cache   Storage   Process   Drillbit   node   Storage   Process   Drillbit   node   Storage   Process   Drillbit   node   Storage   Process   Drillbit   node   Distributed  Cache   Distributed  Cache   Distributed  Cache  
  16. Wire-­‐level  Architecture   •  Origina[ng  Drillbit  acts  as  foreman:  manages

     query  execu[on,   scheduling,  locality  informa[on,  etc.   •  Streaming  data  communica.on  avoiding  SerDe   Curator/Zk   Distributed  Cache   Storage   Process   Drillbit   node   Storage   Process   Drillbit   node   Storage   Process   Drillbit   node   Storage   Process   Drillbit   node   Distributed  Cache   Distributed  Cache   Distributed  Cache  
  17. Wire-­‐level  Architecture   Foreman  turns  into   root  of  the

     mul[-­‐level   execu[on  tree,  leafs   ac[vate  their  storage   engine  interface.   node   node   node   Curator/Zk  
  18. Key  features   •  Full  SQL  –  ANSI  SQL  2003

      •  Nested  Data  as  first  class  ci[zen   •  Op[onal  Schema   •  Extensibility  Points  …  
  19. Extensibility  Points   •  Source  query  à  parser  API  

    •  Custom  operators,  UDF  à  logical  plan   •  Serving  tree,  CF,  topology  à  physical  plan/op[mizer   •  Data  sources  &formats  à  scanner  API   Source   Query   Parser   Logical   Plan   Op[mizer   Physical   Plan   Execu[on  
  20. …  and  Hadoop?   •  HDFS  can  be  a  data

     source   •  Complementary  use  cases*   •  …  use  Apache  Drill   –  Find  record  with  specified  condi[on   –  Aggrega[on  under  dynamic  condi[ons   •  …  use  MapReduce   –  Data  mining  with  mul[ple  itera[ons   –  ETL   *)  h>ps://cloud.google.com/files/BigQueryTechnicalWP.pdf    
  21. Basic  Demo   h>ps://cwiki.apache.org/confluence/display/DRILL/Demo+HowTo     { "id": "0001", "type":

    "donut", ”ppu": 0.55, "batters": { "batter”: [ { "id": "1001", "type": "Regular" }, { "id": "1002", "type": "Chocolate" }, … data  source:  donuts.json   query:[ { op:"sequence", do:[ { op: "scan", ref: "donuts", source: "local-logs", selection: {data: "activity"} }, { op: "filter", expr: "donuts.ppu < 2.00" }, … logical  plan:  simple_plan.json   result:  out.json   { "sales" : 700.0, "typeCount" : 1, "quantity" : 700, "ppu" : 1.0 } { "sales" : 109.71, "typeCount" : 2, "quantity" : 159, "ppu" : 0.69 } { "sales" : 184.25, "typeCount" : 2, "quantity" : 335, "ppu" : 0.55 }
  22. Status   •  Heavy  development  by  mul[ple  organiza[ons   • 

    Available   – Logical  plan  (ADSP)   – Reference  interpreter   – Basic  SQL  parser     – Basic  demo  
  23. Status   May  2013     •  Full  SQL  support

     (+JDBC)   •  Physical  plan   •  In-­‐memory  compressed  data  interfaces   •  Distributed  execu[on  
  24. Status   May  2013     •  HBase  and  MySQL

     storage  engine   •  WebUI  client  
  25. Contribu[ng   Contribu[ons  appreciated  (not  only  code  drops)  …  

      •  Test  data  &  test  queries   •  Use  case  scenarios  (textual/SQL  queries)   •  Documenta[on   •  Further  schedule   –  Alpha  Q2   –  Beta  Q3  
  26. Kudos  to  …   •  Julian  Hyde,  Pentaho    

    •  Lisen  Mu,  XingCloud   •  Tim  Chen,  Microsow   •  Chris  Merrick,  RJMetrics     •  David  Alves,  UT  Aus[n   •  Sree  Vaadi,  SSS/NGData   •  Jacques  Nadeau,  MapR   •  Ted  Dunning,  MapR  
  27. Engage!   •  Follow  @ApacheDrill  on  Twi>er   •  Sign

     up  at  mailing  lists  (user  |  dev)     h>p://incubator.apache.org/drill/mailing-­‐lists.html       •  Standing  G+  hangouts  every  Tuesday  at  5pm  GMT   h>p://j.mp/apache-­‐drill-­‐hangouts     •  Keep  an  eye  on  h>p://drill-­‐user.org/