Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Apache Drill status 04-2013

Apache Drill status 04-2013

Talk at the HUG Munich on 2013-04-19

Michael Hausenblas

April 19, 2013
Tweet

More Decks by Michael Hausenblas

Other Decks in Technology

Transcript

  1. Workloads   •  Batch  processing  (MapReduce)   •  Light-­‐weight  OLTP

     (HBase,  Cassandra,  etc.)   •  Stream  processing  (Storm,  S4)   •  Search  (Solr,  ElasVcsearch)   •  Interac1ve,  ad-­‐hoc  query  and  analysis  (?)  
  2. Use  Case  I   •  Jane,  a  markeVng  analyst  

    •  Determine  target  segments   •  Data  from  different  sources    
  3. Use  Case  II   •  LogisVcs  –  supplier  status  

    •  Queries   – How  many  shipments  from  supplier  X?   – How  many  shipments  in  region  Y?   SUPPLIER_ID   NAME   REGION   ACM   ACME  Corp   US   GAL   GotALot  Inc   US   BAP   Bits  and  Pieces  Ltd   Europe   ZUP   Zu  Pli   Asia   { "shipment": 100123, "supplier": "ACM", “timestamp": "2013-02-01", "description": ”first delivery today” }, { "shipment": 100124, "supplier": "BAP", "timestamp": "2013-02-02", "description": "hope you enjoy it” } …
  4. Today’s  SoluVons   •  RDBMS-­‐focused   –  ETL  data  from

     MongoDB  and  Hadoop   –  Query  data  using  SQL   •  MapReduce-­‐focused   –  ETL  from  RDBMS  and  MongoDB   –  Use  Hive,  etc.  
  5. Requirements   •  Support  for  different  data  sources   • 

    Support  for  different  query  interfaces   •  Low-­‐latency/real-­‐Vme   •  Ad-­‐hoc  queries   •  Scalable,  reliable  
  6. Apache  Drill  Overview   •  Inspired  by  Google’s  Dremel  

    •  Standard    SQL  2003  support   •  Other  QL  possible   •  Plug-­‐able  data  sources   •  Support  for  nested  data   •  Schema  is  opVonal   •  Community  driven,  open,  100’s  involved  
  7. High-­‐level  Architecture   •  Each  node:  Drillbit  -­‐  maximize  data

     locality   •  Co-­‐ordinaVon,  query  planning,  execuVon,  etc,  are  distributed   •  By  default  Drillbits  hold  all  roles   •  Any  node  can  act  as  endpoint  for  a  query   Storage   Process   Drillbit   node   Storage   Process   Drillbit   node   Storage   Process   Drillbit   node   Storage   Process   Drillbit   node  
  8. High-­‐level  Architecture   •  Zookeeper  for  ephemeral  cluster  membership  info

      •  Distributed  cache  (Hazelcast)  for  metadata,  locality   informaVon,  etc.   Curator/Zk   Distributed  Cache   Storage   Process   Drillbit   node   Storage   Process   Drillbit   node   Storage   Process   Drillbit   node   Storage   Process   Drillbit   node   Distributed  Cache   Distributed  Cache   Distributed  Cache  
  9. High-­‐level  Architecture   •  Origina1ng  Drillbit  acts  as  foreman,  manages

     query  execuVon,   scheduling,  locality  informaVon,  etc.   •  Streaming  data  communica1on  avoiding  SerDe   Curator/Zk   Distributed  Cache   Storage   Process   Drillbit   node   Storage   Process   Drillbit   node   Storage   Process   Drillbit   node   Storage   Process   Drillbit   node   Distributed  Cache   Distributed  Cache   Distributed  Cache  
  10. Principled  Query  ExecuVon   Source   Query   Parser  

    Logical   Plan   OpVmizer   Physical   Plan   ExecuVon   SQL  2003     DrQL   MongoQL   DSL   scanner  API   topology   query: [ { @id: "log", op: "sequence", do: [ { op: "scan", source: “logs” }, { op: "filter", condition: "x > 3” }, parser  API  
  11. Drillbit  Modules   DFS  Engine   HBase  Engine   RPC

     Endpoint   SQL   HiveQL   Pig   Parser   Distributed  Cache   Logical  Plan   Physical  Plan   OpVmizer   Storage  Engine  Interface   Scheduler   Foreman   Operators   Mongo  
  12. Key  Features   •  Full  SQL  2003   •  Nested

     data   •  OpVonal  schema   •  Extensibility  points  
  13. Full  SQL  –  ANSI  SQL  2003   •  SQL-­‐like  is

     oien  not  enough   •  IntegraVon  with  exisVng  tools   –  Datameer,  Tableau,  Excel,  SAP  Crystal  Reports   –  Use  standard  ODBC/JDBC  driver  
  14. Nested  Data   •  Nested  data  becoming  prevalent   – 

    JSON/BSON,  XML,  ProtoBuf,  Avro   –  Some  data  sources  support  it  naVvely   (MongoDB,  etc.)   •  FlaEening  nested  data  is  error-­‐prone   •  Extension  to  ANSI  SQL  2003  
  15. OpVonal  Schema   •  Many  data  sources  don’t  have  rigid

     schemas   –  Schema  changes  rapidly   –  Different  schema  per  record  (e.g.  HBase)   •  Supports  queries  against  unknown  schema   •  User  can  define  schema  or  via  discovery  
  16. Extensibility  Points   •  Source  query  à  parser  API  

    •  Custom  operators,  UDF  à  logical  plan   •  Serving  tree,  CF,  topology  à  physical  plan/opVmizer   •  Data  sources  &formats  à  scanner  API   Source   Query   Parser   Logical   Plan   OpVmizer   Physical   Plan   ExecuVon  
  17. …  and  Hadoop?   •  HDFS  can  be  a  data

     source   •  Complementary  use  cases*   •  …  use  Apache  Drill   –  Find  record  with  specified  condiVon   –  AggregaVon  under  dynamic  condiVons   •  …  use  MapReduce   –  Data  mining  with  mulVple  iteraVons   –  ETL   22   *)  hEps://cloud.google.com/files/BigQueryTechnicalWP.pdf    
  18. Example   hEps://cwiki.apache.org/confluence/display/DRILL/Demo+HowTo     { "id": "0001", "type": "donut",

    ”ppu": 0.55, "batters": { "batter”: [ { "id": "1001", "type": "Regular" }, { "id": "1002", "type": "Chocolate" }, … data  source:  donuts.json   query:[ { op:"sequence", do:[ { op: "scan", ref: "donuts", source: "local-logs", selection: {data: "activity"} }, { op: "filter", expr: "donuts.ppu < 2.00" }, … logical  plan:  simple_plan.json   result:  out.json   { "sales" : 700.0, "typeCount" : 1, "quantity" : 700, "ppu" : 1.0 } { "sales" : 109.71, "typeCount" : 2, "quantity" : 159, "ppu" : 0.69 } { "sales" : 184.25, "typeCount" : 2, "quantity" : 335, "ppu" : 0.55 }
  19. Status   •  Heavy  development  by  mulVple  organizaVons   • 

    Available   – Logical  plan  (ADSP)   – Reference  interpreter   – Basic  SQL  parser     – Basic  demo   – Basic  HBase  back-­‐end  
  20. Status   April  2013     •  Extend  SQL  syntax

      •  Physical  plan   •  In-­‐memory  compressed  data  interfaces   •  Distributed  execuVon  
  21. ContribuVng   •  Learn  where  and  how  to  contribute  

    hEps://cwiki.apache.org/confluence/display/DRILL/ ContribuVng     •  Jira,  Git,  Apache  build  and  test  tools   •  Preparing  for  dependencies   –  Hazelcast   –  Neolix  Curator  
  22. ContribuVng   General  contribuVons  appreciated:   •  Supersonic  (?)  

    •  Test  data  &  test  queries   •  Use  case  scenarios  (textual  desc./SQL  queries)   •  DocumentaVon  
  23. ContribuVng   •  Dremel-­‐inspired  columnar  format   –  TwiEer’s  Parquet

        –  Hive’s  ORC  file   •  IntegraVon  with  Hive  metastore  (?)   •  DRILL-­‐13  Storage  Engine:  Define  Java  Interface   •  DRILL-­‐15  Build  HBase  storage  engine  implementaVon  
  24. ContribuVng   •  DRILL-­‐48  RPC  interface  for  query  submission  and

     physical  plan   execuVon   •  DRILL-­‐53  Setup  cluster  configuraVon  and  membership  mgmt   system   •  Further  schedule   –  Alpha  Q2   –  Beta  Q3  
  25. Kudos  to  …   •  Julian  Hyde,  Pentaho    

    •  Lisen  Mu   •  Tim  Chen,  Microsoi   •  Chris  Merrick,  RJMetrics     •  David  Alves,  UT  AusVn   •  Sree  Vaadi,  SSS/NGData   •  Jacques  Nadeau,  MapR   •  Ted  Dunning,  MapR  
  26. Engage!   •  Follow  @ApacheDrill  on  TwiEer   •  Sign

     up  at  mailing  lists  (user  |  dev)     hEp://incubator.apache.org/drill/mailing-­‐lists.html       •  Standing  G+  hangouts  every  Tuesday  at  18:00  CET   •  Keep  an  eye  on  hEp://drill-­‐user.org/