Slide 1

Slide 1 text

® ® ©  2014  MapR  Technologies   Sept  18th,  2014   Richard  Shaw  –  Solu>ons  Architect  

Slide 2

Slide 2 text

® Pre-Slideware Summary Low Latency ANSI SQL on Hadoop Data & NoSQL At the same time

Slide 3

Slide 3 text

® Top  Ranked   500+  Customers   Cloud  Leaders   MapR Enterprise Hadoop

Slide 4

Slide 4 text

® WORLDWIDE PRESENCE & CUSTOMER SUPPORT HQ

Slide 5

Slide 5 text

® One Of Our Key Strengths.. We Innovate

Slide 6

Slide 6 text

® Hadoop Distributions Open Source Open Source Distribu9on  A   Distribu9on  C   MANAGEMENT   Open Source MANAGEMENT   ARCHITECTURAL  INNOVATIONS  

Slide 7

Slide 7 text

®

Slide 8

Slide 8 text

®

Slide 9

Slide 9 text

® Silos make analysis very difficult •  How  do  I  iden>fy  a   unique  {customer,   trade}  across  data   sets?     •  How  can  I  guarantee   the  lack  of  anomalous   behavior  if  I  can’t  see   all  data?  

Slide 10

Slide 10 text

® Here’s an idea Give Users The Power To Query Across Silos ..Irrespective of Data Types

Slide 11

Slide 11 text

® Rethink SQL for Big Data Preserve   • ANSI  SQL   • Familiar  and  ubiquitous   • Performance   • Interac>ve  nature  crucial  for  BI/Analy>cs   • One  technology   • Painful  to  manage  different  technologies   • Enterprise  ready   • System-­‐of-­‐record,  HA,  DR,  Security,  Mul>-­‐ tenancy,  …   Invent   • Flexible  data-­‐model   • Allow  schemas  to  evolve  rapidly   • Support  semi-­‐structured  data  types   • Agility   • Self-­‐service  possible  when  developer  and  DBA  is   same   • Scalability   • In  all  dimensions:  data,  speed,  schemas,  processes,   management  

Slide 12

Slide 12 text

® SQL is here to stay  

Slide 13

Slide 13 text

® Hadoop is here to stay  

Slide 14

Slide 14 text

® Self-Describing Data Ubiquitous Centralised  schema   -­‐  Sta>c   -­‐  Managed  by  the  DBAs   -­‐  In  a  centralised  repository     Long,  me>culous  data  prepara>on  process  (ETL,   create/alter  schema,  etc.)   Self-­‐describing,  or  schema-­‐less,  data   -­‐  Dynamic/evolving   -­‐  Managed  by  the  applica>ons   -­‐  Embedded  in  the  data     Less  schema,  more  suitable  for  data  that  has   higher  volume,  variety  and  velocity   Apache  Drill  

Slide 15

Slide 15 text

® Drill   ●  Apache  open  source  project     ●  Scale-­‐out  execu>on  engine  for  low-­‐latency  SQL     queries     ●  Unified  SQL-­‐based  API  for  zero  day  analy>cs     &  opera>onal  applica>ons     ●  Flexible  data  sources  

Slide 16

Slide 16 text

® Drill & Dremel   ●  Inspired  by  Google  Tech     ●  SQL  querying  of  Google  data  over  GFS  &  BigTable     ●  In  use  produc>on  use  since  2006  -­‐  8  YEARS!     ●  Tens  of  thousand  of  concurrent  users  over  PB  of  data     ●  Dremel  paper  released  2010  

Slide 17

Slide 17 text

® Drill Zookeeper   DFS/HBase   DFS/HBase   DFS/HBase   Drillbit   Distributed  Cache   Drillbit   Distributed  Cache   Drillbit   Distributed  Cache   Query   1.  Query  comes  to  any  Drillbit  (JDBC,  ODBC,  CLI,  protobuf)   2.  Drillbit  generates  execu>on  plan  based  on  query  op>miza>on  &  locality   3.  Fragments  are  farmed  to  individual  nodes   4.  Result  is  returned  to  driving  node   c   c   c  

Slide 18

Slide 18 text

® A Drill Database •  What  is  a  database  with  Drill/MapR?  There  isn’t  one   •  Just  a  directory,  with  a  bunch  of  related  files  or  other   sources   ~/work/bugs   symptom    version  date  bugid  dump-­‐name   app    crash  3.1.1  14/7/14  12345  cust1.tgz   app  slow    3.1.0  12/7/14  45678  cust2.tgz   Customers   BugList   name  rep  se  dump-­‐name   xxxx  dkim  junhyuk  cust1.tgz   yyyy  yoshi  aki  cust2.tgz  

Slide 19

Slide 19 text

® Data Source is in the Query !select timestamp, message! !from dfs1.logs.`AppServerLogs/2014/Jan/ p001.parquet` where errorLevel > 2     This  is  a  cluster  in  Apache  Drill   -­‐  DFS   -­‐  HBase   -­‐  Hive  meta-­‐store   A  work-­‐space   -­‐  Typically  a  sub-­‐ directory   -­‐  HIVE  database   A    table   -­‐  pathnames   -­‐  Hbase  table   -­‐  Hive  table  

Slide 20

Slide 20 text

® Can be an entire directory tree // On a file! select errorLevel, count(*)
 from dfs.logs.`/AppServerLogs/2014/Jan/ part0001.parquet` group by errorLevel;! ! // On the entire data collection: all years, all months! select errorLevel, count(*)
 from dfs.logs.`/AppServerLogs`
 group by errorLevel!

Slide 21

Slide 21 text

® Combine data sources on the fly •  JSON   •  CSV   •  ORC  (ie,  all  Hive  types)   •  Parquet   •  HBase  tables   •  …  can  combine  them   Select    USERS.name,    USERS.emails.work     from        dfs.logs.`/data/logs`    LOGS,      dfs.users.`/profiles.json`    USERS,   where        LOGS.uid  =  USERS.uid      and        errorLevel  >  5   order  by    count(*);    

Slide 22

Slide 22 text

® Queries are simple select      b.bugid,  b.symptom,  b.date   from          dfs.bugs.’/Customers’  c,  dfs.bugs.’/BugList’  b   where      c.dump-­‐name  =  b.dump-­‐name   Let’s  say  I  want  to  cross-­‐reference  against  your  list:    select  bugid,  symptom    from  dfs.bugs.’/Buglist’  b,    dfs.yourbugs.’/YourBugFile’  b2    where    b.bugid  =  b2.xxx  

Slide 23

Slide 23 text

® What does it mean? •  No  ETL   •  Reach  out  directly  to  the  par>cular  table/file   •  As  long  as  the  permissions  are  fine,  you  can  do  it   •  No  need  to  have  the  meta-­‐data   – None  needed  

Slide 24

Slide 24 text

® a •  Schema  can  change  over  course  of  query   •  Operators  are  able  to  reconfigure  themselves  on  schema   change  events   – Minimize  flexibility  overhead   – Support  more  advanced  execu>on  op>miza>on  based  on  actual  data   characteris>cs  

Slide 25

Slide 25 text

® Querying JSON {    name:  classic          fillings:  [                {  name:  sugar  cal:    400  }]}     {  name:  choco        fillings:  [              {  name:  sugar    cal:  400  }              {  name:  chocolate  cal:  300  }]}     {    name:  bostoncreme          fillings:    [              {  name:    sugar    cal:  400  }              {  name:    cream    cal:  1000  }              {  name:    jelly    cal:  600  }]}       donuts.json  

Slide 26

Slide 26 text

® Another example !select d.name, count( d.fillings),! !from (select convert_from( cf1.donut-json, json)as d ! ! from hbase.user.`donuts` );   •  convert_from(  xx,  json)    invokes  the  json  parser  inside  Drill   •  What  if  you  could  plug  in  any  parser   –  XML?   –  Another  NoSQL  Database  format   –  Any  other  file  format  

Slide 27

Slide 27 text

® No ETL •  Basically,  Drill  is  querying  the  raw  data  directly   •  Joining  with  processed  data   •  NO  ETL   •  Folks,  this  is  very,  very  powerful   •  NO  ETL  

Slide 28

Slide 28 text

® Seamless integration with Apache Hive •  Low  latency  queries  on  Hive  tables   •  Support  for  100s  of  Hive  file  formats     •  Ability  to  reuse  Hive  UDFs   •  Support  for  mul>ple  Hive  metastores  in  a  single  query  

Slide 29

Slide 29 text

® A Quick Tour through Apache Drill

Slide 30

Slide 30 text

® Apache Drill FLEXIBLE  SCHEMA   MANAGEMENT   FRICTIONLESS  ANALYTICS   ON  NESTED  DATA   PLUG  AND  PLAY     WITH  EXISTING   Analyze  data,  self-­‐ described  or  central   metadata           Reuse  investments  in  SQL/ BI  tools   and  Apache  Hive   Analyze  semi  structured   &  nested  data   …  and  with  an  architecture  built  ground  up  for  Low  Latency  queries  at  Scale  

Slide 31

Slide 31 text

® Apache Drill Roadmap • Low-latency SQL • Schema-less execution • Files & HBase/M7 support • Hive integration • BI and SQL tool support via ODBC/JDBC Data exploration/ad-hoc queries 1.0 • HBase query speedup • Nested data functions • Advanced SQL functionality Advanced analytics and operational data 1.1 • Ultra low latency queries • Single row insert/update/ delete • Workload management Operational SQL 2.0

Slide 32

Slide 32 text

® Apache Drill Resources •  Drill  0.5     •  Ge{ng  started  with  Drill  is  easy   –   Download  Drill  Sandbox  from  mapr.com     •  Mailing  lists   –  drill-­‐user@incubator.apache.org   –  drill-­‐dev@incubator.apache.org   •  Docs:    h}ps://cwiki.apache.org/confluence/display/DRILL/Apache+Drill+Wiki     •  Fork  us  on  GitHub:      h}p://github.com/apache/incubator-­‐drill/   •  Create  a  JIRA:      h}ps://issues.apache.org/jira/browse/DRILL  

Slide 33

Slide 33 text

® Active Drill Community •  Large  community,  growing  rapidly   – 35-­‐40  contributors,  16  commi}ers   – Microso•,  Linked-­‐in,  Oracle,  Facebook,  Visa,  Lucidworks,   Concurrent,  many  universi>es   •  In  2014   – over  20  meet-­‐ups,  many  more  coming  soon   – 2  hackathons,  with  40+  par>cipants   •  Encourage  you  to  join,  learn,  contribute  and  have  fun  …  

Slide 34

Slide 34 text

® Drill at MapR •  World-­‐class  SQL  team,  ~20  people   •  150+  years  combined  experience  building  commercial   databases   •  Oracle,  DB2,  ParAccel,  Teradata,  SQLServer,  Ver>ca   •  Team  works  on    Drill,  Hive,  Impala   •  Fixed  some  of  the  toughest  problems  in  Apache  Hive  

Slide 35

Slide 35 text

® Thank  you!   Richard  Shaw   rshaw@mapr.com   @aggress