Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Phoenix Data Conference 2014 - Mike Farnbach

Phoenix Data Conference 2014 - Mike Farnbach

Apache Drill: Highly Flexible, High Performance Queries on Big Data

teamclairvoyant

October 25, 2014
Tweet

More Decks by teamclairvoyant

Other Decks in Technology

Transcript

  1. ® © 2014 MapR Technologies 1 ® © 2014 MapR

    Technologies Self-Service Data Exploration and Nested-Data Analytics on Hadoop Michael Farnbach October 2014
  2. ® © 2014 MapR Technologies 2 Top Ranked 500+ Customers

    Cloud Leaders MapR Enterprise Hadoop
  3. ® © 2014 MapR Technologies 3 UNSTRUCTURED DATA STRUCTURED DATA

    1980 2000 2010 1990 2020 Unstructured data will account for more than 80% of the data collected by organizations Source: Human-Computer Interaction & Knowledge Discovery in Complex Unstructured, Big Data Total Data Stored
  4. ® © 2014 MapR Technologies 4 Today’s Data Comes in

    Different Shapes… Social Media Messages Audio Sensors Mobile Data Email Clickstream
  5. ® © 2014 MapR Technologies 5 Data landscape is changing

    { "data": [ "id": "X999_Y999", "from": { "name": "Tom Brady", "id": "X12" }, "message": "Looking forward to 2014!", "actions": [ { "name": "Comment", "link": "http://www.facebook.com/X99/posts Y999" }, { "name": "Like", "link": "http://www.facebook.com/X99/posts Y999" } ], "type": "status", "created_time": "2013-08-02T21:27:44+0000", "updated_time": "2013-08-02T21:27:44+0000" } } JSON New types of applications •  Social, mobile, Web, “Internet of Things”, Cloud… •  Iterative/Agile in nature •  More users, more data New data models & data types •  Flexible (schema-less) data •  Rapidly changing •  Semi-structured •  Nested
  6. ® © 2014 MapR Technologies 9 Rethink SQL for Big

    Data •  ANSI SQL •  Familiar and ubiquitous •  Interactive •  Interactive nature crucial for BI/Analytics •  Support Structured Data •  Traditional Databases •  Flexible •  Semi-structured data types •  Nested data types •  Many data sources •  Agility •  Self describing data •  Schema-less access •  Scalability •  Process efficiently vast amounts of data Preserve Invent
  7. ® © 2014 MapR Technologies 10 SQL with existing technologies

    on Hadoop is limited select * from A where exists ( select 1 from B where B.b < 100 ); Did you know Apache HIVE cannot compute this query? –  e.g. Hive, Impala, Spark SQL
  8. ® © 2014 MapR Technologies 11 Self-described Data is not

    handled at all select cf.month, cf.year from hbase.table1; •  Did you know normal SQL cannot handle the above? •  Nor can HIVE and its variants like Impala, Shark? •  Because there’s no meta-store definition available
  9. ® © 2014 MapR Technologies 13 Performance and Scale: Distributed

    Database Zookeeper DFS/HBase DFS/HBase DFS/HBase Drillbit Distributed Cache Drillbit Distributed Cache Drillbit Distributed Cache Query 1. Query comes to any Drillbit (JDBC, ODBC, CLI, protobuf) 2. Drillbit generates execution plan based on query optimization & locality 3. Fragments are farmed to individual nodes 4. Result is returned to driving node
  10. ® © 2014 MapR Technologies 14 Performance and Scale: Efficient

    Processing Run Time Compilation Binary Data Representation Columnar Model
  11. ® © 2014 MapR Technologies 15 Performance and Scale: No

    Checkpointing 0 20 40 60 80 100 120 140 160 Speed vs. check-pointing Apache Drill
  12. ® © 2014 MapR Technologies 16 Performance and Scale: Disk

    Spill Capability •  Random access: sort without copy or restructuring •  Avoids serialization/deserialization •  Off-heap (no GC woes when lots of memory) •  Full specification + off-heap + batch –  Enables C/C++ operators (fast!) •  Read/write to disk –  when data larger than memory DISK Drill Bit Memory overflow uses disk
  13. ® © 2014 MapR Technologies 17 Apache Drill brings SQL

    capabilities to many data sources Centralized  schema   -­‐  Sta&c   -­‐  Managed  by  the  DBAs   -­‐  In  a  centralized  repository     Long,  me&culous  data  prepara&on  process  (ETL,   create/alter  schema,  etc.)    –  can  take  6-­‐18  months   Self-­‐describing,  or  schema-­‐less,  data   -­‐  Dynamic/evolving   -­‐  Managed  by  the  applica&ons   -­‐  Embedded  in  the  data     Less  schema,  more  suitable  for  data  that  has   higher  volume,  variety  and  velocity     Apache  Drill  
  14. ® © 2014 MapR Technologies 18 Flexibility: Modern Data Model

    HBase JSON BSON CSV TSV Parquet Avro Schema-less Fixed schema Flat Complex Flexibility Flexibility Name! Gender! Age! Michael! M! 6! Jennifer! F! 3! {! name: {! first: Michael,! last: Smith! },! hobbies: [ski, soccer],! district: Los Altos! }! {! name: {! first: Jennifer,! last: Gates! },! hobbies: [sing],! preschool: CCLC! }! RDBMS/SQL-on-Hadoop table Apache Drill table
  15. ® © 2014 MapR Technologies 19 Drill { “ID”: 1,

    “NAME”: “Fairmont San Francisco”, “DESCRIPTION”: “Historic grandeur…”, “AVG_REVIEWER_SCORE”: “4.3”, “AMENITY”: {“TYPE”: “gym”, DESCRIPTION: “fitness center” }, {“TYPE”: “wifi”, “DESCRIPTION”: “free wifi”}, “RATE_TYPE”: “nightly”, “PRICE”: “$199”, “REVIEWS”: [“review_1”, “review_2”], “ATTRACTIONS”: “Chinatown”, } JSON Drill Flexible Schema Management HotelID AmenityID 1 1 1 2 ID Type Descriptio n 1 Gym Fitness center 2 Wifi Free wifi Drill doesn’t require any schema definitions to query data making it faster to get insights from data for users. Drill leverages schema definitions if exists.
  16. ® © 2014 MapR Technologies 20 Flexibility: No Schema Required

    Data Modeling and Transformation Data Visualization IT-driven IT-driven IT-driven Self-service IT-driven Self-service Optional Self-service Traditional BI w/ RDBMS Self-Service BI w/ RDBMS SQL-on-Hadoop Self-Service Data Exploration Zero-day analytics
  17. ® © 2014 MapR Technologies 21 Flexibility: Change Schemas on

    the Fly •  Schema can change over course of query •  Operators are able to reconfigure themselves on schema change events –  Minimize flexibility overhead –  Support more advanced execution optimization based on actual data characteristics
  18. ® © 2014 MapR Technologies 22 Flexibility: What does it

    mean? •  No ETL •  Reach out directly to the particular table/file •  As long as the permissions are fine, you can do it •  No need to have the meta-data –  None needed
  19. ® © 2014 MapR Technologies 23 Flexibility: Automatic Handling of

    Nested Data SQL Parser Optimizer Scheduler Pig Parser Physical Plan Mongo Cassandra HiveQL Parser RPC Endpoint Distributed Cache Storage Engine Interface Operators Operators Foreman Logical Plan HDFS HBase JDBC Endpoint ODBC Endpoint
  20. ® © 2014 MapR Technologies 24 Flexibility: Combine data sources

    on the fly •  JSON •  CSV •  ORC (ie, all Hive types) •  Parquet •  HBase tables •  … can combine them Select USERS.name, USERS.emails.work from dfs.logs.`/data/logs` LOGS, dfs.users.`/profiles.json` USERS, where LOGS.uid = USERS.uid and errorLevel > 5 order by count(*);
  21. ® © 2014 MapR Technologies 25 Seamless Integration: ANSI SQL

    2003 •  Users want “standard” SQL rather than SQL-like (HiveQL) –  Leverage existing expertise –  Better support for BI/DI tools •  Drill’s supports ANSI SQL –  Extensions to handle complex data •  Current status: –  Drill 0.5 runs 15 of 22 TPC-H queries unmodified –  Impala 1.4 runs 2 of 22 TPC-H queries unmodified •  JDBC/ODBC Driver support
  22. ® © 2014 MapR Technologies 26 Seamless Integration: Apache Hive

    •  Low latency queries on Hive tables •  Support for 100s of Hive file formats •  Ability to reuse Hive UDFs •  Support for multiple Hive metastores in a single query
  23. ® © 2014 MapR Technologies 27 ! SELECT timestamp, message!

    FROM dfs1.logs.`AppServerLogs/2014/Jan/p001.parquet` ! WHERE errorLevel > 2! Data Source is in the Query A storage engine instance -  DFS -  HBase -  Hive Metastore/HCatalog A workspace -  Sub-directory -  Hive database -  HBase namespace A table -  pathnames -  HBase table -  Hive table
  24. ® © 2014 MapR Technologies 28 ! // On a

    file select errorLevel, count(*)
 from dfs.logs.`/AppServerLogs/2014/Jan/ part0001.parquet` group by errorLevel; // On the entire data collection: all years, all months select errorLevel, count(*)
 from dfs.logs.`/AppServerLogs`
 group by errorLevel Examples: Can be an entire directory tree
  25. ® © 2014 MapR Technologies 29 ! // count the

    number of tweets per customer, where the customers are in Hive, and their tweets are in HBase. Note that the hbase data has no meta-data information select c.customerName, hb.tweets.count from hive.CustomersDB.`Customers` c join hbase.user.`SocialData` hb on c.customerId = convert_from( hb.rowkey, UTF-8); Examples: De-centralized metadata
  26. ® © 2014 MapR Technologies 30 Further Reading •  http://incubator.apache.org/drill/

    •  https://cwiki.apache.org/confluence/display/DRILL/Apache+Drill +Wiki
  27. ® © 2014 MapR Technologies 31 Q & A @mapr

    maprtech [email protected] Engage with us! MapR maprtech mapr-technologies
  28. ® © 2014 MapR Technologies 32 Data entering Hadoop Hadoop

    / HBase DW platforms (e.g., Oracle, Teradata) Move to traditional systems Use Case 1: SQL on Hadoop Process data using Hive/Pig/Mapreduce Data exploration / Raw data analysis Enterprise users Ad-hoc query/BI reporting Enterprise users Ad-hoc query/ BI reporting Instant / “Day 0” queries
  29. ® © 2014 MapR Technologies 33 Data entering HBase directly

    Hadoop / HBase Mongo/Cassandra Move data from NoSQL systems without transformation Use Case 2: Single Interface for Structured and Semi-Structured Data (SQL-on-Hadoop and SQL-on-NoSQL) Process data using Hive/Pig/MapReduce and populate HBase Data exploration/ Raw data analysis Enterprise users Ad-hoc query/ BI reporting
  30. ® © 2014 MapR Technologies 34 MapR Distribution for Apache

    Hadoop MapR Data Platform (Random Read/Write) Data Hub Enterprise Grade Operational MapR-FS (POSIX) MapR-DB (High-Performance NoSQL) Security YARN Pig Cascading Spark Batch Spark Streaming Storm* Streaming HBase Solr NoSQL & Search Juju Provisioning & Coordination Savannah* Mahout MLLib ML, Graph GraphX MapReduce v1 & v2 APACHE HADOOP AND OSS ECOSYSTEM EXECUTION ENGINES DATA GOVERNANCE AND OPERATIONS Workflow & Data Governance Tez* Accumulo* Hive Impala SparkSQL Drill SQL Sentry* Oozie ZooKeeper Sqoop Knox* Whirr Falcon* Flume Data Integration & Access HttpFS Hue NFS   HDFS  API   HBase  API   JSON  API   MapR Control System (Management and Monitoring) * In Roadmap for inclusion/certification CLI   GUI   REST  API