Phoenix Data Conference 2014 - Mike Farnbach

® © 2014 MapR Technologies 1 ® © 2014 MapR
Technologies Self-Service Data Exploration and Nested-Data Analytics on Hadoop Michael Farnbach October 2014

® © 2014 MapR Technologies 2 Top Ranked 500+ Customers
Cloud Leaders MapR Enterprise Hadoop

® © 2014 MapR Technologies 3 UNSTRUCTURED DATA STRUCTURED DATA
1980 2000 2010 1990 2020 Unstructured data will account for more than 80% of the data collected by organizations Source: Human-Computer Interaction & Knowledge Discovery in Complex Unstructured, Big Data Total Data Stored

® © 2014 MapR Technologies 4 Today’s Data Comes in
Different Shapes… Social Media Messages Audio Sensors Mobile Data Email Clickstream

® © 2014 MapR Technologies 5 Data landscape is changing
{ "data": [ "id": "X999_Y999", "from": { "name": "Tom Brady", "id": "X12" }, "message": "Looking forward to 2014!", "actions": [ { "name": "Comment", "link": "http://www.facebook.com/X99/posts Y999" }, { "name": "Like", "link": "http://www.facebook.com/X99/posts Y999" } ], "type": "status", "created_time": "2013-08-02T21:27:44+0000", "updated_time": "2013-08-02T21:27:44+0000" } } JSON New types of applications •  Social, mobile, Web, “Internet of Things”, Cloud… •  Iterative/Agile in nature •  More users, more data New data models & data types •  Flexible (schema-less) data •  Rapidly changing •  Semi-structured •  Nested

® © 2014 MapR Technologies 7 But….SQL access is key
for the enterprise …

® © 2014 MapR Technologies 8 … and slowly but
surely, so is Hadoop

® © 2014 MapR Technologies 9 Rethink SQL for Big
Data •  ANSI SQL •  Familiar and ubiquitous •  Interactive •  Interactive nature crucial for BI/Analytics •  Support Structured Data •  Traditional Databases •  Flexible •  Semi-structured data types •  Nested data types •  Many data sources •  Agility •  Self describing data •  Schema-less access •  Scalability •  Process efficiently vast amounts of data Preserve Invent

® © 2014 MapR Technologies 10 SQL with existing technologies
on Hadoop is limited select * from A where exists ( select 1 from B where B.b < 100 ); Did you know Apache HIVE cannot compute this query? –  e.g. Hive, Impala, Spark SQL

® © 2014 MapR Technologies 11 Self-described Data is not
handled at all select cf.month, cf.year from hbase.table1; •  Did you know normal SQL cannot handle the above? •  Nor can HIVE and its variants like Impala, Shark? •  Because there’s no meta-store definition available

® © 2014 MapR Technologies 12 © 2014 MapR Technologies
® Key Values of Drill

® © 2014 MapR Technologies 13 Performance and Scale: Distributed
Database Zookeeper DFS/HBase DFS/HBase DFS/HBase Drillbit Distributed Cache Drillbit Distributed Cache Drillbit Distributed Cache Query 1. Query comes to any Drillbit (JDBC, ODBC, CLI, protobuf) 2. Drillbit generates execution plan based on query optimization & locality 3. Fragments are farmed to individual nodes 4. Result is returned to driving node

® © 2014 MapR Technologies 14 Performance and Scale: Efficient
Processing Run Time Compilation Binary Data Representation Columnar Model

® © 2014 MapR Technologies 15 Performance and Scale: No
Checkpointing 0 20 40 60 80 100 120 140 160 Speed vs. check-pointing Apache Drill

® © 2014 MapR Technologies 16 Performance and Scale: Disk
Spill Capability •  Random access: sort without copy or restructuring •  Avoids serialization/deserialization •  Off-heap (no GC woes when lots of memory) •  Full specification + off-heap + batch –  Enables C/C++ operators (fast!) •  Read/write to disk –  when data larger than memory DISK Drill Bit Memory overflow uses disk

® © 2014 MapR Technologies 17 Apache Drill brings SQL
capabilities to many data sources Centralized schema -‐ Sta&c -‐ Managed by the DBAs -‐ In a centralized repository Long, me&culous data prepara&on process (ETL, create/alter schema, etc.) – can take 6-‐18 months Self-‐describing, or schema-‐less, data -‐  Dynamic/evolving -‐  Managed by the applica&ons -‐  Embedded in the data Less schema, more suitable for data that has higher volume, variety and velocity Apache Drill

® © 2014 MapR Technologies 18 Flexibility: Modern Data Model
HBase JSON BSON CSV TSV Parquet Avro Schema-less Fixed schema Flat Complex Flexibility Flexibility Name! Gender! Age! Michael! M! 6! Jennifer! F! 3! {! name: {! first: Michael,! last: Smith! },! hobbies: [ski, soccer],! district: Los Altos! }! {! name: {! first: Jennifer,! last: Gates! },! hobbies: [sing],! preschool: CCLC! }! RDBMS/SQL-on-Hadoop table Apache Drill table

® © 2014 MapR Technologies 19 Drill { “ID”: 1,
“NAME”: “Fairmont San Francisco”, “DESCRIPTION”: “Historic grandeur…”, “AVG_REVIEWER_SCORE”: “4.3”, “AMENITY”: {“TYPE”: “gym”, DESCRIPTION: “fitness center” }, {“TYPE”: “wifi”, “DESCRIPTION”: “free wifi”}, “RATE_TYPE”: “nightly”, “PRICE”: “$199”, “REVIEWS”: [“review_1”, “review_2”], “ATTRACTIONS”: “Chinatown”, } JSON Drill Flexible Schema Management HotelID AmenityID 1 1 1 2 ID Type Descriptio n 1 Gym Fitness center 2 Wifi Free wifi Drill doesn’t require any schema definitions to query data making it faster to get insights from data for users. Drill leverages schema definitions if exists.

® © 2014 MapR Technologies 20 Flexibility: No Schema Required
Data Modeling and Transformation Data Visualization IT-driven IT-driven IT-driven Self-service IT-driven Self-service Optional Self-service Traditional BI w/ RDBMS Self-Service BI w/ RDBMS SQL-on-Hadoop Self-Service Data Exploration Zero-day analytics

® © 2014 MapR Technologies 21 Flexibility: Change Schemas on
the Fly •  Schema can change over course of query •  Operators are able to reconfigure themselves on schema change events –  Minimize flexibility overhead –  Support more advanced execution optimization based on actual data characteristics

® © 2014 MapR Technologies 22 Flexibility: What does it
mean? •  No ETL •  Reach out directly to the particular table/file •  As long as the permissions are fine, you can do it •  No need to have the meta-data –  None needed

® © 2014 MapR Technologies 23 Flexibility: Automatic Handling of
Nested Data SQL Parser Optimizer Scheduler Pig Parser Physical Plan Mongo Cassandra HiveQL Parser RPC Endpoint Distributed Cache Storage Engine Interface Operators Operators Foreman Logical Plan HDFS HBase JDBC Endpoint ODBC Endpoint

® © 2014 MapR Technologies 24 Flexibility: Combine data sources
on the fly •  JSON •  CSV •  ORC (ie, all Hive types) •  Parquet •  HBase tables •  … can combine them Select USERS.name, USERS.emails.work from dfs.logs.`/data/logs` LOGS, dfs.users.`/profiles.json` USERS, where LOGS.uid = USERS.uid and errorLevel > 5 order by count(*);

® © 2014 MapR Technologies 25 Seamless Integration: ANSI SQL
2003 •  Users want “standard” SQL rather than SQL-like (HiveQL) –  Leverage existing expertise –  Better support for BI/DI tools •  Drill’s supports ANSI SQL –  Extensions to handle complex data •  Current status: –  Drill 0.5 runs 15 of 22 TPC-H queries unmodified –  Impala 1.4 runs 2 of 22 TPC-H queries unmodified •  JDBC/ODBC Driver support

® © 2014 MapR Technologies 26 Seamless Integration: Apache Hive
•  Low latency queries on Hive tables •  Support for 100s of Hive file formats •  Ability to reuse Hive UDFs •  Support for multiple Hive metastores in a single query

® © 2014 MapR Technologies 27 ! SELECT timestamp, message!
FROM dfs1.logs.`AppServerLogs/2014/Jan/p001.parquet` ! WHERE errorLevel > 2! Data Source is in the Query A storage engine instance -  DFS -  HBase -  Hive Metastore/HCatalog A workspace -  Sub-directory -  Hive database -  HBase namespace A table -  pathnames -  HBase table -  Hive table

® © 2014 MapR Technologies 28 ! // On a
file select errorLevel, count(*)  from dfs.logs.`/AppServerLogs/2014/Jan/ part0001.parquet` group by errorLevel; // On the entire data collection: all years, all months select errorLevel, count(*)  from dfs.logs.`/AppServerLogs`  group by errorLevel Examples: Can be an entire directory tree

® © 2014 MapR Technologies 29 ! // count the
number of tweets per customer, where the customers are in Hive, and their tweets are in HBase. Note that the hbase data has no meta-data information select c.customerName, hb.tweets.count from hive.CustomersDB.`Customers` c join hbase.user.`SocialData` hb on c.customerId = convert_from( hb.rowkey, UTF-8); Examples: De-centralized metadata

® © 2014 MapR Technologies 30 Further Reading •  http://incubator.apache.org/drill/
•  https://cwiki.apache.org/confluence/display/DRILL/Apache+Drill +Wiki

® © 2014 MapR Technologies 32 Data entering Hadoop Hadoop
/ HBase DW platforms (e.g., Oracle, Teradata) Move to traditional systems Use Case 1: SQL on Hadoop Process data using Hive/Pig/Mapreduce Data exploration / Raw data analysis Enterprise users Ad-hoc query/BI reporting Enterprise users Ad-hoc query/ BI reporting Instant / “Day 0” queries

® © 2014 MapR Technologies 33 Data entering HBase directly
Hadoop / HBase Mongo/Cassandra Move data from NoSQL systems without transformation Use Case 2: Single Interface for Structured and Semi-Structured Data (SQL-on-Hadoop and SQL-on-NoSQL) Process data using Hive/Pig/MapReduce and populate HBase Data exploration/ Raw data analysis Enterprise users Ad-hoc query/ BI reporting

® © 2014 MapR Technologies 34 MapR Distribution for Apache
Hadoop MapR Data Platform (Random Read/Write) Data Hub Enterprise Grade Operational MapR-FS (POSIX) MapR-DB (High-Performance NoSQL) Security YARN Pig Cascading Spark Batch Spark Streaming Storm* Streaming HBase Solr NoSQL & Search Juju Provisioning & Coordination Savannah* Mahout MLLib ML, Graph GraphX MapReduce v1 & v2 APACHE HADOOP AND OSS ECOSYSTEM EXECUTION ENGINES DATA GOVERNANCE AND OPERATIONS Workflow & Data Governance Tez* Accumulo* Hive Impala SparkSQL Drill SQL Sentry* Oozie ZooKeeper Sqoop Knox* Whirr Falcon* Flume Data Integration & Access HttpFS Hue NFS HDFS API HBase API JSON API MapR Control System (Management and Monitoring) * In Roadmap for inclusion/certification CLI GUI REST API

Phoenix Data Conference 2014 - Mike Farnbach

Phoenix Data Conference 2014 - Mike Farnbach

More Decks by teamclairvoyant

Other Decks in Technology

Featured

Transcript