SQL-on-Hadoop for BI and Analytics : What Are my Options, What is the Future

Slide 1

Slide 1 text

MARK RITTMAN, COO, RITTMAN MEAD BUDAPEST DATA FORUM, JUNE 2016 SQL-ON-HADOOP FOR ANALYTICS + BI:   WHAT ARE MY OPTIONS, WHAT’S THE FUTURE?

Slide 2

Slide 2 text

[email protected] www.rittmanmead.com @rittmanmead 2 •Many customers and organisations are now running initiatives around “big data” •Some are IT-led and are looking for cost-savings around data warehouse storage + ETL •Others are “skunkworks” projects in the marketing department that are now scaling-up •Projects now emerging from pilot exercises •And design patterns starting to emerge Many Organisations are Running Big Data Initiatives

Slide 3

Slide 3 text

[email protected] www.rittmanmead.com @rittmanmead Highly Scalable (and Aﬀordable) Cluster Computing •Enterprise High-End RDBMSs such as Oracle can scale into the petabytes, using clustering ‣Sharded databases (e.g. Netezza) can scale further but with complexity / single workload trade-oﬀs •Hadoop was designed from outside for massive horizontal scalability - using cheap hardware •Anticipates hardware failure and makes multiple copies of data as protection •More nodes you add, more stable it becomes •And at a fraction of the cost of traditional  RDBMS platforms

Slide 4

Slide 4 text

[email protected] www.rittmanmead.com @rittmanmead •We can now aﬀordably create a single, massive archive of all our corporate data •Separate to our OLTP and operational BI tools •Leave it stored at the individual transaction level •And then run multiple compute frameworks on it ‣SQL queries through Hive, Impala etc ‣Data processing using Spark, MR ‣Graph Analysis, etc One Platform, Multiple Processing Frameworks Hadoop Data Lake Webserver  Log Files (txt) Social Media  Logs (JSON) DB Archives  (CSV) Sensor Data  (XML) `Spatial & Graph  (XML, txt) IoT Logs  (JSON, txt) Chat Transcripts  (Txt) DB Transactions  (CSV, XML) Blogs, Articles  (TXT, HTML) Raw Data Processed Data NoSQL Key-Value  Store DB Tabular Data  (Hive Tables) Aggregates  (Impala Tables) NoSQL Document   Store DB

Slide 5

Slide 5 text

[email protected] www.rittmanmead.com @rittmanmead 5 •Typical implementation of Hadoop and big data in an analytic context is the “data lake” •Additional data storage platform with cheap storage, ﬂexible schema support + compute •Data lands in the data lake or reservoir in raw form, then minimally processed •Data then accessed directly by “data scientists”, or processed further into DW In the Context of BI & Analytics : The Data Reservoir Data Transfer Data Access Data Factory Data Reservoir Business Intelligence Tools Hadoop Platform File Based Integration Stream Based Integration Data streams Discovery & Development Labs Safe & secure Discovery and Development environment Data sets and samples Models and programs Marketing / Sales Applications Models Machine Learning Segments Operational Data Transactions Customer Master ata Unstructured Data Voice + Chat Transcripts ETL Based Integration Raw Customer Data Data stored in the original format (usually files) such as SS7, ASN.1, JSON etc. Mapped Customer Data Data sets produced by mapping and transforming raw data

Slide 6

Slide 6 text

[email protected] www.rittmanmead.com @rittmanmead •Programming model for processing large data sets in parallel on a cluster •Not specific to a particular language, but usually written in Java •Inspired by the map and reduce functions commonly used in functional programming ‣Map() performs filtering and sorting ‣Reduce() aggregates the output of mappers ‣and a Shuffle() step to redistribute output by keys •Resolved several complications of distributed computing: ‣Allows unlimited computations on unlimited data ‣Map and reduce functions can be easily distributed ‣Originated at Google; Hadoop was Yahoo’s open-source  implementation of MapReduce, + two are synonymous MapReduce - The Original Big Data Query Framework Mapper Filter, Project Mapper Filter, Project Mapper Filter, Project Reducer Aggregate Reducer Aggregate Output  One HDFS file per reducer,  in a directory

Slide 7

Slide 7 text

[email protected] www.rittmanmead.com @rittmanmead 7 •Original developed at Facebook, now foundational within the Hadoop project •Allows users to query Hadoop data using SQL-like language •Tabular metadata layer that overlays files, can interpret semi-structured data (e.g. JSON) •Generates MapReduce code to return required data •Extensible through SerDes and Storage Handlers •JDBC and ODBC drivers for most platforms/tools •Perfect for set-based access + batch ETL work Apache Hive : SQL Metadata + Engine over Hadoop

Slide 8

Slide 8 text

[email protected] www.rittmanmead.com @rittmanmead •Data integration tools could now load and process Hadoop data •BI tools could treat Hadoop as just another data source •Generally use MapReduce and Hive to access data ‣ODBC and JDBC access to Hive tabular data ‣Allows Hadoop unstructured/semi-structured  data on HDFS to be accessed like RDBMS Hive Provides a SQL Interface for BI + ETL Tools Access direct Hive or extract using ODI12c for structured OBIEE dashboard analysis What pages are people visiting? Who is referring to us on Twitter? What content has the most reach?

Slide 9

Slide 9 text

No content

Slide 10

Slide 10 text

No content

Slide 11

Slide 11 text

Hive is slow

Slide 12

Slide 12 text

MapReduce is for losers

Slide 13

Slide 13 text

No content

Slide 14

Slide 14 text

But the future is fast

Slide 15

Slide 15 text

No content

Slide 16

Slide 16 text

[email protected] www.rittmanmead.com @rittmanmead 16 Hadoop 2.0 Processing Frameworks + Tools

Slide 17

Slide 17 text

[email protected] www.rittmanmead.com @rittmanmead 17 •MapReduce 2 (MR2) splits the functionality of the JobTracker  by separating resource management and job scheduling/monitoring •Introduces YARN (Yet Another Resource Manager) •Permits other processing frameworks to MR ‣For example, Apache Spark •Maintains backwards compatibility with MR1 •Introduced with CDH5+ MapReduce 2 and YARN Node  Manager Node  Manager Node  Manager Resource  Manager Client Client

Slide 18

Slide 18 text

[email protected] www.rittmanmead.com @rittmanmead 18 •Runs on top of YARN, provides a faster execution engine than MapReduce for Hive, Pig etc •Models processing as an entire data flow graph (DAG), rather than separate job steps ‣DAG (Directed Acyclic Graph) is a new programming style for distributed systems ‣Dataflow steps pass data between them as streams, rather than writing/reading from disk •Supports in-memory computation, enables Hive on Tez (Stinger) and Pig on Tez •Favoured In-memory / Hive v2   route by Hortonworks Apache Tez Input Data TEZ DAG Map() Map() Map() Reduce() Output Data Reduce() Reduce() Reduce() Input Data Map() Map() Reduce() Reduce()

Slide 19

Slide 19 text

[email protected] www.rittmanmead.com @rittmanmead 19 Tez Advantage - Drop-In Replacement for MR with Hive, Pig set hive.execution.engine=mr set hive.execution.engine=tez 4m 17s 2m 25s

Slide 20

Slide 20 text

[email protected] www.rittmanmead.com @rittmanmead 20 •Cloudera’s answer to Hive query response time issues •MPP SQL query engine running on Hadoop, bypasses MapReduce for direct data access •Mostly in-memory, but spills to disk if required •Uses Hive metastore to access Hive table metadata •Similar SQL dialect to Hive - not as rich though and no support for Hive SerDes, storage handlers etc Cloudera Impala - Fast, MPP-style Access to Hadoop Data

Slide 21

Slide 21 text

[email protected] www.rittmanmead.com @rittmanmead 21 •Log into Impala Shell, run INVALIDATE METADATA command to refresh Impala table list •Run SHOW TABLES Impala SQL command to view tables available •Run COUNT(*) on main ACCESS_PER_POST table to see typical response time Enabling Hive Tables for Impala [oracle@bigdatalite ~]$ impala-shell Starting Impala Shell without Kerberos authentication [bigdatalite.localdomain:21000] > invalidate metadata; Query: invalidate metadata Fetched 0 row(s) in 2.18s [bigdatalite.localdomain:21000] > show tables; Query: show tables +-----------------------------------+ | name | +-----------------------------------+ | access_per_post | | access_per_post_cat_author | | … | | posts | |——————————————————————————————————-+ Fetched 45 row(s) in 0.15s [bigdatalite.localdomain:21000] > select count(*)   from access_per_post; Query: select count(*) from access_per_post +----------+ | count(*) | +----------+ | 343 | +----------+ Fetched 1 row(s) in 2.76s

Slide 22

Slide 22 text

[email protected] www.rittmanmead.com @rittmanmead 22 •Beginners usually store data in HDFS using text file formats (CSV) but these have limitations •Apache AVRO often used for general-purpose processing ‣Splitability, schema evolution, in-built metadata, support for block compression •Parquet now commonly used with Impala due to column-orientated storage ‣Mirrors work in RDBMS world around column-store ‣Only return (project) the columns you require across a wide table Apache Parquet - Column-Orientated Storage for Analytics

Slide 23

Slide 23 text

Hive (and HDFS) is Append-Only

Slide 24

Slide 24 text

[email protected] www.rittmanmead.com @rittmanmead 24 •Main DML operation in HiveQL is SELECT statement HiveQL Limited to SELECT and Bulk Load •CREATE TABLE test (  product_id int,  product_desc string);  •SHOW TABLES;  •CREATE TABLE test2  AS SELECT * FROM test; •SELECT SUM(sales)  FROM sales_summary;  •LOAD DATA INPATH ‘/user/mrittman/logs’ INTO TABLE log_entries;

Slide 25

Slide 25 text

What if you need to UPDATE,  INSERT or DELETE?

Slide 26

Slide 26 text

For example …   SCD2 Dimension Tables?

Slide 27

Slide 27 text

[email protected] www.rittmanmead.com @rittmanmead 27 •Extract data from source into flat files •Load files into Apache HBase tables, with Apache Hive table metadata over them ‣Loading into HBase permits full CRUD activity, rather than append-only loading •Query using Apache Hive and BI/ETL tool; update existing rows by LOAD DATA into Hive Standard Updatable-Hive Table Solution - Hive-on-Base

Slide 28

Slide 28 text

[email protected] www.rittmanmead.com @rittmanmead 28 •Based on BigTable paper from Google, 2006, Dean et al. ‣“Bigtable is a sparse, distributed, persistent multi-dimensional sorted map.”Key Features: ‣Distributed storage across cluster of machines – Random, online read and write data access ‣Schemaless data model (“NoSQL”) ‣Self-managed data partitions •Alternative to HDFS as Hive storage layer ‣Allows us to do update and delete  activity rather than just Hive append-only ‣Very fast for incremental loading ‣Can deﬁne Hive tables over HBase ones,  allowing OBIEE to then access them What is HBase?

Slide 29

Slide 29 text

[email protected] www.rittmanmead.com @rittmanmead 29 •HBase Shell CLI allows you to create HBase tables •GET and PUT commands can then be used to add/update cells, query cells etc Creating HBase Tables using HBase Shell hbase shell create 'carriers','details' create 'geog_origin','origin' create 'geog_dest','dest' create 'flight_delays','dims','measures' put 'geog_dest','LAX','dest:airport_name','Los Angeles, CA: Los Angeles' put 'geog_dest','LAX','dest:city','Los Angeles, CA' put 'geog_dest','LAX','dest:state','California' put 'geog_dest','LAX','dest:id','12892' hbase(main):015:0> scan 'geog_dest' ROW COLUMN+CELL  LAX column=dest:airport_name, timestamp=1432067861347, value=Los Angeles, CA: Los Angeles  LAX column=dest:city, timestamp=1432067861375, value=Los Angeles,CA  LAX column=dest:id, timestamp=1432067862018,value=12892  LAX column=dest:state, timestamp=1432067861404,value=California  1 row(s) in 0.0240 seconds

Slide 30

Slide 30 text

[email protected] www.rittmanmead.com @rittmanmead 30 •Create Hive tables over the HBase ones to provide SQL load/query capabilities ‣Uses HBaseStorageHandler Storage Handler for HBAse ‣HBase columns mapped to Hive columns using SERDEPROPERTIES Create Hive Table Metadata over HBase Tables CREATE EXTERNAL TABLE hbase_flight_delays (key string, year string, carrier string, orig string, dest string, flights string, late string, cancelled string, distance string ) STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,dims:year,dims:carrier,dims:orig,dims:dest,  measures:flights,measures:late,measures:cancelled,measures:distance") TBLPROPERTIES ("hbase.table.name" = "test1_flight_delays");

Slide 31

Slide 31 text

[email protected] www.rittmanmead.com @rittmanmead 31 •Hive-on-HBase tables have the opposite query characteristics to Hive-on-HDFS tables ‣For small tables, query response-time is fast (HBase cell-level random access) ‣But response time increases (beyond regular HDFS-stored Hive tables) when lots of  data needs to be retrieved and aggregated But … (and there’s always a “But…”) hive> select * from hbase_geog_dest where dest_city = 'San Francisco, CA'; OK JCC San Francisco, CA: China Basin Heliport San Francisco, CA California 12457 SFO San Francisco, CA: San Francisco International San Francisco, CA California 14771 Time taken: 0.171 seconds, Fetched: 2 row(s) hive> select sum(cast(f.flights as bigint)) as flight_count, o.origin_airport_name from flight_delays f > join hbase_geog_origin o on f.orig = o.key > and o.origin_state = 'California' > group by o.origin_airport_name; ... OK 17638Arcata/Eureka, CA: Arcata 9146Bakersfield, CA: Meadows Field 125433Burbank, CA: Bob Hope ... 1653Santa Maria, CA: Santa Maria Public/Capt. G. Allan Hancock Field Time taken: 51.757 seconds, Fetched: 27 row(s)

Slide 32

Slide 32 text

[email protected] www.rittmanmead.com @rittmanmead 32 •Either use Impala engine to query the Hive-on-HBase tables as they are ‣Makes overall SELECT operations faster •Or CREATE TABLE … AS SELECT snapshot the Hive-on-HBase tables to parquet ﬁles ‣Takes advantage of Parquet column-storage optimization Partial Solution - Snapshot to Impala + Parquet [oracle@bigdatalite ~]$ impala-shell Starting Impala Shell without Kerberos authentication [bigdatalite.localdomain:21000] > invalidate metadata; Query: invalidate metadata Fetched 0 row(s) in 2.18s [bigdatalite.localdomain:21000] > show tables; Query: show tables +-----------------------------------+ | name | +-----------------------------------+ | access_per_post | | access_per_post_cat_author | | … | | posts | |——————————————————————————————————-+ Fetched 45 row(s) in 0.15s [bigdatalite.localdomain:21000] > select count(*)   from access_per_post; Query: select count(*) from access_per_post +----------+ | count(*) | +----------+ | 343 | +----------+ Fetched 1 row(s) in 2.76s

Slide 33

Slide 33 text

[email protected] www.rittmanmead.com @rittmanmead 33 •But Parquet (and HDFS) have significant limitation for real-time analytics applications ‣Append-only orientation, focus on column-store   makes streaming ingestion harder •Cloudera Kudu aims to combine best of HDFS + HBase ‣Real-time analytics-optimised ‣Supports updates to data ‣Fast ingestion of data ‣Accessed using SQL-style tables  and get/put/update/delete API Cloudera Kudu - Combining Best of HBase and Column-Store

Slide 34

Slide 34 text

[email protected] www.rittmanmead.com @rittmanmead 34 •Kudu storage used with Impala - create tables using Kudu storage handler •Can now UPDATE, DELETE and INSERT into Hadoop tables, not just SELECT and LOAD DATA Example Impala DDL + DML Commands with Kudu CREATE TABLE `my_first_table` ( `id` BIGINT, `name` STRING ) TBLPROPERTIES( 'storage_handler' = 'com.cloudera.kudu.hive.KuduStorageHandler', 'kudu.table_name' = 'my_first_table', 'kudu.master_addresses' = 'kudu-master.example.com:7051', 'kudu.key_columns' = 'id' ); INSERT INTO my_first_table VALUES (99, "sarah"); INSERT IGNORE INTO my_first_table VALUES (99, "sarah"); UPDATE my_first_table SET name="bob" where id = 3; DELETE FROM my_first_table WHERE id < 3; DELETE c FROM my_second_table c, stock_symbols s WHERE c.name = s.symbol;

Slide 35

Slide 35 text

and there’s more

Slide 36

Slide 36 text

[email protected] www.rittmanmead.com @rittmanmead 36 •Many Hadoop tools now use in-memory processing (R, Python, Spark etc) •But they all work to their own standards •Considerable overhead in serialising / deserialising data between tools •Apache Arrow standardised how in-memory data is held •Considerably reduced overhead and latency between tools Apache Arrow - Standardising In-Memory Structures

Slide 37

Slide 37 text

[email protected] www.rittmanmead.com @rittmanmead 37 Apache Arrow : Common In-Memory Layer for Hadoop

Slide 38

Slide 38 text

[email protected] www.rittmanmead.com @rittmanmead 38 •Similar concept to Oracle BI Server ‣Logical star scheme for business model ‣Maps to federated data sources ‣Integrates and models data for query ‣Can run alongside Kylin MOLAP server •Part of wider next-gen Hadoop BI stack ‣Apache Zepplin web-based notebook ‣Spark/MR/Tez execution engines ‣Hive and Drill for ETL/ad-hoc SQL •Slowly the BI stack gets built out... Apache Lens : Logical Dimensional Model for Hadoop

Slide 39

Slide 39 text

For data processing, In Oracle we have PL/SQL

Slide 40

Slide 40 text

In SQL Server we have TSQL

Slide 41

Slide 41 text

In Hadoop we have ... Spark

Slide 42

Slide 42 text

No content

Slide 43

Slide 43 text

[email protected] www.rittmanmead.com @rittmanmead 43 •Another DAG execution engine running on YARN •More mature than TEZ, with richer API and more vendor support •Uses concept of an RDD (Resilient Distributed Dataset) ‣RDDs like tables or Pig relations, but can be cached in-memory ‣Great for in-memory transformations, or iterative/cyclic processes •Spark jobs comprise of a DAG of tasks operating on RDDs •Access through Scala, Python or Java APIs •Related projects include ‣Spark SQL ‣Spark Streaming Apache Spark

Slide 44

Slide 44 text

[email protected] www.rittmanmead.com @rittmanmead 44 •Native support for multiple languages   with identical APIs ‣Python - prototyping, data wrangling ‣Scala - functional programming features ‣Java - lower-level, application integration •Use of closures, iterations, and other   common language constructs to minimize code •Integrated support for distributed +  functional programming •Unified API for batch and streaming Rich Developer Support + Wide Developer Ecosystem scala> val logfile = sc.textFile("logs/access_log") 14/05/12 21:18:59 INFO MemoryStore: ensureFreeSpace(77353)   called with curMem=234759, maxMem=309225062 14/05/12 21:18:59 INFO MemoryStore: Block broadcast_2   stored as values to memory (estimated size 75.5 KB, free 294.6 MB) logfile: org.apache.spark.rdd.RDD[String] =   MappedRDD[31] at textFile at :15 scala> logfile.count() 14/05/12 21:19:06 INFO FileInputFormat: Total input paths to process : 1 14/05/12 21:19:06 INFO SparkContext: Starting job: count at :1 ... 14/05/12 21:19:06 INFO SparkContext: Job finished:   count at :18, took 0.192536694 s res7: Long = 154563 scala> val logfile = sc.textFile("logs/access_log").cache scala> val biapps11g = logfile.filter(line => line.contains("/biapps11g/")) biapps11g: org.apache.spark.rdd.RDD[String] = FilteredRDD[34] at filter at :17 scala> biapps11g.count() ... 14/05/12 21:28:28 INFO SparkContext: Job finished: count at :20, took 0.387960876 s res9: Long = 403

Slide 45

Slide 45 text

[email protected] www.rittmanmead.com @rittmanmead 45 •Spark SQL, and Data Frames, allow RDDs in Spark to be processed using SQL queries •Bring in and federate additional data from JDBC sources •Load, read and save data in Hive, Parquet and other structured tabular formats Spark SQL - SQL within Apache Spark val accessLogsFilteredDF = accessLogs .filter( r => ! r.agent.matches(".*(spider|robot|bot|slurp).*")) .filter( r => ! r.endpoint.matches(".*(wp-content|wp-admin).*")).toDF() .registerTempTable("accessLogsFiltered") val topTenPostsLast24Hour = sqlContext.sql("SELECT p.POST_TITLE, p.POST_AUTHOR, COUNT(*)   as total   FROM accessLogsFiltered a   JOIN posts p ON a.endpoint = p.POST_SLUG   GROUP BY p.POST_TITLE, p.POST_AUTHOR   ORDER BY total DESC LIMIT 10 ") // Persist top ten table for this window to HDFS as parquet file topTenPostsLast24Hour.save("/user/oracle/rm_logs_batch_output/topTenPostsLast24Hour.parquet"  , "parquet", SaveMode.Overwrite)

Slide 46

Slide 46 text

No content

Slide 47

Slide 47 text

[email protected] www.rittmanmead.com @rittmanmead •Apache Drill is another SQL-on-Hadoop project that focus on schema-free data discovery •Inspired by Google Dremel, innovation is querying raw data with schema optional •Automatically infers and detects schema from semi-structured datasets and NoSQL DBs •Join across different silos of data e.g. JSON records, Hive tables and HBase database •Aimed at different use-cases than Hive -   low-latency queries, discovery   (think Endeca vs OBIEE) Apache Drill - Leverage Embedded Metadata in Files

Slide 48

Slide 48 text

[email protected] www.rittmanmead.com @rittmanmead •Most modern datasource formats embed their schema in the data (“schema-on-read”) •Apache Drill makes these as easy to join to traditional datasets as “point me at the data” •Cuts out unnecessary work in defining Hive schemas for data that’s self-describing •Supports joining across files,  databases, NoSQL etc Self-Describing Data - Parquet, AVRO, JSON etc

Slide 49

Slide 49 text

[email protected] www.rittmanmead.com @rittmanmead •Files can exist either on the local filesystem, or on HDFS •Connection to directory or file defined in storage configuration •Can work with CSV, TXT, TSV etc •First row of file can provide schema (column names) Apache Drill and Text Files SELECT * FROM dfs.`/tmp/csv_with_header.csv2`; +-------+------+------+------+ | name | num1 | num2 | num3 | +-------+------+------+------+ | hello | 1 | 2 | 3 | | hello | 1 | 2 | 3 | | hello | 1 | 2 | 3 | | hello | 1 | 2 | 3 | | hello | 1 | 2 | 3 | | hello | 1 | 2 | 3 | | hello | 1 | 2 | 3 | +-------+------+------+------+ 7 rows selected (0.12 seconds) SELECT * FROM dfs.`/tmp/csv_no_header.csv`; +------------------------+ | columns | +------------------------+ | ["hello","1","2","3"] | | ["hello","1","2","3"] | | ["hello","1","2","3"] | | ["hello","1","2","3"] | | ["hello","1","2","3"] | | ["hello","1","2","3"] | | ["hello","1","2","3"] | +------------------------+ 7 rows selected (0.112 seconds)

Slide 50

Slide 50 text

[email protected] www.rittmanmead.com @rittmanmead •JSON (Javascript Object Notation) documents are often used for data interchange •Exports from Twitter and other consumer services •Web service responses and other B2B interfaces •A more lightweight form of XML that is “self-describing” •Handles evolving schemas, and optional attributes •Drill treats each document as a row, and has features to •Flatten nested data (extract elements from arrays) •Generate key/value pairs for loosely structured data Apache Drill and JSON Documents use dfs.iot; show files; select in_reply_to_user_id, text from `all_tweets.json` limit 5; +---------------------+------+ | in_reply_to_user_id | text | +---------------------+------+ | null | BI Forum 2013 in Brighton has now sold-out | | null | "Football has become a numbers game | | null | Just bought Lyndsay Wise’s Book | | null | An Oracle BI "Blast from the Past" | | 14716125 | Dilbert on Agile Programming | +---------------------+------+ 5 rows selected (0.229 seconds) select name, flatten(fillings) as f   from dfs.users.`/donuts.json`   where f.cal < 300;

Slide 51

Slide 51 text

[email protected] www.rittmanmead.com @rittmanmead •Drill can connect to Hive to make use of metastore (incl. multiple Hive metastores) •NoSQL databases (HBase etc) •Parquet files (native storage format - columnar + self describing) Apache Drill and Hive, HBase, Parquet Sources etc USE hbase; SELECT * FROM students; +-------------+-----------------------+-----------------------------------------------------+ | row_key | account | address | +-------------+-----------------------+------------------------------------------------------+ | [B@e6d9eb7 | {"name":"QWxpY2U="} | {"state":"Q0E=","street":"MTIzIEJhbGxtZXIgQXY="} | | [B@2823a2b4 | {"name":"Qm9i"} | {"state":"Q0E=","street":"MSBJbmZpbml0ZSBMb29w"} | | [B@3b8eec02 | {"name":"RnJhbms="} | {"state":"Q0E=","street":"NDM1IFdhbGtlciBDdA=="} | | [B@242895da | {"name":"TWFyeQ=="} | {"state":"Q0E=","street":"NTYgU291dGhlcm4gUGt3eQ=="} | +-------------+-----------------------+----------------------------------------------------------------------+ SELECT firstname,lastname FROM   hiveremote.`customers` limit 10;`  +------------+------------+ | firstname | lastname | +------------+------------+ | Essie | Vaill | | Cruz | Roudabush | | Billie | Tinnes | | Zackary | Mockus | | Rosemarie | Fifield | | Bernard | Laboy | | Marianne | Earman | +------------+------------+ SELECT * FROM dfs.`iot_demo/geodata/region.parquet`; +--------------+--------------+-----------------------+ | R_REGIONKEY | R_NAME | R_COMMENT | +--------------+--------------+-----------------------+ | 0 | AFRICA | lar deposits. blithe | | 1 | AMERICA | hs use ironic, even | | 2 | ASIA | ges. thinly even pin | | 3 | EUROPE | ly final courts cajo | | 4 | MIDDLE EAST | uickly special accou | +--------------+--------------+-----------------------+

Slide 52

Slide 52 text

[email protected] www.rittmanmead.com @rittmanmead •Drill developed for real-time, ad-hoc data exploration with schema discovery on-the-fly •Individual analysts exploring new datasets, leveraging corporate metadata/data to help •Hive is more about large-scale, centrally curated set-based big data access •Drill models conceptually as JSON, vs. Hive’s tabular approach •Drill introspects schema from whatever it connects to, vs. formal modeling in Hive Apache Drill vs. Apache Hive Interactive Queries  (Data Discovery, Tableau/VA) Reporting Queries  (Canned Reports, OBIEE) ETL  (ODI, Scripting, Informatica) Apache Drill Apache Hive Interactive Queries 100ms - 3mins Reporting Queries 3mins - 20mins ETL & Batch Queries 20mins - hours

Slide 53

Slide 53 text

No content

Slide 54

Slide 54 text

www.rittmanmead.com

Slide 55

Slide 55 text

MARK RITTMAN, COO, RITTMAN MEAD BUDAPEST DATA FORUM, JUNE 2016 SQL-ON-HADOOP FOR ANALYTICS + BI:   WHAT ARE MY OPTIONS, WHAT’S THE FUTURE?