Qrious about Insights -- Big Data in the Real World
Presentation for the Data Science Research Group Workshop on 7 February 2017 at AUT. The talk centres around the problem in Big Data analytics, tools for overcoming these problems, and the way the company Qrious leverages these to build solutions.
up a Solution Flotsam and Jetsam Outline 1 The Problem 2 Examples 3 The Solution 4 Tools of the Trade 5 Boxing up a Solution 6 Flotsam and Jetsam Guy Kloss | Big Data in the Real World 2/41
up a Solution Flotsam and Jetsam Who/What is Qrious? We help New Zealand businesses and public sector organisations create value and solve their most pressing business problems by turning data into actionable insight. Guy Kloss | Big Data in the Real World 3/41
up a Solution Flotsam and Jetsam Who/What is Qrious? Backed by Spark Approx. 60 employees Offices in Auckland & Wellington Substantial investment across Data, Platform & People Built from the ground up (new generation technology and working principles) One of the largest Data Science teams in the country with > 80% qualified to Masters & PhD level and over 60 years of combined experience years of combined experience NZs leading data analytics specialist by 2017 Guy Kloss | Big Data in the Real World 4/41
up a Solution Flotsam and Jetsam Our Capabilities Advanced analytics Location insights Big Data platforms Consulting services BI & Warehousing Guy Kloss | Big Data in the Real World 5/41
up a Solution Flotsam and Jetsam Who am I? Chemical Engineer (Masters) Rocket Scientist (German Aerospace Centre) Computer Scientist (PhD) Former lecturer (AUT) Lead Software Developer and Head Crypto Geek @ Mega Enterprise Architect at Qrious Dad, baseballer, diver, . . . general geek! Guy Kloss | Big Data in the Real World 6/41
up a Solution Flotsam and Jetsam Outline 1 The Problem 2 Examples 3 The Solution 4 Tools of the Trade 5 Boxing up a Solution 6 Flotsam and Jetsam Guy Kloss | Big Data in the Real World 7/41
up a Solution Flotsam and Jetsam An exponentially growing data world Relative Speeds Source: http://www.cs.cmu.edu/~amarp/cpu-io-gap Guy Kloss | Big Data in the Real World 10/41
up a Solution Flotsam and Jetsam Size Does Matter! Access/processing beyond a single machine (RAM, disk, CPU) Expensive data transfers at volume (latency, throughput) Guy Kloss | Big Data in the Real World 11/41
up a Solution Flotsam and Jetsam Storage Issues Storage, access, index, find Transfer, manage, prevent data loss Guy Kloss | Big Data in the Real World 12/41
up a Solution Flotsam and Jetsam Correlating . . . co-relating . . . mashing . . . Not single record problem But an m : n problem Guy Kloss | Big Data in the Real World 14/41
up a Solution Flotsam and Jetsam Beyond Exponential Problems are between exponential and hyperexponential → Enabling data processing in an exponential world Guy Kloss | Big Data in the Real World 15/41
up a Solution Flotsam and Jetsam Outline 1 The Problem 2 Examples 3 The Solution 4 Tools of the Trade 5 Boxing up a Solution 6 Flotsam and Jetsam Guy Kloss | Big Data in the Real World 16/41
up a Solution Flotsam and Jetsam Number of Records > 1 trillion (109) records: Spark’s location based data set Anonymised for privacy (on ingest) Fully encrypted (at rest and in transport) Continuous/stream ingestion Normalisation and segmentation on data set Correlating with external data set → Finding insights in this “hay mountain” Guy Kloss | Big Data in the Real World 17/41
up a Solution Flotsam and Jetsam Data Volume 100s of TB to PB of “Data Lakes” Not just a backup/data grave Fully encrypted (at rest and in transport) Includes data querying and processing capability → Capability to “store everything” (every thing and kind) Guy Kloss | Big Data in the Real World 18/41
up a Solution Flotsam and Jetsam Outline 1 The Problem 2 Examples 3 The Solution 4 Tools of the Trade 5 Boxing up a Solution 6 Flotsam and Jetsam Guy Kloss | Big Data in the Real World 19/41
up a Solution Flotsam and Jetsam Divide and Conquer Massively parallel processing: MPP Parallelise: Map-Reduce Pipelines: Stream processing Guy Kloss | Big Data in the Real World 20/41
up a Solution Flotsam and Jetsam The Right Tools Don’t re-invent the wheel Use existing high performing tools where possible Available high productivity frameworks, making use of high level languages The right tool for the type of data Use the Source, Luke! (Leverage open source based tooling with a community) Guy Kloss | Big Data in the Real World 22/41
up a Solution Flotsam and Jetsam The Right Data Organisation Row vs. columnar storage → For analytics often better in columnar format Guy Kloss | Big Data in the Real World 23/41
up a Solution Flotsam and Jetsam In, Out, Cha-Cha-Cha Ingest data from (legacy, external) source systems → ETL – Extract, Transform, Load Make sure the rhythm fits (no missing “Out”) Guy Kloss | Big Data in the Real World 24/41
up a Solution Flotsam and Jetsam Outline 1 The Problem 2 Examples 3 The Solution 4 Tools of the Trade 5 Boxing up a Solution 6 Flotsam and Jetsam Guy Kloss | Big Data in the Real World 25/41
up a Solution Flotsam and Jetsam Hadoop Hadoop and distributions Processing tools for relational, streaming, batch, graph, text, search, . . . Allocates cluster resources dynamically Data distributed (with redundancy), so compute allocated where data is Guy Kloss | Big Data in the Real World 26/41
up a Solution Flotsam and Jetsam Hadoop Distributions Many Hadoop distributions: Similar to Linux distributions Cloudera Partnership with Qrious “Bronze” partner Ambitions to become “Silver” partner and MSP (managed service provider) Guy Kloss | Big Data in the Real World 27/41
up a Solution Flotsam and Jetsam MPP Databases DB for massively parallel processing (MPP) Greenplum database and forks (based on PostgreSQL) Guy Kloss | Big Data in the Real World 29/41
up a Solution Flotsam and Jetsam Generic and Specialised DBs Generic RDBMS (where useful) NoSQL Graph DB Other columnar species Guy Kloss | Big Data in the Real World 30/41
up a Solution Flotsam and Jetsam Outline 1 The Problem 2 Examples 3 The Solution 4 Tools of the Trade 5 Boxing up a Solution 6 Flotsam and Jetsam Guy Kloss | Big Data in the Real World 31/41
up a Solution Flotsam and Jetsam Delivering a Suitable Solution Includes: System management Connectivity Application logic Services Yummy add-ons Guy Kloss | Big Data in the Real World 32/41
up a Solution Flotsam and Jetsam System Management Framework Security Dedicated sub-networks with specific firewall rules External firewalls User and credentials management Log collector Other security tools . . . System access VPN Remote desktop services Guy Kloss | Big Data in the Real World 33/41
up a Solution Flotsam and Jetsam Application Logic Platfor-as-a-Service (PaaS) Huge benefits of containerising application logic (using Docker) → Much reduced cadence for delivery APIs, Micro-Services Orchestration of Big Data analysis Guy Kloss | Big Data in the Real World 35/41
up a Solution Flotsam and Jetsam Services Solutioning, build Analytics and development Operation and maintenance Guy Kloss | Big Data in the Real World 36/41
up a Solution Flotsam and Jetsam Bonus Points for . . . Provenance (reproducibility, auditability, compliance) AI and ML Blockchain (non-repudiation, trust, “smart contracts”, identity management, federation, . . . ) Guy Kloss | Big Data in the Real World 37/41
up a Solution Flotsam and Jetsam Outline 1 The Problem 2 Examples 3 The Solution 4 Tools of the Trade 5 Boxing up a Solution 6 Flotsam and Jetsam Guy Kloss | Big Data in the Real World 38/41
up a Solution Flotsam and Jetsam In the Qrious Pipeline Make Big Data a commodity: Don’t buy, pay what you need! → Big-Data-as-a-Service – BDPaaS Sliced, diced and configured to your needs Straight on bare metal, not VMs (like most cloud hosters) Guy Kloss | Big Data in the Real World 39/41
up a Solution Flotsam and Jetsam Maximising the Jobmarket What skills do you need? RDBMS? SAS? NoSQL DBs? Maybe Hadoop is a good answer? Guy Kloss | Big Data in the Real World 40/41
up a Solution Flotsam and Jetsam Questions? Parallelise! Guy Kloss [email protected] Just a humble hair–dryer from the 30s: “One of the first machines used for permanent wave hairstyling back in the 1920’s and 1930’s.” Dark Roasted Blend: http://www.darkroastedblend.com/2007/05/ mystery-devices-issue-2.html Guy Kloss | Big Data in the Real World 41/41