Slide 1

Slide 1 text

CLICKSTREAM ANALYSIS WITH SPARK – UNDERSTANDING VISITORS IN REAL-TIME Dr. Josef Adersberger QAware GmbH, Germany

Slide 2

Slide 2 text

THE CHALLENGE

Slide 3

Slide 3 text

One Kettle to Rule ‘em All Web Tracking Ad Tracking ERP CRM

Slide 4

Slide 4 text

One Kettle to Rule ‘em All Retention Reach Monetarization steer … § Campaigns § Offers § Contents

Slide 5

Slide 5 text

No content

Slide 6

Slide 6 text

THE CONCEPTS by Randy Paulino

Slide 7

Slide 7 text

The First Sketch (= real-time) Tableau

Slide 8

Slide 8 text

User Journey Analysis C V VT VT VT C X C V V V V V V V V C V V C V V V VT VT V V V VT C V X Event stream: User journeys: Web / Ad tracking KPIs: § Unique users § Conversions § Ad costs / conversion value § … V X VT C Click View View Time Conversion

Slide 9

Slide 9 text

THE ARCHITECTURE

Slide 10

Slide 10 text

„Larry & Friends“ Architecture Collector Aggregation SQL DB Runs not well for more than 1 TB data in terms of ingestion speed, query time and optimization efforts

Slide 11

Slide 11 text

by adweek.com Nope. Sorry, no Big Data.

Slide 12

Slide 12 text

„Hadoop & Friends“ Architecture Collector Batch Processor [Hadoop] Collector Event Data Lake Batch Processor Analytics DB JSON Stream Aggregation takes too long Cumbersome programming model (can be solved with pig, cascading et al.) Not interactive enough

Slide 13

Slide 13 text

Nope. Too sluggish.

Slide 14

Slide 14 text

κ-Architecture Collector Stream Processor Analytics DB Persistence JSON Stream Cumbersome programming model Over-engineered: We only need 15min real-time ;-) Stateful aggregations (unique x, conversions) require a separate DB with high throughput and fast aggregations & lookups.

Slide 15

Slide 15 text

λ-Architecture Collector Event Processor Event Data Lake Batch Processor Analytics DB Speed Layer Batch Layer JSON Stream Cumbersome programming model Complex architecture Redundant logic

Slide 16

Slide 16 text

Feels Over-Engineered… http://www.brainlazy.com/article/random-nonsense/over-engineered

Slide 17

Slide 17 text

The Final Architecture* *) Maybe called μ-architecture one day ;-)

Slide 18

Slide 18 text

Functional Architecture Strange Events Ingestion Raw Event Stream Collection Events Processing Analytics Warehouse Fact Entries Atomic Event Frames Data Lake Master Data Integration § Buffers load peeks § Ensures message delivery (fire & forget for client) § Create user journeys and unique user sets § Enrich dimensions § Aggregate events to KPIs § Ability to replay for schema evolution § The representation of truth § Multidimensional data model § Interactive queries for actions in realtime and data exploration § Eternal memory for all events (even strange ones) § One schema per event type. Time partitioned. class Analytics Model ´ factª WebFact ´ dimensionª Zeit ´ dimensionª Kampagne Jahr Quartal Monat Woche Tag Stunde Minute Kunde + Land: String Partner ´ dimensionª Tracking Tracking Group SensorTag + Typ: SensorTagType Platzierung + Format: ImageSize + Kostenmodell: KostenmodellArt Werbemittel + AdGroup: String + Format: ImageSize + Grˆ fle: KiloBytes + LandingPage: URL + Motif: URL Kampagne ´ dimensionª Client Kategorie Dev ice + Bezeichner: String + Hersteller: String + Typ: String Brow ser + Typ: String + Version: int ´ dimensionª Ausspielort Land Region Stadt ´ dimensionª Kanal Kanal ´ dimensionª Vermarktung ´ enumerationª SensorTagType ORDER_TAG MASTER_TAG CUSTOM_TAG Betriebssystem + Typ: String + Version: Version ? Dimension: Unabh‰ ng ig es Pr‰ dikat auf Metriken bei der Analyse ("kann isoliert dar¸ ber nachdenken / isoliert dazu Analysen fahren") ? H ierarchie: Sub-Pr‰ dikat auf Metriken. Erzeug t mehr als eine (zueinander diskunkte) Teilmeng en der Metriken. Entspricht den g ‰ ng ig en Drill-Down-Pfaden in den Reports bzw. den Batch-Ag g reg ate-Up-Pfaden in der Ag g reg ationslog ik. Semantische Unterstrukturen: "ist Teil von & kann nicht existieren ohne". ? Asssoziation: Nicht verwendet. Separates Stammdatenmodell. ? Attribut: Ermˆ g licht eine weitere (querschneidende) Einschr‰ nkung der Metrikmeng e erg ‰ nzend zu den Hierarchien. Domain Website Tracking Site Vermarkter Auslieferungs- Domain Referral ´ enumerati... KostenmodellArt CPC CPM CPO CPA ´ abstractª DimensionValue + id: int + name: String + sourceId: String WebsiteFact + Bounces: int + Verweildauer: float + Visits: int BasicAdFact + Clicks: int + Sichtbare Views: int + Validierte Clicks: int + View (angefragt): int + View (ausgeliefert): int + View (gemessen): int ´ dimensionª Produkt Shop Produkt + Produktkategorie: String ´ dimensionª Zeitfenster Letzte X Tage ´ dimensionª User User Segment ´ dimensionª Order OrderStatus + Status: OrderStatus ´ enumerationª OrderStatus IN_BEARBEITUNG ERFOLGREICH (AKTIVIERT) ABGELEHNT NICHT_IN_BEARBEITUNG UniquesFact + Unique Clicks: int + Unique Users: int + Unique Views: int AdCostFact + CPC: int + Kosten: float Conv ersionFact + PC: int + PR: int + PV: int + Umsatz PC: float + Umsatz PR: float + Umsatz PV: float AdVisibilityFact + Sichtbarkeitsdauer: float Activ atedOrderFact + Orders: int + Umsatz: float TrackingFact + Orders: int + Page Impressions: int + Umsatz: float X = {7, 14, 28, 30} § Fault tolerant message handling § Event handling: Apply schema, time-partitioning, De-dup, sanity checks, pre-aggregation, filtering, fraud detection § Tolerates delayed events § High throughput, moderate latency (~ 1min)

Slide 19

Slide 19 text

Series Connection of Streaming and Batching - all based on Spark. Ingestion Raw Event Stream Collection Event Data Lake Processing Analytics Warehouse Fact Entries SQL Interface Atomic Event Frames § Cool programming model § Uniform dev&ops § Simple solution § High compression ratio due to column-oriented storage § High scan speed § Cool programming model § Uniform dev&ops § High performance § Interface to R out-of-the-box § Useful libs: MLlib, GraphX, NLP, … § Good connectivity (JDBC, ODBC, …) § Interactive queries § Uniform ops § Can easily be replaced due to Hive Metastore § Obvious choice for cloud-scale messaging § Way the best throughput and scalability of all evaluated alternatives

Slide 20

Slide 20 text

LESSONS LEARNED by: http://hochmeister-alpin.at

Slide 21

Slide 21 text

Technology Mapping https://github.com/qaware/big-data-landscape User Interface Data Lake Data Warehouse Ingestion Processing Data Science Interactive Analysis Reporting & Dashboards Data Sources Analytics Micro Analytics Services instead of reporting servers. Charting Libraries: Dashboards: Analytics Frontends Algorithm Libraries Structured Data Lake: The eternal memory. Efficient data serialization formats: § Integated compression § Column-oriented storage § Predicate pushdown Distributed Filesystem or NoSQL DB Data Workflows ETL Jobs Massive Parallelization Pig Open Studio Data Logicstics Stream Processing NewSQL: SQL meets NoSQL. Polyglott Persistence Index Machines: Fast aggregation and search. In-Memory Databases: Fast access. Time Series Databases Atlas

Slide 22

Slide 22 text

Polyglott Analytics Data Lake Analytics Warehouse SQL lane R lane Timeseries lane Reporting Data Exploration Data Science

Slide 23

Slide 23 text

Micro Analytics Services Microservice Dashboard Microservice …

Slide 24

Slide 24 text

No Retention Paranoia Data Lake Analytics Warehouse § Eternal memory § Close to raw events § Allows replays and refills into warehouse Aggressive forgetting with clearly defined retention policy per aggregation level like: § 15min:30d § 1h:4m § … Events Strange Events

Slide 25

Slide 25 text

Continuous Tuning Ingestion Raw Event Stream Collection Event Data Lake Processing Analytics Warehouse Fact Entries SQL Interface Atomic Event Frames Load Generator Throughput & latency probes

Slide 26

Slide 26 text

In Numbers Overall dev effort until the first release: 250 person days Dimensions: 10 KPIs: 26 Integrated 3rd party systems: 7 Inbound data volume per day: 80GB New data in DWH per day: 2GB Total price of cheapest cluster which is able to handle production load:

Slide 27

Slide 27 text

No content

Slide 28

Slide 28 text

THANK YOU. @adersberger josef.adersberger@qaware.de Download slides

Slide 29

Slide 29 text

Further Best Practices • Know and Monitor your Execution Plan

Slide 30

Slide 30 text

Bonus Topic: Roadmap Spark Kafka HDFS IaaS Simplify Ops with Mesos Faster aggregation & easier updates with Spark-on-Solr http://qaware.blogspot.de/2015/06/solr-with-sparks-or-how-to-submit-spark.html

Slide 31

Slide 31 text

Bonus Topic: Smart Aggregation Ingestion Event Data Lake Processing Analytics Warehouse Fact Entries Analytics Atomic Event Frames 1 2 3

Slide 32

Slide 32 text

Architecture follows requirements class Analytics Model ´ factª WebFact ´ dimensionª Zeit ´ dimensionª Kampagne Jahr Quartal Monat Woche Tag Stunde Minute Kunde + Land: String Partner ´ dimensionª Tracking Tracking Group SensorTag + Typ: SensorTagType Platzierung + Format: ImageSize + Kostenmodell: KostenmodellArt Werbemittel + AdGroup: String + Format: ImageSize + Grˆ fle: KiloBytes + LandingPage: URL + Motif: URL Kampagne ´ dimensionª Client Kategorie Dev ice + Bezeichner: String + Hersteller: String + Typ: String Browser + Typ: String + Version: int ´ dimensionª Ausspielort Land Region Stadt ´ dimensionª Kanal Kanal ´ dimensionª Vermarktung ´ enumerationª SensorTagType ORDER_TAG MASTER_TAG CUSTOM_TAG Betriebssystem + Typ: String + Version: Version ? Dimension: Unabh‰ ngiges Pr‰ dikat auf Metriken bei der Analyse ("kann isoliert dar¸ ber nachdenken / isoliert dazu Analysen fahren") ? H ierarchie: Sub-Pr‰ dikat auf Metriken. Erzeugt mehr als eine (zueinander diskunkte) Teilmengen der Metriken. Entspricht den g‰ ngigen Drill-Down-Pfaden in den Reports bzw. den Batch-Aggregate-Up-Pfaden in der Aggregationslogik. Semantische Unterstrukturen: "ist Teil von & kann nicht existieren ohne". ? Asssoziation: Nicht verwendet. Separates Stammdatenmodell. ? Attribut: Ermˆ glicht eine weitere (querschneidende) Einschr‰ nkung der Metrikmenge erg‰ nzend zu den Hierarchien. Domain Website Tracking Site Vermarkter Auslieferungs- Domain Referral ´ enumerati... KostenmodellArt CPC CPM CPO CPA ´ abstractª DimensionValue + id: int + name: String + sourceId: String WebsiteFact + Bounces: int + Verweildauer: float + Visits: int BasicAdFact + Clicks: int + Sichtbare Views: int + Validierte Clicks: int + View (angefragt): int + View (ausgeliefert): int + View (gemessen): int ´ dimensionª Produkt Shop Produkt + Produktkategorie: String ´ dimensionª Zeitfenster Letzte X Tage ´ dimensionª User User Segment ´ dimensionª Order OrderStatus + Status: OrderStatus ´ enumerationª OrderStatus IN_BEARBEITUNG ERFOLGREICH (AKTIVIERT) ABGELEHNT NICHT_IN_BEARBEITUNG UniquesFact + Unique Clicks: int + Unique Users: int + Unique Views: int AdCostFact + CPC: int + Kosten: float Conv ersionFact + PC: int + PR: int + PV: int + Umsatz PC: float + Umsatz PR: float + Umsatz PV: float AdVisibilityFact + Sichtbarkeitsdauer: float Activ atedOrderFact + Orders: int + Umsatz: float TrackingFact + Orders: int + Page Impressions: int + Umsatz: float X = {7, 14, 28, 30} act Processing Processing Website Order Activation Ad Cost Conversion Uniques + Overlap Ad Visibility Basic Ads Tracking Dimensionsv erzeichnis aktualisieren Page Impressions und Orders zählen Aggregation TrackingFact Ad Impressions zählen Aggregation BasicAdFact CTR und Sichtbarkeitsrate berechnen Inferenz der Sichtbarkeitsdauer Aggregation AdVisibilityFact Dimensionsraum aufspannen Pro Vektor die Ev ents und die dort enthaltenen UserIds und Interaktionsarten ermitteln Menge der UserIds pro Interaktionsart erzeugen und deren Mächtigkeit bestimmen UniquesFact User Journeys erstellen (bzw. LV/LC pro User) Attribution v on Conv ersion auf Basis v orkonfiguriertem Conv ersion Modell Conv ersions zählen (PV, PC, PR) Warenkorbwert ermitteln bei Order-Conv ersion (PV, PC, PR) Aggregation Conv ersionFact Kosten pro Ev ent ermitteln Aggregation Nicht zuweisbare Kosten ermitteln CostFact Inferenz der Visits Visits zählen Bounces zählen Analyse der Verweildauer Aggregation WebsiteFact Order Status ermitteln Anzahl der nicht getrackten Orders ermitteln Aggregation Activ atedOrderFact :Analytics Warehouse :Event Data Lake «flow» «flow» «flow» «flow» «flow» «flow» «flow» «flow» «flow» «flow» «flow» «flow» «flow» «flow» «flow» «flow» «flow» «flow» «flow» «flow» «flow» Data Processing Workflow Multidimensional Data Model

Slide 33

Slide 33 text

Sample results Geolocated and gender- specific conversions. Frequency of visits Performance of an ad campaign