Rocky Road to Big Data analytics

SPEAKER Rocky road to Big Data analytics Jonas Jarutis Saulius
Grigaliūnas

Agenda • Analytics @ Vinted in pre-Hadoop era • Getting
data into Hadoop • Analytics with Hadoop • Concluding remarks

Make secondhand the ﬁrst choice worldwide • 8 countries •
10 million members • 11 000 new members / day • 20+ million listed items • 90 new items / minute

Data at Vinted • Disclaimer: our data is probably NOT
bigger than your data • 10s of terabytes stored • 50+ gigabytes ingested per day • Up to billion user tracking events per day

• Reports Analytics @ Vinted circa March 2014 Fake data!

Pre-Hadoop analytics • Primary data store - MySQL, 8 markets
• Queries on: • Operational database structure • Denormalised tables

Language for writing reports • Custom Ruby on Rails application
• Domain speciﬁc language for report building class ListingsReport < SqlPivotReport metrics listings: 'count(distinct i.id)' listers: 'count(distinct i.lister_id)' dimensions brand: ‘brand_title' max_dimension_depth 3 from 'items i' period_field 'i.created_at' end → SELECT count(distinct i.id) listings, count(distinct i.lister) listers, brand_title FROM items i WHERE i.created_at BETWEEN .. AND .. GROUP BY brand_title

Tracking user events • MySQL as a tracking event store
CREATE TABLE `tracking_events`( ìd` int(11) auto_increment, èvent_type` int, `platform_id` int, ùser_id` int, `created_at` datetime) CREATE TABLE `tracking_event_extras`( ìd` int, `tracking_event_id`, èxtra1` int, èxtra2` int) + class TrackingEvent < ActiveRecord::Base EVENT_TYPES = { ‘session.start' => 1, ‘session.login' => 2, 'visit.start' => 10, ‘favourite.item.user_msg' => 50, ... # around 500 of these ... end

Intensive event tracking • Item impressions, item views, catalog views
• Amazon S3 + Amazon Redshift • Slow to load the cluster • Expensive to keep up all the time

Pros and Cons • + Easy and fast to write
new reports • - Reports run slowly • - Impacts operational database and the end user • - Cannot track and cheaply analyse a lot of data • - Hard to reuse dimension deﬁnitions • - Additional layer of knowledge (e.g event type IDs)

Business requirements • Fast insights • Rich reporting: OLAP style
slicing, funnels, cohorts, retention tracking • Should not affect product development speed • Extendable with possibility to contribute back to the product • Collaborative ﬁltering • Improving search results and feed

Requirements for a new system • Scalable • Fast to
query • Testable • Easily deployable • Simple

Hadoop stack • Cloudera's CDH distribution • Components: Hive, Impala,
Oozie, Sqoop, Hue • Managed by Chef • 10-40 servers • 4 engineers • Github for code changes

Getting data into Hadoop • Data from operational database imported
with Apache Sqoop • 900GB daily • 720 tables from 8 MySQL replicas • Lots of effort to tame Sqoop • Streaming MySQL replication service in pre- production state

Tracking user data 1: Schema registry • Avro for serialisation.
JSON, Hive table schemas • Event schema store service

Tracking user data 2: Kafka • Kafka - high throughput
persistent commit log • Distributed: partitions messages across multiple nodes • Reliable: messages replicated across multiple nodes • Persistent: all messages are persisted to disk

Tracking user data 3: Camus • Kafka to HDFS bridge
• Map-Reduce job • Extended with additional ETL steps via Cloudera Kite Morphlines • Camus sweeper: periodic small ﬁle consolidation

Stop I Use a faster SQL engine

Hive + Impala • 1bn row scan • 24s in
Impala • 46s in Hive • Feature-wise similar to MySQL (at the end of 2014) • Prepare denormalised fact tables with Hive • Aggregate with Impala on runtime • We can crunch all our data now!

Data transformation process Fact table Raw Data in HDFS Fact
table Dimension table Denormalised Fact table Denormalised Fact table Aggregated Data Dimension table Dimension table Dimension table Join Impala Hive Star schema

Processed data sample

However… • We had severe stability issues with Impala •
Limited functionality at that time • Hard to extend with User Deﬁned Functions (UDFs) • UDFs require separate code base • SQL is hard to test • Impala can’t read tables with complex types (Enum, Array, etc.) • String concatenation mess to arrive at the ﬁnal SQL  code

Stop II

• A Pipe in Scalding is a collection of transformations
• Read data, pass to a Pipe, write output • No UDFs, just plain Scala functions • Unit tests (what is a unit of data transformation?) • Can run locally Data transformations in Scalding def firstVisit(viewScreenPipe: TypedPipe[ViewScreen]) = { viewScreenPipe .groupBy { v => (v.portal, v.anon_id) } .sortWithTake(1)(_.time < _.time) .values .flatten }

val in = List(ViewScreen("a", "a", 10L, "a"), ViewScreen("a", "a", 5L,
"b"), ViewScreen("a", "a", 1L, "b"), ViewScreen("a", "b", 2L, "b"), ViewScreen("a", "b", 5L, “b")) val firstEvent = List(in(2), in(3)) "An UserAgentsJob" should { "find first event" in { Operations.firstVisit(TypedPipe.from(in)) .toIterableExecution .waitFor(Config.default, Local(false)) .get.toList should contain theSameElementsAs firstEvent } } We have tests! Mock data Expected data Unit test

However… • Getting anything but CSV ﬁles into Scalding was
hard • Poor documentation • Hard to debug • Slow (ran on Map-Reduce) • High learning curve

Final Destination +

What is Spark? • General computational framework • Large community
• Readable source code • Good documentation, examples • Speed is in between Hive and Impala • DataFrames - convenience of SQL but easier to test

Data transformation process Fact table Raw Data in HDFS Fact
table Dimension table Denormalised Fact table Denormalised Fact table Aggregated Data Dimension table Dimension table Dimension table Join Spark SQL (Dataframes API) Star schema External data (currency rates, user agents, …) Spark + Algebird Serving Layer

Data input • Spark has data readers for most of
the usual formats in Hadoop ecosystem • Connects directly to RDBMS through JDBC • Extendable

Our data ingestion helpers def readCoreTables( name: String, portals: List[String],
columns: List[String]) = { val tables = portals.map { portal => sqlCtx.readHiveTable("mysql_imports", s"${portal}__${name}") } tables.map(_.selectColumns(columns)).reduce(_ unionAll _) } • We have a number of helpers on top Spark data readers to: • Combine sharded data • Deduplicate • Custom data sampling strategies • Fix Sqoop related issues

Custom functionality def seqContains = udf((seq: Seq[String], elem: String) =>
seq.contains(elem) ) def isValidDate = udf((date: String) => Option(date).getOrElse("").isValidDate ) dataframe.filter(isValidDate($"created_at")) • Regular Scala functions wrapped in udf() • You can use lower level (RDD) api for complex functions • For example if state needs to be passed between rows

Transformations • Dataframes API closely resembles regular  SQL operations //
Create a new DataFrame that contains “young users” only val young = users.filter($"age" < 21) // Increment everybody’s age by 1 young.select($"name", $"age" + 1) // Count the number of young users by gender young.groupBy($"gender").count // Join young users with another DataFrame called logs young.as("young").join(logs.as("logs"), $"young.userId" === $"logs.userId", "left_outer")

Cubing class NewVisitorsCube(hiveCtx: HiveContext) extends Cube { val maxRollupDepth: Int
= 4 val dimensionNames = Set("portal", "first_visit_platform") val metrics = MapAggregator( metric( id = "new_visitors", aggregator = Aggregator.size ) ) def facts = FirstAnonVisitEnrichedFact.read(hiveCtx) } • Cubing is done with the help of Algebird library • All metrics calculated in one pass over the data • Aggregated data is pushed to HBase

Testing • Unit tests for reusable functionality • Integration tests
for data transformations • REPL for prototyping and debugging • Sampled workﬂow test on production data • Can be run locally

Spark pros/cons • Fast (although not the fastest) • Easy
to extend • Easy to test • Easy data IO • Moderate learning curve • Growing ecosystem of Spark packages • Conﬁguration requires effort, sometimes is job speciﬁc • Hard to debug • Hard to optimise

Notebooks @ Vinted

Zeppelin

Zeppelin • Pros • Dashboards • Easy to use for
SQL only • Nice out of the box visualisation capabilities • Cons • Crashes once a day usually • Hard to use in multi user environment • Most useful for quick data digging and ad-hoc BI tasks. Spark Notebook • Pros • Context, dependancies conﬁgurable per notebook • Stable • Support available on Gitter • Cons • More involved setup • Scala only • Targeted more at developers/ data scientists

Job deployment • Continuous integration - Jenkins + Ruby +
Oozie • Sampled workﬂow before production • Catch bugs early

Achievements • Fast queries for data • Unified and explicit
tracking event definitions • Explicit fact and dimension definitions • Minimised errors due to job testing, continuous integration and sampled workflow runs

Lessons learnt • Invest in research time • Coding solutions
yourself as a last resort for small teams • Try to avoid bleeding-edge releases • Invest in testing and automation

What’s next? • Simpliﬁed, more approachable fact and dimension generation
• Tracking event auditing and monitoring • Aggregation during query time • Replace Oozie and Sqoop

Thanks! @imsaulius saulius @jjarutis jarutis @VintedEng engineering.vinted.com

Rocky Road to Big Data analytics

Rocky Road to Big Data analytics

More Decks by Saulius Grigaliunas

Other Decks in Programming

Featured

Transcript