Breeding Elephants Amongst Lions (Establishing Data Infrastructure and Hadoop Within A Large Existing System)

#meetup www.linkedin.com/in/katsev

• Provides a rich music experience to users with features
such as: – Live Radio Broadcasting (~2000 stations; own and operate 850) – ‘Custom Radio’ (make your own radio) of >15mil tracks – ‘Perfect For’ and ‘Custom Talk’ new products recently launched ! • 40 million registered users (since launch in 2011), only company that has had bigger growth is Instagram ! • Almost 70 million unique visitors per month ! • Hence, tons of users / activities which results in tons of data and analytics IheartRadio High Level !2

• 10k+ req/sec with strict latency response time requirements •
our average backend response time is 20ms • Pick technologies to solve challenges and problems with mindset of scalability and maintainability: Next Gen Backend Tech Stack

Breeding Elephants Amongst Lions Elephants: • big data >1TB •
multiple producers/ consumers • ML • Hadoop • realtime • universal access Lions: • storage • RDB misuse • ETL • data formats • operations • architecture • culture • resources!

Elephant’s Ugly Teens !5 • Dated view of teenagers •
But the point is not lost (esp. ~april 2013) • scarce resources (docs, human, …) • barely maturing technology (hadoop, flume/kakfa, luigi, …)

• http://www.quora.com/Big-Data/How-much-data-is-Big-Data ! ! ! ! ! ! • Big
Data Checklist [https://medium.com/what-i-learned-building/1b8e3214f96]  • Facebook’s Hadoop cluster is ~30PB (easy to feel insecure) • But, not just about size • type of data (text, image, video) | representation (graph) • degree of structure • type of analysis/processing (query, machine learning, end/dec) Whose Data is Bigger? !6

• Logs: • Search • ~150 MB/day [web] + ~500
MB/day [mobile] • ~15 GB/month • not big data, unless looking at 6 years of data • Play (Custom Radio) • ~20 GB/day • quarterly: 2TB, let alone yearly 8TB • Access • Other logs: CDN, Device Reports, …  • Other data sources: media catalogue, user data, thumbs, station data, … Converting Tons to Bytes !7

• 10s of TB / year ! • many unstructured
sources ! • want to do more ML ! • existing data warehousing is not keeping up Do Elephants Listen To Radio? !8

First Elephant: Hadoop

• stable • trusted • integrated • must have essential
toolset (HDFS, MR, Hive, HBase) • need a realtime SQL query engine (Impala vs Dremel) • open-source • RPMs for our Linux (CentOS) • option of enterprise support Picking a Distribution: Cloudera !10

• Cloudera Install Path A: automated by CDManager • Lions:
• EBS vs Ephemeral • Typically, IO-bound, hence EBS pretty bad • Ephemeral is fast, but ephemeral • Partitioning • Network traffic costs • getting data to/from cluster • (we have a direct link, tho) • Automated install directories might be cumbersome • unless carefully researched and partitioned/mounted prior 5-node cluster in EC2 !11

• poached by OPS when more instances were needed during
a crisis • backups to S3 were manual, dated and never really tested • recitation implausible ! • taught us many early lessons • need more control over the install automation • getting data to/from (Flume, File Formats, Compressions) • automating workflows (Luigi) • writing MR (using python+streaming, Luigi, MRJob) First Elephant: RIP !12

Data Pipelines: Flume or Kafka

• great for connecting A (lightweight) to B • great
for existing architectures • composed of simple/small building blocks (agents) ! ! ! ! • integrated - everything under one apache project • many sources: (avro, thrift), syslog, JMS, spool dir, custom • many sinks: (avro, thrift), HDFS, ElasticSearch, custom • extensible • basic ETL with serializers !14 Flume: minimally invasive surgery

• Lions: • configuration per agent • JVM • source,
channel, sink • channels: jdbc, file, memory • no distribution or partitioning • brittle • example of one config description: ! Maximum total bytes of memory allowed as a sum of all events in this channel. The implementation only counts the Event body, which is the reason for providing the byteCapacityBufferPercentage conﬁguration parameter as well. Defaults to a computed value equal to 80% of the maximum memory available to the JVM (i.e. 80% of the -Xmx value passed on the command line). Note that if you have multiple memory channels on a single JVM, and they happen to hold the same physical events (i.e. if you are using a replicating channel selector from a single source) then those event sizes may be double-counted for channel byteCapacity purposes. Setting this value to 0 will cause this value to fall back to a hard internal limit of about 200 GB. Flume: not great for creating large relays !15

• ideal for connecting A,B,C to D,E,F • message queue
at heart (like ZMQ Forwarder) • built-in partitioning (per topic) • built-in replication (per partition) adj. repl fact • eventual consistency • effectively constant perf with respect to size • managed by zookeeper Kafka: coronary bypass !16

• incompatible versions • 0.8 is not backwards compatible with
0.7.x • however, 0.8 is a huge step forward in functionality from 0.7.x • replicated partitions • producer and consumer are replication aware • can balance latency vs. safety • ‘long-poll’ consumers (end-to-end latency of few millisec) • recommended for all new systems • existing producer/consumer projects are still on 0.7 (kafka-river, camus, python libs, …) • operationally difficult • selecting a more-or-less universal (consumer-wise) message format Kafka: rebellious teens !17

• Long-term: Kafka • replace RabbitMQ for ‘play log’ •
support online and batch consumers • provides throughput headroom (100k+/sec) ! • Short-term: Flume • make necessary passageways for lighter and non- critical data • great for our ‘search log’ Strategy:

• zookeeper • 3 NSX VMs • dedicated log drives
• managed by exhibitor :: https://github.com/Netflix/exhibitor/wiki • chef :: https://github.com/SimpleFinance/chef-zookeeper ! • kafka • 2 broker direct hardware boxes • dedicated log drives • chef :: https://github.com/mthssdrbrg/kafka-cookbook • ~5 topics Kafka / ZK OPS !19

Data Warehousing

• gzip (48%) • good: compression ratio • bad: •
cpu heavy • not indexed • not auto partitioned • can’t be loaded directly into impala (use hive => ETL) • lzo (35%) • good: • fixes most of the gzip badness • the only supported compression for Text in Impala • bad: still flat File Formats: flat (minimal ETL) !21

• parquet (40%) • good: • compression • performance •
impala native • bad: • no raw2parq • ETL via Impala • ETL via ML ! • rcFile or orc • doesn’t seem to offer anything substantial over parquet File Formats: columnar !22

Second Elephant

• Chef provisioned boxes, but still following CDH Installation Path
A • Cobbler partitioned ! • search log connected via Flume • play log loaded from csv.gz exports in S3 • s3curl.pl --id=ihr {url} | zcat | sed 1d | lzop | hadoop fs -put - {fo} • (there’s no way to tell hive to disregard the first row as headers) • ETL(ed) via Hive into Parquet-backed Impala tables • multiple dates orchestrated by Luigi ! • functional testing of Impala proves optimistic | ready to dive in deeper 5-node cluster direct-hardware !24

• Lions: • partitioning / RAID • unnecessary RAID configs
• HDFS shared partition with verbose logs • contention / unpredictable HDFS size • CDH 5 released • manual and chef created users accounts are muxed • lots of manual path configuration via CD Second Elephant: RIP !25

• Hadoop Operations at LinkedIn (Mar 20, 2013) • by
Allen Wittenauer, Grid Systems Architect at LinkedIn • http://www.slideshare.net/allenwittenauer/2013-hadoopsummitemea ! • “Hadoop is not a developer problem; it’s an operations problem.”  true story by a Hadoop vendor ex-employee ! • Many valuable insights into Hadoop ops: • Partitioning • Hardware • Security Elephants Are Not Cacti !26 NameNode: DataNode:

Third Elephant

• Chef + CDH Install Path B (RPMs) • Proper
partitioning • CDH 5 • More integration with Kafka 10-node cluster: in gestation !28

Features: TopHit

• data science • Random Forests • multiple decision trees
having consensus • predictor values are search engine scores ! • API servers > Flume > HDFS > MR > Weka > API Servers Electing the most relevant search result !30

Features: Dashboards

• Shiny - visualization framework for R • Hadoop/Impala/Hive (RJDBC
+ https://github.com/tlpinney/dplyr-impala) ! • query ~1 billion rows of search data from R • interactive exploratory visualizations in Shiny • in-browser (immediate company-wide metrics access) ! • search exit • mobile client crashlytics Analytics + Visualization !32

Search Metrics !33 Analytics/Visualization

Future Elephants

• dashboarding • elastic + kibana • configuration management •
bcfg2 (at least look into it) • universal data access layer • HCatalog • Presto ! • YARN Despite the Lions !35

questions? email: [email protected]

Breeding Elephants Amongst Lions (Establishing ...

Breeding Elephants Amongst Lions (Establishing Data Infrastructure and Hadoop Within A Large Existing System)

More Decks by iHeartRadio

Other Decks in Technology

Featured

Transcript