Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Breeding Elephants Amongst Lions (Establishing ...

iHeartRadio
February 12, 2014

Breeding Elephants Amongst Lions (Establishing Data Infrastructure and Hadoop Within A Large Existing System)

Adopting a Big-Data mindset and establishing a capable data infrastructure to support it is a challenge. This session, hosted by iHeartRadio, will introduce current technologies (Hadoop/Hive/Impala/Parquet, Luigi, Kafka/Flume, ElasticSearch/Kibana), along with a few examples of data and machine learning projects that leverage them. We’ll leave the realm of theory and also go into operationalization strategies, discuss cluster configurations and automated deployment using Chef.

Presented by Pasha Katsev

iHeartRadio

February 12, 2014
Tweet

More Decks by iHeartRadio

Other Decks in Technology

Transcript

  1. • Provides a rich music experience to users with features

    such as: – Live Radio Broadcasting (~2000 stations; own and operate 850) – ‘Custom Radio’ (make your own radio) of >15mil tracks – ‘Perfect For’ and ‘Custom Talk’ new products recently launched ! • 40 million registered users (since launch in 2011), only company that has had bigger growth is Instagram ! • Almost 70 million unique visitors per month ! • Hence, tons of users / activities which results in tons of data and analytics IheartRadio High Level !2
  2. • 10k+ req/sec with strict latency response time requirements •

    our average backend response time is 20ms • Pick technologies to solve challenges and problems with mindset of scalability and maintainability: Next Gen Backend Tech Stack
  3. Breeding Elephants Amongst Lions Elephants: • big data >1TB •

    multiple producers/ consumers • ML • Hadoop • realtime • universal access Lions: • storage • RDB misuse • ETL • data formats • operations • architecture • culture • resources!
  4. Elephant’s Ugly Teens !5 • Dated view of teenagers •

    But the point is not lost (esp. ~april 2013) • scarce resources (docs, human, …) • barely maturing technology (hadoop, flume/kakfa, luigi, …)
  5. • http://www.quora.com/Big-Data/How-much-data-is-Big-Data ! ! ! ! ! ! • Big

    Data Checklist [https://medium.com/what-i-learned-building/1b8e3214f96]
 • Facebook’s Hadoop cluster is ~30PB (easy to feel insecure) • But, not just about size • type of data (text, image, video) | representation (graph) • degree of structure • type of analysis/processing (query, machine learning, end/dec) Whose Data is Bigger? !6
  6. • Logs: • Search • ~150 MB/day [web] + ~500

    MB/day [mobile] • ~15 GB/month • not big data, unless looking at 6 years of data • Play (Custom Radio) • ~20 GB/day • quarterly: 2TB, let alone yearly 8TB • Access • Other logs: CDN, Device Reports, …
 • Other data sources: media catalogue, user data, thumbs, station data, … Converting Tons to Bytes !7
  7. • 10s of TB / year ! • many unstructured

    sources ! • want to do more ML ! • existing data warehousing is not keeping up Do Elephants Listen To Radio? !8
  8. • stable • trusted • integrated • must have essential

    toolset (HDFS, MR, Hive, HBase) • need a realtime SQL query engine (Impala vs Dremel) • open-source • RPMs for our Linux (CentOS) • option of enterprise support Picking a Distribution: Cloudera !10
  9. • Cloudera Install Path A: automated by CDManager • Lions:

    • EBS vs Ephemeral • Typically, IO-bound, hence EBS pretty bad • Ephemeral is fast, but ephemeral • Partitioning • Network traffic costs • getting data to/from cluster • (we have a direct link, tho) • Automated install directories might be cumbersome • unless carefully researched and partitioned/mounted prior 5-node cluster in EC2 !11
  10. • poached by OPS when more instances were needed during

    a crisis • backups to S3 were manual, dated and never really tested • recitation implausible ! • taught us many early lessons • need more control over the install automation • getting data to/from (Flume, File Formats, Compressions) • automating workflows (Luigi) • writing MR (using python+streaming, Luigi, MRJob) First Elephant: RIP !12
  11. • great for connecting A (lightweight) to B • great

    for existing architectures • composed of simple/small building blocks (agents) ! ! ! ! • integrated - everything under one apache project • many sources: (avro, thrift), syslog, JMS, spool dir, custom • many sinks: (avro, thrift), HDFS, ElasticSearch, custom • extensible • basic ETL with serializers !14 Flume: minimally invasive surgery
  12. • Lions: • configuration per agent • JVM • source,

    channel, sink • channels: jdbc, file, memory • no distribution or partitioning • brittle • example of one config description: ! Maximum total bytes of memory allowed as a sum of all events in this channel. The implementation only counts the Event body, which is the reason for providing the byteCapacityBufferPercentage configuration parameter as well. Defaults to a computed value equal to 80% of the maximum memory available to the JVM (i.e. 80% of the -Xmx value passed on the command line). Note that if you have multiple memory channels on a single JVM, and they happen to hold the same physical events (i.e. if you are using a replicating channel selector from a single source) then those event sizes may be double-counted for channel byteCapacity purposes. Setting this value to 0 will cause this value to fall back to a hard internal limit of about 200 GB. Flume: not great for creating large relays !15
  13. • ideal for connecting A,B,C to D,E,F • message queue

    at heart (like ZMQ Forwarder) • built-in partitioning (per topic) • built-in replication (per partition) adj. repl fact • eventual consistency • effectively constant perf with respect to size • managed by zookeeper Kafka: coronary bypass !16
  14. • incompatible versions • 0.8 is not backwards compatible with

    0.7.x • however, 0.8 is a huge step forward in functionality from 0.7.x • replicated partitions • producer and consumer are replication aware • can balance latency vs. safety • ‘long-poll’ consumers (end-to-end latency of few millisec) • recommended for all new systems • existing producer/consumer projects are still on 0.7 (kafka-river, camus, python libs, …) • operationally difficult • selecting a more-or-less universal (consumer-wise) message format Kafka: rebellious teens !17
  15. • Long-term: Kafka • replace RabbitMQ for ‘play log’ •

    support online and batch consumers • provides throughput headroom (100k+/sec) ! • Short-term: Flume • make necessary passageways for lighter and non- critical data • great for our ‘search log’ Strategy:
  16. • zookeeper • 3 NSX VMs • dedicated log drives

    • managed by exhibitor :: https://github.com/Netflix/exhibitor/wiki • chef :: https://github.com/SimpleFinance/chef-zookeeper ! • kafka • 2 broker direct hardware boxes • dedicated log drives • chef :: https://github.com/mthssdrbrg/kafka-cookbook • ~5 topics Kafka / ZK OPS !19
  17. • gzip (48%) • good: compression ratio • bad: •

    cpu heavy • not indexed • not auto partitioned • can’t be loaded directly into impala (use hive => ETL) • lzo (35%) • good: • fixes most of the gzip badness • the only supported compression for Text in Impala • bad: still flat File Formats: flat (minimal ETL) !21
  18. • parquet (40%) • good: • compression • performance •

    impala native • bad: • no raw2parq • ETL via Impala • ETL via ML ! • rcFile or orc • doesn’t seem to offer anything substantial over parquet File Formats: columnar !22
  19. • Chef provisioned boxes, but still following CDH Installation Path

    A • Cobbler partitioned ! • search log connected via Flume • play log loaded from csv.gz exports in S3 • s3curl.pl --id=ihr {url} | zcat | sed 1d | lzop | hadoop fs -put - {fo} • (there’s no way to tell hive to disregard the first row as headers) • ETL(ed) via Hive into Parquet-backed Impala tables • multiple dates orchestrated by Luigi ! • functional testing of Impala proves optimistic | ready to dive in deeper 5-node cluster direct-hardware !24
  20. • Lions: • partitioning / RAID • unnecessary RAID configs

    • HDFS shared partition with verbose logs • contention / unpredictable HDFS size • CDH 5 released • manual and chef created users accounts are muxed • lots of manual path configuration via CD Second Elephant: RIP !25
  21. • Hadoop Operations at LinkedIn (Mar 20, 2013) • by

    Allen Wittenauer, Grid Systems Architect at LinkedIn • http://www.slideshare.net/allenwittenauer/2013-hadoopsummitemea ! • “Hadoop is not a developer problem; it’s an operations problem.”
 true story by a Hadoop vendor ex-employee ! • Many valuable insights into Hadoop ops: • Partitioning • Hardware • Security Elephants Are Not Cacti !26 NameNode: DataNode:
  22. • Chef + CDH Install Path B (RPMs) • Proper

    partitioning • CDH 5 • More integration with Kafka 10-node cluster: in gestation !28
  23. • data science • Random Forests • multiple decision trees

    having consensus • predictor values are search engine scores ! • API servers > Flume > HDFS > MR > Weka > API Servers Electing the most relevant search result !30
  24. • Shiny - visualization framework for R • Hadoop/Impala/Hive (RJDBC

    + https://github.com/tlpinney/dplyr-impala) ! • query ~1 billion rows of search data from R • interactive exploratory visualizations in Shiny • in-browser (immediate company-wide metrics access) ! • search exit • mobile client crashlytics Analytics + Visualization !32
  25. • dashboarding • elastic + kibana • configuration management •

    bcfg2 (at least look into it) • universal data access layer • HCatalog • Presto ! • YARN Despite the Lions !35