Upgrade to Pro — share decks privately, control downloads, hide ads and more …

The promise and peril of abundance: Making Big Data small. BRENDAN MCADAMS at Big Data Spain 2012

The promise and peril of abundance: Making Big Data small. BRENDAN MCADAMS at Big Data Spain 2012

Session presented at Big Data Spain 2012 Conference
16th Nov 2012
ETSI Telecomunicacion UPM Madrid
www.bigdataspain.org
More info: http://www.bigdataspain.org/es-2012/conference/the-promise-and-peril-of-abundance-making-big-data-small/brendan-mcadams

Big Data Spain

November 16, 2012
Tweet

More Decks by Big Data Spain

Other Decks in Technology

Transcript

  1. Brendan McAdams 10gen, Inc. [email protected] @rit A Modest Proposal for

    Taming and Clarifying the Promises of Big Data and the Software Driven Future Friday, November 16, 12
  2. "In short, software is eating the world." - Marc Andreesen

    Wall Street Journal, Aug. 2011 http://on.wsj.com/XLwnmo Friday, November 16, 12
  3. Software is Eating the World • Amazon.com (and .uk, .es,

    etc) started as a bookstore • Today, they sell just about everything - bicycles, appliances, computers, TVs, etc. • In some cities in America, they even do home grocery delivery • No longer as much of a physical goods company - becoming fixated and surrounded by software • Pioneering the eBook revolution with Kindle • EC2 is running a huge percentage of the public internet Friday, November 16, 12
  4. Software is Eating the World • Netflix started as a

    company to deliver DVDs to the home... Friday, November 16, 12
  5. Software is Eating the World • Netflix started as a

    company to deliver DVDs to the home... • But as they’ve grown, business has shifted to an online streaming service • They are now rolling out rapidly in many countries including Ireland, the UK, Canada and the Nordics • No need for physical inventory or postal distribution ... just servers and digital copies Friday, November 16, 12
  6. But What Does All This Software Do? • Software always

    eats data – be it text files, user form input, emails, etc • All things that eat, must eventually excrete... Friday, November 16, 12
  7. So What Does Software Eat? • Software always eats data

    – be it text files, user form input, emails, etc • But what does software excrete? • More Data, of course... • This data gets bigger and bigger • The solutions become narrower for storing & processing this data • Data Fertilizes Software, in an endless cycle... Friday, November 16, 12
  8. There’s a Big Market Here... • Lots of Solutions for

    Big Data • Data Warehouse Software • Operational Databases • Old style systems being upgraded to scale storage + processing • NoSQL - Cassandra, MongoDB, etc • Platforms • Hadoop Friday, November 16, 12
  9. Don’t Tilt At Windmills... • It is easy to get

    distracted by all of these solutions • Keep it simple • Use tools you (and your team) can understand • Use tools and techniques that can scale • Try not to reinvent the wheel Friday, November 16, 12
  10. ... And Don’t Bite Off More Than You Can Chew

    • Break it into smaller pieces • You can’t fit a whole pig into your mouth... • ... slice it into small parts that you can consume. Friday, November 16, 12
  11. Big Data at a Glance • Big Data can be

    gigabytes, terabytes, petabytes or exabytes • An ideal big data system scales up and down around various data sizes – while providing a uniform view • Major concerns • Can I read & write this data efficiently at different scale? • Can I run calculations on large portions of this data? Large Dataset Primary Key as “username” Friday, November 16, 12
  12. Big Data at a Glance • Systems like Google File

    System (which inspired Hadoop’s HDFS) and MongoDB’s Sharding handle the scale problem by chunking • Break up pieces of data into smaller chunks, spread across many data nodes • Each data node contains many chunks • If a chunk gets too large or a node overloaded, data can be rebalanced Large Dataset Primary Key as “username” ... Friday, November 16, 12
  13. Chunks Represent Ranges of Values ... +∞ -∞ Initially, an

    empty collection has a single chunk, running the range of minimum (-∞) to maximum (+∞) As we add data, more chunks are created of new ranges INSERT {USERNAME: “Bill”} -∞ “B” “C” +∞ Individual or partial letter ranges are one possible chunk value... but they can get smaller! “Ba” -∞ “Br” “Be” “Brendan” “Brad” The smallest possible chunk value is not a range, but a single possible value INSERT {USERNAME: “Becky”} INSERT {USERNAME: “Brendan”} INSERT {USERNAME: “Brad”} Friday, November 16, 12
  14. Big Data at a Glance • To simplify things, let’s

    look at our dataset split into chunks by letter • Each chunk is represented by a single letter marking its contents • You could think of “B” as really being “Ba” →”Bz” Large Dataset Primary Key as “username” a b c d e f g h s t u v w x y z ... Friday, November 16, 12
  15. Big Data at a Glance Large Dataset Primary Key as

    “username” a b c d e f g h s t u v w x y z Friday, November 16, 12
  16. Big Data at a Glance Large Dataset Primary Key as

    “username” a b c d e f g h s t u v w x y z MongoDB Sharding ( as well as HDFS ) breaks data into chunks (~64 mb) Friday, November 16, 12
  17. Large Dataset Primary Key as “username” Big Data at a

    Glance Data Node 1 25% of chunks Data Node 2 25% of chunks Data Node 3 25% of chunks Data Node 4 25% of chunks a b c d e f g h s t u v w x y z Representing data as chunks allows many levels of scale across n data nodes Friday, November 16, 12
  18. Scaling Data Node 1 Data Node 2 Data Node 3

    Data Node 4 Data Node 5 a b c d e f g h s t u v w x y z The set of chunks can be evenly distributed across n data nodes Friday, November 16, 12
  19. Add Nodes: Chunk Rebalancing Data Node 1 Data Node 2

    Data Node 3 Data Node 4 Data Node 5 a b c d e f g h s t u v w x y z The goal is equilibrium - an equal distribution. As nodes are added (or even removed) chunks can be redistributed for balance. Friday, November 16, 12
  20. Don’t Bite Off More Than You Can Chew... • The

    answer to calculating big data is much the same as storing it • We need to break our data into bite sized pieces • Build functions which can be composed together repeatedly on partitions of our data • Process portions of the data across multiple calculation nodes • Aggregate the results into a final set of results Friday, November 16, 12
  21. Bite Sized Pieces Are Easier to Swallow • These pieces

    are not chunks – rather, the individual data points that make up each chunk • Chunks make up a useful data transfer units for processing as well • Transfer Chunks as “Input Splits” to calculation nodes, allowing for scalable parallel processing Friday, November 16, 12
  22. MapReduce the Pieces • The most common application of these

    techniques is MapReduce • Based on a Google Whitepaper, works with two primary functions – map and reduce – to calculate against large datasets Friday, November 16, 12
  23. MapReduce to Calculate Big Data • MapReduce is designed to

    effectively process data at varying scales • Composable function units can be reused repeatedly for scaled results Friday, November 16, 12
  24. MapReduce to Calculate Big Data • In addition to the

    HDFS storage component, Hadoop is built around MapReduce for calculation • MongoDB can be integrated to MapReduce data on Hadoop • No HDFS storage needed - data moves directly between MongoDB and Hadoop’s MapReduce engine Friday, November 16, 12
  25. What is MapReduce? • MapReduce made up of a series

    of phases, the primary of which are • Map • Shuffle • Reduce • Let’s look at a typical MapReduce job • Email records • Count # of times a particular user has received email Friday, November 16, 12
  26. MapReducing Email to: tyler from: brendan subject: Ruby Support to:

    brendan from: tyler subject: Re: Ruby Support to: mike from: brendan subject: Node Support to: brendan from: mike subject: Re: Node Support to: mike from: tyler subject: COBOL Support to: tyler from: mike subject: Re: COBOL Support (WTF?) Friday, November 16, 12
  27. Map Step to: tyler from: brendan subject: Ruby Support to:

    brendan from: tyler subject: Re: Ruby Support to: mike from: brendan subject: Node Support to: brendan from: mike subject: Re: Node Support to: mike from: tyler subject: COBOL Support to: tyler from: mike subject: Re: COBOL Support (WTF?) key: tyler value: {count: 1} key: brendan value: {count: 1} key: mike value: {count: 1} key: brendan value: {count: 1} key: mike value: {count: 1} key: tyler value: {count: 1} map function emit(k, v) map function breaks each document into a key (grouping) & value Friday, November 16, 12
  28. Group/Shuffle Step key: tyler value: {count: 1} key: brendan value:

    {count: 1} key: mike value: {count: 1} key: brendan value: {count: 1} key: mike value: {count: 1} key: tyler value: {count: 1} Group like keys together, creating an array of their distinct values (Automatically done by M/R frameworks) Friday, November 16, 12
  29. Group/Shuffle Step key: brendan values: [{count: 1}, {count: 1}] key:

    mike values: [{count: 1}, {count: 1}] key: tyler values: [{count: 1}, {count: 1}] Group like keys together, creating an array of their distinct values (Automatically done by M/R frameworks) Friday, November 16, 12
  30. Reduce Step key: brendan values: [{count: 1}, {count: 1}] key:

    mike values: [{count: 1}, {count: 1}] key: tyler values: [{count: 1}, {count: 1}] For each key reduce function flattens the list of values to a single result reduce function aggregate values return (result) key: tyler value: {count: 2} key: mike value: {count: 2} key: brendan value: {count: 2} Friday, November 16, 12
  31. Processing Scalable Big Data • MapReduce provides an effective system

    for calculating and processing our large datasets (from gigabytes through exabytes and beyond) • MapReduce is supported in many places including MongoDB & Hadoop • We have effective answers for both of our concerns. • Can I read & write this data efficiently at different scale? • Can I run calculations on large portions of this data? • Can I run calculations on large portions of this data? Friday, November 16, 12
  32. Batch Isn’t a Sustainable Answer • There are downsides here

    - fundamentally, MapReduce is a batch process • Batch systems like Hadoop give us a “Catch 22” • You can get answers to questions from Petabytes of Data • But you can’t guarantee you’ll get them quickly • In some ways, this is a step backwards in our industry • Business Stakeholders tend to want answers now • We must evolve Friday, November 16, 12
  33. Moving Away from Batch • The Big Data world is

    moving rapidly away from slow, batch based processing solutions • Google moved forward from Batch into more Realtime over last few years • Hadoop is replacing “MapReduce as Assembly Language” with more flexible resource management in YARN • Now MapReduce is just a feature implemented on top of YARN • Build anything we want • Newer systems like Spark & Storm provide platforms for realtime processes Friday, November 16, 12
  34. In Closing • The World IS Being Eaten By Software

    • All that software is leaving behind an awful lot of data • We must be careful not to “step in it” • More Data Means More Software Means More Data Means... • Practical Solutions for Processing & Storing Data will save us • We as Data Scientists & Technologists must always evolve our strategies, thinking and tools Friday, November 16, 12