Jordan Appleson - A journey through Big Data

by Hey! Presents

Slide 1

Slide 1 text

Hi. I’m Jordan Appleson.

Slide 2

Slide 2 text

I do a bit of this: Software Engineer, bit of a geek. But then again, who isn’t? Lead the Insights and Search Analytics Platform at Branded3. Write lots of C#, PHP, JS amongst other languages

Slide 3

Slide 3 text

Journeying into the world of Big Data and NoSQL.

Slide 4

Slide 4 text

Over the last several months: Decisions associated with choosing a suitable Technology stack Making architectural decisions around structuring data for performance Future prooﬁng and scaling solutions 4

Slide 5

Slide 5 text

INFRASTRUCTURE PROCESSING DATA STORAGE

Slide 6

Slide 6 text

INFRASTRUCTURE PROCESSING DATA STORAGE

Slide 7

Slide 7 text

What is Big Data? Large amounts of Structured and Unstructured Data Volumes ranging from 100GB’s to PB’s Traditional RDBMS’s struggle to allow you effectively retrieve key information Varies depending on the Application / Company

Slide 8

Slide 8 text

0 75 150 225 300 Facebook LinkedIn Petabyte Cluster Size 40PB 300PB sources: http://allfacebook.com/orcﬁle_b130817, http://www.slideshare.net/allenwittenauer/2013-hadoopsummitemea

Slide 9

Slide 9 text

– Jordan Appleson “That’s reeeeeeeaaaaal big data…”

Slide 10

Slide 10 text

It’s not just about size… It’s about what you’re doing with the data Analysing, Mining, Visualising It’s just as much about the tool set

Slide 11

Slide 11 text

What is NoSQL? (Not Only SQL) Built to scale large volumes of data and aid in distributed processing Different types of NoSQL. Document, Graph, Column-Store Schema-less (to a degree) Loss of JOINS, loss of certain aggregation features common to RDBMS’s

Slide 12

Slide 12 text

Where I entered the Big Data scene… “Hey Jordan, come check out this database we have! It’s almost 2TB and you’ll be taking it over…”

Slide 13

Slide 13 text

“… oh and by the way, I’m leaving in a week… Good Luck with it!”

Slide 14

Slide 14 text

_/(o_O)\_

Slide 15

Slide 15 text

I was left with (in terms of data): a database server running MSSQL storing almost 2TB of data 8GB RAM and 4 cores. (pathetic) Queries could take anywhere between 15 and 45 minutes 2 tables with 700 million and 300 million rows in (500GB and 300GB per table respectively) Other tables with 10s of millions of rows Lots of other tables! Some with 100s of columns

Slide 16

Slide 16 text

Let’s be pragmatic… We deﬁnitely need something that can handle this much data from the get go but can handle an exponential increase Is a RDBMs really the right solution for this? How much time am I going to have spend ops-side?

Slide 17

Slide 17 text

Lets look at MySQL: Partitioning data is not easy, i.e. Sharding <- Key Point Bound by Schemas and linear tables It has replication but how about transparent auto failover? Maintaining it could potentially be a full time job straight away… I need to be writing code though?

Slide 18

Slide 18 text

ALTER TABLE = 2 weeks?! GTFO

Slide 19

Slide 19 text

Hadoop MongoDB Cassandra Riak CouchBase HBASE PostgreSQL MySQL Neo4J

Slide 20

Slide 20 text

Apache Hadoop Two Key Components HDFS MapReduce

Slide 21

Slide 21 text

Apache Hadoop Can we just use Hadoop on it’s own? No. Need for other systems to allow for real time access and queries What about the learning curve? ! Potentially good for some aspects of our longer term goals though!

Slide 22

Slide 22 text

Apache Cassandra Column Based Database Scales across different locations well SQL Like Query Language (CQL) Dynamic Schema P2P based clustering (and it’s not a pain to setup!) Actually a very good option, even now!

Slide 23

Slide 23 text

Apache Cassandra How much did I know about Column Based Store optimisations? What’s the learning curve like? Interop options with different technology stacks (PHP, C# speciﬁcally) Documentation and community support TTL? Hm….

Slide 24

Slide 24 text

MongoDB Dynamic Schema (Document Based) Out-of-the-box Replication and Sharding JavaScript Query Engine Large number of supported drivers Documentation and Community Support is Excellent

Slide 25

Slide 25 text

MongoDB TTL potential looked great Built for large amounts of data Migration possibilities for future prooﬁng ticked the box But really, it looked like we could get a prototype up and running quick.

Slide 26

Slide 26 text

MongoDB TTL potential looked great Migration possibilities for future prooﬁng ticked the box But really, it looked like we could get a prototype up and running quick.

Slide 27

Slide 27 text

First thing’s ﬁrst. Denormalise.

Slide 28

Slide 28 text

A Simple Blog MongoDB Posts MySQL Posts Users Comments Tags

Slide 29

Slide 29 text

A Simple Blog - MySQL Posts Tags Comments Posts Users Tags

Slide 30

Slide 30 text

A Simple Blog - MySQL Users User ID, Name Posts Post ID, Post Title, Post Body Comments Comment ID, Post ID, Comment Body Tags Tag ID, Tag Name Posts Tags Tag ID, Post ID

Slide 31

Slide 31 text

MongoDB

Slide 32

Slide 32 text

MongoDB allows us to better form our data layer to the application layer. ! Our schema is our code…. without migrations.

Slide 33

Slide 33 text

A Simple Blog - MongoDB One collection. One Query. No aggregated queries to other tables = No extra query overhead Less overhead mapping the data layer to application layer - it’s already formed. Less time spent making the database because your application layer can be your schema.

Slide 34

Slide 34 text

Just because the schema’s dynamic does not me you do not have a schema. Seriously.

Slide 35

Slide 35 text

More complex document structures.. Mongo allowed us to create buckets of data With lots of daily data, aggregated into buckets. One query could retrieve lots of data with minimal overhead. Indexes on these sub-document objects  16MB document limit. Our biggest has 7000 objects inside which is ~4MB But you can effectively select which objects to return to make it even faster

Slide 36

Slide 36 text

Buckets of data good. ! Growing documents bad.

Slide 37

Slide 37 text

Performance Issues Growing documents can cause page faults MongoDB allocates padding If you exceed the padding, data is moved on disk and performance issues ensue

Slide 38

Slide 38 text

Initial System 2 million updates to the buckets in 1 hour These updates were effectively small chunks of data that needed to be associated with a day Push the small chunk to the bucket for the relevant day ! Documents began to grow at an alarming rate. Fault… Fault.Fault.FaFaFault

Slide 39

Slide 39 text

Revised System Insert data into initial collection chunk by chunk (inserts are blazing fast) Compile the data for each day and then push to bucket  Reduced page faults Super fast compared to the a ﬁrst version

Slide 40

Slide 40 text

Zero to Data API in a few lines of code Prototype API mapped URL parameters to queries JSON API literally fell out of our computers in hours Getting to these data buckets was easy and our front end could consume  and graph out the data with ease. Allowed to focus on processing and scaling

Slide 41

Slide 41 text

Key Points We needed something that can handle terabytes of data and didn’t require babysitting constantly We needed to be able to get off the ground quickly and get to our data as fast as we were putting data in We needed to be able to adapt to an extremely volatile industry our metrics are based on We needed scaling and redundancy options One person needed to do all this in a matter of months.

Slide 42

Slide 42 text

Other Issues we had to deal with: Reducing read/write contention Ensuring the indexes and collections are better suited to our data access patterns Dealing with page faults Mitigated by using SSDs Growing documents can actually have a negative impact on space efﬁciency

Slide 43

Slide 43 text

Things work best when they’re in memory. ! Look at the way you access data, what kind of data should be in memory? If you’re lazy, use SSDs but it won’t save you forever.

Slide 44

Slide 44 text

Summary Dealing with large amounts of data may warrant NoSQL technologies Use them effectively by looking at how to structure your data to best suite your access and processing patterns. Schema’s are important still #sigh Knowing the technology you’re using at a core level is VERY useful for performance enhancements There’s more than one way to skin a cat

Slide 45

Slide 45 text

I’m Jordan Appleson. @jordanisonﬁre Questions?