I do a bit of this:
Software Engineer, bit of a geek. But then again, who isn’t?
Lead the Insights and Search Analytics Platform at Branded3.
Write lots of C#, PHP, JS amongst other languages
Slide 3
Slide 3 text
Journeying into the world of Big Data
and NoSQL.
Slide 4
Slide 4 text
Over the last several months:
Decisions associated with choosing a suitable Technology stack
Making architectural decisions around structuring data for performance
Future proofing and scaling solutions
4
Slide 5
Slide 5 text
INFRASTRUCTURE
PROCESSING
DATA STORAGE
Slide 6
Slide 6 text
INFRASTRUCTURE
PROCESSING
DATA STORAGE
Slide 7
Slide 7 text
What is Big Data?
Large amounts of Structured and Unstructured Data
Volumes ranging from 100GB’s to PB’s
Traditional RDBMS’s struggle to allow you effectively retrieve key information
Varies depending on the Application / Company
– Jordan Appleson
“That’s reeeeeeeaaaaal big data…”
Slide 10
Slide 10 text
It’s not just about size…
It’s about what you’re doing with the data
Analysing, Mining, Visualising
It’s just as much about the tool set
Slide 11
Slide 11 text
What is NoSQL? (Not Only SQL)
Built to scale large volumes of data and aid in distributed processing
Different types of NoSQL. Document, Graph, Column-Store
Schema-less (to a degree)
Loss of JOINS, loss of certain aggregation features common to RDBMS’s
Slide 12
Slide 12 text
Where I entered the Big Data scene…
“Hey Jordan, come check out this database we have! It’s almost
2TB and you’ll be taking it over…”
Slide 13
Slide 13 text
“… oh and by the way, I’m leaving in a week… Good Luck with it!”
Slide 14
Slide 14 text
_/(o_O)\_
Slide 15
Slide 15 text
I was left with (in terms of data):
a database server running MSSQL storing almost 2TB of data
8GB RAM and 4 cores. (pathetic)
Queries could take anywhere between 15 and 45 minutes
2 tables with 700 million and 300 million rows in (500GB and 300GB per table
respectively)
Other tables with 10s of millions of rows
Lots of other tables! Some with 100s of columns
Slide 16
Slide 16 text
Let’s be pragmatic…
We definitely need something that can handle this much data from the get
go but can handle an exponential increase
Is a RDBMs really the right solution for this?
How much time am I going to have spend ops-side?
Slide 17
Slide 17 text
Lets look at MySQL:
Partitioning data is not easy, i.e. Sharding <- Key Point
Bound by Schemas and linear tables
It has replication but how about transparent auto failover?
Maintaining it could potentially be a full time job straight away… I need to
be writing code though?
Slide 18
Slide 18 text
ALTER TABLE = 2 weeks?!
GTFO
Slide 19
Slide 19 text
Hadoop
MongoDB
Cassandra
Riak
CouchBase
HBASE
PostgreSQL
MySQL
Neo4J
Slide 20
Slide 20 text
Apache Hadoop
Two Key Components
HDFS
MapReduce
Slide 21
Slide 21 text
Apache Hadoop
Can we just use Hadoop on it’s own? No.
Need for other systems to allow for real time access and queries
What about the learning curve?
!
Potentially good for some aspects of our longer term goals though!
Slide 22
Slide 22 text
Apache Cassandra
Column Based Database
Scales across different locations well
SQL Like Query Language (CQL)
Dynamic Schema
P2P based clustering (and it’s not a pain to setup!)
Actually a very good option, even now!
Slide 23
Slide 23 text
Apache Cassandra
How much did I know about Column Based Store optimisations?
What’s the learning curve like?
Interop options with different technology stacks (PHP, C# specifically)
Documentation and community support
TTL? Hm….
Slide 24
Slide 24 text
MongoDB
Dynamic Schema (Document Based)
Out-of-the-box Replication and Sharding
JavaScript Query Engine
Large number of supported drivers
Documentation and Community Support is Excellent
Slide 25
Slide 25 text
MongoDB
TTL potential looked great
Built for large amounts of data
Migration possibilities for future proofing ticked the box
But really, it looked like we could get a prototype up and running quick.
Slide 26
Slide 26 text
MongoDB
TTL potential looked great
Migration possibilities for future proofing ticked the box
But really, it looked like we could get a prototype up and running quick.
Slide 27
Slide 27 text
First thing’s first. Denormalise.
Slide 28
Slide 28 text
A Simple Blog
MongoDB
Posts
MySQL
Posts
Users
Comments
Tags
Slide 29
Slide 29 text
A Simple Blog - MySQL
Posts Tags Comments
Posts
Users
Tags
Slide 30
Slide 30 text
A Simple Blog - MySQL
Users User ID, Name
Posts Post ID, Post Title, Post Body
Comments Comment ID, Post ID, Comment Body
Tags Tag ID, Tag Name
Posts Tags Tag ID, Post ID
Slide 31
Slide 31 text
MongoDB
Slide 32
Slide 32 text
MongoDB allows us to better form our data layer to the
application layer.
!
Our schema is our code….
without migrations.
Slide 33
Slide 33 text
A Simple Blog - MongoDB
One collection. One Query.
No aggregated queries to other tables = No extra query overhead
Less overhead mapping the data layer to application layer - it’s already
formed.
Less time spent making the database because your application layer can
be your schema.
Slide 34
Slide 34 text
Just because the schema’s dynamic does not me you do not have a
schema. Seriously.
Slide 35
Slide 35 text
More complex document structures..
Mongo allowed us to create buckets of data
With lots of daily data, aggregated into buckets. One
query could retrieve lots of data with minimal overhead.
Indexes on these sub-document objects
16MB document limit.
Our biggest has 7000 objects inside which is ~4MB
But you can effectively select which objects to return to
make it even faster
Slide 36
Slide 36 text
Buckets of data good.
!
Growing documents bad.
Slide 37
Slide 37 text
Performance Issues
Growing documents can cause page faults
MongoDB allocates padding
If you exceed the padding, data is moved on disk and performance issues
ensue
Slide 38
Slide 38 text
Initial System
2 million updates to the buckets in 1 hour
These updates were effectively small chunks of data that needed to be
associated with a day
Push the small chunk to the bucket for the relevant day
!
Documents began to grow at an alarming rate. Fault… Fault.Fault.FaFaFault
Slide 39
Slide 39 text
Revised System
Insert data into initial collection chunk by chunk (inserts are blazing fast)
Compile the data for each day and then push to bucket
Reduced page faults
Super fast compared to the a first version
Slide 40
Slide 40 text
Zero to Data API in a few lines of code
Prototype API mapped URL parameters to queries
JSON API literally fell out of our computers in hours
Getting to these data buckets was easy and our front end could consume
and graph out the data with ease.
Allowed to focus on processing and scaling
Slide 41
Slide 41 text
Key Points
We needed something that can handle terabytes of data and didn’t require
babysitting constantly
We needed to be able to get off the ground quickly and get to our data as fast as
we were putting data in
We needed to be able to adapt to an extremely volatile industry our metrics are
based on
We needed scaling and redundancy options
One person needed to do all this in a matter of months.
Slide 42
Slide 42 text
Other Issues we had to deal with:
Reducing read/write contention
Ensuring the indexes and collections are better suited to our data access
patterns
Dealing with page faults
Mitigated by using SSDs
Growing documents can actually have a negative impact on space
efficiency
Slide 43
Slide 43 text
Things work best when they’re in memory.
!
Look at the way you access data, what kind of data should be in
memory?
If you’re lazy, use SSDs but it won’t save you forever.
Slide 44
Slide 44 text
Summary
Dealing with large amounts of data may warrant NoSQL technologies
Use them effectively by looking at how to structure your data to best suite
your access and processing patterns.
Schema’s are important still #sigh
Knowing the technology you’re using at a core level is VERY useful for
performance enhancements
There’s more than one way to skin a cat