Franz Bettag
Director of R&D and Network Operations at
München, 2014
Through the Database-Jungle
Slide 2
Slide 2 text
Who is this guy?
Franz Bettag
28 years old (soon 29)
Nuremberg, Germany
Scala, Liftweb and Netty Connoisseur
FreeBSD, Redis, PostgreSQL and Cassandra aficionado
Director of R&D and Network Operations at wasted.io Ltd.
Slide 3
Slide 3 text
Seriously, what does he do..
Lead development on projects for the company
Client consulting in special cases
BGP, OSPF, etc, short - the Routing guy
Molesting Open Source projects
Slide 4
Slide 4 text
Not wasting IO anymore
17 years of coding (10 years commercial)
16 years on FreeBSD and NetBSD
8 years on PostgreSQL
6 years on Scala, Liftweb and Akka
3 years on Cassandra (since 0.6)
Slide 5
Slide 5 text
What am i going to talk about?
Common issues and general solutions
Databases
Combining Databases
Special cases
Slide 6
Slide 6 text
Common Issues and Solutions
Coders choose a DB they felt most comfortable with
Suggest and prove your suggestion
IT Staff is outdated
Update them, also prove your suggestion
Fear of Change/Never change a running System
Take extensive testing into account
Screwy code/code duplication
Takes more money, their fault. explain that!
Old Language/Project
Yeah, it’s gonna cost you.
Slide 7
Slide 7 text
Once picked, the most common mistakes
Only test a few corner cases
Worked with 2 concurrent users (developers)
Worked with 5 real concurrent users testing it
Worked with 10000 Rows
Use a *SINGLE* database to do all the work
Slide 8
Slide 8 text
Poor choices
Using Full-Text Index in your RDBMS
It’s 2014, we have SOLR and ElasticSearch
!
Using RDBMS as cache for other RDBMS-generated data
There is stuff like memcached and Redis
!
Using RDBMS for bulk inserts with client polling
Check your technology, or develop better one
Slide 9
Slide 9 text
Why SOLR or Elastic Search?
Boosting Queries
Geospatial Queries
Stemming Words
Language Support
Performance
Streaming Results (Elastic Search)
And many more
You’ve got Apache Lucene under the Hood!
Slide 10
Slide 10 text
Why Redis or Memcached?
Caching Data which is slow to query
Ease database query load
Data needed for limited time (TTL)
Database has big tomb-stoning/vacuuming penalty
No way to store it in the running application
Not designed for sharding
Need data easily accessible outside of your app
Slide 11
Slide 11 text
So what now?
SOLR/ElasticSearch for Searching
!
Redis/Memcached for Caching
!
Some database for persistency ;)
Slide 12
Slide 12 text
Which database?
Well we like Apache Cassandra
with Apache SOLR
and Apache Mahout
and Apache Pig
!
Especially Datastax Enterprise
Have you seen their startup program? AWESOME!
Slide 13
Slide 13 text
So what’s up with PostgreSQL
INET/CIDR DataType
Query for incl/excl. of INET in CIDR Range
Do a variety of stunts on them
Slide 14
Slide 14 text
PostgreSQL combined with Big Data
IP Based Analysis (netflow.io)
!
Cassandra stores each NetFlow and Counters
PostgreSQL contains the keys to query for an IP Range
!
Indexed Scan across a table containing ca. 2k Ranges
0.2ms
!
Querying Cassandra for current total of 127 IPs
3ms
Slide 15
Slide 15 text
Cassandra and SOLR (DSE)
Network Monitoring (netflow.io)
!
Cassandra stores the alerts generated by Snort
SOLR Indexes on those Events (severity, type, etc)
PostgreSQL knows the key for the IP Ranges
!
SOLR search for all Events regarding Hosts X, Y and Z
16ms
Slide 16
Slide 16 text
SOLR and other Databases
SOLR is great for searching, except IP Ranges ;)
!
ORMs usually have callbacks
onSave - write a new version of the data to SOLR
!
Query SOLR
Write to the Database
Slide 17
Slide 17 text
It’s not all rainbows
SOLR needs a Schema
(can be tricky the first few times)
!
Structuring your data right
IP ADDRESSES ARE NOT INTEGERS! IPv6 OMFG
!
If there is nothing out there?
Slide 18
Slide 18 text
So we had this client...
Heavy notification-system using RDBMS
Publisher-Follower-System
including a Web-Based Live-Video-Chat system
more than 4 million users
multiple publishers with 40k+ followers
200k inserts/sec at peak times
used MySQL, Memcache, PHP
Clients polling every 5 seconds for notifications
what do you do?
Slide 19
Slide 19 text
What was the setup?
He had TWO 48 core, 64GB MySQL
One read-only, one write-only (master)
both using well about 60-80% of all CPUs constantly
!
So many hard-disks died
So many hours went by for MySQL restores
So much data was being held just temporarily
Bulk deletion took out MySQL a couple of times
Slide 20
Slide 20 text
What we did
Reduced the load on the write-only host to
10-12%
!
Reduced the load on the read-only host to
20-30%
!
Introduced two major changes
SOLR for Full-Text Search (not the Live-Chat)
Designed/Wrote thruput.io (our own product)
Slide 21
Slide 21 text
Real-Time Messaging Platform
Browser, Mobile and Native Applications
Persistency
!
Demo time :)
What we did
Slide 22
Slide 22 text
Scales up to 250k concurrents on 4 CPUs/2GB
Browser (down to IE6) incl. JavaScript API (tiny)
Native API (OSX/IOS - drag and drop)
Scala API (JAR)
Delivery using curl/wget
Apple Push and Google Cloud Messaging builtin
Geo-targetted messaging
Uses Cassandra for persistency
Specifically
Slide 23
Slide 23 text
What we learned today
Diversify!
Use Applications for what they’re made for
Don’t be afraid of change
Evaluate existing setups constantly
Sometimes you have to write your own stuff. :)
Slide 24
Slide 24 text
Follow @wastedio on Twitter!
(and/or talk to me afterwards)
Are you wasting IO?
Slide 25
Slide 25 text
@fbettag on Twitter
!
code at http://github.com/fbettag
or http://github.com/wasted
!
Get home safely
Thanks for listening!