Slide 1

Slide 1 text

Franz Bettag Director of R&D and Network Operations at München, 2014 Through the Database-Jungle

Slide 2

Slide 2 text

Who is this guy? Franz Bettag 28 years old (soon 29) Nuremberg, Germany Scala, Liftweb and Netty Connoisseur FreeBSD, Redis, PostgreSQL and Cassandra aficionado Director of R&D and Network Operations at wasted.io Ltd.

Slide 3

Slide 3 text

Seriously, what does he do.. Lead development on projects for the company Client consulting in special cases BGP, OSPF, etc, short - the Routing guy Molesting Open Source projects

Slide 4

Slide 4 text

Not wasting IO anymore 17 years of coding (10 years commercial) 16 years on FreeBSD and NetBSD 8 years on PostgreSQL 6 years on Scala, Liftweb and Akka 3 years on Cassandra (since 0.6)

Slide 5

Slide 5 text

What am i going to talk about? Common issues and general solutions Databases Combining Databases Special cases

Slide 6

Slide 6 text

Common Issues and Solutions Coders choose a DB they felt most comfortable with Suggest and prove your suggestion IT Staff is outdated Update them, also prove your suggestion Fear of Change/Never change a running System Take extensive testing into account Screwy code/code duplication Takes more money, their fault. explain that! Old Language/Project Yeah, it’s gonna cost you.

Slide 7

Slide 7 text

Once picked, the most common mistakes Only test a few corner cases Worked with 2 concurrent users (developers) Worked with 5 real concurrent users testing it Worked with 10000 Rows Use a *SINGLE* database to do all the work

Slide 8

Slide 8 text

Poor choices Using Full-Text Index in your RDBMS It’s 2014, we have SOLR and ElasticSearch ! Using RDBMS as cache for other RDBMS-generated data There is stuff like memcached and Redis ! Using RDBMS for bulk inserts with client polling Check your technology, or develop better one

Slide 9

Slide 9 text

Why SOLR or Elastic Search? Boosting Queries Geospatial Queries Stemming Words Language Support Performance Streaming Results (Elastic Search) And many more You’ve got Apache Lucene under the Hood!

Slide 10

Slide 10 text

Why Redis or Memcached? Caching Data which is slow to query Ease database query load Data needed for limited time (TTL) Database has big tomb-stoning/vacuuming penalty No way to store it in the running application Not designed for sharding Need data easily accessible outside of your app

Slide 11

Slide 11 text

So what now? SOLR/ElasticSearch for Searching ! Redis/Memcached for Caching ! Some database for persistency ;)

Slide 12

Slide 12 text

Which database? Well we like Apache Cassandra with Apache SOLR and Apache Mahout and Apache Pig ! Especially Datastax Enterprise Have you seen their startup program? AWESOME!

Slide 13

Slide 13 text

So what’s up with PostgreSQL INET/CIDR DataType Query for incl/excl. of INET in CIDR Range Do a variety of stunts on them

Slide 14

Slide 14 text

PostgreSQL combined with Big Data IP Based Analysis (netflow.io) ! Cassandra stores each NetFlow and Counters PostgreSQL contains the keys to query for an IP Range ! Indexed Scan across a table containing ca. 2k Ranges 0.2ms ! Querying Cassandra for current total of 127 IPs 3ms

Slide 15

Slide 15 text

Cassandra and SOLR (DSE) Network Monitoring (netflow.io) ! Cassandra stores the alerts generated by Snort SOLR Indexes on those Events (severity, type, etc) PostgreSQL knows the key for the IP Ranges ! SOLR search for all Events regarding Hosts X, Y and Z 16ms

Slide 16

Slide 16 text

SOLR and other Databases SOLR is great for searching, except IP Ranges ;) ! ORMs usually have callbacks onSave - write a new version of the data to SOLR ! Query SOLR Write to the Database

Slide 17

Slide 17 text

It’s not all rainbows SOLR needs a Schema (can be tricky the first few times) ! Structuring your data right IP ADDRESSES ARE NOT INTEGERS! IPv6 OMFG ! If there is nothing out there?

Slide 18

Slide 18 text

So we had this client... Heavy notification-system using RDBMS Publisher-Follower-System including a Web-Based Live-Video-Chat system more than 4 million users multiple publishers with 40k+ followers 200k inserts/sec at peak times used MySQL, Memcache, PHP Clients polling every 5 seconds for notifications what do you do?

Slide 19

Slide 19 text

What was the setup? He had TWO 48 core, 64GB MySQL One read-only, one write-only (master) both using well about 60-80% of all CPUs constantly ! So many hard-disks died So many hours went by for MySQL restores So much data was being held just temporarily Bulk deletion took out MySQL a couple of times

Slide 20

Slide 20 text

What we did Reduced the load on the write-only host to 10-12% ! Reduced the load on the read-only host to 20-30% ! Introduced two major changes SOLR for Full-Text Search (not the Live-Chat) Designed/Wrote thruput.io (our own product)

Slide 21

Slide 21 text

Real-Time Messaging Platform Browser, Mobile and Native Applications Persistency ! Demo time :) What we did

Slide 22

Slide 22 text

Scales up to 250k concurrents on 4 CPUs/2GB Browser (down to IE6) incl. JavaScript API (tiny) Native API (OSX/IOS - drag and drop) Scala API (JAR) Delivery using curl/wget Apple Push and Google Cloud Messaging builtin Geo-targetted messaging Uses Cassandra for persistency Specifically

Slide 23

Slide 23 text

What we learned today Diversify! Use Applications for what they’re made for Don’t be afraid of change Evaluate existing setups constantly Sometimes you have to write your own stuff. :)

Slide 24

Slide 24 text

Follow @wastedio on Twitter! (and/or talk to me afterwards) Are you wasting IO?

Slide 25

Slide 25 text

@fbettag on Twitter ! code at http://github.com/fbettag or http://github.com/wasted ! Get home safely Thanks for listening!