Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Cassandra @ Teads

Cassandra @ Teads

Read how we use Cassandra @ Teads: Architecture on AWS and physical nodes, tuning, issues, tools, our fork and more!

Teads is #1 in Video Ads. Read how Teads handles up to ~1 million requests/s with Apache Cassandra. How do we tuned Cassandra servers and clients. What issues we faced during the last year. How do we provision our clusters. Which tools are used: Datadog for monitoring and alerting, Cassandra reaper, Rundeck, Sumologic, cassandra_snapshotter. Why do we need a fork.

Romain Hardouin

February 16, 2017
Tweet

More Decks by Romain Hardouin

Other Decks in Technology

Transcript

  1. I. II. III. IV. V. VI. VII. VIII. Cassandra @

    Teads About Teads Architecture Provisioning Monitoring & alerting Tuning Tools C’est la vie A light fork
  2. Teads is the inventor of native video advertising With inRead,

    an award-winning format* *IPA Media owner survey 2014, IAB recognized format
  3. Custom C* 2.1.16  C* 3.0 jvm.options  C* 2.2

    logback  Backports  Patches Apache Cassandra version Apache Cassandra version
  4. Topology Topology 2 regions: EU & US  3rd region

    APAC coming soon 4 clusters 7 DC 110 nodes  Up to 150 with temporary DCs HP server blades 1 cluster 18 nodes
  5. i2.2xlarge 8 vCPU 61GB 2 x 800 GB attached SSD

    in RAID0 c3.4xlarge 16 vCPU 30GB 2 x 160 GB attached SSD in RAID0 c4.4xlarge 16 vCPU 30GB EBS 3.4 TB + 1 TB AWS instance types AWS instance types Tons of counters Big Data, wide rows Many billions keys, LCS with TTL
  6. 20 x c4.4xlarge with SSD GP2  3.4 TB data

    10,000 IOPS ⇒ 16KB  1 TB commitlog 3,000 IOPS ⇒ 16KB 25 tables: batch + real time Temporary DC Cheap storage, great for STCS Snapshots (S3 backup) No coupling between disks and CPU/RAM High latency => high I/O wait Throughput: 160 MB/s Unsteady performances More on EBS nodes More on EBS nodes
  7. HP Apollo XL170R Gen9 12 CPU Xeon @ 2.60GHz 128

    GB RAM 3 x 1,5 TB High-end SSD in RAID0 Hardware nodes Hardware nodes For Big Data, supersedes EBS DC
  8. Instance type change Instance type change 20 x i2.2xlarge 20

    x c3.4xlarge Counters Cheaper and more CPUs Counters Rebuild DC X DC Y
  9. Workload isolation Workload isolation 20 x i2.2xlarge 20 x c3.4xlarge

    Counters + Big Data Counters 20 x c4.4xlarge Big Data EBS Step 1: DC split DC A DC B DC C Rebuild +
  10. Workload isolation Workload isolation 20 x c4.4xlarge Big Data EBS

    Step 2: Cluster split Big Data AWS Direct Connect
  11. Capistrano Chef → Custom Cookbooks:  C*  C* tools

     C* reaper  Datadog wrapper Chef provisioning to spawn a cluster Now Now
  12. Ring view More than monitoring Lots of metrics Still lacks

    some metrics Dashboard creation: no templates Agent is heavy Free version limitations:  Data stored in production cluster  Apache C* <= 2.1 only DataStax OpsCenter Free version (v5)
  13. All metrics you want Dashboard creation • Templating  TimeBoard

    vs ScreenBoard Graph creation  Aggregation, trend, rate, anomaly detection No turnkey dashboards yet  May change: TLP templates Additional fees if >350 metrics  We need to increase this limit for our use case
  14. Now we can easily  Find outliers  Compare a

    node to average  Compare two DCs  Explore a node’s metrics  Create overview dashboards  Create advanced dashboards for troubleshooting
  15. Datadog’s cassandra.yaml Datadog’s cassandra.yaml - include: bean_regex: org.apache.cassandra.metrics:type=ReadRepair,name=* attribute: -

    Count - include: bean_regex: org.apache.cassandra.metrics:type=CommitLog,name=(WaitingOnCommit|WaitingOnSegmentAllocation) attribute: - Count - 99thPercentile - include: bean: org.apache.cassandra.metrics:type=CommitLog,name=TotalCommitLogSize - include: bean: org.apache.cassandra.metrics:type=ThreadPools,path=transport,scope=Native-Transport-Requests,name=MaxTasksQueued attribute: Value: alias: cassandra.ntr.MaxTasksQueued
  16. Datadog Alerting Down node Exceptions Commitlog size High latency High

    GC High IO Wait High Pendings Many hints Long thrift connections Clock out of sync Disk space … Don’t miss this one Don’t forget /
  17. GC logs enabled -XX:MaxGCPauseMillis=200 -XX:G1RSetUpdatingPauseTimePercent=5 -XX:G1HeapRegionSize=32m -XX:G1HeapWastePercent=25 -XX:InitiatingHeapOccupancyPercent=? -XX:ParallelGCThreads=#CPU -XX:ConcGCThreads=#CPU

    -XX:+ExplicitGCInvokesConcurrent -XX:+ParallelRefProcEnabled -XX:+UseCompressedOops jvm.options jvm.options -XX:HeapDumpPath= <dir with enough free space> -XX:ErrorFile= <custom dir> -Djava.io.tmpdir= <custom dir> -XX:-UseBiasedLocking -XX:+UseTLAB -XX:+ResizeTLAB -XX:+PerfDisableSharedMem -XX:+AlwaysPreTouch ... Backport from C* 3.0
  18. num_tokens: 256 native_transport_max_threads: 256 or 128 compaction_throughput_mb_per_sec: 64 concurrent_compactors: 4

    or 2 concurrent_reads: 64 concurrent_writes: 128 or 64 concurrent_counter_writes: 128 hinted_handoff_throttle_in_kb: 10240 max_hints_delivery_threads: 6 or 4 memtable_cleanup_threshold: 0.6, 0.5 or 0.4 memtable_flush_writers: 4 or 2 trickle_fsync: true trickle_fsync_interval_in_kb: 10240 dynamic_snitch_badness_threshold: 2.0 internode_compression: dc AWS nodes AWS nodes Heap c3.4xlarge: 15 GB i2.2xlarge: 24 GB
  19. EBS volume != disk compaction_throughput_mb_per_sec: 32 concurrent_compactors: 4 concurrent_reads: 32

    concurrent_writes: 64 concurrent_counter_writes: 64 trickle_fsync_interval_in_kb: 1024 AWS nodes AWS nodes Heap c4.4xlarge: 15 GB
  20. num_tokens: 8 initial_token: ... native_transport_max_threads: 512 compaction_throughput_mb_per_sec: 128 concurrent_compactors: 4

    concurrent_reads: 64 concurrent_writes: 128 concurrent_counter_writes: 128 hinted_handoff_throttle_in_kb: 10240 max_hints_delivery_threads: 6 memtable_cleanup_threshold: 0.6 memtable_flush_writers: 8 trickle_fsync: true trickle_fsync_interval_in_kb: 10240 Hardware nodes Hardware nodes More on this later Heap: 24 GB
  21. Why 8 tokens? Better repair performance, important for Big Data

    Evenly distributed tokens, stored in a Chef data bag Hardware nodes Hardware nodes ./vnodes_token_generator.py --json --indent 2 --servers hosts_interleaved_racks.txt 4 { "192.168.1.1": "-9223372036854775808,-4611686018427387905,-2,4611686018427387901", "192.168.2.1": "-7686143364045646507,-3074457345618258604,1537228672809129299,6148914691236517202", "192.168.3.1": "-6148914691236517206,-1537228672809129303,3074457345618258600,7686143364045646503" } https://github.com/rhardouin/cassandra-scripts Watch out! Know the drawbacks
  22. Small entries, lots of reads compression = { 'chunk_length_kb': '4',

    'sstable_compression': 'org.apache.cassandra.io.compress.LZ4Compressor' } + nodetool scrub (few GB) Compression Compression
  23. DataStax driver policy LatencyAwarePolicy TokenAwarePolicy → LatencyAwarePolicy  Hotspots due

    to premature nodes eviction  Needs thorough tuning and steady workload  We drop it TokenAwarePolicy  Shuffle replicas depending on CL
  24. For cross-region scheduled jobs VPN between AWS regions 20 executors

    with 6GB RAM output.consistency.level = (LOCAL_)ONE output.concurrent.writes = 50 connection.compression = LZ4
  25. {Parallel SSH + cron} on steroids  Security  History

    who/what/when/why Output is kept CQL migration Rolling restart Nodetool or JMX commands Backup and snapshot jobs “Job Scheduler & Runbook Automation” We added a “comment” field
  26. Scheduled range repair Segments: up to 20,000 for TB tables

    Hosted fork for C* 2.1 We will probably switch to TLP’s fork We do not use incremental repairs See fix in C* 4.0
  27. cassandra_snapshotter  Backup on S3  Scheduled with Rundeck We

    created and use a fork  Some PR merged upstream  Restore PR still to be merged
  28. Logs management "C* " and "out of sync" "C* "

    and "new session: will sync" | count ... Alerts on pattern "C* " and "[ERROR]" "C* " and "[WARN]" and not ( … ) ...
  29. OS reboot… seems harmless, right? Cassandra service enabled Want a

    clue? C* 2.0 + counters Upgrade to C* 2.1 was a relief Without any obvious reason
  30. Upgrade 2.0 2.1 → LCS cluster suffered  High load

     Pending compactions was growing Switch to off heap memtable  Less GC => less load Reduce clients load Better after sstables upgrade  Took days
  31. Upgrade 2.0 2.1 → Lots of NTR “All time blocked”

    NTR queue undersized for our workload  128 (hard coded) We add a property to test CASSANDRA-11363 and set value higher and higher… up to 4096 NTR pool needs to be sized accordingly
  32. After replacing nodes DELETE FROM system.peers WHERE peer = '<replaced

    node>'; Used by DataStax Driver for auto discovery
  33. One SSD failed CPUs suddendly became slow on one server

    Smart Array Battery BIOS bug Yup, not a SSD...
  34. Why a fork? 1. Need to add a patch ASAP

     High Blocked NTR CASSANDRA-11363  Require to deploy from source 2. Why not backport interesting tickets? 3. Why not add small features/fixes?  Expose tasks queue length via JMX CASSANDRA-12758 You betcha!
  35. A tiny fork We keep it as small as possible

    to fit our needs Even smaller when we will upgrade to C* 3.0  Backports will be obsolete
  36. The most impressive result for a set of tables: 

    Before: 23 days  With CASSANDRA-12580: 16 hours Longest repair for a table: 2,5 days  Impossible to repair this table before the patch  Fit in gc_grace_seconds
  37. It was a critical fix for us It should have

    landed in 2.1.17 IMHO*  Repair is a mandatory operation in many use cases  Paulo already made the patch for 2.1  C* 2.1 is widely used [*] Full post: http://www.mail-archive.com/[email protected]/msg49344.html
  38. Because C* is critical for our business We don’t need

    fancy stuff (SASI, MV, UDF, ...) We just want a rock solid scalable DB C* 2.1 does the job for the time being We p la n to up grade to C * 3 .0 i n 201 7 We wi l l do th o ro ugh te s ts ; - )
  39. C* 2.2 has some nice improvements:  Boostrapping with LCS:

    send source sstable level 1  Range movement causes CPU & performance impact 2  Resumable bootstrap/rebuild streaming 3 [1] CASSANDRA-7460 [2] CASSANDRA-9258 [3] CASSANDRA-8838, CASSANDRA-8942, CASSANDRA-10810 But migration path 2.2 3.0 is risky →  Just my opinion based on users mailing list  DSE never used 2.2