Upgrade to Pro — share decks privately, control downloads, hide ads and more …

AHOY: A History of YAGNI

KMKLabs
July 29, 2016

AHOY: A History of YAGNI

KMKLabs

July 29, 2016
Tweet

More Decks by KMKLabs

Other Decks in Technology

Transcript

  1. Mar 2015 Problem: How to track "unique plays" 100% duration

    watched for short videos <= 30s At least 40% watched for other videos No repeated watch within 30 minutes Solution: Integrate https://github.com/ankane/ahoy into vidio (#89484470) Rails engine to track visit sessions and arbitary events Client ­> Server ­> DB
  2. A l m o s t n o c o

    d e c h a n g e , L O L - - - / d e v / n u l l + + + / t m p / a r c h i t e c t u r e . t x t - 0 0 2 0 1 5 - 0 3 - 2 6 @ @ - 0 , 0 + 1 , 2 @ @ + - S e r v e r : u n i c o r n p r o c e s s e s s h a r e d w i t h v i d i o + - D B : v i d i o _ p r o d u c t i o n o n p g 9 . 3
  3. Apr 2015 Problem: Events table size growing fast About 1~2

    millions events a day Solution: Partition events table by month (#91482942) Can drop old partitions easily when running out of space Did that in Apr 2016
  4. A r e w e B i g D a

    t a y e t ? - - - / t m p / a r c h i t e c t u r e . t x t - 0 0 2 0 1 5 - 0 3 - 2 6 + + + / t m p / a r c h i t e c t u r e . t x t - 0 1 2 0 1 5 - 0 4 - 0 1 @ @ - 1 , 2 + 1 , 3 @ @ - S e r v e r : u n i c o r n p r o c e s s e s s h a r e d w i t h v i d i o - D B : v i d i o _ p r o d u c t i o n o n p g 9 . 3 + + m o n t h l y p a r t i t i o n f o r e v e n t s
  5. Apr 2015 Problem: Visits table size growing fast too About

    4 millions rows and takes up 1.7 G after less than 20 days Solution: Partition visits table by month too (#92318652)
  6. A r e w e B i g D a

    t a y e t ? - - - / t m p / a r c h i t e c t u r e . t x t - 0 1 2 0 1 5 - 0 4 - 0 1 + + + / t m p / a r c h i t e c t u r e . t x t - 0 2 2 0 1 5 - 0 4 - 1 3 @ @ - 1 , 2 + 1 , 3 @ @ - S e r v e r : u n i c o r n p r o c e s s e s s h a r e d w i t h v i d i o - D B : v i d i o _ p r o d u c t i o n o n p g 9 . 3 + + m o n t h l y p a r t i t i o n f o r v i s i t s
  7. Jun 2015 Unrelated Problem: All unicorn workers busy waiting for

    responses from Gravity Many 503 responses in vidio Solution: Move from unicorn to puma (#87351830)
  8. W i t h m o r e t h

    r e a d s c o m e m o r e t i m e t o w a i t - - - / t m p / a r c h i t e c t u r e . t x t - 0 2 2 0 1 5 - 0 4 - 1 3 + + + / t m p / a r c h i t e c t u r e . t x t - 0 3 2 0 1 5 - 0 6 - 0 8 @ @ - 1 , 2 + 1 , 2 @ @ - - S e r v e r : u n i c o r n p r o c e s s e s s h a r e d w i t h v i d i o + - S e r v e r : p u m a p r o c e s s e s s h a r e d w i t h v i d i o - D B : v i d i o _ p r o d u c t i o n o n p g 9 . 3
  9. Jun 2015 Problems: Autovacuum triggered to prevent transaction id wrap­around

    Massive spike in newrelic request queuing time Solution: Change the conservative default settings (#96423416) autovacuum_freeze_max_age: 1000000000 (1B) vacuum_freeze_min_age: 10000000 (10M) vacuum_freeze_table_age: 800000000 (800M)
  10. W h o n e e d D B A

    ? - - - / t m p / a r c h i t e c t u r e . t x t - 0 3 2 0 1 5 - 0 6 - 0 8 + + + / t m p / a r c h i t e c t u r e . t x t - 0 4 2 0 1 5 - 0 6 - 0 9 @ @ - 1 , 2 + 1 , 3 @ @ - S e r v e r : p u m a p r o c e s s e s s h a r e d w i t h v i d i o - D B : v i d i o _ p r o d u c t i o n o n p g 9 . 3 + + c u s t o m v a c u u m f r e e z e s e t t i n g s
  11. Jun 2015 Problem: Need to reboot RDS to apply certain

    settings Downtime for vidio Solution: Split the database (#97491224)
  12. I t ' s n o t y o u

    , i t ' s m e - - - / t m p / a r c h i t e c t u r e . t x t - 0 4 2 0 1 5 - 0 6 - 0 9 + + + / t m p / a r c h i t e c t u r e . t x t - 0 5 2 0 1 5 - 0 6 - 2 2 @ @ - 1 , 2 + 1 , 2 @ @ - S e r v e r : p u m a p r o c e s s e s s h a r e d w i t h v i d i o - - D B : v i d i o _ p r o d u c t i o n o n p g 9 . 3 + - D B : a h o y _ p r o d u c t i o n o n p g 9 . 4
  13. Jul 2015 Problem: RDS failover on a h o y

    _ p r o d u c t i o n , puma threads hang Downtime for vidio Solution: Move from puma to unicorn Run ahoy rails engine as standalone rails app, a.k.a anahoy (#98169580)
  14. T I L o n l y u n i

    c o r n c a n s u r v i v e R D S f a i l o v e r - - - / t m p / a r c h i t e c t u r e . t x t - 0 5 2 0 1 5 - 0 6 - 2 2 + + + / t m p / a r c h i t e c t u r e . t x t - 0 6 2 0 1 5 - 0 7 - 0 2 @ @ - 1 , 2 + 1 , 2 @ @ - - S e r v e r : p u m a p r o c e s s e s s h a r e d w i t h v i d i o + - S e r v e r : u n i c o r n p r o c e s s e s o n s a m e m a c h i n e a s v i d i o - D B : a h o y _ p r o d u c t i o n o n p g 9 . 4
  15. B F F ! - - - / t m

    p / a r c h i t e c t u r e . t x t - 0 6 2 0 1 5 - 0 7 - 0 2 + + + / t m p / a r c h i t e c t u r e . t x t - 0 7 2 0 1 5 - 0 8 - 1 1 @ @ - 1 , 2 + 1 , 2 @ @ - S e r v e r : u n i c o r n p r o c e s s e s o n s a m e m a c h i n e a s v i d i o - - D B : a h o y _ p r o d u c t i o n o n p g 9 . 4 + - D B : a h o y _ p r o d u c t i o n o n p g 9 . 4 s h a r e d w i t h a n a l i s i s
  16. Oct 2015 Problem: High I/O load on database from ETL

    to populate analisis tables Requests taking more than 30 seconds were dropped Solution: Move the standalone rails app from anahoy to anahoy/ahoy (#105540400) The former uses a n a l i s i s _ p r o d u c t i o n , the latter stays on a h o y _ p r o d u c t i o n
  17. W e c a n s t i l l

    b e f r i e n d - - - / t m p / a r c h i t e c t u r e . t x t - 0 7 2 0 1 5 - 0 8 - 1 1 + + + / t m p / a r c h i t e c t u r e . t x t - 0 8 2 0 1 5 - 1 0 - 1 3 @ @ - 1 , 2 + 1 , 2 @ @ - S e r v e r : u n i c o r n p r o c e s s e s o n s a m e m a c h i n e a s v i d i o - - D B : a h o y _ p r o d u c t i o n o n p g 9 . 4 s h a r e d w i t h a n a l i s i s + - D B : a h o y _ p r o d u c t i o n o n p g 9 . 4
  18. Oct 2015 Problem: High I/O load on database from regular

    events/visits insert Must be web scale Solution: Turn off s y n c h r o n o u s _ c o m m i t (#105732184) Write events/visits to rabbitmq first (#106183286) Separate consumer processes will get messages from rabbitmq and then do batch insert Client ­> Server ­> MQ ­> DB
  19. N o m o n g o d b -

    - - / t m p / a r c h i t e c t u r e . t x t - 0 8 2 0 1 5 - 1 0 - 1 3 + + + / t m p / a r c h i t e c t u r e . t x t - 0 9 2 0 1 5 - 1 0 - 1 6 @ @ - 1 , 2 + 1 , 4 @ @ - S e r v e r : u n i c o r n p r o c e s s e s o n s a m e m a c h i n e a s v i d i o + - M Q : r a b b i t m q - 3 . 5 . 6 ( n o t e n a b l e d y e t ) - D B : a h o y _ p r o d u c t i o n o n p g 9 . 4 + + N o s y n c h r o n o u s _ c o m m i t
  20. Oct 2015 Problem: Sharing elb and nginx with vidio makes

    things complicated H T T P C o d e _ E L B _ 5 X X alert ­ who to blame? Solution: Run ahoy as a new microservice, a.k.a plenty (#106816908)
  21. M i c r o s e r v i

    c e F T W ! - - - / t m p / a r c h i t e c t u r e . t x t - 0 9 2 0 1 5 - 1 0 - 1 6 + + + / t m p / a r c h i t e c t u r e . t x t - 1 0 2 0 1 5 - 1 0 - 2 8 @ @ - 1 , 3 + 1 , 3 @ @ O c t 2 8 , 2 0 1 5 - - S e r v e r : u n i c o r n p r o c e s s e s o n s a m e m a c h i n e a s v i d i o + - S e r v e r : u n i c o r n p r o c e s s e s - M Q : r a b b i t m q - 3 . 5 . 6 ( n o t e n a b l e d y e t ) - D B : a h o y _ p r o d u c t i o n o n p g 9 . 4
  22. Nov 2015 Problem: Request time increased from 16ms to 32ms

    after enabling rabbitmq Solution: Benchmark different amqp client library (#107268400) sinatra + bunny + puma/unicorn pyramid + librabbitmq/pyamqp + waitress/uwsgi/unicorn nginx + lua­resty­rabbitmqstomp
  23. T L ; D R S e l l i

    n g s n a k e o i l - - - / t m p / a r c h i t e c t u r e . t x t - 1 0 2 0 1 5 - 1 0 - 2 8 + + + / t m p / a r c h i t e c t u r e . t x t - 1 1 2 0 1 5 - 1 1 - 1 6 @ @ - 1 , 3 + 1 , 3 @ @ - - S e r v e r : u n i c o r n p r o c e s s e s + - S e r v e r : w a i t r e s s p r o c e s s e s - M Q : r a b b i t m q - 3 . 5 . 6 - D B : a h o y _ p r o d u c t i o n o n p g 9 . 4
  24. Feb 2016 Problem: RDS can't keep up when traffic goes

    above 40k rpm Too many retries during batch insert due to duplicate events pg9.4, no O N C O N F L I C T D O N O T H I N G Solution: Put events into different queues based on v i s i t _ i d [ 0 ] One consumer process for each queues, no contention RDS write can now keep up with traffic spike until ~100k rpm :)
  25. M i c r o m a n a g

    e ! ( l i k e a b o s s ) - - - / t m p / a r c h i t e c t u r e . t x t - 1 1 2 0 1 5 - 1 1 - 1 6 + + + / t m p / a r c h i t e c t u r e . t x t - 1 2 2 0 1 6 - 0 2 - 1 7 @ @ - 1 , 3 + 1 , 4 @ @ - S e r v e r : w a i t r e s s p r o c e s s e s - M Q : r a b b i t m q - 3 . 5 . 6 + + 1 6 e v e n t s q u e u e s - D B : a h o y _ p r o d u c t i o n o n p g 9 . 4
  26. Mar 2016 Unknown Problem: rabbitmq­3.5.6 is stable (has been running

    100+ days) We redeployed the whole stack on Mar 7th, fairly routine...
  27. T I F U b y m a k i

    n g . . . n o c h a n g e - - - / t m p / a r c h i t e c t u r e . t x t - 1 2 2 0 1 6 - 0 2 - 1 7 + + + / t m p / a r c h i t e c t u r e . t x t - 1 3 2 0 1 6 - 0 3 - 0 7 @ @ - 1 , 3 + 1 , 3 @ @ - S e r v e r : w a i t r e s s p r o c e s s e s - - M Q : r a b b i t m q - 3 . 5 . 6 + - M Q : r a b b i t m q - 3 . 6 . 1 - D B : a h o y _ p r o d u c t i o n o n p g 9 . 4
  28. Mar 2016 Problem: rabbitmq­3.6.1 is not stable (ever increasing memory

    usage) Hang during gerhana traffic spike Solution: Pin rabbitmq to 3.5.x (#115400419)
  29. B a c k f r o m r e

    t i r e m e n t - - - / t m p / a r c h i t e c t u r e . t x t - 1 3 2 0 1 6 - 0 3 - 0 7 + + + / t m p / a r c h i t e c t u r e . t x t - 1 4 2 0 1 6 - 0 3 - 0 9 @ @ - 1 , 3 + 1 , 3 @ @ - S e r v e r : w a i t r e s s p r o c e s s e s - - M Q : r a b b i t m q - 3 . 6 . 1 + - M Q : r a b b i t m q - 3 . 5 . 7 - D B : a h o y _ p r o d u c t i o n o n p g 9 . 4
  30. May 2016 Problem: Reading from e v e n t

    s . p r o p e r t i e s (json) is slow Patience needed to analyze events data Solution: Upgrade to pg9.5 and change column type to jsonb (#118078361) Also add O N C O N F L I C D O N O T H I N G to bach insert (#119267289) Didn't help that much
  31. S t i l l n o m o n

    g o d b - - - / t m p / a r c h i t e c t u r e . t x t - 1 4 2 0 1 6 - 0 3 - 0 9 + + + / t m p / a r c h i t e c t u r e . t x t - 1 5 2 0 1 6 - 0 5 - 1 1 @ @ - 1 , 3 + 1 , 4 @ @ - S e r v e r : w a i t r e s s p r o c e s s e s - M Q : r a b b i t m q - 3 . 5 . 7 - - D B : a h o y _ p r o d u c t i o n o n p g 9 . 4 + - D B : a h o y _ p r o d u c t i o n o n p g 9 . 5 + + e v e n t s . p r o p e r t i e s i n j s o n b
  32. Unresolved Problem: Huge traffic spike from push notification events More

    than 10x traffic within a few seconds, and lasts only ~30 seconds No chance for autoscaling (need at least 2~3 minutes) Workaround: Add random sleep time in client Less strict health check to allow more time to absorb the spike (#120924235) More aggresive autoscaling to have spare capacity Solution: Go?
  33. Unresolved Problem: Less trust in rabbitmq after the memory usage

    fiasco with 3.6.x Can't scale indefinitely, although maybe enough for our needs Workaround: Monitor mailing list for progress in 3.6.x Upgrade to larger instance type Solution: Kafka?
  34. Unresolved Problem: Base traffic is at ~100k rpm after adding

    events/visits from L6 RDS write can only keep up with traffic spike until ~100k rpm :( RDS read was already bad before this, now worse 2TB volume is barely enough to hold 2 months of data Potential Workaround: Upgrade to provisioned IOPS (up to 30000 at $6600/month) Upgrade to larger volume (up to 6TB) Solution: Write to s3, process and read with prestodb Client ­> Server ­> MQ ­> S3 ­> DB
  35. H e l l o B i g D a

    t a - - - / t m p / a r c h i t e c t u r e . t x t - 1 5 2 0 1 6 - 0 5 - 1 1 + + + / t m p / a r c h i t e c t u r e . t x t - 1 6 2 0 1 6 - 0 7 - x x @ @ - 1 , 3 + 1 , 3 @ @ - S e r v e r : w a i t r e s s p r o c e s s - M Q : r a b b i t m q - 3 . 5 . 7 - - D B : a h o y _ p r o d u c t i o n o n p g 9 . 5 + - D B : p r e s t o d b 0 . 1 4 7