AHOY: A History of YAGNI

Mar 2015 Problem: How to track "unique plays" 100% duration
watched for short videos <= 30s At least 40% watched for other videos No repeated watch within 30 minutes Solution: Integrate https://github.com/ankane/ahoy into vidio (#89484470) Rails engine to track visit sessions and arbitary events Client > Server > DB

A l m o s t n o c o
d e c h a n g e , L O L - - - / d e v / n u l l + + + / t m p / a r c h i t e c t u r e . t x t - 0 0 2 0 1 5 - 0 3 - 2 6 @ @ - 0 , 0 + 1 , 2 @ @ + - S e r v e r : u n i c o r n p r o c e s s e s s h a r e d w i t h v i d i o + - D B : v i d i o _ p r o d u c t i o n o n p g 9 . 3

Apr 2015 Problem: Events table size growing fast About 1~2
millions events a day Solution: Partition events table by month (#91482942) Can drop old partitions easily when running out of space Did that in Apr 2016

A r e w e B i g D a
t a y e t ? - - - / t m p / a r c h i t e c t u r e . t x t - 0 0 2 0 1 5 - 0 3 - 2 6 + + + / t m p / a r c h i t e c t u r e . t x t - 0 1 2 0 1 5 - 0 4 - 0 1 @ @ - 1 , 2 + 1 , 3 @ @ - S e r v e r : u n i c o r n p r o c e s s e s s h a r e d w i t h v i d i o - D B : v i d i o _ p r o d u c t i o n o n p g 9 . 3 + + m o n t h l y p a r t i t i o n f o r e v e n t s

Apr 2015 Problem: Visits table size growing fast too About
4 millions rows and takes up 1.7 G after less than 20 days Solution: Partition visits table by month too (#92318652)

A r e w e B i g D a
t a y e t ? - - - / t m p / a r c h i t e c t u r e . t x t - 0 1 2 0 1 5 - 0 4 - 0 1 + + + / t m p / a r c h i t e c t u r e . t x t - 0 2 2 0 1 5 - 0 4 - 1 3 @ @ - 1 , 2 + 1 , 3 @ @ - S e r v e r : u n i c o r n p r o c e s s e s s h a r e d w i t h v i d i o - D B : v i d i o _ p r o d u c t i o n o n p g 9 . 3 + + m o n t h l y p a r t i t i o n f o r v i s i t s

Jun 2015 Unrelated Problem: All unicorn workers busy waiting for
responses from Gravity Many 503 responses in vidio Solution: Move from unicorn to puma (#87351830)

W i t h m o r e t h
r e a d s c o m e m o r e t i m e t o w a i t - - - / t m p / a r c h i t e c t u r e . t x t - 0 2 2 0 1 5 - 0 4 - 1 3 + + + / t m p / a r c h i t e c t u r e . t x t - 0 3 2 0 1 5 - 0 6 - 0 8 @ @ - 1 , 2 + 1 , 2 @ @ - - S e r v e r : u n i c o r n p r o c e s s e s s h a r e d w i t h v i d i o + - S e r v e r : p u m a p r o c e s s e s s h a r e d w i t h v i d i o - D B : v i d i o _ p r o d u c t i o n o n p g 9 . 3

Jun 2015 Problems: Autovacuum triggered to prevent transaction id wraparound
Massive spike in newrelic request queuing time Solution: Change the conservative default settings (#96423416) autovacuum_freeze_max_age: 1000000000 (1B) vacuum_freeze_min_age: 10000000 (10M) vacuum_freeze_table_age: 800000000 (800M)

W h o n e e d D B A
? - - - / t m p / a r c h i t e c t u r e . t x t - 0 3 2 0 1 5 - 0 6 - 0 8 + + + / t m p / a r c h i t e c t u r e . t x t - 0 4 2 0 1 5 - 0 6 - 0 9 @ @ - 1 , 2 + 1 , 3 @ @ - S e r v e r : p u m a p r o c e s s e s s h a r e d w i t h v i d i o - D B : v i d i o _ p r o d u c t i o n o n p g 9 . 3 + + c u s t o m v a c u u m f r e e z e s e t t i n g s

Jun 2015 Problem: Need to reboot RDS to apply certain
settings Downtime for vidio Solution: Split the database (#97491224)

I t ' s n o t y o u
, i t ' s m e - - - / t m p / a r c h i t e c t u r e . t x t - 0 4 2 0 1 5 - 0 6 - 0 9 + + + / t m p / a r c h i t e c t u r e . t x t - 0 5 2 0 1 5 - 0 6 - 2 2 @ @ - 1 , 2 + 1 , 2 @ @ - S e r v e r : p u m a p r o c e s s e s s h a r e d w i t h v i d i o - - D B : v i d i o _ p r o d u c t i o n o n p g 9 . 3 + - D B : a h o y _ p r o d u c t i o n o n p g 9 . 4

Jul 2015 Problem: RDS failover on a h o y
_ p r o d u c t i o n , puma threads hang Downtime for vidio Solution: Move from puma to unicorn Run ahoy rails engine as standalone rails app, a.k.a anahoy (#98169580)

T I L o n l y u n i
c o r n c a n s u r v i v e R D S f a i l o v e r - - - / t m p / a r c h i t e c t u r e . t x t - 0 5 2 0 1 5 - 0 6 - 2 2 + + + / t m p / a r c h i t e c t u r e . t x t - 0 6 2 0 1 5 - 0 7 - 0 2 @ @ - 1 , 2 + 1 , 2 @ @ - - S e r v e r : p u m a p r o c e s s e s s h a r e d w i t h v i d i o + - S e r v e r : u n i c o r n p r o c e s s e s o n s a m e m a c h i n e a s v i d i o - D B : a h o y _ p r o d u c t i o n o n p g 9 . 4

Aug 2015 Problem: Need a database for analisis tables Solution:
Use a h o y _ p r o d u c t i o n

B F F ! - - - / t m
p / a r c h i t e c t u r e . t x t - 0 6 2 0 1 5 - 0 7 - 0 2 + + + / t m p / a r c h i t e c t u r e . t x t - 0 7 2 0 1 5 - 0 8 - 1 1 @ @ - 1 , 2 + 1 , 2 @ @ - S e r v e r : u n i c o r n p r o c e s s e s o n s a m e m a c h i n e a s v i d i o - - D B : a h o y _ p r o d u c t i o n o n p g 9 . 4 + - D B : a h o y _ p r o d u c t i o n o n p g 9 . 4 s h a r e d w i t h a n a l i s i s

Oct 2015 Problem: High I/O load on database from ETL
to populate analisis tables Requests taking more than 30 seconds were dropped Solution: Move the standalone rails app from anahoy to anahoy/ahoy (#105540400) The former uses a n a l i s i s _ p r o d u c t i o n , the latter stays on a h o y _ p r o d u c t i o n

W e c a n s t i l l
b e f r i e n d - - - / t m p / a r c h i t e c t u r e . t x t - 0 7 2 0 1 5 - 0 8 - 1 1 + + + / t m p / a r c h i t e c t u r e . t x t - 0 8 2 0 1 5 - 1 0 - 1 3 @ @ - 1 , 2 + 1 , 2 @ @ - S e r v e r : u n i c o r n p r o c e s s e s o n s a m e m a c h i n e a s v i d i o - - D B : a h o y _ p r o d u c t i o n o n p g 9 . 4 s h a r e d w i t h a n a l i s i s + - D B : a h o y _ p r o d u c t i o n o n p g 9 . 4

Oct 2015 Problem: High I/O load on database from regular
events/visits insert Must be web scale Solution: Turn off s y n c h r o n o u s _ c o m m i t (#105732184) Write events/visits to rabbitmq first (#106183286) Separate consumer processes will get messages from rabbitmq and then do batch insert Client > Server > MQ > DB

N o m o n g o d b -
- - / t m p / a r c h i t e c t u r e . t x t - 0 8 2 0 1 5 - 1 0 - 1 3 + + + / t m p / a r c h i t e c t u r e . t x t - 0 9 2 0 1 5 - 1 0 - 1 6 @ @ - 1 , 2 + 1 , 4 @ @ - S e r v e r : u n i c o r n p r o c e s s e s o n s a m e m a c h i n e a s v i d i o + - M Q : r a b b i t m q - 3 . 5 . 6 ( n o t e n a b l e d y e t ) - D B : a h o y _ p r o d u c t i o n o n p g 9 . 4 + + N o s y n c h r o n o u s _ c o m m i t

Oct 2015 Problem: Sharing elb and nginx with vidio makes
things complicated H T T P C o d e _ E L B _ 5 X X alert who to blame? Solution: Run ahoy as a new microservice, a.k.a plenty (#106816908)

M i c r o s e r v i
c e F T W ! - - - / t m p / a r c h i t e c t u r e . t x t - 0 9 2 0 1 5 - 1 0 - 1 6 + + + / t m p / a r c h i t e c t u r e . t x t - 1 0 2 0 1 5 - 1 0 - 2 8 @ @ - 1 , 3 + 1 , 3 @ @ O c t 2 8 , 2 0 1 5 - - S e r v e r : u n i c o r n p r o c e s s e s o n s a m e m a c h i n e a s v i d i o + - S e r v e r : u n i c o r n p r o c e s s e s - M Q : r a b b i t m q - 3 . 5 . 6 ( n o t e n a b l e d y e t ) - D B : a h o y _ p r o d u c t i o n o n p g 9 . 4

Nov 2015 Problem: Request time increased from 16ms to 32ms
after enabling rabbitmq Solution: Benchmark different amqp client library (#107268400) sinatra + bunny + puma/unicorn pyramid + librabbitmq/pyamqp + waitress/uwsgi/unicorn nginx + luarestyrabbitmqstomp

T L ; D R S e l l i
n g s n a k e o i l - - - / t m p / a r c h i t e c t u r e . t x t - 1 0 2 0 1 5 - 1 0 - 2 8 + + + / t m p / a r c h i t e c t u r e . t x t - 1 1 2 0 1 5 - 1 1 - 1 6 @ @ - 1 , 3 + 1 , 3 @ @ - - S e r v e r : u n i c o r n p r o c e s s e s + - S e r v e r : w a i t r e s s p r o c e s s e s - M Q : r a b b i t m q - 3 . 5 . 6 - D B : a h o y _ p r o d u c t i o n o n p g 9 . 4

Feb 2016 Problem: RDS can't keep up when traffic goes
above 40k rpm Too many retries during batch insert due to duplicate events pg9.4, no O N C O N F L I C T D O N O T H I N G Solution: Put events into different queues based on v i s i t _ i d [ 0 ] One consumer process for each queues, no contention RDS write can now keep up with traffic spike until ~100k rpm :)

M i c r o m a n a g
e ! ( l i k e a b o s s ) - - - / t m p / a r c h i t e c t u r e . t x t - 1 1 2 0 1 5 - 1 1 - 1 6 + + + / t m p / a r c h i t e c t u r e . t x t - 1 2 2 0 1 6 - 0 2 - 1 7 @ @ - 1 , 3 + 1 , 4 @ @ - S e r v e r : w a i t r e s s p r o c e s s e s - M Q : r a b b i t m q - 3 . 5 . 6 + + 1 6 e v e n t s q u e u e s - D B : a h o y _ p r o d u c t i o n o n p g 9 . 4

Mar 2016 Unknown Problem: rabbitmq3.5.6 is stable (has been running
100+ days) We redeployed the whole stack on Mar 7th, fairly routine...

T I F U b y m a k i
n g . . . n o c h a n g e - - - / t m p / a r c h i t e c t u r e . t x t - 1 2 2 0 1 6 - 0 2 - 1 7 + + + / t m p / a r c h i t e c t u r e . t x t - 1 3 2 0 1 6 - 0 3 - 0 7 @ @ - 1 , 3 + 1 , 3 @ @ - S e r v e r : w a i t r e s s p r o c e s s e s - - M Q : r a b b i t m q - 3 . 5 . 6 + - M Q : r a b b i t m q - 3 . 6 . 1 - D B : a h o y _ p r o d u c t i o n o n p g 9 . 4

Mar 2016 Problem: rabbitmq3.6.1 is not stable (ever increasing memory
usage) Hang during gerhana traffic spike Solution: Pin rabbitmq to 3.5.x (#115400419)

B a c k f r o m r e
t i r e m e n t - - - / t m p / a r c h i t e c t u r e . t x t - 1 3 2 0 1 6 - 0 3 - 0 7 + + + / t m p / a r c h i t e c t u r e . t x t - 1 4 2 0 1 6 - 0 3 - 0 9 @ @ - 1 , 3 + 1 , 3 @ @ - S e r v e r : w a i t r e s s p r o c e s s e s - - M Q : r a b b i t m q - 3 . 6 . 1 + - M Q : r a b b i t m q - 3 . 5 . 7 - D B : a h o y _ p r o d u c t i o n o n p g 9 . 4

May 2016 Problem: Reading from e v e n t
s . p r o p e r t i e s (json) is slow Patience needed to analyze events data Solution: Upgrade to pg9.5 and change column type to jsonb (#118078361) Also add O N C O N F L I C D O N O T H I N G to bach insert (#119267289) Didn't help that much

S t i l l n o m o n
g o d b - - - / t m p / a r c h i t e c t u r e . t x t - 1 4 2 0 1 6 - 0 3 - 0 9 + + + / t m p / a r c h i t e c t u r e . t x t - 1 5 2 0 1 6 - 0 5 - 1 1 @ @ - 1 , 3 + 1 , 4 @ @ - S e r v e r : w a i t r e s s p r o c e s s e s - M Q : r a b b i t m q - 3 . 5 . 7 - - D B : a h o y _ p r o d u c t i o n o n p g 9 . 4 + - D B : a h o y _ p r o d u c t i o n o n p g 9 . 5 + + e v e n t s . p r o p e r t i e s i n j s o n b

Unresolved Problem: Huge traffic spike from push notification events More
than 10x traffic within a few seconds, and lasts only ~30 seconds No chance for autoscaling (need at least 2~3 minutes) Workaround: Add random sleep time in client Less strict health check to allow more time to absorb the spike (#120924235) More aggresive autoscaling to have spare capacity Solution: Go?

Unresolved Problem: Less trust in rabbitmq after the memory usage
fiasco with 3.6.x Can't scale indefinitely, although maybe enough for our needs Workaround: Monitor mailing list for progress in 3.6.x Upgrade to larger instance type Solution: Kafka?

Unresolved Problem: Base traffic is at ~100k rpm after adding
events/visits from L6 RDS write can only keep up with traffic spike until ~100k rpm :( RDS read was already bad before this, now worse 2TB volume is barely enough to hold 2 months of data Potential Workaround: Upgrade to provisioned IOPS (up to 30000 at $6600/month) Upgrade to larger volume (up to 6TB) Solution: Write to s3, process and read with prestodb Client > Server > MQ > S3 > DB

H e l l o B i g D a
t a - - - / t m p / a r c h i t e c t u r e . t x t - 1 5 2 0 1 6 - 0 5 - 1 1 + + + / t m p / a r c h i t e c t u r e . t x t - 1 6 2 0 1 6 - 0 7 - x x @ @ - 1 , 3 + 1 , 3 @ @ - S e r v e r : w a i t r e s s p r o c e s s - M Q : r a b b i t m q - 3 . 5 . 7 - - D B : a h o y _ p r o d u c t i o n o n p g 9 . 5 + - D B : p r e s t o d b 0 . 1 4 7

AHOY: A History of YAGNI

AHOY: A History of YAGNI

More Decks by KMKLabs

Other Decks in Technology

Featured

Transcript