Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Duy Nguyen - Scraping a Million Pokemon Battles...

Duy Nguyen - Scraping a Million Pokemon Battles: Distributed Systems By Example

I love Pokemon. However, I don't love how some players make the community less welcoming towards beginners by hiding their strategies. So I did what any defiant engineer would. I signed up for a free AWS account and began (responsibly) scraping millions of their unauthenticated Pokemon battles.

We'll journey together through this passion project of mine and draw on specific examples to better understand the trade-offs of working with distributed systems or microservice architectures in the cloud.

https://us.pycon.org/2019/schedule/presentation/172/

PyCon 2019

May 03, 2019
Tweet

More Decks by PyCon 2019

Other Decks in Programming

Transcript

  1. Duy Nguyen - Scaping a Million Pokemon Battles (bit.ly/pycon-poke) Introduction

    • Currently at Google in Core Systems • Particularly care about gender equality and mentorship in tech • Previously at Ellevation Education then left to wander the world
  2. Duy Nguyen - Scaping a Million Pokemon Battles (bit.ly/pycon-poke) Introduction

    • Currently at Google in Core Systems • Particularly care about gender equality and mentorship in tech • Previously at Ellevation Education then left to wander the world
  3. Duy Nguyen - Scaping a Million Pokemon Battles (bit.ly/pycon-poke) Objective

    and Scope • I love Pokemon...for the most part. • Distributed Systems “102” ◦ We’ll journey together through this passion project of mine and draw on specific examples to better understand getting started working with distributed systems or microservice architectures in the cloud. • Scalability and 3 “Pillars” ◦ Concurrency of Resources ◦ Asserting for Correctness ◦ Resilience against Failures lvh - Distributed Systems 101 - PyCon 2015 Correctness Resilience Concurrency
  4. Duy Nguyen - Scaping a Million Pokemon Battles (bit.ly/pycon-poke) Architecture

    Overview Correctness Resilience Concurrency Pokemon Showdown
  5. (4) Lambda Function watches the S3 Bucket for new objects

    Duy Nguyen - Scaping a Million Pokemon Battles (bit.ly/pycon-poke) Architecture Overview Correctness Resilience Concurrency (5) New objects are indexed in ElastiCache (Redis) (3) Battle logs are stored in S3 (1) Room List Watcher pushes new URLs onto the SQS (2) Download Bots pull new URLs from SQS and scrapes the battle logs
  6. Correctness Resilience Concurrency Duy Nguyen - Scaping a Million Pokemon

    Battles (bit.ly/pycon-poke) Concurrency “Our application cannot handle increases in traffic.” 1 urls = room_list_watcher.scrape() 2 3 for url in urls: 4 5 download_bot.scrape(url=url)
  7. Correctness Resilience Concurrency Duy Nguyen - Scaping a Million Pokemon

    Battles (bit.ly/pycon-poke) Concurrency - System Characteristics 1 urls = room_list_watcher.scrape() 2 3 for url in urls: 4 5 download_bot.scrape(url=url) Business Logic Business Logic State
  8. Correctness Resilience Concurrency Duy Nguyen - Scaping a Million Pokemon

    Battles (bit.ly/pycon-poke) Concurrency - System Characteristics 1 urls = room_list_watcher.scrape() 2 3 for url in urls: 4 5 download_bot.scrape(url=url) Business Logic Business Logic State t ~= 300 milliseconds
  9. Correctness Resilience Concurrency Duy Nguyen - Scaping a Million Pokemon

    Battles (bit.ly/pycon-poke) Concurrency - System Characteristics 1 urls = room_list_watcher.scrape() 2 3 for url in urls: 4 5 download_bot.scrape(url=url) Business Logic Business Logic State t ~= 300 milliseconds 2 seconds <= t <= 45 minutes
  10. Correctness Resilience Concurrency Duy Nguyen - Scaping a Million Pokemon

    Battles (bit.ly/pycon-poke) Concurrency - Partitioning 1 urls = room_list_watcher.scrape() 2 3 for url in urls: 4 5 download_bot.scrape(url=url) Business Logic Business Logic State t ~= 300 milliseconds 2 seconds <= t <= 45 minutes
  11. “Unless you're implementing an operating system, use higher-level primitives such

    as atomic message queues.” Correctness Resilience Concurrency Duy Nguyen - Scaping a Million Pokemon Battles (bit.ly/pycon-poke) Concurrency - Model of Computation (Python) 1 def produce(): 2 while True: 3 urls = room_list_watcher.scrape() 4 for url in urls: 5 url_queue.put(url) 1 url_queue = Queue.Queue() 1 def consume(): 2 while True: 3 url = url_queue.get() 4 download_bot.scrape(url=url) Raymond Hettinger - Thinking about Concurrency - PyCon Russia 2016
  12. Correctness Resilience Concurrency Duy Nguyen - Scaping a Million Pokemon

    Battles (bit.ly/pycon-poke) Concurrency - Model of Computation (AWS) 1 def produce(): 2 while True: 3 urls = room_list_watcher.scrape() 4 for url in urls: 5 url_queue.put(url) 1 def consume(): 2 while True: 3 url = url_queue.get() 4 download_bot.scrape(url=url) 1 url_queue = SQSQueue( 2 name=self._properties['queue']['name']) github.com/boto/boto3
  13. Correctness Resilience Concurrency Duy Nguyen - Scaping a Million Pokemon

    Battles (bit.ly/pycon-poke) Concurrency - Model of Computation (Go) 1 urlQueue := make(chan string) 1 go func() { 2 for { 3 urls := roomListWatcher.Scrape() 4 for _, url := range urls { 5 urlQueue <- url 6 } 7 } 8 } 1 go func() { 2 for { 3 url := <-urlQueue 4 downloadBot.Scrape(url) 5 } 6 } Rob Pike - “Concurrency Is Not Parallelism” - Heroku “Concurrency is the composition of independently executing components.”
  14. Correctness Resilience Concurrency Duy Nguyen - Scaping a Million Pokemon

    Battles (bit.ly/pycon-poke) Correctness “Our application is more difficult to reason about.” • New Problems ◦ Loss of determinism ◦ Long startup times ◦ Increased flakiness • Glossary Term Definition Double A generic term for any object that replaces/stand in for a production object during testing. Mock Has expectations about the calls or interactions. Fake Has working implementations but take some shortcut (e.g. InMemoryDatabase). Augie Fackler and Nathaniel Manista - Stop Mocking, Start Testing - PyCon 2012
  15. Correctness Resilience Concurrency Duy Nguyen - Scaping a Million Pokemon

    Battles (bit.ly/pycon-poke) Correctness - Testing External Dependencies Implementation Requirements Trade-Offs Pokemon Google Community Google - Software Development Coding and Practices Guide - Mock Objects Mike Krieger - Scaling Instagram - Airbnb
  16. Correctness Resilience Concurrency Duy Nguyen - Scaping a Million Pokemon

    Battles (bit.ly/pycon-poke) Correctness - Testing External Dependencies Implementation Requirements Trade-Offs Fake • Somewhat closer to production environment • Somewhat more engineering time for development and maintenance Pokemon Google Community Google - Software Development Coding and Practices Guide - Mock Objects Mike Krieger - Scaling Instagram - Airbnb
  17. Correctness Resilience Concurrency Duy Nguyen - Scaping a Million Pokemon

    Battles (bit.ly/pycon-poke) Correctness - Testing External Dependencies Implementation Requirements Trade-Offs Fake • Somewhat closer to production environment • Somewhat more engineering time for development and maintenance Real • Closest to production environment • Least engineering time for development and maintenance • Longest test execution time • Most computationally expensive Pokemon Google Community Google - Software Development Coding and Practices Guide - Mock Objects Mike Krieger - Scaling Instagram - Airbnb
  18. Correctness Resilience Concurrency Duy Nguyen - Scaping a Million Pokemon

    Battles (bit.ly/pycon-poke) Correctness - Testing External Dependencies Implementation Requirements Trade-Offs Fake • Somewhat closer to production environment • Somewhat more engineering time for development and maintenance Real • Closest to production environment • Least engineering time for development and maintenance • Longest test execution time • Most computationally expensive Mock • The original Instagram engineering philosophy is “Do the simple thing first.” • Furthest from production environment • Most susceptible to false positives (interface drift) Pokemon Google Community Google - Software Development Coding and Practices Guide - Mock Objects Mike Krieger - Scaling Instagram - Airbnb
  19. Correctness Resilience Concurrency Duy Nguyen - Scaping a Million Pokemon

    Battles (bit.ly/pycon-poke) Resilience “Our application will inevitably fail.” Policy Premise AKA Timeout Beyond a certain wait, a success result is unlikely. "Don't wait forever" Retry Many faults are transient and may self-correct after a short delay. "Maybe it's just a blip" Fallback Things will still fail - plan what you will do when that happens. "Degrade gracefully" github.com/App-vNext/Polly
  20. Correctness Resilience Concurrency Duy Nguyen - Scaping a Million Pokemon

    Battles (bit.ly/pycon-poke) Resilience - Timeout 1 def find_button(self, locator): 2 condition = expected_conditions.element_to_be_clickable( 3 locator=locator) 4 try: 5 button = self._wait_context.until(condition) 6 except selenium.common.exceptions.TimeoutException: 7 button = None 8 9 result = lookup.results.Find(value=button, zero_value=None) 10 return result github.com/SeleniumHQ/selenium
  21. Correctness Resilience Concurrency Duy Nguyen - Scaping a Million Pokemon

    Battles (bit.ly/pycon-poke) Resilience - Retry 1 def build_policy(...): 2 stop_strategy = retry.stop_strategies.AfterDuration( 3 maximum_duration=self._properties['policy']['stop_strategy']['maximum_duration']) 4 wait_strategy = retry.wait_strategies.Fixed( 5 wait_time=self._properties['policy']['wait_strategy']['wait_time']) 6 7 retry_policy = retry.PolicyBuilder() \ 8 .with_stop_strategy(stop_strategy) \ 9 .with_wait_strategy(wait_strategy) \ 10 .continue_on_exception(automation.exceptions.ConnectionLost) \ 11 .continue_on_exception(automation.exceptions.WebDriverError) \ 12 .continue_on_exception(exceptions.BattleNotCompleted) \ 13 .build() github.com/rholder/guava-retrying
  22. Correctness Resilience Concurrency Duy Nguyen - Scaping a Million Pokemon

    Battles (bit.ly/pycon-poke) Resilience - Fallback 1 def scrape(self, url): 2 elements = None 3 while elements is None: 4 try: 5 elements = self._policy.execute(self._scraper.scrape, url=url) 6 except retry.exceptions.MaximumRetry as e: 7 # The expected errors have persisted. Defer to the 8 # fallback. 9 elements = list() 10 except selenium.common.exceptions.StaleElementReferenceException as e: 11 # An expected error has occurred that cannot be handled 12 # by alternative measures. Reload the existing scraper. 13 self._reload() 16 17 return elements Amandine Lee - Passing Exceptions 101: Paradigms in Error Handling - PyCon 2017
  23. Duy Nguyen - Scaping a Million Pokemon Battles (bit.ly/pycon-poke) “We

    don’t ship code; we ship features.” “We don’t solve problems for computers; we solve problems for people.” Jack Diederich - “Stop Writing Classes” - PyCon 2012
  24. Duy Nguyen - Scaping a Million Pokemon Battles (bit.ly/pycon-poke) HAH,

    you think there’s still time for questions? bit.ly/pycon-poke