Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Debug Like a Scientist! Copenhagen Developers F...

mrzasa
August 28, 2024

Debug Like a Scientist! Copenhagen Developers Festival 2024

Debugging: being wrong, chasing your own tail, venting off. You know it, right?

Don't despair. We can leverage centuries-old practices used by scientists. Observation, hypothesis, experiment - rinse and repeat. This method worked for me in real cases. I'll tell about a non-deterministic test, a nasty production issue, and malfunctioning CI infrastructure.

This talk will show you how to debug without frustration. You'll see practices hardening you for a fight against complex bugs. Finally, you'll know how to scale debugging to the entire team.

mrzasa

August 28, 2024
Tweet

More Decks by mrzasa

Other Decks in Programming

Transcript

  1. Maciek Rząsa @mjrzasa Debug Like a Scientist Maciek Rząsa Photo

    by: National Cancer Institute on Unsplash STpcsDG
  2. Maciek Rząsa @mjrzasa Debugging Science If a system of coordinates

    K is chosen so that, in relation to it, physical laws hold good in their simplest form, the same laws hold good in relation to any other system of coordinates K' moving in uniform translation relatively to K. ... light is always propagated in empty space with a definite velocity c which is independent of the state of motion of the emitting body Einstein 1905 L(v) = L(0) ​ 1 − ​ c2 v2 Image by wikimedia.org
  3. Maciek Rząsa @mjrzasa 0. Surprising behaviour (aka bug) Calling a

    non-existent method Rails Exception:, #<NoMethodError: undefined method `association_tree' for #<ArLazyPreload::Contexts::AutoPreloadContext:0x000055f9902b13f0> Did you mean? association_needs_preload? Did you mean? association_needs_preload?> gems/ar_lazy_preload-0.3.1/lib/ar_lazy_preload/associated_context_builder.rb:44: in `child_association_tree' gems/ar_lazy_preload-0.3.1/lib/ar_lazy_preload/associated_context_builder.rb:35: in `perform' gems/ar_lazy_preload-0.3.1/lib/ar_lazy_preload/associated_context_builder.rb:12: in `prepare' gems/ar_lazy_preload-0.3.1/lib/ar_lazy_preload/contexts/base_context.rb:43: in `perform_preloading' gems/ar_lazy_preload-0.3.1/lib/ar_lazy_preload/contexts/base_context.rb:28: in `try_preload_lazily' gems/ar_lazy_preload-0.3.1/lib/ar_lazy_preload/active_record/base.rb:12: in `try_preload_lazily' gems/ar_lazy_preload-0.3.1/lib/ar_lazy_preload/active_record/association.rb:7: in `load_target' gems/activerecord-6.0.3.2/lib/active_record/associations/association.rb:65: in `reload'
  4. Maciek Rząsa @mjrzasa 1. Gathering: Context We somehow pass the

    guard statement ArLazyPreload.configure do |config| config.auto_preload = true return nil if ArLazyPreload.config.auto_preload? # we SURELY return here # the bug happens here vvvvvvvvvvvvvvvvvvvvvvvvvvvvv AssociationTreeBuilder.new(parent_context.association_tree). subtree_for(association_name) # our code # ar_lazy_preload/lib/ar_lazy_preload/associated_context_builder.rb class AssociatedContextBuilder def child_association_tree end
  5. Maciek Rząsa @mjrzasa 1. Gathering: Observation We do change the

    guard variable class Graphql::Support::ArLazyPreloadOperationsInstrumenter def before_query(query) ArLazyPreload.config.auto_preload = true if has_operations_field?(query) end def after_query(_query) ArLazyPreload.config.auto_preload = DEFAULT_AUTO_PRELOAD end
  6. Maciek Rząsa @mjrzasa 3. Experiment If we simulate interleaving threads,

    we should see the same error ArLazyPreload.config.auto_preload = true; preload = ArLazyPreload.config.auto_preload; m = Manager.find(123); ArLazyPreload.config.auto_preload = !preload; puts m.user.to_s; # loading association # #<NoMethodError: undefined method `association_tree' # for #<ArLazyPreload::Contexts::AutoPreloadContext ArLazyPreload.config.auto_preload = preload;
  7. Maciek Rząsa @mjrzasa Hypothesis-Driven Debugging 1. Gathering a. Context: Error

    after the guard statement b. Observation: Config is changed 2. Guess Hypothesis: Race condition 3. Experiment: Simulated race condition 4. Understanding: Thread-safety in before filter
  8. Maciek Rząsa @mjrzasa What if We Can’t Dig into the

    Code Locally? Photo by: Wes Hicks on Unsplash
  9. Maciek Rząsa @mjrzasa 1. Gathering: Observation Can’t connect to graphql-gateway

    container Failed to open TCP connection to graphql-gateway:5000 (getaddrinfo: Name or service not known) (Faraday::ConnectionFailed) Pushing schemas for staff, talent... /ruby/net/http.rb:949:in `rescue in block in connect' /ruby/net/http.rb:946:in `block in connect' /ruby/timeout.rb:93:in `block in timeout' /ruby/timeout.rb:103:in `timeout' /ruby/net/http.rb:945:in `connect' /ruby/net/http.rb:930:in `do_start' /ruby/net/http.rb:919:in `start' /gems/webmock-3.5.1/lib/webmock/http_lib_adapters/net_http.rb:109:in `request' /gems/faraday-0.17.3/lib/faraday/adapter/net_http.rb:87:in `perform_request' /gems/faraday-0.17.3/lib/faraday/adapter/net_http.rb:43:in `block in call' /gems/faraday-0.17.3/lib/faraday/adapter/net_http.rb:92:in `with_net_http_connection' /gems/faraday-0.17.3/lib/faraday/adapter/net_http.rb:38:in `call'
  10. Maciek Rząsa @mjrzasa 1. Gathering: Observation def upload_and_activate_schemas puts "====checking:

    #{ping(Settings.graphql_gateway.url)}" post(Settings.graphql_gateway.url, body: schema) # <<<< error is raised here end def ping(url) response = Faraday.head(url) response.status end ====checking: 404 Net::ReadTimeout (Net::ReadTimeout)
  11. Maciek Rząsa @mjrzasa 1. Gathering: Context A successful HTTP request

    has 3 main steps: 1. Open a TCP connection to the endpoint. 2. Send a request over the connection. 3. Read the response written to the connection. Net::ReadTimeout is raised when step 3 doesn’t succeed within the given time. This error is not the same as Net::OpenTimeout, which is raised when a connection cannot be setup within a given amount of time (i.e. when step 1 does not succeed within a given time). https://www.exceptionalcreatures.com/bestiary/Net/ReadTimeout
  12. Maciek Rząsa @mjrzasa 1. Gathering: Context Log output of a

    postgres container stops after running for a while which is causing subsequent database queries to hang due to failing to write to stderr. docker-compose (…) only reads the logs from those configured by the user (the main service when using docker-compose run). This likely builds up backpressure for all other containers as their output buffers fill, but are never drained, (…). Eventually, this backpressure causes the docker engine to stop processing the containers’ stdout/stderr. https://github.com/docker/compose/issues/6018 and we log a lot on /healthz endpoint!
  13. Maciek Rząsa @mjrzasa But would it Work on a Bigger

    Case? Photo by: Sigmund on Unsplash
  14. Maciek Rząsa @mjrzasa Crucial Piece of Notes 10:19:46 is first

    client timeout 10:20 started and dropped 10:20 started 10:23:45 saturated 10:27 dropped 10:27:30 is the memory peak so it is on startup 11:03:15 started 11:04:45 saturated 11:15:45 dropped
  15. Maciek Rząsa @mjrzasa Hypothesis-Driven Debugging: Scaling Multiple paths of inquiry

    Everything public, available to everybody Round-the-clock work (timezones) Progress metric: rejected hypotheses A way to coach junior teammates
  16. Maciek Rząsa @mjrzasa It Was a Harsh Learning Experience, Make

    Sure It Sticks https://twitter.com/ProfFeynman/status/1579123122980425731