Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Surviving Black Friday: 329 billion requests wi...

Sponsored · Ship Features Fearlessly Turn features on and off without deploys. Used by thousands of Ruby developers.

Surviving Black Friday: 329 billion requests with Falcon!

On Black Friday 2025, Shopify's Storefront served billions of requests on Falcon—85 million RPM at peak, zero dropped requests. One year earlier, we ran three different web servers and had no idea which parts of our 15-year-old Rails codebase were thread-safe.
This talk brings together three perspectives on migrating a massive Ruby application to Falcon:
Samuel Williams built Falcon and explains how it works — fibers, the async scheduler, and Long Tasks. He also investigates and fixes Crisis 3: the silent worker death caused by a torn pipe write in async-container.
Marc-Andre Cournoyer sets the scene: Shopify's Storefront Renderer, why Unicorn was showing its limits, and the bet to ship Falcon before BFCM. He covers the October rollback and closes with results and lessons.
Josh Teeter owns the rollout: phased deployment, scale tests, the rdkafka segfault crisis, and the Outbox — the solution that solved the Kafka connection explosion and earned the green light.

Avatar for Samuel Williams

Samuel Williams

April 29, 2026

More Decks by Samuel Williams

Other Decks in Programming

Transcript

  1. I'm Samuel Williams. I built Falcon. And BFCM 2025 was

    the moment we found out what it was actually worth. I'm joined by Marc and Josh — and together, we're going to tell you about the biggest bet we've ever made on Ruby. How We Survived BFCM with Falcon at Shopify RubyKaigi 2026 Falcon でBFCM を乗り越えた方法 Slide 1 of 170 0010-title.md Elapsed: 00:00 Slide: 00:14 🎤 Samuel
  2. Five parts. The bet we made, the things that broke,

    the darkest week of the project, BFCM itself, and the lessons we took away. Marc will take us through the bet. The Bet Things Start Breaking The Darkest Week BFCM Lessons 本日の流れ:賭け、問題発生、最も暗い一週間、BFCM 、教訓 Slide 2 of 170 0015-agenda.md Elapsed: 00:14 Slide: 00:11 🎤 Samuel
  3. Marc-Andre takes over. The bet... Act I: The Bet 賭け

    Slide 3 of 170 0020-act1-section.md Elapsed: 00:25 Slide: 00:03 🎤 Marc-Andre
  4. Ship Falcon to a 100% of storefront traffic before BFCM.

    That ended up being: Ship Falcon to 100% of Shopify's storefront traffic before BFCM 2025. BFCM 2025 までにShopify のストアフロントトラフィックの100% にFalcon を展開する。 Slide 4 of 170 0021-the-bet.md Elapsed: 00:28 Slide: 00:07 🎤 Marc-Andre
  5. Three hundred and twenty-nine billion requests. 329,000,000,000 requests 3290 億リクエスト

    Slide 5 of 170 0030-number.md Elapsed: 00:35 Slide: 00:04 🎤 Marc-Andre
  6. During one weekend. Our biggest weekend of the year: BFCM

    — Black Friday, Cyber Monday. The app: One weekend. たった一週末。 Slide 6 of 170 0040-weekend.md Elapsed: 00:39 Slide: 00:08 🎤 Marc-Andre
  7. Storefront Renderer. A Rack app. It powers every online store

    on the platform. If you've ever visited a Shopify store, this is the code that rendered the page. The App: Shopify's Storefront Renderer. Powers every online store. Shopify のストアフロントレンダラーは、すべてのオンラインストアを動かしている。 Slide 7 of 170 0070-storefront.md Elapsed: 00:47 Slide: 00:10 🎤 Marc-Andre
  8. Some of the biggest online stores are hosted on Storefront

    Renderer: Mattel, Allbirds, Glossier, Ruggable. Slide 8 of 170 0071-storefronts.md Elapsed: 00:57 Slide: 00:12 🎤 Marc-Andre
  9. And once a year, for BFCM, all of our traffic

    doubles. This is the biggest shopping weekend on the internet. Once a year, all of that traffic doubles. 年に一度、そのトラフィックが倍になる。 Slide 9 of 170 0090-bfcm-doubles.md Elapsed: 01:09 Slide: 00:10 🎤 Marc-Andre
  10. This is what BFCM looks like. The peaks — Black

    Friday, Saturday, Sunday, and Cyber Monday. The dips are when people sleep. Then it ramps right back up. Requests Per Second — BFCM BFCM 中の秒間リクエスト数 Slide 10 of 170 0100-bfcm-image.md Elapsed: 01:19 Slide: 00:14 🎤 Marc-Andre
  11. We peak at 1.42 million requests per second. Every one

    of them goes through the Storefront Renderer. Peak: 1.42 million requests per second. ピーク:毎秒142 万リクエスト。 Slide 11 of 170 0110-peak.md Elapsed: 01:33 Slide: 00:11 🎤 Marc-Andre
  12. And for a long time we ran all this on

    Unicorn. It uses one process per request. It's simple, reliable. Problem is it blocks on I/O. Unicorn: One process. One request. Simple. Reliable. 1 プロセス。1 リクエスト。シンプル。信頼性。 Slide 12 of 170 0130-unicorn-model.md Elapsed: 01:44 Slide: 00:13 🎤 Marc-Andre
  13. When a request waits on I/O — calling to an

    external service, for example — the whole worker just sits there. Doing nothing. The entire worker just sits there. Doing nothing. ワーカー全体が何もせずに遊んでいる。 Slide 13 of 170 0140-unicorn-problem.md Elapsed: 01:57 Slide: 00:08 🎤 Marc-Andre
  14. Here's the picture. Worker 1 is blocked waiting on I/O

    — it can't take another request. Requests queue up waiting for a free worker. Worker 2 is the one doing CPU work here, because it's not blocked on I/O. Unicorn Worker 1 Request A · I/O Wait Request C · Queued Request E · Queued Worker 2 Request B · Running Request D · Queued CPU Idle CPU Busy I/O 待ち中、CPU はアイドル。リクエストがキューに積まれる。 Slide 14 of 170 0150-unicorn-diagram.md Elapsed: 02:05 Slide: 00:15 🎤 Marc-Andre
  15. Falcon is a fiber-based, async Ruby web server, built by

    Samuel. Samuel had built Falcon — a fiber-based, async Ruby web server. Samuel がFalcon を作っていた ー ファイバーベースの非同期Ruby ウェブサーバー。 Slide 15 of 170 0170-falcon-intro.md Elapsed: 02:20 Slide: 00:07 🎤 Marc-Andre
  16. It can handle multiple requests in one single process at

    a time. When one request waits on I/O, let the worker pick up another request. 一つのリクエストがI/O を待っている間に、ワーカーは別のリクエストを処理する。 Slide 16 of 170 0180-falcon-idea.md Elapsed: 02:27 Slide: 00:04 🎤 Marc-Andre
  17. In Falcon each worker handles multiple requests concurrently, in the

    same process, using fibers. When a fiber is waiting on I/O, another fiber picks up the next request. Now I'll hand it to Samuel to walk us through how Falcon works. Falcon Worker 1 Request A · I/O Wait Request B · Running Request C · I/O Wait Worker 2 Request D · I/O Wait Request E · Running CPU Busy CPU Busy I/O 待ち中、別のリクエストを処理。CPU はビジー。キューは空。 Slide 17 of 170 0190-falcon-diagram.md Elapsed: 02:31 Slide: 00:17 🎤 Marc-Andre
  18. Samuel takes over. So how does Falcon actually work? Let

    me walk you through the engine we were betting on. How Falcon Works Falcon の仕組み Slide 18 of 170 0230-how-falcon-section.md Elapsed: 02:48 Slide: 00:07 🎤 Samuel
  19. Falcon is a layered system. Async::Container manages the Falcon worker

    processes and is responsible for restarting them if they fail. Async::Container Worker 1 Worker 2 Async::Container がワーカープロセスを管理し、障害時には自動的に再起動する。 Slide 19 of 170 0231-overview.md Elapsed: 02:55 Slide: 00:08 🎤 Samuel
  20. Each worker runs the Falcon service using Async to multiplex

    requests. Async::Container Worker 1 Worker 2 Falcon Async Falcon Async 各ワーカーでFalcon がAsync のスケジューラー上で動き、複数のリクエストを多重処理する。 Slide 20 of 170 0232-overview.md Elapsed: 03:03 Slide: 00:06 🎤 Samuel
  21. Each fiber handles one request. When a fiber is waiting

    on I/O, Async suspends it and runs the next one. This is what lets a single worker handle multiple requests concurrently. Async::Container Worker 1 Falcon Async Worker 2 Falcon Async Fiber Request Fiber Request Fiber Request Fiber Request 各ファイバーが1 つのリクエストを担当。I/O 待ちになるとAsync がファイバーを一時停止し、次のファイバーに切り替える。これにより1 つのワーカーで複数 リクエストを並行処理できる。 Slide 21 of 170 0233-overview.md Elapsed: 03:09 Slide: 00:11 🎤 Samuel
  22. In addition to that, Falcon can have a supervisor, which

    monitors memory usage, process status and server utilization. These monitors ensure the Falcon server remains healthy and operational. Async::Container Worker 1 Falcon Async Fiber Request Fiber Request Worker 2 Falcon Async Fiber Request Fiber Request Supervisor Memory Monitor Process Monitor Utilization Monitor さらに、Falcon はスーパーバイザーを持つことができる。メモリ使用量・プロセス状態・サーバー利用率を監視し、ワーカーの健全性を維持する。 Slide 22 of 170 0239-overview.md Elapsed: 03:20 Slide: 00:13 🎤 Samuel
  23. Now let's look at how request scheduling actually works —

    how Async decides which fiber runs next, and what happens when a request is waiting on I/O. Async Fiber Request Fiber Request では、Async がどのようにファイバーをスケジュールするかを見てみましょう。 Slide 23 of 170 0240-async-fiber.md Elapsed: 03:33 Slide: 00:10 🎤 Samuel
  24. Here is a trivialized example showing two requests. Async —

    Concurrent Requests Async do request_a = Async do cart = Core.fetch_cart(session) send_response(200, render(cart)) end request_b = Async do products = Spanner.query_catalog(shop) send_response(200, render(products)) end request_a.wait request_b.wait end 2 つのリクエストを並行処理する簡略化した例。 Slide 24 of 170 0250-requests.md Elapsed: 03:43 Slide: 00:04 🎤 Samuel
  25. Request A — a cart fetch from Core. Async —

    Concurrent Requests Async do request_a = Async do cart = Core.fetch_cart(session) send_response(200, render(cart)) end request_b = Async do products = Spanner.query_catalog(shop) send_response(200, render(products)) end request_a.wait request_b.wait end リクエストA — Core からカートを取得する。 Slide 25 of 170 0251-requests.md Elapsed: 03:47 Slide: 00:03 🎤 Samuel
  26. When this fiber hits the network wait, it doesn't block.

    It yields. Async picks up the next ready fiber. Async — Concurrent Requests Async do request_a = Async do cart = Core.fetch_cart(session) send_response(200, render(cart)) end request_b = Async do products = Spanner.query_catalog(shop) send_response(200, render(products)) end request_a.wait request_b.wait end ネットワーク待ちが発生すると、ファイバーはブロックせずにyield し、Async が次のファイバーに切り替える。 Slide 26 of 170 0252-requests.md Elapsed: 03:50 Slide: 00:07 🎤 Samuel
  27. That is request B — a catalog query from Spanner.

    Same pattern. Async — Concurrent Requests Async do request_a = Async do cart = Core.fetch_cart(session) send_response(200, render(cart)) end request_b = Async do products = Spanner.query_catalog(shop) send_response(200, render(products)) end request_a.wait request_b.wait end リクエストB — Spanner にカタログをクエリするI/O 処理。 Slide 27 of 170 0260-requests.md Elapsed: 03:57 Slide: 00:05 🎤 Samuel
  28. While this one waits on the database, request A can

    continue. The two requests execute concurrently. Async — Concurrent Requests Async do request_a = Async do cart = Core.fetch_cart(session) send_response(200, render(cart)) end request_b = Async do products = Spanner.query_catalog(shop) send_response(200, render(products)) end request_a.wait request_b.wait end リクエストB がDB 待ちの間、リクエストA が進行できる。2 つのリクエストが並行して実行される。 Slide 28 of 170 0261-requests.md Elapsed: 04:02 Slide: 00:07 🎤 Samuel
  29. Finally we wait for both requests to complete. But here's

    the key: they haven't been sitting idle. While request A was waiting on Core, request B was running. While request B was waiting on Spanner, request A was running. Two requests, interleaved. Async — Concurrent Execution Async do request_a = Async do cart = Core.fetch_cart(session) send_response(200, render(cart)) end request_b = Async do products = Spanner.query_catalog(shop) send_response(200, render(products)) end request_a.wait request_b.wait end 両方を wait で待つが、その間も互いにI/O 待ちを利用して並行実行されていた。 Slide 29 of 170 0270-requests.md Elapsed: 04:09 Slide: 00:14 🎤 Samuel
  30. At it's heart, async uses the io-event gem, which provides

    platform-specific backends for multiplexing blocking operations: io_uring and epoll on Linux and kqueue on macOS. io-event — the engine underneath: epoll , kqueue , io_uring . プラットフォーム固有のI/O バックエンド。 Slide 30 of 170 0290-io-event.md Elapsed: 04:23 Slide: 00:14 🎤 Samuel
  31. Let's take a look at how the Ruby Fiber Scheduler

    integrates with the io-event selector. The Fiber Scheduler class Async::Scheduler def io_wait(io, events, timeout) # Register interest in this I/O: fiber = Fiber.current monitor = @selector.io_wait(fiber, io, events) # Yield control to the event loop: transfer # When we resume, the I/O is ready: return events end def run while @selector.ready? @selector.select(interval) do |fiber| fiber.transfer end end end end ファイバースケジューラーのコードを見てみよう。 Slide 31 of 170 0310-scheduler-code.md Elapsed: 04:37 Slide: 00:05 🎤 Samuel
  32. When an I/O operation would otherwise block, Ruby invokes the

    io_wait hook. The Fiber Scheduler class Async::Scheduler def io_wait(io, events, timeout) # Register interest in this I/O: fiber = Fiber.current monitor = @selector.io_wait(fiber, io, events) # Yield control to the event loop: transfer # When we resume, the I/O is ready: return events end def run while @selector.ready? @selector.select(interval) do |fiber| fiber.transfer end end end end I/O 待ちが発生すると io_wait フックが呼ばれ、ファイバーをセレクターに登録する。 Slide 32 of 170 0311-scheduler-code.md Elapsed: 04:42 Slide: 00:05 🎤 Samuel
  33. The implementation registers the fiber with the io-event selector... The

    Fiber Scheduler class Async::Scheduler def io_wait(io, events, timeout) # Register interest in this I/O: fiber = Fiber.current monitor = @selector.io_wait(fiber, io, events) # Yield control to the event loop: transfer # When we resume, the I/O is ready: return events end def run while @selector.ready? @selector.select(interval) do |fiber| fiber.transfer end end end end io-event セレクターにファイバーを登録し、I/O の準備ができるまで監視する。 Slide 33 of 170 0312-scheduler-code.md Elapsed: 04:47 Slide: 00:04 🎤 Samuel
  34. ...then calls transfer — which yields control back to the

    event loop. The fiber is suspended. The Fiber Scheduler class Async::Scheduler def io_wait(io, events, timeout) # Register interest in this I/O: fiber = Fiber.current monitor = @selector.io_wait(fiber, io, events) # Yield control to the event loop: transfer # When we resume, the I/O is ready: return events end def run while @selector.ready? @selector.select(interval) do |fiber| fiber.transfer end end end end transfer でイベントループに制御を返す。ファイバーは一時停止し、次の準備済みファイバーに切り替わる。 Slide 34 of 170 0313-scheduler-code.md Elapsed: 04:51 Slide: 00:07 🎤 Samuel
  35. The actual event loop runs on the main fiber. The

    Event Loop # Yield control to the event loop: transfer # When we resume, the I/O is ready: return events end def run while @selector.ready? @selector.select(interval) do |fiber| fiber.transfer end end end end イベントループはメインファイバー上で動作する。 Slide 35 of 170 0320-scheduler-code.md Elapsed: 04:58 Slide: 00:04 🎤 Samuel
  36. It keeps selecting ready fibers and transferring to them. And

    when they need to wait, they come back to this loop. The Event Loop # Yield control to the event loop: transfer # When we resume, the I/O is ready: return events end def run while @selector.ready? @selector.select(interval) do |fiber| fiber.transfer end end end end 準備済みのファイバーを選択してtransfer を繰り返す。I/O 待ちのファイバーはここに戻ってくる。 Slide 36 of 170 0321-scheduler-code.md Elapsed: 05:02 Slide: 00:08 🎤 Samuel
  37. Now let's talk about how the supervisor works. Async::Container Worker

    1 Falcon Worker 2 Falcon Async Fiber Request Fiber Request Async Fiber Request Fiber Request Supervisor Memory Monitor Process Monitor Utilization Monitor Async::Container がワーカーとスーパーバイザーを管理。スーパーバイザーがワーカーの健全性を監視する。 Slide 37 of 170 0330-overview.md Elapsed: 05:10 Slide: 00:04 🎤 Samuel
  38. The Supervisor is a separate process inside the container. It

    runs monitors that keep the workers healthy — tracking memory usage, process metrics, and server utilization. Supervisor Memory Monitor Process Monitor Utilization Monitor スーパーバイザーは独立したプロセスとして動作し、各ワーカーの状態を監視する。 Slide 38 of 170 0331-supervisor.md Elapsed: 05:14 Slide: 00:11 🎤 Samuel
  39. Every worker has a bi-directional connection to the supervisor via

    the async-bus gem — a transparent Ruby RPC mechanism that lets monitors execute operations directly in the context of a worker process. Workers communicate using async-bus . ワーカーは双方向チャンネルでスーパーバイザーと通信する。 Slide 39 of 170 0332-supervisor.md Elapsed: 05:25 Slide: 00:13 🎤 Samuel
  40. The Memory Monitor is policy-driven — when a worker's memory

    usage crosses a threshold, it terminates the process. It supports both a maximum size limit and a minimum free size limit. The Memory Monitor terminates workers that exceed memory limits. メモリモニターはポリシーに基づき、閾値を超えたワーカープロセスを終了する。 Slide 40 of 170 0352-memory-monitor.md Elapsed: 05:38 Slide: 00:12 🎤 Samuel
  41. The Process Monitor emits system metrics for each worker, including

    CPU usage, private and shared memory usage, memory map count, and so on, useful for diagnostics and debugging purposes. The Process Monitor emits system metrics for each worker (and all child processes). プロセスモニターはCPU ・メモリ使用量など、各ワーカーのシステムメトリクスを収集する。 Slide 41 of 170 0353-process-monitor.md Elapsed: 05:50 Slide: 00:12 🎤 Samuel
  42. The Utilization Monitor reads worker utilization information from a shared

    memory buffer, including things like number of active connections and number of active requests. This information can be used for controlling scaling and load balancing. The Utilization Monitor reads worker utilization information. 利用率モニターはアクティブな接続数・リクエスト数など、ワーカーの利用率情報を読み取る。 Slide 42 of 170 0354-utilization-monitor.md Elapsed: 06:02 Slide: 00:15 🎤 Samuel
  43. So in summary, Falcon workers, along with a supervisor, run

    inside the container. The container restarts processes if they fail, and the supervisor monitors the state of the workers to ensure they are healthy and behaving correctly. But here's the thing — we didn't turn on all of this concurrency on day one. Async::Container Worker 1 Falcon Worker 2 Falcon Async Fiber Request Fiber Request Supervisor Memory Monitor Process Monitor Utilization Monitor Async Fiber Request Fiber Request Async::Container がワーカーとスーパーバイザーを管理。スーパーバイザーがワーカーの健全性を監視する。 Slide 43 of 170 0359-overview.md Elapsed: 06:17 Slide: 00:18 🎤 Samuel
  44. Let me tell you about Long Tasks — the feature

    that gave us control over when to unlock concurrency. Long Tasks ロングタスク Slide 44 of 170 0360-long-tasks-title.md Elapsed: 06:35 Slide: 00:06 🎤 Samuel
  45. Handling multiple concurrent CPU bound requests can make overall latency

    worse, so we wanted a way to control this carefully, while still enabling I/O concurrency. CPU-bound requests don't benefit from concurrency — they fight over the CPU. CPU バウンドリクエストは並行処理の恩恵を受けない — CPU を奪い合うだけ。 Slide 45 of 170 0362-why-unicorn-mode.md Elapsed: 06:41 Slide: 00:10 🎤 Samuel
  46. We introduced a semaphore — a permit a worker must

    acquire to accept a request. Before a request runs, it acquires the permit. When it finishes, it releases it. Falcon is now a drop-in replacement for Unicorn, with all the same behavioural characteristics. Same performance baseline, same operational profile. A safe starting point. We deployed Falcon in Unicorn mode first — one request per worker. 最初はUnicorn モードで展開した — ワーカーごとに1 リクエスト。 Slide 46 of 170 0364-unicorn-mode-first.md Elapsed: 06:51 Slide: 00:23 🎤 Samuel
  47. Long Tasks is opt-in concurrency. Wrapping an I/O-bound block in

    a Long Task releases the semaphore too — admitting the next request — then reacquires it when the I/O is done. CPU-bound requests never release the semaphore. They run without competition. Critical CPU-bound requests stay fast. 重要なリクエストは速いまま。 Slide 47 of 170 0365-long-tasks-why.md Elapsed: 07:14 Slide: 00:18 🎤 Samuel
  48. Here is an example. Two requests arrive. Neither can proceed

    until it acquires the semaphore. The semaphore is free. REQUEST A REQUEST B Waiting at gate… Waiting at gate… リクエストA とB が到着。セマフォは解放されている。 Slide 48 of 170 0380-long-task-diagram.md Elapsed: 07:32 Slide: 00:09 🎤 Samuel
  49. Request A acquires the semaphore and starts processing. Request B

    must wait. REQUEST A REQUEST B Waiting at gate… Waiting at gate… Admitted → acquire semaphore Running (CPU) リクエストA がセマフォを取得し処理を開始。B はA が解放するまで待機。 Slide 49 of 170 0381-long-task-a-admits.md Elapsed: 07:41 Slide: 00:06 🎤 Samuel
  50. Request A enters a Long Task — I/O-bound work. It

    releases the semaphore. Request B acquires it immediately. REQUEST A REQUEST B Waiting at gate… Waiting at gate… Admitted → acquire semaphore Running (CPU) Long Task → release semaphore Admitted → acquire semaphore リクエストA がロングタスクに入り、セマフォを解放。B が同時に取得する。 Slide 50 of 170 0382-long-task-release.md Elapsed: 07:47 Slide: 00:08 🎤 Samuel
  51. Request A runs its I/O. Request B runs on the

    CPU. Two requests in flight — one doing I/O, one doing CPU work. REQUEST A REQUEST B Waiting at gate… Waiting at gate… Admitted → acquire semaphore Running (CPU) Long Task → release semaphore Admitted → acquire semaphore Running (I/O) Running (CPU) A はI/O 処理中。B がCPU 処理を開始。2 つのリクエストが同時進行。 Slide 51 of 170 0383-long-task-b-admits.md Elapsed: 07:55 Slide: 00:09 🎤 Samuel
  52. The Long Task completes. Request A waits at the gate.

    Request B finishes and releases the semaphore. REQUEST A REQUEST B Waiting at gate… Waiting at gate… Admitted → acquire semaphore Running (CPU) Long Task → release semaphore Admitted → acquire semaphore Running (I/O) Running (CPU) Long task done → waiting at gate Complete ✓ → release semaphore ロングタスク完了。A はゲートでB の完了を待つ。 Slide 52 of 170 0384-long-task-b-done.md Elapsed: 08:04 Slide: 00:07 🎤 Samuel
  53. Two requests, one worker, zero CPU contention. That's Long Tasks.

    REQUEST A REQUEST B Waiting at gate… Waiting at gate… Admitted → acquire semaphore Running (CPU) Long Task → release semaphore Admitted → acquire semaphore Running (I/O) Running (CPU) Long task done → waiting at gate Complete ✓ → release semaphore Admitted → acquire semaphore Complete ✓ → release semaphore 2 つのリクエスト、1 つのワーカー、CPU の競合なし。ロングタスクの仕組み。 Slide 53 of 170 0385-long-task-a-done.md Elapsed: 08:11 Slide: 00:06 🎤 Samuel
  54. But, this only works if your code is fiber-safe. We

    had a massive Ruby codebase with native C extensions that was never built for this kind of concurrency. But... only works if your code is fiber-safe. ファイバーセーフなコードでなければ動かない。 Slide 54 of 170 0410-fiber-safe.md Elapsed: 08:17 Slide: 00:10 🎤 Samuel
  55. This is a key danger. A C extension that makes

    a blocking system call without yielding doesn't just stall one request — it stalls every concurrent request on that worker. Every fiber. The entire point of Falcon disappears. A C extension that blocks the thread blocks every fiber on that worker. スレッドをブロックするC 拡張は、そのワーカー上のすべてのファイバーをブロックする。 Slide 55 of 170 0411-not-fiber-safe.md Elapsed: 08:27 Slide: 00:15 🎤 Samuel
  56. These aren't obscure libraries. rdkafka wraps librdkafka in C. gRPC

    has its own threading model. They were written for a world where the thread and the execution context are the same thing. rdkafka . gRPC . Native gems we'd depended on for years. rdkafka 。gRPC 。長年依存してきたネイティブgem 。 Slide 56 of 170 0413-native-deps.md Elapsed: 08:42 Slide: 00:15 🎤 Samuel
  57. Some would break, we just didn't know which ones yet...

    let me pass over to Josh who will tell us what happened next! Some would break. We just didn't know which ones yet. いくつかは壊れる。どれかはまだわからなかった。 Slide 57 of 170 0416-about-to-find-out.md Elapsed: 08:57 Slide: 00:08 🎤 Samuel
  58. Josh takes over. We had a plan. A phased rollout.

    And for a while — it seemed like it was going to work. Act II: Things Start Breaking 壊れ始める Slide 58 of 170 0420-act2-section.md Elapsed: 09:05 Slide: 00:05 🎤 Josh
  59. We didn't flip a switch. You don't go from zero

    to one hundred on a new web server overnight just before Black Friday. We built a plan — a careful, methodical rollout designed to catch problems before they could reach merchants. The Rollout 段階的デプロイ Slide 59 of 170 0430-rollout-title.md Elapsed: 09:10 Slide: 00:12 🎤 Josh
  60. Confidence is high. We'd been running Falcon on lower-stakes workloads

    for some time. The fiber scheduler was solid. Samuel's code was battle-tested. We just hadn't put it under the full weight of Storefront at BFCM scale. Confidence is high. 自信は高い。 Slide 60 of 170 0440-confidence.md Elapsed: 09:22 Slide: 00:13 🎤 Josh
  61. The rollout was sequential and manually targeted. We have Canary

    and phases 1 through N with the final phase being production. Canary Phase 1 Phase 2 Phase 3 Phase 4 Phase 5 Phase 6 1% Falcon Falcon Falcon Falcon Falcon Falcon Falcon 10% Falcon Falcon Falcon Falcon Falcon Falcon Falcon 25% Falcon Falcon Falcon Falcon Falcon Falcon Falcon 50% Falcon Falcon Falcon Falcon Falcon Falcon Falcon 75% Falcon Falcon Falcon Falcon Falcon Falcon Falcon 100% Falcon Falcon Falcon Falcon Falcon Falcon Falcon 順次・手動でロールアウト。カナリアからフェーズ6 まで、各段階で1% → 10% → 25% → 50% → 75% → 100% 。 Slide 61 of 170 0450-multi-layered-rollout.md Elapsed: 09:35 Slide: 00:08 🎤 Josh
  62. Only once canary was fully on Falcon did we start

    targeting parts of phase one. Then parts of phase two. And so on. At any moment only one cohort was moving and only for a fraction of its fractional traffic. Canary Phase 1 Phase 2 Phase 3 Phase 4 Phase 5 Phase 6 1% Falcon Falcon Falcon Falcon Falcon Falcon Falcon 10% Falcon Falcon Falcon Falcon Falcon Falcon Falcon 25% Falcon Falcon Falcon Falcon Falcon Falcon Falcon 50% Falcon Falcon Falcon Falcon Falcon Falcon Falcon 75% Falcon Falcon Falcon Falcon Falcon Falcon Falcon 100% Falcon Falcon Falcon Falcon Falcon Falcon Falcon 一度に一つのコホートだけを進める。現状が安定するまで次に進まない。 Slide 62 of 170 0451-multi-layered-rollout-1.md Elapsed: 09:43 Slide: 00:15 🎤 Josh
  63. We didn't advance until we were sure that the current

    step was stable. Canary Phase 1 Phase 2 Phase 3 Phase 4 Phase 5 Phase 6 1% Falcon Falcon Falcon Falcon Falcon Falcon Falcon 10% Falcon Falcon Falcon Falcon Falcon Falcon Falcon 25% Falcon Falcon Falcon Falcon Falcon Falcon Falcon 50% Falcon Falcon Falcon Falcon Falcon Falcon Falcon 75% Falcon Falcon Falcon Falcon Falcon Falcon Falcon 100% Falcon Falcon Falcon Falcon Falcon Falcon Falcon 一度に一つのコホートだけを進める。現状が安定するまで次に進まない。 Slide 63 of 170 0452-multi-layered-rollout-2.md Elapsed: 09:58 Slide: 00:04 🎤 Josh
  64. And when something goes wrong: We turn the box turns

    red and determine if we need to rollback. It it does we can go straight back to Unicorn. Minimal impact. Canary Phase 1 Phase 2 Phase 3 Phase 4 Phase 5 Phase 6 1% 10% 25% 50% 75% 100% Unicorn Unicorn Unicorn Unicorn Unicorn Unicorn Unicorn Unicorn Unicorn Unicorn Unicorn Unicorn Unicorn Unicorn Unicorn Unicorn Unicorn Unicorn Unicorn Unicorn Unicorn Unicorn Unicorn Unicorn Unicorn Unicorn Unicorn Unicorn Unicorn Unicorn Unicorn Unicorn Unicorn Unicorn Unicorn Unicorn Unicorn Unicorn Unicorn Unicorn Unicorn Unicorn 問題発生時:セルが赤くなり停止。Falcon に移行した分は全てUnicorn に即戻る。一つのコマンドで。 Slide 64 of 170 0455-rollback.md Elapsed: 10:02 Slide: 00:10 🎤 Josh
  65. These were two separate deployments, each with its own rollout.

    We worked through any issues with our GraphQL deployment first. The surface area is smaller and it stood to benefit the most from moving to Falcon. Two separate deployments and rollouts. GraphQL API IO wait → Template rendering CPU bound GraphQL API とテンプレートレンダリングは別々のデプロイ・ロールアウト。GraphQL を先に進める。 Slide 65 of 170 0460-sfapi-first.md Elapsed: 10:12 Slide: 00:13 🎤 Josh
  66. Once GraphQL deployment was at 100% ands stable, we started

    the process over from scratch for Liquid template rendering which is the bulk of the traffic and much higher stakes. Two separate deployments and rollouts. GraphQL API IO wait → Template rendering CPU bound GraphQL が100% で安定したら、テンプレートレンダリングを最初からやり直す。 Slide 66 of 170 0461-sfapi-first.md Elapsed: 10:25 Slide: 00:11 🎤 Josh
  67. Company wide scale tests our our way of simulating high

    traffic events like Black Friday before the real thing. Scale Tests スケールテスト Slide 67 of 170 0470-scale-tests.md Elapsed: 10:36 Slide: 00:06 🎤 Josh
  68. This is where we find out if Falcon can actually

    handle the load before it's showtime. Scale Tests スケールテスト Slide 68 of 170 0471-scale-tests-1.md Elapsed: 10:42 Slide: 00:05 🎤 Josh
  69. The first scale test. Low traffic — Falcon wasn't carrying

    much yet. We had cautious optimism. Everything worked as expected and Marc-Andre posted to our team channel afterward: "Low numbers so far. But it's working.". The first signal that Falcon could run in production under heavy load at scale without falling apart. The first scale test. Falcon holds steady. Nothing breaks. 最初のスケールテスト。Falcon は安定。何も壊れない。 Slide 69 of 170 0480-scale1.md Elapsed: 10:47 Slide: 00:18 🎤 Josh
  70. The first scale test. Traffic ramps, plateaus, ramps down. Unicorn

    is carrying almost all of it at this point. Falcon's line is the green one at the bottom. Falcon's share is tiny by contrast, but it held steady, and nothing broke which is key. Requests Per Second 秒間リクエスト数 Slide 70 of 170 0481-scale1-data.md Elapsed: 11:05 Slide: 00:16 🎤 Josh
  71. The second scale test hit with way more long tasks

    than April. I posted to the channel afterward: "Nothing exciting really happened." OPTIONAL: Just a week before the test, a real flash sale triggered millions of long tasks in production. External IO slowed down, Falcon absorbed it without scaling up. We didn't plan that test. We didn't need to. Falcon just did what it was built to do. The second test, more long tasks. "Nothing exciting really happened." 2 回目のテスト、ロングタスク急増。 「特に何も起こらなかった。 」 Slide 71 of 170 0490-scale2.md Elapsed: 11:21 Slide: 00:08 🎤 Josh
  72. Second scale test. Long tasks per minute jump from the

    baseline to a plateau during the test window, then fall back. More traffic, more long tasks. Still nothing broke. Long Tasks Created Per Minute 分間ロングタスク作成数 Slide 72 of 170 0491-scale2-data.md Elapsed: 11:29 Slide: 00:10 🎤 Josh
  73. Third test. Unicorn's utilization is climbing — headroom shrinking under

    load. Falcon? Steady. The third test. Unicorn is struggling with utilization. Falcon? Looking great. 3 回目のテスト。Unicorn が苦戦。Falcon ?好調。 Slide 73 of 170 0500-scale3.md Elapsed: 11:39 Slide: 00:06 🎤 Josh
  74. Scale test 3. Falcon's utilization stays flat around 10% the

    whole test. Unicorn spikes — 26% during the first load window, over 30% during the second. Same traffic, same cluster. Falcon is doing the work more efficiently Utilization % 使用率 falcon unicorn Falcon は約10% で安定、Unicorn は26% 〜32% までスパイク。同じ負荷で、Falcon の方がずっと余裕がある。 Slide 74 of 170 0501-scale3-data.md Elapsed: 11:45 Slide: 00:18 🎤 Josh
  75. Tone shift. Scale test four is set to run. Three

    tests in, nothing catastrophic. We have no reason to expect this one to be any different. September: The Wake-Up Call 9 月:警鐘 Slide 75 of 170 0510-september-section.md Elapsed: 12:03 Slide: 00:08 🎤 Josh
  76. We watch the same dashboards we've been watching all year.

    We're expecting the same result. We were wrong. Scale Test 4 — September スケールテスト4 ー 9 月 Slide 76 of 170 0520-scale-test-4.md Elapsed: 12:11 Slide: 00:07 🎤 Josh
  77. The Falcon pods don't degrade. They don't slow down gracefully.

    They lock up. Utilization spikes to the ceiling. Load shedding kicks in. We are dropping requests on production. Falcon pods in two large regions lock up. Utilization spikes. Load shedding kicks in. 2 つの大きなリージョンでFalcon ポッドがロック。利用率がスパイク。ロードシェディング発動。 Slide 77 of 170 0540-lockup.md Elapsed: 12:18 Slide: 00:09 🎤 Josh
  78. The Unicorn pods are fine. Only Falcon is melting. Only

    Falcon is melting. 溶けているのはFalcon だけ。 Slide 78 of 170 0550-unicorn-fine.md Elapsed: 12:27 Slide: 00:04 🎤 Josh
  79. We dig in and we find not one problem, but

    three. Each one would have been bad enough on its own. We found not one problem, but three. Each one would have been bad enough on its own. 一つではなく、三つの問題を発見。どれか一つだけでも十分に深刻だった。 Slide 79 of 170 0560-three-problems.md Elapsed: 12:31 Slide: 00:06 🎤 Josh
  80. Together, they nearly killed the project. Together, they nearly killed

    the project. 合わさって、プロジェクトを殺しかけた。 Slide 80 of 170 0570-together.md Elapsed: 12:37 Slide: 00:02 🎤 Josh
  81. Josh continues. Crisis one. rdkafka segfaults. Crisis 1: rdkafka Segfaults

    セグフォルト Slide 81 of 170 0580-crisis1-section.md Elapsed: 12:39 Slide: 00:03 🎤 Josh
  82. The Kafka client gem wraps a C library called rdkafka

    . It handles all our event streaming. The Kafka client gem wraps a C library — rdkafka . Kafka クライアントgem はC ライブラリrdkafka をラップしている。 Slide 82 of 170 0590-kafka-c.md Elapsed: 12:42 Slide: 00:09 🎤 Josh
  83. And it was never designed for fibers. The C code

    has no concept of cooperative scheduling. It was never designed for fibers. ファイバーのために設計されたことは一度もなかった。 Slide 83 of 170 0600-kafka-fibers.md Elapsed: 12:51 Slide: 00:06 🎤 Josh
  84. This was a difficult investigation. Heaps were nearly useless, it

    was basically impossible to reproduce... We eventually found out what was happening. High load ↓ rdkafka segfaults ↓ System protection → hard timeout ↓ 1 GB default buffer silently fills ↓ OOM killed · heap useless 高負荷時:セグフォルト、ハードタイムアウト、1GB バッファのサイレントメモリ消費、そしてOOM Kill 。 Slide 84 of 170 0610-kafka-symptoms.md Elapsed: 12:57 Slide: 00:10 🎤 Josh
  85. Under high load, we see segfaults. High load ↓ rdkafka

    segfaults ↓ System protection → hard timeout ↓ 1 GB default buffer silently fills ↓ OOM killed · heap useless 高負荷時、rdkafka がセグフォルトを起こす。 Slide 85 of 170 0611-kafka-symptoms-segfault.md Elapsed: 13:07 Slide: 00:03 🎤 Josh
  86. Our system protection kicks in with hard timeouts where we

    try to exit the process. High load ↓ rdkafka segfaults ↓ System protection → hard timeout ↓ 1 GB default buffer silently fills ↓ OOM killed · heap useless システム保護機構が発動し、ハードタイムアウトでプロセス終了を試みる。 Slide 86 of 170 0612-kafka-symptoms-timeout.md Elapsed: 13:10 Slide: 00:05 🎤 Josh
  87. A one gigabyte default buffer that silently eats memory until

    things start to corrupt. High load ↓ rdkafka segfaults ↓ System protection → hard timeout ↓ 1 GB default buffer silently fills ↓ OOM killed · heap useless デフォルト1GB のバッファが、気付かれないままメモリを食い潰していく。 Slide 87 of 170 0613-kafka-symptoms-buffer.md Elapsed: 13:15 Slide: 00:06 🎤 Josh
  88. Then we get OOM killed and we're left with a

    useless heap. High load ↓ rdkafka segfaults ↓ System protection → hard timeout ↓ 1 GB default buffer silently fills ↓ OOM killed · heap useless 最終的にOOM Kill され、残されたヒープは解析に使えない。 Slide 88 of 170 0614-kafka-symptoms-oom.md Elapsed: 13:21 Slide: 00:05 🎤 Josh
  89. We ship a flurry of fixes. We ship a flurry

    of fixes 修正を次々と投入します。 Slide 89 of 170 0620-kafka-fixes.md Elapsed: 13:26 Slide: 00:12 🎤 Josh
  90. Bumped gem versions. Bumped gem versions gem のバージョンを上げる。 Slide 92

    of 170 0623-kafka-fixes-3.md Elapsed: 14:02 Slide: 00:12 🎤 Josh
  91. Each fix peels back another layer. There's always seems to

    be another problem underneath. Now I'll pass it to Marc. Each fix peels back another layer. 修正するたびに、別の問題が現れる。 Slide 93 of 170 0624-kafka-fixes-4.md Elapsed: 14:14 Slide: 00:12 🎤 Josh
  92. Marc-Andre takes over. Crisis two. Spanner deadlocks. Crisis 2: Spanner

    Deadlocks デッドロック Slide 94 of 170 0630-crisis2-section.md Elapsed: 14:26 Slide: 00:05 🎤 Marc-Andre
  93. Spanner is Google's distributed database — we use it for

    session data. Under load, Falcon workers were freezing inside a C call, event loop not running. The health check timed out. The controller sent SIGINT , then SIGTERM . The process stayed alive — completely unresponsive, stuck waiting for Spanner partitions that were never going to arrive. The health check timed out. The controller sent SIGINT . Then SIGTERM . Neither worked. ヘルスチェックがタイムアウト。コントローラーがSIGINT 、次にSIGTERM を送信。どちらも効果なし。 Slide 95 of 170 0642-supervisor-signals.md Elapsed: 14:31 Slide: 00:27 🎤 Marc-Andre
  94. Here is Async's event loop run_loop ... Async Scheduler —

    run_loop private def run_loop(&block) Thread.handle_interrupt(::SignalException => :never) do until self.interrupted? break unless yield end end rescue Interrupt => interrupt Thread.handle_interrupt(::SignalException => :never) do self.stop end retry end Async のイベントループは安全なタイミングまでシグナル処理を意図的に遅延させる。 Slide 96 of 170 0642a-async-run-loop.md Elapsed: 14:58 Slide: 00:11 🎤 Marc-Andre
  95. ...it deliberately defers signal handling until it's safe to do

    so. Async Scheduler — run_loop private def run_loop(&block) Thread.handle_interrupt(::SignalException => :never) do until self.interrupted? break unless yield end end rescue Interrupt => interrupt Thread.handle_interrupt(::SignalException => :never) do self.stop end retry end Async のイベントループは安全なタイミングまでシグナル処理を意図的に遅延させる。 Slide 97 of 170 0642b-async-run-loop.md Elapsed: 15:09 Slide: 00:11 🎤 Marc-Andre
  96. Here is gRPC's blocking operation loop, used by Spanner. The

    loop only exits if the operation completes successfully or times out explicitly. If an interrupt occurs, it tries again. gRPC's Completion Queue do { next_call.interrupted = 0; rb_thread_call_without_gvl( grpc_rb_completion_queue_pluck_no_gil, (void*)&next_call, unblock_func, /* ← SIGINT calls this. Sets interrupted = 1. */ (void*)&next_call ); if (next_call.event.type != GRPC_QUEUE_TIMEOUT) break; } while (next_call.interrupted); gRPC のブロッキングループ。割り込みが発生しても成功またはタイムアウトまでリトライを続ける。 Slide 98 of 170 0643-grpc-sigint.md Elapsed: 15:20 Slide: 00:16 🎤 Marc-Andre
  97. Now normally, rb_thread_call_without_gvl checks Ruby's internal pending interrupts and can

    raise an exception on SIGINT or SIGTERM , but in our case, Async has deferred exception handling. So no user-space signal can exit this loop. The process won't respond. gRPC's Completion Queue do { next_call.interrupted = 0; rb_thread_call_without_gvl( grpc_rb_completion_queue_pluck_no_gil, (void*)&next_call, unblock_func, /* ← SIGINT calls this. Sets interrupted = 1. */ (void*)&next_call ); if (next_call.event.type != GRPC_QUEUE_TIMEOUT) break; } while (next_call.interrupted); Async がシグナルを遅延しているため、ユーザー空間のシグナルではこのループを抜けられない。プロセスは応答しない。 Slide 99 of 170 0644-grpc-reset.md Elapsed: 15:36 Slide: 00:22 🎤 Marc-Andre
  98. We updated async-container to use SIGKILL as a final resort.

    SIGKILL tells the operating system to terminate the process immediately, regardless of what C code it's running. New workers spawn. The hung process is gone. And separately, we fixed the query itself, which eliminated the hang at the source. Now I'll hand it to Samuel to walk us through Crisis 3. The fix: send SIGKILL after SIGTERM is ignored. SIGKILL cannot be caught or ignored. 修正:SIGTERM が無視された後にSIGKILL を送信。SIGKILL はキャッチも無視もできない。 Slide 100 of 170 0646-sigkill-fix.md Elapsed: 15:58 Slide: 00:27 🎤 Marc-Andre
  99. Samuel takes over. Crisis three. The silent death. Crisis 3:

    The Silent Death 静かな死 Slide 101 of 170 0680-crisis3-section.md Elapsed: 16:25 Slide: 00:05 🎤 Samuel
  100. The most insidious bug. Because you can't see it happening.

    The most insidious bug. 最も陰険なバグ。 Slide 102 of 170 0690-insidious.md Elapsed: 16:30 Slide: 00:05 🎤 Samuel
  101. Workers get killed for exceeding memory limits. That's normal. The

    supervisor kills them, the async-container controller spawns new ones. Workers get killed for exceeding memory limits — normal. メモリ制限を超えてワーカーが終了される ー これは正常。 Slide 103 of 170 0700-memory-kill.md Elapsed: 16:35 Slide: 00:09 🎤 Samuel
  102. But in this case, the controller process hangs and is

    unable to replace them. But the controller process hangs instead of restarting them. しかし、スーパーバイザープロセスが代わりを作る代わりにハングする。 Slide 104 of 170 0710-supervisor-hangs.md Elapsed: 16:44 Slide: 00:06 🎤 Samuel
  103. So the pod slowly bleeds workers. Five workers become four.

    Then three. Then two. Each one that dies is never replaced. The pod slowly bleeds workers. Five → four → three → two. ポッドがゆっくりワーカーを失う。5 → 4 → 3 → 2 。 Slide 105 of 170 0720-bleed.md Elapsed: 16:50 Slide: 00:10 🎤 Samuel
  104. And no alerts fire. There's no alarm. Just a slow,

    silent degradation until the pod can't handle any more traffic and starts failing every request. No alerts. Just a slow, silent degradation. アラートなし。静かな劣化だけ。 Slide 106 of 170 0730-no-alerts.md Elapsed: 17:00 Slide: 00:09 🎤 Samuel
  105. We found the root cause: A write wasn't atomic —

    SIGINT could land between the data and the newline. The newline never arrives. The controller hangs forever. Let me show you the code. Root cause: a torn pipe write in async-container . 根本原因: async-container のパイプ書き込みの分断。 Slide 107 of 170 0740-root-cause.md Elapsed: 17:09 Slide: 00:15 🎤 Samuel
  106. Workers communicate with the controller through a notification pipe —

    it's how they signal readiness, health, and status. Let's look at the channel implementation. async/container/channel.rb class Async::Container::Channel def initialize @in, @out = ::IO.pipe end def receive if data = @in.gets return JSON.parse(data, symbolize_names: true) end end end コントローラーはノーティファイパイプで子プロセスと通信する。 Slide 108 of 170 0750-race-explain.md Elapsed: 17:24 Slide: 00:11 🎤 Samuel
  107. We create a pipe for the communication, the child process

    inherits this pipe and the container controller reads from it. async/container/channel.rb class Async::Container::Channel def initialize @in, @out = ::IO.pipe end def receive if data = @in.gets return JSON.parse(data, symbolize_names: true) end end end タイムアウトなし。パイプの読み取りは永遠にブロックする可能性がある。 Slide 109 of 170 0751-race-explain.md Elapsed: 17:35 Slide: 00:08 🎤 Samuel
  108. The receive method calls gets , which reads until it

    sees a newline. async/container/channel.rb class Async::Container::Channel def initialize @in, @out = ::IO.pipe end def receive if data = @in.gets return JSON.parse(data, symbolize_names: true) end end end gets は改行文字が届くまでブロックし続ける。コントローラー全体が止まる。 Slide 110 of 170 0752-race-explain.md Elapsed: 17:43 Slide: 00:05 🎤 Samuel
  109. On the other side — the worker. It writes status

    messages to the pipe using puts . And puts is not always one write — it can be two: the data, then the newline as a separate system calls. Between any two Ruby system calls, a pending signal can fire. async/container/notify/pipe.rb class Async::Container::Notify::Pipe def send(**message) data = ::JSON.dump(message) @io.puts(data) # write 1: data ← arrives # write 2: "\n" ← arrives (usually) @io.flush end end ワーカーはノーティファイパイプにメッセージを書き込む。 puts は2 つのRuby 命令で書く。 Slide 111 of 170 0760-race-diagram.md Elapsed: 17:48 Slide: 00:16 🎤 Samuel
  110. SIGINT fires between the two writes. The JSON data arrives.

    The newline doesn't. The controller process hangs waiting for data it will never receive. async/container/notify/pipe.rb class Async::Container::Notify::Pipe def send(**message) data = ::JSON.dump(message) @io.puts(data) # write 1: data ← arrives # ← SIGINT fires here # write 2: "\n" ← never arrives @io.flush end end SIGINT がRuby の命令と命令の間に割り込み、改行が届かない。子プロセスもコントローラーもハングする。 Slide 112 of 170 0761-race-diagram.md Elapsed: 18:04 Slide: 00:13 🎤 Samuel
  111. Fix one: add a timeout. If a complete line hasn't

    arrived within a second, gets gives up and returns nil. The controller moves on. It can restart workers. It can do its job. async/container/channel.rb — fixed class Async::Container::Channel def initialize(timeout: 1.0) @in, @out = ::IO.pipe @in.timeout = timeout end def receive if data = @in.gets return JSON.parse(data, symbolize_names: true) end end end タイムアウトを追加:1 秒後にブロック解除。コントローラーは前に進める。 Slide 113 of 170 0765-fix-code.md Elapsed: 18:17 Slide: 00:12 🎤 Samuel
  112. Fix two: make the write atomic. Append the newline to

    the data string before writing, so it goes out in a single syscall. There's no gap for the SIGINT to land in. Either the whole message arrives, or nothing does. async/container/notify/pipe.rb — fixed class Async::Container::Notify::Pipe def send(**message) data = ::JSON.dump(message) << "\n" @io.write(data) # one atomic syscall @io.flush end end 改行をデータに追加してから1 回の書き込み。アトミック。OOM キラーが割り込む隙間がない。 Slide 114 of 170 0766-fix-code.md Elapsed: 18:29 Slide: 00:14 🎤 Samuel
  113. I shipped the fix and deployed it. Watched the pods.

    Workers that died were replaced. The silent death stopped. I shipped the fix. 修正をアップストリームに送信してデプロイした。ポッドを監視した。死んだワーカーは置き換えられた。静かな出血が止まった。 Slide 115 of 170 0770-fix.md Elapsed: 18:43 Slide: 00:07 🎤 Samuel
  114. Marc-Andre takes over. Three crises. Fixes shipped. But then... Act

    III: The Darkest Week 最も暗い一週間 Slide 116 of 170 0780-act3-section.md Elapsed: 18:50 Slide: 00:05 🎤 Marc-Andre
  115. October 14th. I meet with our infra leads. We made

    the call to roll back all major regions to Unicorn, as we fix Falcon. BFCM is six weeks away. We set a deadline — fix Falcon by next Monday, or we stay on Unicorn. October 14th — Roll Back to Unicorn 10 月14 日 ー Unicorn にロールバック Slide 117 of 170 0790-october-14.md Elapsed: 18:55 Slide: 00:21 🎤 Marc-Andre
  116. One week. Two blockers remain. One has a fix that's

    ready but unproven. The other has no solution yet. One Week — Two Blockers 残り一週間。ブロッカーが二つ。 Slide 118 of 170 0830-one-week.md Elapsed: 19:16 Slide: 00:08 🎤 Marc-Andre
  117. Blocker number 1: workers not restarting due to silent death,

    like Samuel has shown. The fix is ready — but it hasn't been tested at scale. We don't know if it holds under real BFCM traffic. 1. Workers not restarting due to silent death (fix ready but unproven) 1. OOM 後にワーカーが再起動しない(修正は準備済みだが未検証) Slide 119 of 170 0840-blocker1.md Elapsed: 19:24 Slide: 00:18 🎤 Marc-Andre
  118. Blocker number 2: Kafka connection cardinality. Now I'll hand it

    to Josh to walk us through this one. 2. Kafka connection cardinality (no solution yet) 2. Kafka 接続のカーディナリティ(まだ解決策なし) Slide 120 of 170 0842-blocker2.md Elapsed: 19:42 Slide: 00:07 🎤 Marc-Andre
  119. Josh takes over. We had one week. One blocker with

    a fix that might work, and one with nothing. This is the story of how we found the nothing. The Outbox アウトボックス Slide 121 of 170 0850-outbox-section.md Elapsed: 19:49 Slide: 00:09 🎤 Josh
  120. Not the segfaults — those we'd fixed. This was a

    different problem that was more structural. And if we couldn't solve it, Falcon couldn't ship. The Kafka problem is existential. Kafka の問題は死活問題だ。 Slide 122 of 170 0860-kafka-problem.md Elapsed: 19:58 Slide: 00:07 🎤 Josh
  121. Under Unicorn, each process has one Kafka connection that only

    connects on the first message. One process, one connection. You have twenty workers per pod, you have twenty connections. Under Unicorn: each process has one Kafka connection that lazily connects. Simple. Unicorn では、各プロセスがKafka 接続を1 つだけ持ち、遅延的に接続する。シンプル。 Slide 123 of 170 0870-unicorn-kafka.md Elapsed: 20:05 Slide: 00:11 🎤 Josh
  122. In our experience with Unicorn. Low PIDs tend to handle

    more requests meaning with twenty workers only the first few will handle the majority of traffic staying very warm We had depended on the Unicorn behaviour for years and plan for that. Unicorn workers W1 W2 W3 W4 W5 │ │ ╎ ╎ ╎ Kafka 5 つのUnicorn ワーカー。W1 とW2 が接続を確立し、Kafka にメッセージ送信。W3 〜W5 はまだ接続していない。 Slide 124 of 170 0871-unicorn-kafka-diagram.md Elapsed: 20:16 Slide: 00:15 🎤 Josh
  123. From our experience Falcon will more evenly distribute the traffic.

    This is good as it means more workers are warm at a given time. Under Falcon we more evenly distribute traffic. Falcon ではトラフィックがより均等に分散される。 Slide 125 of 170 0880-falcon-kafka.md Elapsed: 20:31 Slide: 00:08 🎤 Josh
  124. The catch is: We now connect on more quickly connect

    on more of the workers. Falcon workers W1 W2 W3 W4 W5 │ │ │ │ │ Kafka Falcon では、より多くのワーカーが素早く接続を確立する。 Slide 126 of 170 0881-falcon-kafka-1.md Elapsed: 20:39 Slide: 00:05 🎤 Josh
  125. We're running thousands of pods. Each pod has around a

    hundred workers. Each worker has its own Kafka connections. The connection count explodes — and it grows linearly with every pod we add. Across thousands of pods: the connection count explodes. ~100 connections per pod. 数千のポッドで接続数が爆発。ポッドあたり約100 接続。 Slide 127 of 170 0890-connection-explode.md Elapsed: 20:44 Slide: 00:14 🎤 Josh
  126. We ran the numbers with the Kafka infrastructure team. At

    BFCM scale — with the traffic we were expecting — we would overwhelm the brokers. They were direct about it: this will not work. You cannot ship Falcon like this. At BFCM scale, we'd overwhelm the Kafka brokers. BFCM の規模では、Kafka ブローカーを圧倒してしまう。 Slide 128 of 170 0900-overwhelm.md Elapsed: 20:58 Slide: 00:11 🎤 Josh
  127. Ilya suggests something simple: What if only one process per

    pod talks to Kafka? All the workers send their messages to that one process. The one process talks to Kafka. The broker only ever sees one connection per pod. Ilya Grigorik suggests an idea: "What if only one process per pod talks to Kafka?" 「ポッドごとに一つのプロセスだけがKafka と通信したら?」 Slide 129 of 170 0910-ilya.md Elapsed: 21:09 Slide: 00:18 🎤 Josh
  128. Marc and I pair on a prototype and Marc gets

    it just ready enough to talk about. It's rough, but it works. Marc and Josh prototype. Josh takes it from prototype to production. Marc が週末でプロトタイプ。Josh がプロトタイプからプロダクションへ。 Slide 130 of 170 0920-prototype.md Elapsed: 21:27 Slide: 00:07 🎤 Josh
  129. I take it and build it into something we can

    actually ship — proper error handling, retry logic, load testing etc. We call it Outbox. And we have four weeks to test, scale test and roll it out everywhere before BFCM. Marc and Josh prototype. Josh takes it from prototype to production. Marc が週末でプロトタイプ。Josh がプロトタイプからプロダクションへ。 Slide 131 of 170 0921-prototype-2.md Elapsed: 21:34 Slide: 00:12 🎤 Josh
  130. Each Falcon worker sends its Kafka messages to a single

    Outbox process in the same pod — over HTTP. Outbox buffers them, then forwards one batched connection to Kafka. From the broker's perspective, it's one connection per pod. Not a hundred. Web workers W1 W2 W3 W4 ↓ ↓ ↓ ↓ HTTP Outbox msg ↓ batch Kafka 全ワーカーが一つのOutbox プロセスにHTTP で送信。Outbox がKafka にバッチ送信。 Slide 132 of 170 0930-outbox-diagram.md Elapsed: 21:46 Slide: 00:18 🎤 Josh
  131. This dramatically reduces the load on Kafka. Every worker's messages

    flow through one process per pod. Instead of hundreds of connections hitting the brokers, they see almost nothing. We test it on a small cluster first — and the numbers are exactly what we hoped. ~1 connection per worker → ~1 connection per pod ワーカーあたり約1 接続 → ポッドあたり約1 接続 Slide 133 of 170 0940-outbox-result.md Elapsed: 22:04 Slide: 00:14 🎤 Josh
  132. Here's view of what that kind of looked like. You

    can see the connection count drop the moment the Outbox rolls out. Before — a wall of connections. After — almost nothing. Kafka Connections Kafka 接続数 Slide 134 of 170 0941-outbox-observe.md Elapsed: 22:18 Slide: 00:10 🎤 Josh
  133. The go/no-go meeting arrives. We walk in with two things:

    Samuel's fix for the silent death deployed and holding. And Outbox, running on a small cluster with numbers we can actually show. We put them on the screen and walk through our changes. The go/no-go meeting arrives. We show the numbers. ゴー/ ノーゴー会議が来る。数字を見せる。 Slide 135 of 170 0950-go-nogo.md Elapsed: 22:28 Slide: 00:14 🎤 Josh
  134. We show the numbers. The fixes are holding. The Outbox

    works. We get the green light to go ahead. Green Light ✓ ゴーサイン ✓ Slide 136 of 170 0960-green-light.md Elapsed: 22:42 Slide: 00:06 🎤 Josh
  135. Four weeks. That's what we have between the green light

    and BFCM. We roll the Outbox out cluster by cluster — carefully, watching the connection numbers after each one. Every rollout confirms the same thing: it works. Connection count drops. Kafka team signs off on each region. The next 4 weeks are a sprint. Outbox rolls out cluster by cluster, phase by phase. 次の4 週間はスプリント。Outbox をクラスターごと、フェーズごとに展開。 Slide 137 of 170 0970-sprint.md Elapsed: 22:48 Slide: 00:19 🎤 Josh
  136. Mid-sprint, we ran Scale Test 5 and a Kafka regional

    failover test at full scale. Traffic more than doubled. Outbox handled it. Message drop rate was nearly zero (0.00001%). The Kafka team confirmed: it works. That was the moment the last doubt went away. Mid-sprint: Scale Test 5. Kafka failover. Traffic doubles. Almost nothing drops. スプリント中盤:スケールテスト5 。Kafka フェイルオーバー。トラフィックが倍増。ほぼ何もドロップしない。 Slide 138 of 170 0975-scale-test-5.md Elapsed: 23:07 Slide: 00:16 🎤 Josh
  137. November fourteenth. The Outbox is live on every cluster, every

    region. Two weeks before BFCM. For the first time since October, we feel like we're going to make it. November 14th: Outbox is live everywhere. 11 月14 日:Outbox がすべてのクラスターで稼働。 Slide 139 of 170 0980-nov14.md Elapsed: 23:23 Slide: 00:12 🎤 Josh
  138. The Kafka team sends us the real numbers. Real numbers:

    NA region 1 dropped from ~40M to ~27M connections; NA region 2 from ~34M to ~18M. That's not just "Outbox works." That's Outbox making the whole fleet leaner than it was even under Unicorn. NA region 1: ~1.5× fewer connections NA region 2: ~1.9× fewer connections Slide 140 of 170 0990-kafka-impact.md Elapsed: 23:35 Slide: 00:22 🎤 Josh
  139. Josh continues. Everything we built, everything we fixed, everything we

    shipped — it was all for this. Scale tests. Three crises. A rollback. Outbox. A four-week sprint. And now here it is. BFCM. Act IV: BFCM ブラックフライデー Slide 141 of 170 1000-act4-section.md Elapsed: 23:57 Slide: 00:11 🎤 Josh
  140. November twenty-eighth, twenty twenty-five. Black Friday. November 28, 2025 —

    Black Friday ブラックフライデー Slide 142 of 170 1010-date.md Elapsed: 24:08 Slide: 00:04 🎤 Josh
  141. Traffic ramps up. The dashboards light up. And we watch.

    Traffic ramps. The dashboards light up. We watch. トラフィックが増加。ダッシュボードが点灯。見守る。 Slide 143 of 170 1020-ramps.md Elapsed: 24:12 Slide: 00:03 🎤 Josh
  142. Eighty-five million requests per minute at peak. The dashboards are

    lit up. We're watching every number. 85,000,000 requests per minute at peak ピーク時の毎分8500 万リクエスト Slide 144 of 170 1030-85m.md Elapsed: 24:15 Slide: 00:08 🎤 Josh
  143. One hundred and twenty thousand long tasks per minute. Fibers

    picking up work while other requests wait. Exactly as designed. 120,000 long tasks per minute — fibers working exactly as designed 毎分12 万のロングタスク ー 設計通りにファイバーが動作 Slide 145 of 170 1040-120k-tasks.md Elapsed: 24:23 Slide: 00:10 🎤 Josh
  144. One hundred and ten million Kafka messages per minute —

    every single one flowing through the Outbox. Six weeks ago, this would have crushed the brokers. Today, the brokers barely notice. 110,000,000 Kafka messages per minute via Outbox Outbox 経由の毎分1 億1000 万Kafka メッセージ Slide 146 of 170 1050-outbox-msgs.md Elapsed: 24:33 Slide: 00:12 🎤 Josh
  145. All with fewer connections than Unicorn ever used. Outbox is

    working exactly as designed. 110,000,000 Kafka messages per minute via Outbox Outbox 経由の毎分1 億1000 万Kafka メッセージ Slide 147 of 170 1051-outbox-msgs-1.md Elapsed: 24:45 Slide: 00:06 🎤 Josh
  146. Response times: flat. Which is exactly what we expected. Response

    times: flat. レスポンスタイム:フラット。 Slide 148 of 170 1070-response-times.md Elapsed: 24:51 Slide: 00:04 🎤 Josh
  147. We see some P99 bumps on GraphQL when upstream services

    slow down, but that's expected. That's not us. That's upstream. Some P99 bumps on GraphQL when upstream services slow down. But that's expected. アップストリームサービスが遅くなるとGraphQL のP99 にバンプ。しかし、これは想定内。 Slide 149 of 170 1080-p99.md Elapsed: 24:55 Slide: 00:08 🎤 Josh
  148. Memory terminations: stable and consistent. The silent death bug is

    fixed. Workers die and are replaced, life goes on. Memory terminations: stable and consistent. メモリ終了:安定して一貫している。 Slide 150 of 170 1090-memory.md Elapsed: 25:03 Slide: 00:07 🎤 Josh
  149. The graphs stay flat. The graphs stay flat. グラフはフラットのまま。 Slide

    152 of 170 1110-graphs-flat.md Elapsed: 25:13 Slide: 00:02 🎤 Josh
  150. On December third, I posted to the channel: "Congrats folks

    on making it through BFCM on Falcon. We wouldn't have been able to do it without everyone here." I'll pass it to Marc to continue going through numbers "We wouldn't have been able to do it without everyone here." 「ここにいる全員なしでは成し遂げられなかった。 」 Slide 153 of 170 1120-josh-quote.md Elapsed: 25:15 Slide: 00:12 🎤 Josh
  151. 9.7% better throughput / core compared to Unicorn. 14 more

    requests/minute/core. 9.7% better throughput per core vs Unicorn 158 vs 144 req/min/core (+14) コアあたりのスループットがUnicorn より9.7% 向上。 Slide 154 of 170 1160-throughput.md Elapsed: 25:27 Slide: 00:14 🎤 Marc-Andre
  152. 14% better at P99 latency. The tail got shorter. 14%

    better at P99 latency P99 レイテンシが14% 改善。 Slide 155 of 170 1170-p99-improvement.md Elapsed: 25:41 Slide: 00:07 🎤 Marc-Andre
  153. Kafka connections down 35% fleet-wide. The Outbox was built for

    a crisis, but it turned out to be a straight-up improvement. Kafka connections reduced ~35% fleet-wide Kafka 接続をフリート全体で約35% 削減。 Slide 156 of 170 1190-kafka-reduction.md Elapsed: 25:48 Slide: 00:07 🎤 Marc-Andre
  154. And, again, 329 billion requests over the BFCM weekend. On

    a web server we'd never run at this scale. 329 billion requests over BFCM weekend BFCM 週末に3290 億リクエスト。 Slide 157 of 170 1191-329b-again.md Elapsed: 25:55 Slide: 00:08 🎤 Marc-Andre
  155. 5 things we learned from all this... Lessons 教訓 Slide

    158 of 170 1210-lessons-title.md Elapsed: 26:03 Slide: 00:02 🎤 Marc-Andre
  156. 1. Native code that doesn't know about cooperative scheduling will

    catch you off guard. Audit your dependencies before you go anywhere near production with Fibers. 1. C extensions are a big risk when adopting fibers. C 拡張はファイバー導入における大きなリスク。 Slide 159 of 170 1220-lesson1.md Elapsed: 26:05 Slide: 00:12 🎤 Marc-Andre
  157. 2. Scale tests save you. We ran 5 of them.

    Each one uncovered a different class of failure. Without them, we'd have discovered 3 bugs on BFCM day. If you have a high-stakes event coming — test early, and make each run harder than the last. 2. Scale tests save you. スケールテストが救ってくれる。 Slide 160 of 170 1230-lesson2.md Elapsed: 26:17 Slide: 00:17 🎤 Marc-Andre
  158. 3. Roll back without ego. We pulled Falcon from production

    six weeks before BFCM. It gave us time to focus on fixing things properly instead of reacting to fires. 3. Roll back without ego. プライドを捨ててロールバックする。 Slide 161 of 170 1240-lesson3.md Elapsed: 26:34 Slide: 00:15 🎤 Marc-Andre
  159. 4. Pressure creates invention. The Outbox was born from a

    crisis with a deadline. From prototype to production in 4 weeks. Sometimes the best solutions come when there's no time for the perfect one. 4. Pressure creates invention. プレッシャーが発明を生む。 Slide 162 of 170 1250-lesson4.md Elapsed: 26:49 Slide: 00:12 🎤 Marc-Andre
  160. 5. Contribute upstream. The Ruby ecosystem got better because we

    pushed Falcon this hard, and contributed the fixes back, for everyone to use. 5. Contribute upstream. アップストリームに貢献する。 Slide 163 of 170 1260-lesson5.md Elapsed: 27:01 Slide: 00:11 🎤 Marc-Andre
  161. What's next? The work doesn't stop after BFCM. What's Next?

    次のステップ Slide 164 of 170 1270-whats-next-title.md Elapsed: 27:12 Slide: 00:04 🎤 Marc-Andre
  162. Falcon unlocked Rack 3 support. Unicorn never supported it. Rack

    3 — finally possible. Unicorn doesn't support it. Falcon does. Rack 3 ー ついに可能に。Unicorn はRack 3 をサポートしない。Falcon はサポートする。 Slide 165 of 170 1281-rack3.md Elapsed: 27:16 Slide: 00:08 🎤 Marc-Andre
  163. Falcon lets us mix workloads in the same process —

    Liquid template rendering, GraphQL, and long I/O operations. This gives us better utilization, lower cost, and simplifies our infrastructure. Better concurrency controls open the door to combining workloads. Same container. Better utilization. Lower cost. より高度な並行制御により、ワークロードの統合が可能に。同じコンテナ。より高い利用率。コスト削減。 Slide 166 of 170 1285-combining-workloads.md Elapsed: 27:24 Slide: 00:15 🎤 Marc-Andre
  164. We've proven Falcon can handle Shopify's scale. That opens the

    door — both for other teams inside Shopify, and for anyone else in the Ruby world who's been waiting to make the jump. Falcon is proven at scale. The door is open to further adoption. Falcon は大規模で実証済み。さらなる採用への道が開かれた。 Slide 167 of 170 1290-further-adoption.md Elapsed: 27:39 Slide: 00:15 🎤 Marc-Andre
  165. If you want to go deeper — on Falcon, Async,

    the Outbox, fiber schedulers, or anything else we talked about today — we'd love to talk more. Come find us. Thank you! Oh and one more thing... Pass over to Samuel Questions? Samuel Williams (ioquatix) · Josh Teeter (whatisinternet) · Marc-Andre Cournoyer (macournoyer) 質問はありますか? Slide 168 of 170 1300-questions.md Elapsed: 27:54 Slide: 00:16 🎤 Marc-Andre
  166. Samuel gestures at the screen. This whole talk — live

    reloading, the animations, the diagrams — all running on presently , which is built on top of async and lively . Same stack, different application. This presentation runs on the same stack as Falcon — via presently . このプレゼンテーションはFalcon と同じスタックで動いています — presently を通して。 Slide 169 of 170 1305-presently.md Elapsed: 28:10 Slide: 00:13 🎤 Samuel
  167. If you'd like to go deeper on Falcon, Async, or

    building real-time web applications with Ruby, please come to the Async Cafe Workshop. Scan the QR code or visit the RubyKaigi event page for more information. Looking forward to hanging out! All bow Async Cafe Workshop — Ruby によるリアルタイムWeb アプリ開発 / https://connpass.com/event/390661/ Slide 170 of 170 1310-async-cafe.md Elapsed: 28:23 Slide: 00:17 🎤 Samuel