Upgrade to Pro — share decks privately, control downloads, hide ads and more …

If You Didn't Record It, It Didn't Happen: Prac...

Sponsored · SiteGround - Reliable hosting with speed, security, and support you can count on.
Avatar for Aaron Cruz Aaron Cruz
February 01, 2026

If You Didn't Record It, It Didn't Happen: Practical Observability for Rails

Your Rails app is in production. A user says it’s slow. Another says a feature just broke. You need answers fast.
This talk shows how to make your app observable: capturing metrics, logs, and traces, then feeding them into open source monitoring tools. You’ll see what’s worth recording, how to avoid drowning in noise, and how to set alerts that lead straight to the cause.
Walk away ready to see inside your app when it matters most and fix problems before everything is on fire.

Avatar for Aaron Cruz

Aaron Cruz

February 01, 2026
Tweet

More Decks by Aaron Cruz

Other Decks in Technology

Transcript

  1. From: Sarah Chen <[email protected]> To: [email protected] Subject: Missing commits -

    invoice due today Tuesday, 9:47 AM Hi, I pushed commits yesterday but they're not showing up on my invoice. Last week everything synced fine. Now some commits appear within minutes, others never show up at all. I need to send this invoice today. Sarah
  2. sarahchen/marketly-frontend · 14 commits in January a1f Fix navbar responsive

    · 2 hours ago b92 Update dependencies · yesterday f71 Add dark mode toggle · yesterday ... (11 more) GitHub: 14 commits BillMyCommits: 12 commits
  3. $ tail -f log/production.log [09:41:23] INFO Started POST "/webhooks/github" [09:41:23]

    INFO Parameters: {"ref"=>"refs/heads/main"... [09:41:24] INFO Completed 200 OK in 142ms [09:41:52] INFO Started GET "/dashboard" [09:41:52] INFO User Load (0.4ms) SELECT "users"... [09:41:53] INFO Completed 200 OK in 89ms [... scrolling ...]
  4. What I know: • Something is wrong • No errors

    in logs • All jobs succeeded What I don't know: • What's actually failing • How many users affected • How long this has been happening
  5. How Prometheus Works Most monitoring: YOU push data out Your

    App ──(push)──▶ DataDog / New Relic / CloudWatch ✗ Configure endpoints, API keys, credentials ✗ Your app handles failures, retries, buffering Prometheus: IT pulls from you Your App ──(expose /metrics) Prometheus scrapes every 15s ✓ Just expose a route, like any other endpoint ✓ No credentials, no external config ✓ Prometheus handles retries and failures
  6. In Practice GET /metrics # github_sync_cache_hits_total 1247 # github_sync_cache_misses_total 89

    # rails_request_duration_seconds 0.127 That's it. Prometheus config (not in your app): scrape_configs: - job_name: 'billmycommits' targets: ['localhost:3000'] Your app doesn't know Prometheus exists.
  7. Prometheus aggregates at WRITE time LOGS (aggregate at read): [09:41:23]

    Request to /dashboard took 127ms [09:41:24] Request to /invoices took 843ms ...store everything, query later PROMETHEUS (aggregate at write): http_requests_total{path="/dashboard"} 1,247 http_request_duration_sum 142.3 ...store aggregates, query summaries
  8. What this means: ✗ Can't ask: "Show me all requests

    over 1 second" → Prometheus doesn't have individual request data ✗ Can't ask: "What's the average for user 847?" → Already aggregated across all users ✗ Can't ask: "What was the slowest request yesterday?" → No individual timing data stored If you didn't instrument it upfront, the data doesn't exist.
  9. What you CAN track: ✓ Counters: "How many?" http_requests_total{path="/dashboard"} 1,247

    ✓ Gauges: "What's the current value?" active_users_count 847 ✓ Histograms: "How many fell into each bucket?" http_duration_bucket{le="100ms"} 823 http_duration_bucket{le="500ms"} 1,201 http_duration_bucket{le="1000ms"} 1,247 → Approximate percentiles from buckets
  10. THE TRADEOFF ✓ UPSIDE Scales to billions Constant memory Fast

    queries ✗ DOWNSIDE Plan metrics ahead Can't ask new questions Less flexible For our GitHub sync problem? We need: cache hits, cache misses, API call rate That's specific. That's measurable. That's perfect for Prometheus.
  11. Custom Metrics # config/initializers/yabeda.rb Yabeda.configure do group :github_sync do counter

    :cache_hits counter :cache_misses counter :api_calls, tags: [:status] histogram :cache_age_hours, buckets: [1, 6, 12, 24, 48] end end
  12. Cache Hit Rate (24h) 100% 0% Mon Tue Wed Thu

    Fri ↑ Sarah's email 2 days since cache updated
  13. What I now know: • We're never fetching fresh data

    What I still don't know: • What's the cache invalidation logic? Metrics told me WHAT. I need traces to tell me WHY.
  14. JAEGER — SyncCommitsJob traces Trace: abc123 8ms Trace: def456 12ms

    Trace: ghi789 847ms Trace: jkl012 9ms ↑ This one called GitHub (847ms) 8ms, 12ms, 9ms = cache hits. Let's click into one.
  15. Trace: abc123 — Total: 8ms SyncCommitsJob.perform — 8ms github.fetch_commits —

    6ms Redis GET — 2ms Attributes: user.id: 847 cache.hit: true cache.age_hours: 47 47 hours old. Still treated as valid.
  16. Trace: ghi789 — Total: 847ms SyncCommitsJob.perform — 847ms HTTP GET

    api.github.com — 834ms http.status_code: 304 GitHub says "Not Modified" — but our cache is 2 days old! The bug: We trust GitHub's 304 without checking cache age.
  17. The Picture Emerges Metrics told us: Cache hit rate is

    100% => never fetching fresh data Traces told us: When we DO call GitHub, it returns 304 Cache is 47 hours old but we think it's "valid" Still need: WHO is affected? HOW MANY?
  18. Plain text: [09:41:23] INFO GitHub API returned 304... Structured JSON:

    { "timestamp": "2026-01-14T09:41:23Z", "user_id": 847, "trace_id": "abc123", "cache_hit": true, "cache_age_hours": 47, "github_status": 304 }
  19. Query: cache_age_hours > 24 Results: 847 log entries across 126

    users User 847: cache age 47h (Sarah Chen) User 291: cache age 36h User 1042: cache age 29h ... 126 users affected
  20. THE BUG ETags answer: "Has the CONTENT changed?" We assumed:

    "Is the CACHE still fresh?" GitHub's commits didn't change → 304 → we keep stale cache The fix: Check cache age FIRST, before trusting ETag. If older than 24 hours, refresh anyway.
  21. The Fix (4 lines) def cache_valid?(user) return false if cache_expired?(user)

    # NEW! return true if etag_matches?(user) false end def cache_expired?(user) user.commits_cached_at < 24.hours.ago end
  22. The Timeline 9:47 AM Sarah's email 10:15 AM Metrics: 100%

    cache hit 10:30 AM Traces: 304 + stale cache 10:45 AM Logs: 126 users affected 11:00 AM Root cause found 11:20 AM Fix deployed 93 minutes from email to fix
  23. What Traditional Alerting Sees ✓ Error rate > 1% Not

    firing (0%) ✓ p99 > 500ms Not firing (127ms) ✓ Failed jobs > 5 Not firing (0) ALL GREEN ✓ 126 users affected
  24. The Shift SYMPTOM ALERTS (after the damage) "Errors are happening"

    "Users are complaining" "Jobs are failing" Reactive BEHAVIOR ALERTS (before the damage) "We stopped calling GitHub" "Cache is getting stale" "Hit rate too high" Proactive
  25. THE NEW TUESDAY 3:47 AM Alert: Cache hit rate 100%

    for 1 hour 3:52 AM Grafana → Jaeger → Logs: found the issue 4:15 AM Fix deployed 9:47 AM Sarah generates invoice. All commits present. She never emails. Because there's nothing wrong.
  26. Anti-pattern #1 ✗ "No errors = no problems" ✓ Track

    SUCCESS indicators, not just failures
  27. Quick wins for Monday 1. RED metrics on every endpoint

    (Rate, Errors, Duration) 2. Cache age metrics 3. Upstream call rate ("Are we talking to dependencies?") 4. trace_id in structured logs (Connects logs ↔ traces) Total: ~2 hours setup
  28. THIS WEEK Monday yabeda-rails + prometheus 30 min Tuesday opentelemetry-instrumentation-all

    20 min Wednesday semantic_logger 15 min Thursday One "silence" alert 10 min Total: ~75 minutes to production observability
  29. RESOURCES This talk + demo: github.com/mraaroncruz/observability-demo The gems: • yabeda-rails

    / yabeda-prometheus • opentelemetry-instrumentation-all • semantic_logger @mraaroncruz · billmycommits.com