If You Didn't Record It, It Didn't Happen: Practical Observability for Rails

From: Sarah Chen <[email protected]> To: [email protected] Subject: Missing commits -
invoice due today Tuesday, 9:47 AM Hi, I pushed commits yesterday but they're not showing up on my invoice. Last week everything synced fine. Now some commits appear within minutes, others never show up at all. I need to send this invoice today. Sarah

If You Didn't Record It, It Didn't Happen RubyConf Thailand
2026 Aaron Cruz · billmycommits.com

billmycommits.com demo

By the evening I had found a fix

sarahchen/marketly-frontend · 14 commits in January a1f Fix navbar responsive
· 2 hours ago b92 Update dependencies · yesterday f71 Add dark mode toggle · yesterday ... (11 more) GitHub: 14 commits BillMyCommits: 12 commits

$ tail -f log/production.log [09:41:23] INFO Started POST "/webhooks/github" [09:41:23]
INFO Parameters: {"ref"=>"refs/heads/main"... [09:41:24] INFO Completed 200 OK in 142ms [09:41:52] INFO Started GET "/dashboard" [09:41:52] INFO User Load (0.4ms) SELECT "users"... [09:41:53] INFO Completed 200 OK in 89ms [... scrolling ...]

$ grep -i "error\|exception\|fail" log/production.log $_ Nothing.

All green. All good. Right?

The system says everything is fine. The user says it's
broken.

What I know: • Something is wrong • No errors
in logs • All jobs succeeded What I don't know: • What's actually failing • How many users affected • How long this has been happening

We've all been here...

We've only asked for errors

We've only asked for errors Did the user get what
they need?

Let's fix that.

Does anyone have a guess?

Going deep on a part of Andre's talk

METRICS Numbers about your system, over time.

How Prometheus Works Most monitoring: YOU push data out Your
App ──(push)──▶ DataDog / New Relic / CloudWatch ✗ Configure endpoints, API keys, credentials ✗ Your app handles failures, retries, buffering Prometheus: IT pulls from you Your App ──(expose /metrics) Prometheus scrapes every 15s ✓ Just expose a route, like any other endpoint ✓ No credentials, no external config ✓ Prometheus handles retries and failures

In Practice GET /metrics # github_sync_cache_hits_total 1247 # github_sync_cache_misses_total 89
# rails_request_duration_seconds 0.127 That's it. Prometheus config (not in your app): scrape_configs: - job_name: 'billmycommits' targets: ['localhost:3000'] Your app doesn't know Prometheus exists.

/metrics Demo

Prometheus aggregates at WRITE time LOGS (aggregate at read): [09:41:23]
Request to /dashboard took 127ms [09:41:24] Request to /invoices took 843ms ...store everything, query later PROMETHEUS (aggregate at write): http_requests_total{path="/dashboard"} 1,247 http_request_duration_sum 142.3 ...store aggregates, query summaries

What this means: ✗ Can't ask: "Show me all requests
over 1 second" → Prometheus doesn't have individual request data ✗ Can't ask: "What's the average for user 847?" → Already aggregated across all users ✗ Can't ask: "What was the slowest request yesterday?" → No individual timing data stored If you didn't instrument it upfront, the data doesn't exist.

What you CAN track: ✓ Counters: "How many?" http_requests_total{path="/dashboard"} 1,247
✓ Gauges: "What's the current value?" active_users_count 847 ✓ Histograms: "How many fell into each bucket?" http_duration_bucket{le="100ms"} 823 http_duration_bucket{le="500ms"} 1,201 http_duration_bucket{le="1000ms"} 1,247 → Approximate percentiles from buckets

THE TRADEOFF ✓ UPSIDE Scales to billions Constant memory Fast
queries ✗ DOWNSIDE Plan metrics ahead Can't ask new questions Less flexible For our GitHub sync problem? We need: cache hits, cache misses, API call rate That's specific. That's measurable. That's perfect for Prometheus.

Adding Prometheus # Gemfile gem 'prometheus_exporter' gem 'yabeda-rails' gem 'yabeda-prometheus'

Custom Metrics # config/initializers/yabeda.rb Yabeda.configure do group :github_sync do counter
:cache_hits counter :cache_misses counter :api_calls, tags: [:status] histogram :cache_age_hours, buckets: [1, 6, 12, 24, 48] end end

Grafana metrics demo

Cache Hit Rate (24h) 100% 0% Mon Tue Wed Thu
Fri ↑ Sarah's email 2 days since cache updated

What I now know: • We're never fetching fresh data
What I still don't know: • What's the cache invalidation logic? Metrics told me WHAT. I need traces to tell me WHY.

TRACES Follow one request through your entire system.

Adding OpenTelemetry # Gemfile gem 'opentelemetry-sdk' gem 'opentelemetry-exporter-otlp' gem 'opentelemetry-instrumentation-all'

Jaeger traces demo

JAEGER — SyncCommitsJob traces Trace: abc123 8ms Trace: def456 12ms
Trace: ghi789 847ms Trace: jkl012 9ms ↑ This one called GitHub (847ms) 8ms, 12ms, 9ms = cache hits. Let's click into one.

Trace: abc123 — Total: 8ms SyncCommitsJob.perform — 8ms github.fetch_commits —
6ms Redis GET — 2ms Attributes: user.id: 847 cache.hit: true cache.age_hours: 47 47 hours old. Still treated as valid.

Trace: ghi789 — Total: 847ms SyncCommitsJob.perform — 847ms HTTP GET
api.github.com — 834ms http.status_code: 304 GitHub says "Not Modified" — but our cache is 2 days old! The bug: We trust GitHub's 304 without checking cache age.

The Picture Emerges Metrics told us: Cache hit rate is
100% => never fetching fresh data Traces told us: When we DO call GitHub, it returns 304 Cache is 47 hours old but we think it's "valid" Still need: WHO is affected? HOW MANY?

STRUCTURED LOGS Query your logs like a database.

Plain text: [09:41:23] INFO GitHub API returned 304... Structured JSON:
{ "timestamp": "2026-01-14T09:41:23Z", "user_id": 847, "trace_id": "abc123", "cache_hit": true, "cache_age_hours": 47, "github_status": 304 }

Loki logging demo

Query: cache_age_hours > 24 Results: 847 log entries across 126
users User 847: cache age 47h (Sarah Chen) User 291: cache age 36h User 1042: cache age 29h ... 126 users affected

THE BUG ETags answer: "Has the CONTENT changed?" We assumed:
"Is the CACHE still fresh?" GitHub's commits didn't change → 304 → we keep stale cache The fix: Check cache age FIRST, before trusting ETag. If older than 24 hours, refresh anyway.

The Fix (4 lines) def cache_valid?(user) return false if cache_expired?(user)
# NEW! return true if etag_matches?(user) false end def cache_expired?(user) user.commits_cached_at < 24.hours.ago end

The Timeline 9:47 AM Sarah's email 10:15 AM Metrics: 100%
cache hit 10:30 AM Traces: 304 + stale cache 10:45 AM Logs: 126 users affected 11:00 AM Root cause found 11:20 AM Fix deployed 93 minutes from email to fix

ALERTING See the issues as soon as there is a
problem

What Traditional Alerting Sees ✓ Error rate > 1% Not
firing (0%) ✓ p99 > 500ms Not firing (127ms) ✓ Failed jobs > 5 Not firing (0) ALL GREEN ✓ 126 users affected

The Shift SYMPTOM ALERTS (after the damage) "Errors are happening"
"Users are complaining" "Jobs are failing" Reactive BEHAVIOR ALERTS (before the damage) "We stopped calling GitHub" "Cache is getting stale" "Hit rate too high" Proactive

Alert 1: Cache Hit Rate github_sync_cache_hit_rate > 0.99 for 1h
99% for 1 hour = something's stuck

Alert 2: Cache Age avg(cache_age_hours) > 6 Would have caught
us at 6h, not 47h

Alert 3: API Calls Stopped rate(api_calls_total[1h]) == 0 Zero API
calls = zero fresh data. Critical.

THE NEW TUESDAY 3:47 AM Alert: Cache hit rate 100%
for 1 hour 3:52 AM Grafana → Jaeger → Logs: found the issue 4:15 AM Fix deployed 9:47 AM Sarah generates invoice. All commits present. She never emails. Because there's nothing wrong.

ANTI-PATTERNS

Anti-pattern #1 ✗ "No errors = no problems" ✓ Track
SUCCESS indicators, not just failures

Anti-pattern #2 ✗ "Trusting upstream blindly" ✓ Set your own
TTLs, verify responses

Anti-pattern #3 ✗ "Cache metrics = hits and misses" ✓
Also track cache AGE

Anti-pattern #4 ✗ "High cardinality metrics" ✓ User IDs go
in logs/traces, not metrics

Quick wins for Monday 1. RED metrics on every endpoint
(Rate, Errors, Duration) 2. Cache age metrics 3. Upstream call rate ("Are we talking to dependencies?") 4. trace_id in structured logs (Connects logs ↔ traces) Total: ~2 hours setup

THIS WEEK Monday yabeda-rails + prometheus 30 min Tuesday opentelemetry-instrumentation-all
20 min Wednesday semantic_logger 15 min Thursday One "silence" alert 10 min Total: ~75 minutes to production observability

Your future self is the customer of your observability. Be
kind to them.

RESOURCES This talk + demo: github.com/mraaroncruz/observability-demo The gems: • yabeda-rails
/ yabeda-prometheus • opentelemetry-instrumentation-all • semantic_logger @mraaroncruz · billmycommits.com

If You Didn't Record It, It Didn't Happen: Prac...

If You Didn't Record It, It Didn't Happen: Practical Observability for Rails

More Decks by Aaron Cruz

Other Decks in Technology

Featured

Transcript