Slide 1

Slide 1 text

Dashboard Design using Information Architecture @hashfyre Staff Engineer @ Infracloud.io X-Razorpay X-Hotstar

Slide 2

Slide 2 text

Dashboards - Why? Global Overview of Information Data Functions Controls Leverage State Stakeholders Insight Use alter

Slide 3

Slide 3 text

Purpose Single Pane Of Glass Decision Making Goal Tracking Historical Review System Monitoring

Slide 4

Slide 4 text

Understanding Your Audience

Slide 5

Slide 5 text

Make Dashboards with available metrics Talk to Stakeholders

Slide 6

Slide 6 text

Design SRE Thinking Empathy Priority Prototype Test Validate Iterate Empathy Map J2BD User Journey No Sprint Time for Engineers Skills Abilities KPIs Attitude Knowledge Environment Urgency Water-cooler chats Slack Banter 1:1s Outages

Slide 7

Slide 7 text

Influencing Factors BIZ TECH Org Hierarchy & Culture Geographic Distribution Of Teams Business Vertical

Slide 8

Slide 8 text

The Stakeholders BIZ TECH CEO PMs CTO Dir. Eng EMs Engineers Impact feedback

Slide 9

Slide 9 text

Care BIZ TECH CEO, CFO PMs CTO Dir. Eng EMs Engineers Are things up? What’s the cost? When can we go live? Does it scale? Are we fast and secure? How’s our code quality Performance? Will it break? Do we have enough compute?

Slide 10

Slide 10 text

Information Architecture & Hierarchy Are things up? What’s the cost? When can we go live? Does it scale? Are we fast and secure? Code Quality? Performance? Will it break? Enough compute? Grain & Precision ++ RED USE Latency Bug count CVEs / deploy Deploys/week OK 10K/mo Biz Critical ++

Slide 11

Slide 11 text

Information Hierarchy Grain & Precision ++ RED USE Latency Bug count CVEs / deploy Deploys/week OK 10K/mo KPIs & SLOs SLIs Debug, MTTR, MTBF

Slide 12

Slide 12 text

Principles of Information Architecture

Slide 13

Slide 13 text

Only the Most Important Information OK Gateway 200.5K RPS Time Req Gateway Time Req Payments Time Req Auth DOWN Payments 10.8K 5xx OK Auth 50.5K RPS

Slide 14

Slide 14 text

Data Ink Ratio Time (Hr) Req/s Gateway 5k X Gridlines X Icons X Colors X Labels Time (Hr) Req/ sec Gateway 00:09:00 00:10:00 00:11:00 00:12:00 00:13:00 00:14:00 00:15:00 00:16:00 1000 2000 3000 4000 5000 6000 7000 8000 7.2K

Slide 15

Slide 15 text

Precision vs Cognition 1349745.667 1009345.667 Users (Q1) USA UK 901236.667 EU 801236.667 ASIA 1.3M 1.0M Users (Q1) USA UK 0.9M EU 0.8M ASIA

Slide 16

Slide 16 text

Right Visualization For Insights OK Gateway 200.5K RPS 200.5K 5xx Payments KPIs Status, Rounded Numbers 1.3M 1.0M Users (Q1) USA UK 0.9M EU 0.8M ASIA Comparisons Bar Graphs Composition Pie, Donut, Area charts Releases P95 latency Relationship Correlations Distributions

Slide 17

Slide 17 text

F/Z Reading Patterns OK Gateway 200.5K RPS DOWN Payments 10.8K 5xx OK Auth 50.5K RPS P95 Latency num(req) OK MySQL READS 100K WARN MySQL Slow Query 15 Uptime Gateway 99.99% Auth 99.99% payment 80.65% Cart 95.28% Review 99.28% OK MySQL WRITES 50K Scan Primary KPIs Skim Scan Secondary SLIs Scan Primary KPIs Skim Scan Secondary SLIs

Slide 18

Slide 18 text

Emphasize Reading Patterns OK Gateway 200.5K RPS DOWN Payments 10.8K 5xx OK Auth 50.5K RPS P95 Latency num(req) OK MySQL READS 100K WARN MySQL Slow Query 15 Uptime 99.99% Gateway OK MySQL WRITES 50K Color Size 80.65% Payment 95.28% Cart 99.28% Review Complexity We always look first for complex shapes Font Weight

Slide 19

Slide 19 text

Contextual Grouping OK Gateway 200.5K RPS DOWN Payments 10.8K 5xx OK Auth 50.5K RPS P95 Latency num(req) OK MySQL READS 100K WARN MySQL Slow Query 15 Uptime 99.99% Gateway OK MySQL WRITES 50K 80.65% Payment 95.28% Cart 99.28% Review Database performance Application performance User Experience

Slide 20

Slide 20 text

Raw Numbers vs Numbers with Context MySQL Slow Query / day -15% WARN 15 WARN 15

Slide 21

Slide 21 text

Applying the Concepts

Slide 22

Slide 22 text

Dashboards Targeting Stakeholders BIZ TECH CEO PMs CTO Dir. Eng EMs Engineers Executive Summary - Tech + Biz Feature Performance Global Status - KPIs, SLOs SLI RED, USE

Slide 23

Slide 23 text

Executive Summaries 1.8M Global Users +10% 4.5M Gateway Requests +20% 50K Signups +15% 10K Live Users +12% 30K Payments +10% 10.5K $ Cost +10% 1K Payment Failures +90%

Slide 24

Slide 24 text

Global Status 4.5M Gateway RPS +20% 10.8K 5xx Payments +30% 100K Auth RPS +30% P95 Latency num(req) Uptime 99.99% Gateway 80.65% Payment 95.28% Cart 99.28% Review 100K Payment RPS +20% 400K Cart RPS +30% 50K Review RPS +20% 1.5K 4xx Auth +20% 0 5xx Review -35% 0 5xx Cart 0% 99.87% Auth 1M Reads 100K Writes 1K Slow MySQL Queries Auth Cart Payment Review

Slide 25

Slide 25 text

Service Level Indicators 10.8K 5xx Payments +30% P95 Latency num(req) 99.99% Gateway 100K RPS +20% 99.87% Auth 1M Reads GET /refund/:id GET /payment/:id POST /create POST /refund 80.65% Uptime -20% 1.5hrs Error Budget -45% 100K MySQL READS 15 MySQL Slow Queries 50K MySQL WRITES 5.8K 5xx SlyPay +30% 0.1K 5xx SharpPay +0% Top K Errors Unable to Connect to MySQL Endpoint - 1K Query timed out - 500 Unable to resolve DNS payment.slypay.com - 4K

Slide 26

Slide 26 text

Debugging & RCA ● RED Method ○ Rate ○ Errors ○ Duration ● USE Method ○ Utilization ○ Saturation ○ Errors ● Golden Signals ○ Latency ○ Traffic ○ Errors ○ Saturation ● Create an inverted Pyramid of abstractions ○ L7 Metrics ○ L4 Metrics ○ CPU, Memory, Disk, Network ○ Logs, traces, events ● Plot SLIs of dependent services & dependencies ○ SLIs for 3rd party APIs ○ SLIs for Cloud and K8s APIs ○ SLIs for DBs, Caches and queues ■ Separate Read / write paths ■ Slow queries filtered for the service ○ DNS calls

Slide 27

Slide 27 text

Key Takeaways Shape Cognition Complexity Talk to Stakeholders Only Emphasize Important Data Use Information Architecture Color Size Separate Dashboards Drill-Down Contextual Relationships Storytelling using correlation

Slide 28

Slide 28 text

Lastly… Get Stakeholders Run Metric Review Cadences Design Iterate Review Measure Efficacy MTTR MTBF

Slide 29

Slide 29 text

References ● https://www.geckoboard.com/best-practice/dashboard-design/ ● https://infovis-wiki.net/wiki/Data-Ink_Ratio ● https://www.sciencedirect.com/science/article/abs/pii/S0969698922001242 ● https://youtu.be/UPHWsheepZo?si=mk6Quj-G7j84BFru ● https://grafana.com/go/webinar/getting-started-with-grafana-dashboard-desi gn-emea/ https://pastebin.com/rx208qpC

Slide 30

Slide 30 text

Thanks, Phew! @hashfyre joy.bhattacherjee@infracloud.io