Software Performance: An Overview

Performance: An Overview George V. Reilly Software Development Lead, MetaBrite

Agenda • What is Performance? • Algorithmic Complexity • Caching
• Memory Hierarchy • Server Performance • Concurrency • Measuring Performance • Scalability

Not Discussing • Performance Tuning • Database Performance • Web
Performance • Python Performance • C/C++ Performance • High-Performance Teams

What is Performance? • Efficiency: Doing More With Less •
It’s Better Than Before! • Responsiveness

Low Utilization of Resources • Less Time (CPU Cycles) •
Less Memory (RAM) • Less Storage (Disk) • Less Power (Battery) • Less Contention (Locks) • Don’t do something more than once

Layer Inefficiencies Compound • Inner loops are often 10–20 layers
deep in call stack • If each layer has just 50% inefficiency, ten layers will experience 60x slowdown: 1.510 = 57.6 ◦ 100% inefficiency for 10 layers ⇒ 1000x slowdown: 210 = 1024 • If a lower level is inefficient, it affects all higher levels • A higher level may be calling an efficient lower level far too often

Algorithmic Complexity: Big O • Many algorithms operate on N
items. • Runtime (or space) is a polynomial function of N: c k Nk + c k-1 Nk-1 + … c 1 N + c 0 • We take the highest term, drop the constant c k , and say that the algorithm is O(Nk). • As N grows large, runtime draws asymptotically close to Nk • For smaller N, the constant factors matter.

Algorithmic Complexity • O(1): constant time—addition, hash table lookup, …
• O(log N): logarithmic—binary search, binary tree lookup • O(N): linear—search unordered list, vector addition, … • O(N log N): linearithmic—quicksort, etc • O(N2): quadratic—bubble sort, matrix addition, &c • O(N3): cubic—matrix multiplication • O(cN), c > 1: exponential—traveling salesman • N!: factorial—permutations and combinations

Big O Graph

Caching • Fundamental performance technique • Trade Space for Time
• Store results to prevent recomputation/refetching • Caches used at all levels throughout system • Problems: ◦ Low cache hit rate; i.e., many cache misses ◦ Cache replacement policy; e.g., Least Recently Used ◦ Cache invalidation ◦ Using too much memory

Case Study: Fibonacci • 1, 1, 2, 3, 5, 8,
13, 21, 34, 55, … • F n = F n-1 + F n-2 , F 2 = F 1 = 1 • F n ≈ φ n / √5 ◦ exponential ◦ φ = 1.618…, the Golden Ratio • Naive Recursive solution is prohibitively expensive ◦ Computing F n takes F n additions. • Iterative solution is linear • Cache is constant time (after initial calculation)

Memory Hierarchy Cache hierarchy of the K8 core in the
AMD Athlon 64 CPU—Wikipedia

Memory Hierarchy • L1, L2, L3 caches • Main Memory:
DRAM • Virtual Memory—paging to disk when Working Set exceeded

Event Latency Scaled 1 CPU Cycle 0.3 ns 1 sec
Level 1 cache access 0.9 ns 3 secs Level 2 cache access 2.8 ns 9 secs Level 3 cache access 12.9 ns 43 secs Main memory access (DRAM, from CPU) 120 ns 6 mins Solid-State Disk I/O 50–150 µs 2–6 days Rotational Disk I/O 1–10ms 1–12 months Event Latency Scaled Internet: San Francisco to New York 40 ms 4 years Internet: SF to United Kingdom 81 ms 8 years Internet: SF to Australia 183 ms 19 years TCP Packet Retransmit 1–3 s 105–317 yrs OS virtualization system reboot 4 s 423 years SCSI Command Timeout 30 s 3 millennia Hardware virt. system reboot 40s 4 millennia Physical system reboot 5 min 32 millennia System Latency Source: Brendan Gregg, System Performance

Server Performance • Throughput = # requests / request-time •
Latency = end-to-end processing time • An assembly line is manufacturing cars. ◦ It takes 8 hours to manufacture a car ◦ The factory produces 120 cars per day ◦ Latency: 8 hours ◦ Throughput: 120 cars / day = 5 cars / hour

Concurrency, Throughput, Latency

Concurrency • Take advantage of multiple CPUs in a computer
• Multiple threads or multiple processes • Problems ◦ Non-determinism ◦ Debugging ◦ Synchronization ◦ Serialization ◦ Lock and Resource Contention ◦ False Sharing

Premature Optimization “Programmers waste enormous amounts of time thinking about,
or worrying about, the speed of noncritical parts of their programs, and these attempts at efficiency actually have a strong negative impact when debugging and maintenance are considered. We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3%.” — Donald Knuth

Measuring Performance • < 4% of code takes > 50%
runtime • Changes with high expectations often disappoint • Correctness is more important than efficiency • Metrics & Instrumentation • Logs • Profiling

Scalability • Bigger Workloads • Not synonymous with Performance •
Scale Up (Vertical) ◦ More CPUs, more RAM. ◦ You can’t buy (or afford) a 1,000,000-core system with petabyte RAM • Scale Out (Horizontal) ◦ Add more and more commodity machines • Bottlenecks • Scale Down ◦ Embedded systems ◦ More containers per host

Software Performance: An Overview

Software Performance: An Overview

George V. Reilly

More Decks by George V. Reilly

Other Decks in Technology

Featured

Transcript

Performance: An Overview George V. Reilly Software Development Lead, MetaBrite

Agenda • What is Performance? • Algorithmic Complexity • Caching

Not Discussing • Performance Tuning • Database Performance • Web

What is Performance? • Efficiency: Doing More With Less •

Low Utilization of Resources • Less Time (CPU Cycles) •

Layer Inefficiencies Compound • Inner loops are often 10–20 layers

Algorithmic Complexity: Big O • Many algorithms operate on N

Algorithmic Complexity • O(1): constant time—addition, hash table lookup, …

Big O Graph

Caching • Fundamental performance technique • Trade Space for Time

Case Study: Fibonacci • 1, 1, 2, 3, 5, 8,

Memory Hierarchy Cache hierarchy of the K8 core in the

Memory Hierarchy • L1, L2, L3 caches • Main Memory:

Event Latency Scaled 1 CPU Cycle 0.3 ns 1 sec

Server Performance • Throughput = # requests / request-time •

Concurrency, Throughput, Latency

Concurrency • Take advantage of multiple CPUs in a computer

Premature Optimization “Programmers waste enormous amounts of time thinking about,

Measuring Performance • < 4% of code takes > 50%

Scalability • Bigger Workloads • Not synonymous with Performance •