George V. Reilly
January 21, 2016
57

# Software Performance: An Overview

What is Performance?
Algorithmic Complexity
Caching
Memory Hierarchy
Server Performance
Concurrency
Measuring Performance
Scalability

January 21, 2016

## Transcript

1. Performance:
An Overview
George V. Reilly
Software Development Lead, MetaBrite

2. Agenda
● What is Performance?
● Algorithmic Complexity
● Caching
● Memory Hierarchy
● Server Performance
● Concurrency
● Measuring Performance
● Scalability

3. Not Discussing
● Performance Tuning
● Database Performance
● Web Performance
● Python Performance
● C/C++ Performance
● High-Performance Teams

4. What is Performance?
● Efficiency: Doing More With Less
● It’s Better Than Before!
● Responsiveness

5. Low Utilization of Resources
● Less Time (CPU Cycles)
● Less Memory (RAM)
● Less Storage (Disk)
● Less Power (Battery)
● Less Contention (Locks)
● Don’t do something more than once

6. Layer Inefficiencies Compound
● Inner loops are often 10–20 layers deep in call stack
● If each layer has just 50% inefficiency, ten layers will
experience 60x slowdown: 1.510 = 57.6
○ 100% inefficiency for 10 layers ⇒ 1000x slowdown: 210 = 1024
● If a lower level is inefficient, it affects all higher levels
● A higher level may be calling an efficient lower level far
too often

7. Algorithmic Complexity: Big O
● Many algorithms operate on N items.
● Runtime (or space) is a polynomial function of N:
c
k
Nk + c
k-1
Nk-1 + … c
1
N + c
0
● We take the highest term, drop the constant c
k
, and say
that the algorithm is O(Nk).
● As N grows large, runtime draws asymptotically close to Nk
● For smaller N, the constant factors matter.

8. Algorithmic Complexity
● O(1): constant time—addition, hash table lookup, …
● O(log N): logarithmic—binary search, binary tree lookup
● O(N): linear—search unordered list, vector addition, …
● O(N log N): linearithmic—quicksort, etc
● O(N2): quadratic—bubble sort, matrix addition, &c
● O(N3): cubic—matrix multiplication
● O(cN), c > 1: exponential—traveling salesman
● N!: factorial—permutations and combinations

9. Big O Graph

10. Caching
● Fundamental performance technique
● Trade Space for Time
● Store results to prevent recomputation/refetching
● Caches used at all levels throughout system
● Problems:
○ Low cache hit rate; i.e., many cache misses
○ Cache replacement policy; e.g., Least Recently Used
○ Cache invalidation
○ Using too much memory

11. Case Study: Fibonacci
● 1, 1, 2, 3, 5, 8, 13, 21, 34, 55, …
● F
n
= F
n-1
+ F
n-2
, F
2
= F
1
= 1
● F
n
≈ φ n / √5
○ exponential
○ φ = 1.618…, the Golden Ratio
● Naive Recursive solution is prohibitively expensive
○ Computing F
n
takes F
n
additions.
● Iterative solution is linear
● Cache is constant time (after initial calculation)

12. Memory Hierarchy
Cache hierarchy of the K8 core in the AMD Athlon 64 CPU—Wikipedia

13. Memory Hierarchy
● L1, L2, L3 caches
● Main Memory: DRAM
● Virtual Memory—paging to disk when Working Set
exceeded

14. Event Latency Scaled
1 CPU Cycle 0.3 ns 1 sec
Level 1 cache access 0.9 ns 3 secs
Level 2 cache access 2.8 ns 9 secs
Level 3 cache access 12.9 ns 43 secs
Main memory access (DRAM, from CPU) 120 ns 6 mins
Solid-State Disk I/O 50–150 µs 2–6 days
Rotational Disk I/O 1–10ms 1–12 months
Event Latency Scaled
Internet: San Francisco to New York 40 ms 4 years
Internet: SF to United Kingdom 81 ms 8 years
Internet: SF to Australia 183 ms 19 years
TCP Packet Retransmit 1–3 s 105–317 yrs
OS virtualization system reboot 4 s 423 years
SCSI Command Timeout 30 s 3 millennia
Hardware virt. system reboot 40s 4 millennia
Physical system reboot 5 min 32 millennia
System Latency
Source: Brendan Gregg, System Performance

15. Server Performance
● Throughput = # requests / request-time
● Latency = end-to-end processing time
● An assembly line is manufacturing cars.
○ It takes 8 hours to manufacture a car
○ The factory produces 120 cars per day
○ Latency: 8 hours
○ Throughput: 120 cars / day = 5 cars / hour

16. Concurrency, Throughput, Latency

17. Concurrency
● Take advantage of multiple CPUs in a computer
● Multiple threads or multiple processes
● Problems
○ Non-determinism
○ Debugging
○ Synchronization
○ Serialization
○ Lock and Resource Contention
○ False Sharing

18. Premature Optimization
“Programmers waste enormous amounts of time thinking
about, or worrying about, the speed of noncritical parts of
their programs, and these attempts at efficiency actually have
a strong negative impact when debugging and maintenance
are considered. We should forget about small efficiencies, say
about 97% of the time: premature optimization is the root of
all evil. Yet we should not pass up our opportunities in that
critical 3%.”
— Donald Knuth

19. Measuring Performance
● < 4% of code takes > 50% runtime
● Changes with high expectations often disappoint
● Correctness is more important than efficiency
● Metrics & Instrumentation
● Logs
● Profiling

20. Scalability
● Bigger Workloads
● Not synonymous with Performance
● Scale Up (Vertical)
○ More CPUs, more RAM.
○ You can’t buy (or afford) a 1,000,000-core system with petabyte RAM
● Scale Out (Horizontal)
○ Add more and more commodity machines
● Bottlenecks
● Scale Down
○ Embedded systems
○ More containers per host