Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Software Performance: An Overview

Software Performance: An Overview

What is Performance?
Algorithmic Complexity
Caching
Memory Hierarchy
Server Performance
Concurrency
Measuring Performance
Scalability

George V. Reilly

January 21, 2016
Tweet

More Decks by George V. Reilly

Other Decks in Technology

Transcript

  1. Performance:
    An Overview
    George V. Reilly
    Software Development Lead, MetaBrite

    View Slide

  2. Agenda
    ● What is Performance?
    ● Algorithmic Complexity
    ● Caching
    ● Memory Hierarchy
    ● Server Performance
    ● Concurrency
    ● Measuring Performance
    ● Scalability

    View Slide

  3. Not Discussing
    ● Performance Tuning
    ● Database Performance
    ● Web Performance
    ● Python Performance
    ● C/C++ Performance
    ● High-Performance Teams

    View Slide

  4. What is Performance?
    ● Efficiency: Doing More With Less
    ● It’s Better Than Before!
    ● Responsiveness

    View Slide

  5. Low Utilization of Resources
    ● Less Time (CPU Cycles)
    ● Less Memory (RAM)
    ● Less Storage (Disk)
    ● Less Power (Battery)
    ● Less Contention (Locks)
    ● Don’t do something more than once

    View Slide

  6. Layer Inefficiencies Compound
    ● Inner loops are often 10–20 layers deep in call stack
    ● If each layer has just 50% inefficiency, ten layers will
    experience 60x slowdown: 1.510 = 57.6
    ○ 100% inefficiency for 10 layers ⇒ 1000x slowdown: 210 = 1024
    ● If a lower level is inefficient, it affects all higher levels
    ● A higher level may be calling an efficient lower level far
    too often

    View Slide

  7. Algorithmic Complexity: Big O
    ● Many algorithms operate on N items.
    ● Runtime (or space) is a polynomial function of N:
    c
    k
    Nk + c
    k-1
    Nk-1 + … c
    1
    N + c
    0
    ● We take the highest term, drop the constant c
    k
    , and say
    that the algorithm is O(Nk).
    ● As N grows large, runtime draws asymptotically close to Nk
    ● For smaller N, the constant factors matter.

    View Slide

  8. Algorithmic Complexity
    ● O(1): constant time—addition, hash table lookup, …
    ● O(log N): logarithmic—binary search, binary tree lookup
    ● O(N): linear—search unordered list, vector addition, …
    ● O(N log N): linearithmic—quicksort, etc
    ● O(N2): quadratic—bubble sort, matrix addition, &c
    ● O(N3): cubic—matrix multiplication
    ● O(cN), c > 1: exponential—traveling salesman
    ● N!: factorial—permutations and combinations

    View Slide

  9. Big O Graph

    View Slide

  10. Caching
    ● Fundamental performance technique
    ● Trade Space for Time
    ● Store results to prevent recomputation/refetching
    ● Caches used at all levels throughout system
    ● Problems:
    ○ Low cache hit rate; i.e., many cache misses
    ○ Cache replacement policy; e.g., Least Recently Used
    ○ Cache invalidation
    ○ Using too much memory

    View Slide

  11. Case Study: Fibonacci
    ● 1, 1, 2, 3, 5, 8, 13, 21, 34, 55, …
    ● F
    n
    = F
    n-1
    + F
    n-2
    , F
    2
    = F
    1
    = 1
    ● F
    n
    ≈ φ n / √5
    ○ exponential
    ○ φ = 1.618…, the Golden Ratio
    ● Naive Recursive solution is prohibitively expensive
    ○ Computing F
    n
    takes F
    n
    additions.
    ● Iterative solution is linear
    ● Cache is constant time (after initial calculation)

    View Slide

  12. Memory Hierarchy
    Cache hierarchy of the K8 core in the AMD Athlon 64 CPU—Wikipedia

    View Slide

  13. Memory Hierarchy
    ● L1, L2, L3 caches
    ● Main Memory: DRAM
    ● Virtual Memory—paging to disk when Working Set
    exceeded

    View Slide

  14. Event Latency Scaled
    1 CPU Cycle 0.3 ns 1 sec
    Level 1 cache access 0.9 ns 3 secs
    Level 2 cache access 2.8 ns 9 secs
    Level 3 cache access 12.9 ns 43 secs
    Main memory access (DRAM, from CPU) 120 ns 6 mins
    Solid-State Disk I/O 50–150 µs 2–6 days
    Rotational Disk I/O 1–10ms 1–12 months
    Event Latency Scaled
    Internet: San Francisco to New York 40 ms 4 years
    Internet: SF to United Kingdom 81 ms 8 years
    Internet: SF to Australia 183 ms 19 years
    TCP Packet Retransmit 1–3 s 105–317 yrs
    OS virtualization system reboot 4 s 423 years
    SCSI Command Timeout 30 s 3 millennia
    Hardware virt. system reboot 40s 4 millennia
    Physical system reboot 5 min 32 millennia
    System Latency
    Source: Brendan Gregg, System Performance

    View Slide

  15. Server Performance
    ● Throughput = # requests / request-time
    ● Latency = end-to-end processing time
    ● An assembly line is manufacturing cars.
    ○ It takes 8 hours to manufacture a car
    ○ The factory produces 120 cars per day
    ○ Latency: 8 hours
    ○ Throughput: 120 cars / day = 5 cars / hour

    View Slide

  16. Concurrency, Throughput, Latency

    View Slide

  17. Concurrency
    ● Take advantage of multiple CPUs in a computer
    ● Multiple threads or multiple processes
    ● Problems
    ○ Non-determinism
    ○ Debugging
    ○ Synchronization
    ○ Serialization
    ○ Lock and Resource Contention
    ○ False Sharing

    View Slide

  18. Premature Optimization
    “Programmers waste enormous amounts of time thinking
    about, or worrying about, the speed of noncritical parts of
    their programs, and these attempts at efficiency actually have
    a strong negative impact when debugging and maintenance
    are considered. We should forget about small efficiencies, say
    about 97% of the time: premature optimization is the root of
    all evil. Yet we should not pass up our opportunities in that
    critical 3%.”
    — Donald Knuth

    View Slide

  19. Measuring Performance
    ● < 4% of code takes > 50% runtime
    ● Changes with high expectations often disappoint
    ● Correctness is more important than efficiency
    ● Metrics & Instrumentation
    ● Logs
    ● Profiling

    View Slide

  20. Scalability
    ● Bigger Workloads
    ● Not synonymous with Performance
    ● Scale Up (Vertical)
    ○ More CPUs, more RAM.
    ○ You can’t buy (or afford) a 1,000,000-core system with petabyte RAM
    ● Scale Out (Horizontal)
    ○ Add more and more commodity machines
    ● Bottlenecks
    ● Scale Down
    ○ Embedded systems
    ○ More containers per host

    View Slide