Upgrade to Pro — share decks privately, control downloads, hide ads and more …

PremDay - Performance measurement

PremDay - Performance measurement

Adrien Mahieux from Moji presents : 'Performance Assessment : who to ask, what to expect, how to measure'

Avatar for PremDay

PremDay

July 02, 2024
Tweet

More Decks by PremDay

Other Decks in Technology

Transcript

  1. Performance measurement Adrien Mahieux CTO & Performance Engineer moji -

    orness - mvconcept gh: github.com/Saruspete tw: @Saruspete em: [email protected]
  2. Adrien Mahieux CTO & Performance Engineer “From algorithm to silicium”

    Production first Experiences from many companies (~40) Size: Big & Medium Time: Long-run & Commando Type: HFT / VM / HPC
  3. Small company but full of geeks Fiber ISP: 1 Tbps

    internet Datacenter: Free Cooling PUE < 1.1 Services: packaged to business requirements, from design to run
  4. Performance Measurement Who to ask “What is performance” ? What

    to expect ? How to measure ? Benchmarking examples Communication between actors
  5. What is performance ? Most of us being technical people,

    we usually understand “performance” as “technical efficiency”. “how well a person, machine, etc. does a piece of work or an activity” ⇒ Depending on who you ask, their activity will change, and so will their KPI. ⇒ If you have only 1 contact, you’ll see the company through their own prism So real question is: what is your activity ?
  6. “I’m in I.T.” - simplified Most only show interest in

    1 level up/down. - Business needs app to work - Dev understands business, needs middleware to run their apps - Middleware understands apps, needs storage/compute/network - Housing understands hardware
  7. What is “performance” for business Referring to the added value

    of the company. - Revenue: Needs to sell more / reduce TCO / higher margins - Competitive advantage: faster during the 5% most important time frame - Growth capacity: remove existing barriers
  8. What is “performance” for developers Developers create tools for business

    to work efficiently. - Time To Market: Deploy faster new version of code - Density: ship more code on equivalent resources - Security: focus on the main business activity -
  9. Middleware handles the application runtime (System, DB, VM…) - Stability:

    System / hardware must be rock-solid, and detect or (at least) give detail on failure - Automation: Reduce time-to-deploy. Keep same toolkit & UX during multiple revisions - Compatibility: Drivers must be available OOB for OS - Density: more workload for same amount of management What is “performance” for middleware
  10. Compute is usually provided by 1U/2U x86 servers Network is

    usually ethernet single-vendor. Storage is now servers with storage devices (hdd, ssd, nvme) - Risk management: more density = more loss during outage or maintainance - Network: 24/32/48 ports, 10/25/40/100 Gbps - Storage: Throughput or latency ? Depends on access type (large/small blocks + sequential/random) - Density: good, if tools follow for rebalancing What is “performance” for compute
  11. What is “performance” for datacenter Datacenter is focused on raw

    power. 1 W electricity in = 1W heat out. - Raw power efficiency: PUE / WUE - Maintainability: easy to manipulate, identify FRU - Compatibility: standard sockets (C13/C14), rack size - Environmental factors: airflow, power-factor, chassis - Water cooling: heat-reuse, durability (algae…) - Rack filling: too dense = useless vacant slots
  12. We need to have profiles Depending on the use-case, you

    need to tailor your products. ⇒ One-size fits all is never efficient. We need Application profiles: - Bound on: CPU / Memory / IO - Workload type: Latency / Throughput / Jitter - Risk acceptance: Security / Performance - Data lifetime: one-shot / short-lived / long-lived
  13. How to measure ? Technical load: probes on OS &

    Hardware, logs on application. OS gives hints on what resource is consumed (cpu, memory, I/O) Hardware provides usage efficiency (IPC, Mem BW, PCIe BW) Logs provide detail on the multiple steps of the workload User-experience is hard to assess (VDI). End-user shadowing helps Question: When to activate HT/SMT ?
  14. System monitoring Netdata will gather all data: system and many

    applications 1s sampling Distributed and standalone Very low overhead Custom collectors easy to create https://netdata.cloud
  15. Hardware monitoring Many insights on fan-site wikichip.org & osdev.org Tools

    provided by vendors, Intel PCM & AMD uProfPcm Intel PCM AMD uProf
  16. OGrEE: Bridge between software & hardware OGrEE : Datacenter Digital

    Twin Need to connect a server ? Network sees “Et12/1” Sysop sees “eth0” Datacenter sees “PCI Slot 1” Works with browser / VR / AR Correlate the views of the different teams to reduce errors and time https://ogree.ditrit.io
  17. Example: Renderman - 1 Frame Quickly use all CPU in

    userspace Small dataset: ~10GB used Average efficiency: IPC 1.1 L3 hit ratio 80% RAM Bandwidth: 15GB/s write + 30GB/s read
  18. Example: monte-carlo Series of mono-thread processing for distribution and multi-thread

    for work. A lot of time spent in mono-thread. You can throw more cores as much as you want, results will not be linear
  19. Example: scheduler mess Standard HPC/HFT/Virtualization issue: When free cores are

    available, Kernel scheduler will mess-up the placement and thus cache
  20. Example: CSP bench For HPC, many caveats and tricks: -

    Infiniband requires many tunings & bugs (need Scaling Set, else bad partition by group of 5) - MPI have issues with large number of NUMA - When on VM, issues with noisy neighbour
  21. Communicate your needs and flexibility Communication is most important when

    trying to reduce TCO Reducing TCO often requires a few actors to lose a bit, for the group to win a lot. Owners: find “champions” within every team to be your point of contact to discuss possibilities, upgrades, requirements… Vendors: understand what is the main activity of your current contact, and expand your contact circle.
  22. HPC - “software is using all grid” Fortran application using

    MPI Developer sees application consuming 100% CPU all cores New test servers with more cores are not faster Check with netdata: only 1 core is doing real work (yellow), all others are active-spinning (blue) Quick fixes: - servers with less core but higher frequency. - Run jobs with less cores, but more different jobs
  23. HPC - Unusual setup but good efficiency 1 Node =

    96 Cores / 800W 11 chassis 2U4N per rack - Application: AVX512 (good) - System: 4 NUMA nodes (good) - Power: 35KW (high but good) - Network: 44 * 25G (good) - Cooling: DLC + Air (custom-built) - 3 PSU = 3 Phases (not standard)
  24. HFT - “server is frozen” Trading: Server freeze randomly Netdata:

    not random, every 1300s (precisely) for 2s-8s. Checks logs… hum firmware upgrade yesterday. Let’s downgrade. No more issue. BMC generated SCSI bus scan for monitoring, that froze the I/Os.
  25. VDI / HPC - Temporary storage / Short lived data

    End-user has short-lived data (AI/ML, VDI cache). Just needs low-latency, local access to it for temporary data Storage team propose new network SSD/NVMe to keep data safe & still “fast”. Tests were “ok, but not that good” At scale, it crushed the network and latency skyrocketed For end-user experience, it’s unusable. ⇒ High-end storage, high cost, no use.
  26. Support experience - software - usual 1) Open case through

    portal with useful data + memory dump 2) Level 1 search the public KB, give proposals you already tested 3) Ask for escalation 4) Level 2 takeover, seeks bugs, test things… but don’t know your business/company/env, so they ask for more details 5) Shift is over, FTS takes ticket, ask again details, then copy/paste all in bugzilla 6) Ticket is not trivial, so it may be processed in a few months. 7) The product-owner deemed it not urgent, so it’ll wait next release. 8) You contact an engineer through other ways (twitter, linkedin, github…) Lots of frustration: you lost time & energy in ticket, issue isn’t solved quickly & support was useless
  27. Support experience - software - efficient 1) Open case through

    portal with useful data + memory dump 2) TAM takes it. He knows you already tested public solutions, he just reminds them while searching for more insights. 3) TAM prepares a test/lab env usable for engineering to step right in. 4) Engineering works swiftly on analysis with provided details. Indeed, they didn’t though/had experience of your corner-case. They propose bugfix in ticket. 5) Bugfix is validated and added mainline Nobody lost time in the process, the fix was tricky to validate but support response was quick and efficient, and now all customers benefits of this fix
  28. Support experience - Firmware Working on many 2U4N chassis, compute

    team requested datacenter for system reset. They had issues identifying the physical node slot (bad SN or not available). System team only had remote access: missed the physical view Datacenter team only had physical view: missed confirmation / remote status We asked vendor for a solution (eg led blink, /dev/mem read, etc…) Firmware engineer provided an “ipmitool raw” command to get the slot from the BMC (not documented). Good enough, problem solved !
  29. Support nightmare : multiple actors to sync Context: PCIe-Switches +

    RDMA. Setup is battle-tested, working great. We test new CPU model: app cannot init RDMA. Need to integrate many vendors: Chassis + CPU and Switch + ASIC Sent all hardware for tests to switch-vendor. Took almost 6 months Final word: One manufacturer fixed bugs in PCIe, which previously allowed the partial PCIe implementation of another manufacturer to work as intended.
  30. Recipe for good support experience Need to create a bridge

    between engineers for efficient experience. Engineers do not like support-tickets nor play guess with customers. But they do like to track a valid customer-case with reproducible steps. Customer: have the production experience and overview Vendor: have the knowledge and ability to upgrade product. Question: How can we identify & tag these efficient people ?
  31. In a few words Select your comparison unit: - Hard

    to compare at same time: cost + efficiency + production - Usage Cost: - Price/request (application) - Price/core (system) - Price/Watt (datacenter) - Application Efficiency: - Instructions Per Cycle (IPC) - Performance / Watt - Production: - Man-days to integrate new model - Specific integration needed or not - Compatibility with existing toolchain Increase internal communication: - Identify champions in teams - They are curious and like to understand - Share your needs & flexibility - You’re all on the same team - Being honest increase trust and efficiency between actors - They might help you in many ways - Create application profiles to easily validate new products - Bound on: CPU / Memory / IO - Workload: Latency / Throughput / Jitter - Risk acceptance: Security / Performance - Data lifetime: one-shot / short / long