PremDay - Performance measurement

Performance measurement Adrien Mahieux CTO & Performance Engineer moji -
orness - mvconcept gh: github.com/Saruspete tw: @Saruspete em: [email protected]

Adrien Mahieux CTO & Performance Engineer “From algorithm to silicium”
Production ﬁrst Experiences from many companies (~40) Size: Big & Medium Time: Long-run & Commando Type: HFT / VM / HPC

Small company but full of geeks Fiber ISP: 1 Tbps
internet Datacenter: Free Cooling PUE < 1.1 Services: packaged to business requirements, from design to run

Performance Measurement Who to ask “What is performance” ? What
to expect ? How to measure ? Benchmarking examples Communication between actors

What is performance ?

What is performance ? Most of us being technical people,
we usually understand “performance” as “technical eﬃciency”. “how well a person, machine, etc. does a piece of work or an activity” ⇒ Depending on who you ask, their activity will change, and so will their KPI. ⇒ If you have only 1 contact, you’ll see the company through their own prism So real question is: what is your activity ?

“I’m in I.T.” Map of technologies and associated teams The
bigger the company, the more the silos.

“I’m in I.T.” - simpliﬁed Most only show interest in
1 level up/down. - Business needs app to work - Dev understands business, needs middleware to run their apps - Middleware understands apps, needs storage/compute/network - Housing understands hardware

What to expect

What is “performance” for business Referring to the added value
of the company. - Revenue: Needs to sell more / reduce TCO / higher margins - Competitive advantage: faster during the 5% most important time frame - Growth capacity: remove existing barriers

What is “performance” for developers Developers create tools for business
to work eﬃciently. - Time To Market: Deploy faster new version of code - Density: ship more code on equivalent resources - Security: focus on the main business activity -

Middleware handles the application runtime (System, DB, VM…) - Stability:
System / hardware must be rock-solid, and detect or (at least) give detail on failure - Automation: Reduce time-to-deploy. Keep same toolkit & UX during multiple revisions - Compatibility: Drivers must be available OOB for OS - Density: more workload for same amount of management What is “performance” for middleware

Compute is usually provided by 1U/2U x86 servers Network is
usually ethernet single-vendor. Storage is now servers with storage devices (hdd, ssd, nvme) - Risk management: more density = more loss during outage or maintainance - Network: 24/32/48 ports, 10/25/40/100 Gbps - Storage: Throughput or latency ? Depends on access type (large/small blocks + sequential/random) - Density: good, if tools follow for rebalancing What is “performance” for compute

What is “performance” for datacenter Datacenter is focused on raw
power. 1 W electricity in = 1W heat out. - Raw power efficiency: PUE / WUE - Maintainability: easy to manipulate, identify FRU - Compatibility: standard sockets (C13/C14), rack size - Environmental factors: airflow, power-factor, chassis - Water cooling: heat-reuse, durability (algae…) - Rack filling: too dense = useless vacant slots

We need to have profiles Depending on the use-case, you
need to tailor your products. ⇒ One-size fits all is never efficient. We need Application profiles: - Bound on: CPU / Memory / IO - Workload type: Latency / Throughput / Jitter - Risk acceptance: Security / Performance - Data lifetime: one-shot / short-lived / long-lived

How to measure ? Tools…

How to measure ? Technical load: probes on OS &
Hardware, logs on application. OS gives hints on what resource is consumed (cpu, memory, I/O) Hardware provides usage eﬃciency (IPC, Mem BW, PCIe BW) Logs provide detail on the multiple steps of the workload User-experience is hard to assess (VDI). End-user shadowing helps Question: When to activate HT/SMT ?

System monitoring Many many tools to monitor and manage subsystems
www.brendangregg.com/linuxperf.html

System monitoring Netdata will gather all data: system and many
applications 1s sampling Distributed and standalone Very low overhead Custom collectors easy to create https://netdata.cloud

Hardware monitoring Many insights on fan-site wikichip.org & osdev.org Tools
provided by vendors, Intel PCM & AMD uProfPcm Intel PCM AMD uProf

Hardware monitoring Turn raw-data into useful, low-level monitoring: IPC, L2/L3
hit, mem bw…

OGrEE: Bridge between software & hardware OGrEE : Datacenter Digital
Twin Need to connect a server ? Network sees “Et12/1” Sysop sees “eth0” Datacenter sees “PCI Slot 1” Works with browser / VR / AR Correlate the views of the diﬀerent teams to reduce errors and time https://ogree.ditrit.io

Benchmarking examples

Example: Renderman - 1 Frame Quickly use all CPU in
userspace Small dataset: ~10GB used Average eﬃciency: IPC 1.1 L3 hit ratio 80% RAM Bandwidth: 15GB/s write + 30GB/s read

Example: monte-carlo Series of mono-thread processing for distribution and multi-thread
for work. A lot of time spent in mono-thread. You can throw more cores as much as you want, results will not be linear

Example: scheduler mess Standard HPC/HFT/Virtualization issue: When free cores are
available, Kernel scheduler will mess-up the placement and thus cache

Example: CSP bench For HPC, many caveats and tricks: -
Inﬁniband requires many tunings & bugs (need Scaling Set, else bad partition by group of 5) - MPI have issues with large number of NUMA - When on VM, issues with noisy neighbour

Communication between actors

Communicate your needs and ﬂexibility Communication is most important when
trying to reduce TCO Reducing TCO often requires a few actors to lose a bit, for the group to win a lot. Owners: ﬁnd “champions” within every team to be your point of contact to discuss possibilities, upgrades, requirements… Vendors: understand what is the main activity of your current contact, and expand your contact circle.

HPC - “software is using all grid” Fortran application using
MPI Developer sees application consuming 100% CPU all cores New test servers with more cores are not faster Check with netdata: only 1 core is doing real work (yellow), all others are active-spinning (blue) Quick ﬁxes: - servers with less core but higher frequency. - Run jobs with less cores, but more diﬀerent jobs

HPC - Unusual setup but good eﬃciency 1 Node =
96 Cores / 800W 11 chassis 2U4N per rack - Application: AVX512 (good) - System: 4 NUMA nodes (good) - Power: 35KW (high but good) - Network: 44 * 25G (good) - Cooling: DLC + Air (custom-built) - 3 PSU = 3 Phases (not standard)

HFT - “server is frozen” Trading: Server freeze randomly Netdata:
not random, every 1300s (precisely) for 2s-8s. Checks logs… hum ﬁrmware upgrade yesterday. Let’s downgrade. No more issue. BMC generated SCSI bus scan for monitoring, that froze the I/Os.

VDI / HPC - Temporary storage / Short lived data
End-user has short-lived data (AI/ML, VDI cache). Just needs low-latency, local access to it for temporary data Storage team propose new network SSD/NVMe to keep data safe & still “fast”. Tests were “ok, but not that good” At scale, it crushed the network and latency skyrocketed For end-user experience, it’s unusable. ⇒ High-end storage, high cost, no use.

Support experience - software - usual 1) Open case through
portal with useful data + memory dump 2) Level 1 search the public KB, give proposals you already tested 3) Ask for escalation 4) Level 2 takeover, seeks bugs, test things… but don’t know your business/company/env, so they ask for more details 5) Shift is over, FTS takes ticket, ask again details, then copy/paste all in bugzilla 6) Ticket is not trivial, so it may be processed in a few months. 7) The product-owner deemed it not urgent, so it’ll wait next release. 8) You contact an engineer through other ways (twitter, linkedin, github…) Lots of frustration: you lost time & energy in ticket, issue isn’t solved quickly & support was useless

Support experience - software - efficient 1) Open case through
portal with useful data + memory dump 2) TAM takes it. He knows you already tested public solutions, he just reminds them while searching for more insights. 3) TAM prepares a test/lab env usable for engineering to step right in. 4) Engineering works swiftly on analysis with provided details. Indeed, they didn’t though/had experience of your corner-case. They propose bugfix in ticket. 5) Bugfix is validated and added mainline Nobody lost time in the process, the fix was tricky to validate but support response was quick and efficient, and now all customers benefits of this fix

Support experience - Firmware Working on many 2U4N chassis, compute
team requested datacenter for system reset. They had issues identifying the physical node slot (bad SN or not available). System team only had remote access: missed the physical view Datacenter team only had physical view: missed conﬁrmation / remote status We asked vendor for a solution (eg led blink, /dev/mem read, etc…) Firmware engineer provided an “ipmitool raw” command to get the slot from the BMC (not documented). Good enough, problem solved !

Support nightmare : multiple actors to sync Context: PCIe-Switches +
RDMA. Setup is battle-tested, working great. We test new CPU model: app cannot init RDMA. Need to integrate many vendors: Chassis + CPU and Switch + ASIC Sent all hardware for tests to switch-vendor. Took almost 6 months Final word: One manufacturer ﬁxed bugs in PCIe, which previously allowed the partial PCIe implementation of another manufacturer to work as intended.

Recipe for good support experience Need to create a bridge
between engineers for eﬃcient experience. Engineers do not like support-tickets nor play guess with customers. But they do like to track a valid customer-case with reproducible steps. Customer: have the production experience and overview Vendor: have the knowledge and ability to upgrade product. Question: How can we identify & tag these eﬃcient people ?

Finally…

In a few words Select your comparison unit: - Hard
to compare at same time: cost + efficiency + production - Usage Cost: - Price/request (application) - Price/core (system) - Price/Watt (datacenter) - Application Efficiency: - Instructions Per Cycle (IPC) - Performance / Watt - Production: - Man-days to integrate new model - Specific integration needed or not - Compatibility with existing toolchain Increase internal communication: - Identify champions in teams - They are curious and like to understand - Share your needs & flexibility - You’re all on the same team - Being honest increase trust and efficiency between actors - They might help you in many ways - Create application profiles to easily validate new products - Bound on: CPU / Memory / IO - Workload: Latency / Throughput / Jitter - Risk acceptance: Security / Performance - Data lifetime: one-shot / short / long

Thank you !

PremDay - Performance measurement

PremDay - Performance measurement

More Decks by PremDay

Other Decks in Technology

Featured

Transcript