we usually understand “performance” as “technical efficiency”. “how well a person, machine, etc. does a piece of work or an activity” ⇒ Depending on who you ask, their activity will change, and so will their KPI. ⇒ If you have only 1 contact, you’ll see the company through their own prism So real question is: what is your activity ?
1 level up/down. - Business needs app to work - Dev understands business, needs middleware to run their apps - Middleware understands apps, needs storage/compute/network - Housing understands hardware
of the company. - Revenue: Needs to sell more / reduce TCO / higher margins - Competitive advantage: faster during the 5% most important time frame - Growth capacity: remove existing barriers
to work efficiently. - Time To Market: Deploy faster new version of code - Density: ship more code on equivalent resources - Security: focus on the main business activity -
System / hardware must be rock-solid, and detect or (at least) give detail on failure - Automation: Reduce time-to-deploy. Keep same toolkit & UX during multiple revisions - Compatibility: Drivers must be available OOB for OS - Density: more workload for same amount of management What is “performance” for middleware
usually ethernet single-vendor. Storage is now servers with storage devices (hdd, ssd, nvme) - Risk management: more density = more loss during outage or maintainance - Network: 24/32/48 ports, 10/25/40/100 Gbps - Storage: Throughput or latency ? Depends on access type (large/small blocks + sequential/random) - Density: good, if tools follow for rebalancing What is “performance” for compute
need to tailor your products. ⇒ One-size fits all is never efficient. We need Application profiles: - Bound on: CPU / Memory / IO - Workload type: Latency / Throughput / Jitter - Risk acceptance: Security / Performance - Data lifetime: one-shot / short-lived / long-lived
Hardware, logs on application. OS gives hints on what resource is consumed (cpu, memory, I/O) Hardware provides usage efficiency (IPC, Mem BW, PCIe BW) Logs provide detail on the multiple steps of the workload User-experience is hard to assess (VDI). End-user shadowing helps Question: When to activate HT/SMT ?
Twin Need to connect a server ? Network sees “Et12/1” Sysop sees “eth0” Datacenter sees “PCI Slot 1” Works with browser / VR / AR Correlate the views of the different teams to reduce errors and time https://ogree.ditrit.io
Infiniband requires many tunings & bugs (need Scaling Set, else bad partition by group of 5) - MPI have issues with large number of NUMA - When on VM, issues with noisy neighbour
trying to reduce TCO Reducing TCO often requires a few actors to lose a bit, for the group to win a lot. Owners: find “champions” within every team to be your point of contact to discuss possibilities, upgrades, requirements… Vendors: understand what is the main activity of your current contact, and expand your contact circle.
MPI Developer sees application consuming 100% CPU all cores New test servers with more cores are not faster Check with netdata: only 1 core is doing real work (yellow), all others are active-spinning (blue) Quick fixes: - servers with less core but higher frequency. - Run jobs with less cores, but more different jobs
not random, every 1300s (precisely) for 2s-8s. Checks logs… hum firmware upgrade yesterday. Let’s downgrade. No more issue. BMC generated SCSI bus scan for monitoring, that froze the I/Os.
End-user has short-lived data (AI/ML, VDI cache). Just needs low-latency, local access to it for temporary data Storage team propose new network SSD/NVMe to keep data safe & still “fast”. Tests were “ok, but not that good” At scale, it crushed the network and latency skyrocketed For end-user experience, it’s unusable. ⇒ High-end storage, high cost, no use.
portal with useful data + memory dump 2) Level 1 search the public KB, give proposals you already tested 3) Ask for escalation 4) Level 2 takeover, seeks bugs, test things… but don’t know your business/company/env, so they ask for more details 5) Shift is over, FTS takes ticket, ask again details, then copy/paste all in bugzilla 6) Ticket is not trivial, so it may be processed in a few months. 7) The product-owner deemed it not urgent, so it’ll wait next release. 8) You contact an engineer through other ways (twitter, linkedin, github…) Lots of frustration: you lost time & energy in ticket, issue isn’t solved quickly & support was useless
portal with useful data + memory dump 2) TAM takes it. He knows you already tested public solutions, he just reminds them while searching for more insights. 3) TAM prepares a test/lab env usable for engineering to step right in. 4) Engineering works swiftly on analysis with provided details. Indeed, they didn’t though/had experience of your corner-case. They propose bugfix in ticket. 5) Bugfix is validated and added mainline Nobody lost time in the process, the fix was tricky to validate but support response was quick and efficient, and now all customers benefits of this fix
team requested datacenter for system reset. They had issues identifying the physical node slot (bad SN or not available). System team only had remote access: missed the physical view Datacenter team only had physical view: missed confirmation / remote status We asked vendor for a solution (eg led blink, /dev/mem read, etc…) Firmware engineer provided an “ipmitool raw” command to get the slot from the BMC (not documented). Good enough, problem solved !
RDMA. Setup is battle-tested, working great. We test new CPU model: app cannot init RDMA. Need to integrate many vendors: Chassis + CPU and Switch + ASIC Sent all hardware for tests to switch-vendor. Took almost 6 months Final word: One manufacturer fixed bugs in PCIe, which previously allowed the partial PCIe implementation of another manufacturer to work as intended.
between engineers for efficient experience. Engineers do not like support-tickets nor play guess with customers. But they do like to track a valid customer-case with reproducible steps. Customer: have the production experience and overview Vendor: have the knowledge and ability to upgrade product. Question: How can we identify & tag these efficient people ?
to compare at same time: cost + efficiency + production - Usage Cost: - Price/request (application) - Price/core (system) - Price/Watt (datacenter) - Application Efficiency: - Instructions Per Cycle (IPC) - Performance / Watt - Production: - Man-days to integrate new model - Specific integration needed or not - Compatibility with existing toolchain Increase internal communication: - Identify champions in teams - They are curious and like to understand - Share your needs & flexibility - You’re all on the same team - Being honest increase trust and efficiency between actors - They might help you in many ways - Create application profiles to easily validate new products - Bound on: CPU / Memory / IO - Workload: Latency / Throughput / Jitter - Risk acceptance: Security / Performance - Data lifetime: one-shot / short / long