Save 37% off PRO during our Black Friday Sale! »

How do we do Benchmark? Impressions from Conversations with the Community

B207c84229c3cc91fa26369bc374d2eb?s=47 Stefan Marr
September 30, 2021

How do we do Benchmark? Impressions from Conversations with the Community

B207c84229c3cc91fa26369bc374d2eb?s=128

Stefan Marr

September 30, 2021
Tweet

Transcript

  1. How do we do Benchmark? Impressions from Conversations with the

    Community VMM’21, Virtual Stefan Marr
  2. Got a Ques*on? Feel free to interrupt me! 2

  3. There’s Work on What We Should be Doing 3

  4. There’s Work on What We Should be Doing 4

  5. There’s Work on What We Should be Doing 5

  6. There’s Work on What We Should be Doing 6

  7. What are we doing? and What are we struggling with?

    I had a different ques/on 7
  8. Outline 1. My Interview Methodology 2. Some Anecdotes 3. Our

    struggles 4. Our best prac/ces 8
  9. MY METHODOLOGY 9

  10. My Methodology • 21 interviews • .ny sample 10

  11. My Methodology • 21 interviews • With groups in the

    field of “programming languages and systems” • .ny sample • not representa.ve 11
  12. My Methodology • 21 interviews • With groups in the

    field of “programming languages and systems” • Semi-structured interviews • .ny sample • not representa.ve • not the same for all interviews 12
  13. My Methodology • 21 interviews • With groups in the

    field of “programming languages and systems” • Semi-structured interviews • Ad hoc result analysis • tiny sample • not representative • not the same for all interviews • Interpretation biases 13
  14. This isn’t Data It’s Anecdotes 14

  15. Use of Automated Test, Continuous Integration Do you use some

    form of automated tes/ng/CI? 15 Using Zoom’s Reactions likely at the bottom of the screen
  16. Use of Automated Test, Con=nuous Integra=on Do you use some

    form of automated tes/ng/CI? 16 Using Zoom’s Reactions likely at the bottom of the screen 👏 yes ❤ some=mes 😮 no
  17. Use of Automated Test, Con=nuous Integra=on Out of 21 groups

    >75% use some0mes CI 17
  18. Use of Automated Test, Continuous Integration Out of 21 groups

    >75% use some0mes CI 18 But • Can differ per student • Per project • …
  19. Use of Automated Test, Con=nuous Integra=on Out of 21 groups

    >75% use some0mes CI 19 But • Can differ per student • Per project • … 😱 In academia, tes/ng is not “standard prac/ce”
  20. Benchmarking and Frequency Do you run benchmarks for your day-to-day

    engineering? 20 👏 don’t need it ❤ yes 😮 only for a paper
  21. Benchmarking and Frequency Do you run benchmarks for your day-to-day

    engineering? Out of 21 groups, do it at least for some projects ≈30% for every pull request ≈30% at some interval ≈50% only for a paper 21
  22. Hardware Setup Out of 21 groups, for at least some

    projects: >55% dedicated, self-hosted ≈15% bare-metal cloud ≈20% mul/-tenant cloud ≈15% developer machine 22
  23. Hardware Setup Out of 21 groups, for at least some

    projects: >55% dedicated, self-hosted ≈15% bare-metal cloud ≈20% mul/-tenant cloud ≈15% developer machine 23 60% of groups: high cost/effort of maintaining machines and tools
  24. Are the Machines Prepared in Some Way? Out of 21

    groups >70% do some prepara5on <30% do no prepara5on Prepara5on may include • disabling daemons, disk usage, Address Space Layout Randomiza5on • disabling turbo boost, frequency scaling • NUMA-node pinning, thread pinning 24
  25. Are the Machines Prepared in Some Way? Out of 21

    groups >70% do some prepara5on <30% do no prepara5on Prepara5on may include • disabling daemons, disk usage, Address Space Layout Randomiza5on • disabling turbo boost, frequency scaling • NUMA-node pinning, thread pinning 25 👍 for awareness But, requires expertise and is not trivial
  26. Data Provenance Did you ever have an issue like: –

    Unsure what was measured? – Mixed up data from experiments? – Unsure which parameters were used 26 👏 yes ❤ no
  27. Data Provenance Out of 21 groups, for some projects <50%

    track it systematically >60% do not track it 27
  28. Data Provenance Out of 21 groups, for some projects <50%

    track it systema/cally >60% do not track it 28 Common issues named: • Comparing wrong data, only no5ced by inconsistencies • Losing track of what’s what • Parameters/setup details not recorded
  29. Data Processing The 21 groups named the following: >71% Python

    >40% Matplotlib ≈40% R ≈33% Spreadsheets and other things 29
  30. Data Processing The 21 groups named the following: >71% Python

    >40% Matplotlib ≈40% R ≈33% Spreadsheets and other things 30 Concerns • Too much .me spent analyzing data • OFen the same, but no reuse
  31. Data Processing Of the 21 groups, >88% do something manual

    >70% have some things scripted 2 groups automate everything including generating Latex macros 31
  32. STRUGGLES AND BEST PRACTICES 32

  33. Our Struggles • Finding good benchmarks • Setup and maintain

    machines, minimizing measurement error • Tracking data provenance • Historic data available/useful • Standard analyses, data processing, and sta/s/cs 33
  34. Best Prac0ces • Use CI/Automated Tes0ng – At the very

    least, check that benchmarks produce correct results • Use same setup for day-to-day engineering as for producing data for papers – The setup is already debugged! • Most CI systems can store ar0facts – Basic provenance tracking for results! • Automate data handling – Spreadsheets can import data from external data sources – Avoid manually copying data around • Define workflow that works for your group – And teach it! 34
  35. ?? ? Ques%ons? Our Struggles • Finding good benchmarks •

    Setup and maintain machines, minimizing measurement error • Tracking data provenance • Historic data available/useful • Standard analyses, data processing, and staBsBcs Best Prac/ces • Use CI/Automated Testing • Use same setup for day-to-day engineering as for producing data for papers • Most CI systems can store artifacts • Automate data handling • Define workflow that works for your group 35