Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Lessons from 12 Years of Progress Chef Development

Joe Nuspl
October 21, 2023

Lessons from 12 Years of Progress Chef Development

Joe Nuspl

October 21, 2023
Tweet

More Decks by Joe Nuspl

Other Decks in Technology

Transcript

  1. I am the Principal Engineer in Operations Engineering at Workday.

    We oversee the Chef tooling, infrastructure, and cookbooks. We operate at scale, with over 70,000 servers across our private data centers, AWS, Azure, and Google Cloud.
  2. Infrastructure is... thankless work When was the last time you

    came home from a stressful day of work, went to the fridge for a cold beverage, and lo and behold, it was cold. So thankful for that cold beverage you call up your electric company and thank them for keeping the power on. Probably never.
 Any sort of infrastructure work, whether it be servers, networking, build systems, is thankless work. Usually people come to you when things are broken. Get used to it.
  3. Infrastructure is... hidden work It is often di ff i

    cult to describe infrastructure work. It is like a "force" in physics. You cannot describe gravity, just the e ff ects of it. When I was in grad school, the board of trustees came around the School of Computer Science to see what the di ff erent departments did. I had to demo something after the Graphics department. Their demo was a ball bouncing across multiple screens, compressing and elongating. I have to admit, it was pretty cool for 1993. I helped them with the networking code so when it was my turn, I had them restart the demo. I then pulled a network cable. The demo stopped. I talked about networking, physical layer, protocols, etc. I then plugged the cable back in and the demo resumed. "We make THAT happen"
  4. If I'm doing my job, I should have nothing to

    do There are two types of work: reactive and proactive. Reactive work is often seen as busy: running around herding people to fi x a problem NOW. "Oh that person is busy, they must be doing a good job." The opposite is often true. Ideally, if you are doing a great job, you are operating in the proactive space. Thinking about the edges, future needs, trends. Solving issues before they become problems. Problems that require people to drop what they're doing and react to them. I've often said "if I'm doing my job, I should have nothing to do" Or more precisely, "I should have nothing that must be done NOW"
  5. Reactive work is easy to measure. Proactive is not Since

    proactive work is often unseen, it is hard to measure. This makes it di ffi cult for upper management to see what your team does. This can lead to questions about your necessity. Early in my career I discovered a scaling issue. I knew we'd hit the issue when students returned from summer break and everyone tried to log into a particular system. I had a fi x in hand but my manager would not allow me to apply it. "Let it break and be ready with the fi x to show how responsive we are." For the longest this left me with a bad taste in my mouth. Why wouldn't we just fi x the issue beforehand? Measuring reactive work is easy, just count the incidents. It is much harder to count the number of incidents that were avoided.
  6. chef 0.8.2 to 0.9.4 When I joined Workday in 2011,

    there was a small POC running on chef-0.8.2. One of the fi rst tasks was to upgrade to 0.9.4. This seemed like a trivial task but it wasn't. Well the upgrade of chef was, but chef-client would regularly fail with SEGV in ruby itself. Digging in, there was a stack variable that was not marked as such so when the ruby garbage collector ran, it corrupted the stack. I'll just rebuild the ruby RPM...
  7. Yak Shaving And this lead to a huge yak shaving

    exercise. For CentOS, there was the RBEL repository Ruby Enterprise Linux. Well, the ruby RPM was built on an earlier version of CentOS 5.x that had an earlier version of automake. Fine, I'll just rebuild automake. But that requires a di ff erent version of PERL. So I had to rebuild PERL just so I could rebuild Ruby. There will always be yaks to shave.
  8. Tech Debt Tech debt is often yak shaving. It is

    the work that needs to be done BEFORE you can do the work at hand. Imagine needing to build a hot fi x but the build is broken on that branch. While the work may seem "nice to have" it can often become "must have". Usually at the most inopportune time. Don't put o ff the proactive work of today, it will become the reactive work of tomorrow.
  9. Poised for Success When you embark on a transformation, you

    know it is the right thing to do. Others will see it as "more work". Be ready to demonstrate the power of the transformation. Prior to chef managed systems, Workday used NFS diskless servers. When the Shellshock exploit was disclosed, there was a big meeting where teams responsible for di ff erent environments were present, giving status on how long it would take to fi x things. I started working on cookbook change to upgrade bash. I tested the change, got it code reviewed, merged, and deployed to the new datacenter we were building. When the PM got to me and asked for my estimate, I said I was done. "Ok, then what is your estimate?" "No, I'm done. All the servers in my environment have been upgraded."
 I had many people skeptical of chef, the "you're going to automate me out of a job" crowd now wanting to learn about chef. When an opportunity presents itself, be ready to show the power of your initiative, be "Poised for Success"
  10. Share the Knowledge (Creatively) Writing blog posts is good, as

    long as people read them. Doing demos is good, as long as people attend them. You need to share knowledge somehow. You cannot rely on water cooler conversations. You need to get creative. I work in a remote o ffi ce and every time I'm at HQ I will arrange to "run into" someone in the hall (usually near the team I want to drum up interest) and talk about the new thing I'm working on so others can eavesdrop and become interested. I also like to leave diagrams or bullet points on white boards in conference rooms before big meetings. Maybe people will read them while waiting for the meeting start and it sparks some interest.
  11. The Simplicity Principle A former VP of mine, Ralph Hedberg,

    bestowed what he referred to as "the simplicity principle" upon me early in my career and it has resonated with me ever since. When designing anything, it should be easy to do the right thing and di ffi cult to do the wrong thing. Don't take away the ability to do suboptimal or bad practices, just don't make it easy for them. Make it so easy for them to the right thing, why would they do anything else...
  12. No one cares about your work as much as you

    You'll always encounter developers or engineers that think infrastructure work is beneath them. I get it, application teams are judged by the features they deliver to their application. Not by migrating to say a new CI system. It may be easier for you to maintain but as long as it doesn't impact them, it doesn't really matter to them. A devious person might slow performance on the old system, so the speed-up you get from the new system is the motivating factor for them. Unless there is a bene fi t to them, they will not care.
  13. How can I make your life suck less? I am

    very pragmatic. I know I can't make everything better. Some people lead with "How can I help?" I like "How can I make your life suck less?" This usually gets a chuckle and helps break the ice. Sometimes just listening, people just need to vent. Or saying, "yeah, we've encountered that as well". Knowing others are in the same situation helps. Maybe they felt isolated or singled out. You may have just made another ally in the collective fi ght.
  14. Lead by Example Lead by Example: Publicly ask questions. No

    one knows everything. Admit mistakes. Take ownership for them. Don't blame people. Most people are not malicious, they weren't trying to break something. People/teams/organizations learn from mistakes. You can only learn when all the facts are present. If people feel like they will get blamed, they tend to downplay or withhold information. Instead of "why did you make that decision", I like "what pieces of information were you missing to make a di ff erent decision" Tooling / documentation can only help the decision making process. People don't want to look stupid. Make sure you never make them feel that way.
  15. Code Less / Less Code Not everything has to be

    solved with code. Documentation in the form of screen shots are perfectly acceptable. Don't over engineer code. It can be daunting for others coming into the code base. Say there is a 1% bug rate based on lines of code. 1000 lines of code means 10 bugs, but 100 lines means only 1 bug. Less code, less potential for bugs.
 There is elegance and genius in simple code. The goal is for someone else, including your future self, to glance at the code and get the gist of what is going on. You shouldn't need 3 monitors and 8 code windows to comprehend the code. It will be too di ffi cult to make a change. That "simple"change is no longer "simple".
  16. Make smaller commits... more often Bugs happen. Often times you

    need to go back to fi gure out exactly when the bug was introduced. "Well, it worked in 1.5 and broke in 2.0." You may need to check out 1.6 to test and then further subdivide from there. If that commit changes 1000 lines, it will be more di ff i cult to determine where the bug is than if the commit only changed 10 lines. Smaller commits can make the debugging process easier.
  17. Simpler code allows others to fi x it If code

    is very complicated, people may be like "I don't understand the code, we'll need Joe to fi x it." Well, if I'm on vacation that means it is broken until I return or management is interrupting my time o ff . This is often referred to the bus factor: if Joe gets hit by a bus, someone else needs to become the subject matter expert. There is an onboarding cost associated and this often a ff ects velocity. I like the vacation analogy better. If I get hit by a bus, I'm dead. The company has no other choice but to assign the work to someone else. The vacation analogy is more self-serving. As a developer you are more motivated because it directly a ff ects you. Either by your vacation plans being interrupted or returning from vacation to a hot mess. You work so hard to get caught back up you don't feel like you even had a vacation. Or if it is a regular occurrence, you may begin to dread taking time o ff .
  18. Build Often Deploy Often Most code bases have dependencies. Run

    your build often (even if nothing has changed) and run your tests. You'll know when something downstream broke you closer to the breaking change. Don't wait to fi nd out things are broken until you have to deploy a change. If there was only 1 downstream change, it is pretty easy to fi gure out the culprit. If there has been a 100 over the last 6 months, it is much harder to pinpoint. This is especially true for deploys. "Has anyone made a change that could have a ff ected this in the last six months?" I can barely remember last sprint's work. It also means whoever made the change still has context. Otherwise they may need to relearn the code change from 6 months ago. Or worse yet, the person who made the change is no longer with the company.
  19. Put the Engineering back in Software Engineering. (Data Wins Arguments)

    Let's get back to the scienti fi c method. Make a hypothesis and then test that hypothesis. Before coding up a remediation for the latest CVE, we leverage InSpec to determine what servers in our fl eet are a ff ected. If CVE only a ff ects machines in the hardware eval lab, remediation is lower priority than if it is more wide spread. I'm not saying it shouldn't be remedied, just that you can make an INFORMED decision based on data instead of speculation.
  20. Don't say no By saying no, you can seem like

    an obstructionist. Even if the ask if far fetched, I don't say no. For I example, I might say "Well, we can do that but it will take 2 years." And then o ff er alternatives, like "We can do X which has some aspects of Z and we can do that today. Will that help with the immediate problem?" A lot of times people ask for what they perceive as the solution instead of leading with the problem. There may be other solutions to the problem that they have not thought of
  21. Don't codify/test everything There is only so much time in

    a day (or a schedule). Don't try to codify everything or every possible future use case. Focus on the biggest pain points, whether that be things that touch every server, or things that will be the most disastrous if you lost. In my 2021 ChefConf talk I about testing your code, not the underlying functionality. For example, if you install zsh, your Inspec test should just be "Is package zsh installed" not "is /bin/zsh mode 755". Rely that underlying layers work.
  22. Maintain golden examples Programmers plagiarize. In publishing it is a

    crime. In software (as long as it from your own code base) it is encouraged, there's even a term for it "code reuse". Why re-invent the fl at tire when there is something working that can be copied. In my experience, if there are multiple example people will inevitably pick the worst possible example. And the tech debt grows... When people ask how to do something, point them at that golden example if it exists. Or work with them on creating a new golden example.
  23. Code only what you need When people search for examples

    they often choose the most complicated example. "They already fi gured the complicated stu ff out, I won't have to. I can just copy what they did." This is often wrong and leads to code bloat. If there was a bug in the original code, now it needs to be fi xed in multiple places.
  24. Code only what you need TODAY Say what you will

    about "agile" development. The biggest bene fi t in my opinion is the change in mindset. 10 years ago we as an industry was still operating under the "since it is a year/18 months/2 years between releases, this release needs to hold us over until the next release." Predicting the future is error prone. The further in the future, the less uncertain the prediction will be. Code for what you need in the short term. Use ongoing evidence to re fi ne the plan. . Some new disruptive technology may drastically change your 2 year roadmap and you need to be able to pivot.
  25. Stop. Listen. Collaborate. –Vanilla Yoda Ice Simple math, you have

    two ears and one mouth. You should listen twice as much as you speak. If I leave you with only one take away, it would be this. Reordered as Yoda would, you should stop and listen then collaborate. Thank you and have a great ChefConf.