Upgrade to Pro — share decks privately, control downloads, hide ads and more …

The Two Types of Being "At Scale"

The Two Types of Being "At Scale"

When you hear someone say "we're at scale" they could mean two things: The team, development, and code are at scale, or the amount of active users is at scale.

As a mobile team grows in headcount, it must tackle new types of challenges including scaling development. I will be sharing what problems companies like Google and Uber encounter and how they solve them, from modularization to frameworks to different built tooling. I will also discuss why many of these companies are often slow to adopt new toolings such as Jetpack Compose, Coroutines, and Kotlin.

An app can also be at scale based purely on the number of daily active users. This requires a whole other tactic around making sure monitoring is in place for performance and reliability, along with dealing with the long tail of OEM and regional differences.

Kurt Nelson

June 03, 2022
Tweet

More Decks by Kurt Nelson

Other Decks in Programming

Transcript

  1. Active Users When and where are you users? Timezone spread

    Installations != users Public store or business development driven installs
  2. More than just physical form factor. OEM oddities Battery optimizers

    Language (RTL, screen readers) Operating system version Old binaries Network connectivity
  3. Next Billion Users Devices that are impossible for you to

    get. Carriers no one has heard of. Android Go, low-end devices. Language can make bug reproduction difficult. Payment frameworks. Usable with no connectivity. https://nextbillionusers.google/
  4. Feature Flags These allow toggling features in your code on

    and off easily. No Play Store push is required. Requires developers to put them in place. Extremely helpful when merging incomplete code. Enables controlled early access for developers, internal users, or power users.
  5. Feature Flags Allow targeted kill switching Slow rollout of a

    new code path Device specific flags When you have many engineers, the quickest mitigation can often be flipping someone else’s flag while root causing happens Leads to happier on-calls; flip a flag, go back to sleep.
  6. Experimentation Often fully automated Coupled to analytics Business driven Often

    based on a flag Isolates from other experiments Might involve a backend code path change How is it different than a flag?
  7. How many is many? Do you have more than one

    app shipping? Is there a monorepo? Is upgrading a library like playing crash roulette?
  8. A Healthy Engineering Culture Communication Collaborative bug backlog Code review

    SLA Up to date tickets Internal open source Healthy on call Post-mortems
  9. Firebase It can get quite clunky with a bunch of

    developers in one console! Every developer or program manager should have access to crashlytics. App distribution can really help you get production equivalent binaries out to external QA teams.
  10. Firebase Always good to glance at these graphs after a

    new binary goes out. Easy way to make sure you didn’t break the world. Give your product and program managers access so they don’t have to ask you!
  11. Big teams means big repos Slow to uptake on the

    new hotness Productivity must be a goal Automate away bikeshedding Migrations will be a thing (I hope?)
  12. Invest in Productivity Ensure that developer productivity is measured Ask

    leadership to start allocating dedicated headcount for developer productivity if it starts tanking Build time, CI time and time-to-land matter Gradle Enterprise is helpful if your build goes sideways
  13. Automatic the Bike Shed Door Set lint to break the

    build Use tools like Detekt and KTLint Spotless can automatically format everything to a config file Minimizes noise in pull requests Consistent imports make wide renames less painful Share IDE configuration
  14. Conformance Testing Excellent for users of internal libraries Ban known-bad

    APIs and classes Block bad patterns at compile time Write an IDE plugin to compliment and catch issues even earlier Enforcing style or convention via a test
  15. Catch issues with automation Human processes will miss issues With

    CI, you can be confident that you will not break the build for other engineers. Even with zero tests, CI is beneficial. With CD, you can push your latest code to your own engineers.
  16. Local Build Everyone likely already does this: Building the APK

    on your machine and pushing it to a test emulator or device Feature Branch Also known as a review build. If you have designers or product managers involved in the feature, automating this allows them to play with it as part of review. Generally point at some sort of staging environment. Main Branch The latest “done” version of the app. This build is useful for spot debugging of issues that are not in production but have been caught by an internal resource. Often signed and able to be used against production. Nightly Ships to QA and hopefully all engineers. This build can be pushed to the internal or alpha channel of the Play Store, and is a candidate for promotion to the public.
  17. Benefits of an Automated Pipeline Debugging production-only issues does not

    require building an APK for every commitsha you need to test. You can write a shell script to git bisect using pre-built APKs! Any engineer could theoretically release an APK, especially important for on-calls Non-engineer stakeholders can easily test new features or flags with minimal SWE assistance All employees can automatically have nightly or weeklies after passing QA. Eliminates thrashyness due to tooling changes that are only noticed upon a release build
  18. What is the long-tail? Events that occur in only a

    low-number percentile of users but are critical. Severe device or OS specific crashes Obscure screens in your app that few people use Major accessibility issues Hard-to-repro performance issues
  19. Logging & Instrumentation Logs are pretty much free For a

    core user journey, instrument it via eventing Set up alerts on events and logged errors Even if you can recover in a try/catch, consider logging the exception
  20. Firebase Basics The free stuff! Use a logger like Lumber

    to send to both logcat and firebase. Crashlytics! If you use 3rd party analytics, consider plumbing some events through to Firebase Performance SDK can be a pain to set up, but worth playing with.
  21. When it gets weird You will absolutely encounter extremely mysterious

    crashes that are at a very low rate and you have no idea how to reproduce. (I’m looking at you NDK) Have a plan if you don’t have that drawer of old phones. Communicate with business stakeholders what the policy will be for dropping support.
  22. Cheat Codes Forcing an APK upgrade, either via the official

    library or an internal soft nag. Disable distribution to known broken devices Develop a whole side-channel system for recovering from bad state Fallback to a webview and your mobile site Feature-flag powered walled gardens Buying users new devices
  23. In Conclusion These ideas can be useful in smaller teams

    or apps too! Ensure you aren’t putting roadblocks in place Pick your biggest pain points first Get monitoring in place early