The Two Types of Being "At Scale"

The Two Types of being “At Scale” Kurt Nelson @kurtisnelson
For-hire Software Engineer

Who considers their work at scale? mobile

QPS Backend and frontend engineers will generally use Queries per
Second when discussing scale.

Mobile apps can be at scale in very different ways.

(M|W|D)AUs Monthly, Weekly or Daily Active Users

Device Diversity Mobile software runs diﬀerently on diﬀerent devices.

Developers Mobile apps can have hundreds of engineers working on
one monolithic app

Scale is a vague term on mobile.

Active Users

Active Users When and where are you users? Timezone spread
Installations != users Public store or business development driven installs

Device Diversity

More than just physical form factor. OEM oddities Battery optimizers
Language (RTL, screen readers) Operating system version Old binaries Network connectivity

Next Billion Users Devices that are impossible for you to
get. Carriers no one has heard of. Android Go, low-end devices. Language can make bug reproduction diﬃcult. Payment frameworks. Usable with no connectivity. https://nextbillionusers.google/

Device diversity correlates with a DAU increase.

The Two Scales (simpliﬁed) Device count Developer count

Common practices at scale

Feature Flags These allow toggling features in your code on
and oﬀ easily. No Play Store push is required. Requires developers to put them in place. Extremely helpful when merging incomplete code. Enables controlled early access for developers, internal users, or power users.

Feature Flags Allow targeted kill switching Slow rollout of a
new code path Device specific flags When you have many engineers, the quickest mitigation can often be flipping someone else’s flag while root causing happens Leads to happier on-calls; flip a flag, go back to sleep.

Experimentation Often fully automated Coupled to analytics Business driven Often
based on a flag Isolates from other experiments Might involve a backend code path change How is it different than a flag?

Experimentation is scalable user feedback.

We’ve got a ton of developers, send help!

How many is many? Do you have more than one
app shipping? Is there a monorepo? Is upgrading a library like playing crash roulette?

A Healthy Engineering Culture Communication Collaborative bug backlog Code review
SLA Up to date tickets Internal open source Healthy on call Post-mortems

Firebase is your (free) valuable but annoying tool.

Firebase It can get quite clunky with a bunch of
developers in one console! Every developer or program manager should have access to crashlytics. App distribution can really help you get production equivalent binaries out to external QA teams.

Firebase Always good to glance at these graphs after a
new binary goes out. Easy way to make sure you didn’t break the world. Give your product and program managers access so they don’t have to ask you!

Big teams means big repos Slow to uptake on the
new hotness Productivity must be a goal Automate away bikeshedding Migrations will be a thing (I hope?)

Update Uptake Kotlin Compose Jetpack And of course, ye ole
SDK bumps

Invest in Productivity Ensure that developer productivity is measured Ask
leadership to start allocating dedicated headcount for developer productivity if it starts tanking Build time, CI time and time-to-land matter Gradle Enterprise is helpful if your build goes sideways

Automatic the Bike Shed Door Set lint to break the
build Use tools like Detekt and KTLint Spotless can automatically format everything to a config file Minimizes noise in pull requests Consistent imports make wide renames less painful Share IDE configuration

Conformance Testing Excellent for users of internal libraries Ban known-bad
APIs and classes Block bad patterns at compile time Write an IDE plugin to compliment and catch issues even earlier Enforcing style or convention via a test

Continuous delivery and integration are mandatory

Catch issues with automation Human processes will miss issues With
CI, you can be conﬁdent that you will not break the build for other engineers. Even with zero tests, CI is beneﬁcial. With CD, you can push your latest code to your own engineers.

Local Build Everyone likely already does this: Building the APK
on your machine and pushing it to a test emulator or device Feature Branch Also known as a review build. If you have designers or product managers involved in the feature, automating this allows them to play with it as part of review. Generally point at some sort of staging environment. Main Branch The latest “done” version of the app. This build is useful for spot debugging of issues that are not in production but have been caught by an internal resource. Often signed and able to be used against production. Nightly Ships to QA and hopefully all engineers. This build can be pushed to the internal or alpha channel of the Play Store, and is a candidate for promotion to the public.

Beneﬁts of an Automated Pipeline Debugging production-only issues does not
require building an APK for every commitsha you need to test. You can write a shell script to git bisect using pre-built APKs! Any engineer could theoretically release an APK, especially important for on-calls Non-engineer stakeholders can easily test new features or ﬂags with minimal SWE assistance All employees can automatically have nightly or weeklies after passing QA. Eliminates thrashyness due to tooling changes that are only noticed upon a release build

We’ve got a ton of users, send help!

How many is many? This deﬁnition is up to you.

The challenge is handling the long-tail.

What is the long-tail? Events that occur in only a
low-number percentile of users but are critical. Severe device or OS speciﬁc crashes Obscure screens in your app that few people use Major accessibility issues Hard-to-repro performance issues

Log, log, log. Even if “it should never happen”.

Logging & Instrumentation Logs are pretty much free For a
core user journey, instrument it via eventing Set up alerts on events and logged errors Even if you can recover in a try/catch, consider logging the exception

Firebase is your best friend now.

Firebase Basics The free stuﬀ! Use a logger like Lumber
to send to both logcat and ﬁrebase. Crashlytics! If you use 3rd party analytics, consider plumbing some events through to Firebase Performance SDK can be a pain to set up, but worth playing with.

Ensure you have visibility before you need it.

When it gets weird You will absolutely encounter extremely mysterious
crashes that are at a very low rate and you have no idea how to reproduce. (I’m looking at you NDK) Have a plan if you don’t have that drawer of old phones. Communicate with business stakeholders what the policy will be for dropping support.

Cheat Codes Forcing an APK upgrade, either via the oﬃcial
library or an internal soft nag. Disable distribution to known broken devices Develop a whole side-channel system for recovering from bad state Fallback to a webview and your mobile site Feature-ﬂag powered walled gardens Buying users new devices

In Conclusion These ideas can be useful in smaller teams
or apps too! Ensure you aren’t putting roadblocks in place Pick your biggest pain points ﬁrst Get monitoring in place early

The Two Types of being “At Scale” Kurt Nelson @kurtisnelson
For-hire Software Engineer

The Two Types of Being "At Scale"

The Two Types of Being "At Scale"

More Decks by Kurt Nelson

Other Decks in Programming

Featured

Transcript