[Dmitriy Novakovskiy, Ivan Prisyazhnyy] Making every minute count: How Nimses scales it's mobile platform on Google Cloud

Making every minute count How Nimses scales it's mobile platform
on Google Cloud

Who are we? Dmitriy Novakovskiy Cloud Sales Engineer @Google [email protected]
Ivan Prisyazhnyy Software Engineer @Nimses Core [email protected]

So, what is Nimses? Let’s take a look

What is Nimses? • Mobile app that turns every minute
of life into a unit value • Geolocation based • > 6 million users and growing globally • > 100 microservices • Private blockchain, NIM exchange • Ads, Marketplace, ML

Original architecture • A handful of microservices • Written in
Go, GRPC, go-kit • EC2 w/ Docker-compose • RDS w/ PostgreSQL • Memcached, ELK • Amazon CDN (CloudFront)

Managed infrastructure • CI/CD & microservices • Autoscaling • Ops
automation Distributed transactions • Blockchain • Global scale Scaling up while staying small Horizontally scalable NoSQL • Store app data • Handle >100M users • No data management toil

Step 1: Managed Kubernetes “We lost a few nodes in
one Zone last night. No human impact, all things recovered on their own“.

• Container scheduling • Lifecycle and health • Autoscaling (vertical
and horizontal) • Service discovery • Load balancing • Logging and monitoring • Storage management • Network connectivity Kubernetes gives us

• Provisioning Infrastructure: Create VM’s and bootstrap core components for
Kubernetes control plane • Implement Networking: IP Ranges for Pods, Services, configuring an overlay • Bootstrap Cluster Services: Provision kube-dns, logging daemons, monitoring • Ensuring Node Resiliency: At a master & node level, ensuring uptime, configuring kernel upgrades, OS updates But running Kubernetes “your way” is not easy

Kubernetes

Lessons learned from our GKE migration

This is deployment.yaml for Kubernetes

You need package manager - Helm • Reusable templates for
service deployments • Expose only required parameters to developers • Smaller manifests -> less errors

• Developers don’t like operations :) • Let them care
about values.yaml only • Leave Ingresses, ReplicaSets and other k8s constructs to Ops Decouple operations from development values.yaml Ingresses, ReplicaSets, Services...

• There is no security “out of the box” •
Containerized applications may have “defaults” like root account and no resource limits • Use audit tools and “best practices”, for example: https://kubesec.io/ Security still matters

• On Production Day 1 we switched 100% of user
traffic into newly deployed GKE cluster • Users started receiving auth errors right away • We had to debug and fix Ingress configuration “on the go” • Probably we should’ve done Canary :) Be careful at first Production rollout

What’s next? Service mesh with Istio Intelligent routing • Dynamic
route configuration • A/B tests • Canaries • Gradually upgrade versions Resilience • Timeouts • Retries • Health checks • Circuit breakers Security & policy • Mutual TLS • Organizational policy • Access policies • Rate Limiting Telemetry • Service Dependencies • Traffic Flow • Distributed Tracing

Step 2: Cloud Datastore A database without DBA required

• NoSQL document-oriented database ◦ Row -> Entity / Entity
group ◦ Table -> Kind ◦ Field -> Property • NoOps DB that actually works • Horizontal scale with no DBA effort • “Pay-per–use” (storage and operations) • No capacity management required What Datastore gives us

• Isolation and consistency depends on operation and index type
• “Hotspotting” on storage layer impacts key choices, indexing and ingestion • Datastore does autopartitioning of key ranges - hotspots can take time to split • 1 update per second per Entity Group We had to learn the “Managed NoSQL way”

• Datastore charges for bulks of operations (reads, writes, etc)
• It’s useful to have caching layer in front of Datastore for cost saving • It’s not always easy to spot a bug in your code that is “wasting” your reads, or a suboptimal “background” script Software bugs can impact your DB costs

• We needed visibility into operational stats and index usage
• UI stats are limited, but Google Support can help with granular reports • We needed backups to protect against bugs that can corrupt data • Managed Backups are NOT incremental and charge for every read Tooling around Managed services can be limited

Take care of your indexes and data model • Indexes
consume space and performance • Reindexing is an involved operation • Visibility into index usage can be limited • Migration of data model requires strategy, remember about backward and forward compatibility • Distribute load evenly • Don’t index monotonically increasing properties

We still use multiple DB services • We moved most
of backend services to Datastore: user profiles, chats, lobby, authentication, feed, events, music • Geospatial queries are still on PostgreSQL because of PostGIS (CloudSQL) ( ( • Our blockchain works on Cloud Spanner because of stricter consistency and higher throughput • All analytics are in BigQuery

What’s next? Firestore • Runs on top of Spanner (replacing
Megastore) • Removes Datastore restrictions: ◦ Eventual consistency, All queries become strongly consistent ◦ Transactions are no longer limited to 25 entity groups ◦ Writes to an entity group are no longer limited to 1 per second • Has Datastore API compatibility mode • Currently in Beta, automatic upgrade of existing DBs in 2019

Step 3: Google Cloud Spanner Going global with relational database

Nimses Blockchain • Financial component to support transactions in Nims
• Trust & Transparency • Governance • Performance & Scalability

• Serializable transactions with external consistency • Synchronous data replication
across multiple locations • 99.999% availability • Horizontally scalable Reads and Writes, automated sharding and re-sharding • Fully managed, no maintenance downtime What Cloud Spanner gives us

Performance optimization requires iterations • Whitepapers and “best practice” docs
are useful • Understanding of core concepts (Splits, Interleaving) is mandatory • Data layout will evolve along with business requirements • Minimize the number of Splits (a.k.a. shards) participating in a single transaction

It’s a Managed DB, but with some Ops • Use
native Spanner API to gather profiling and stats on transactions • Spanner does data maintenance in the background. Keep CPU load at 70-75% • You can always scale up by adding a node, no downtime • Backups are easy, with Dataflow integrated on UI

• No DML support yet with SQL • Query execution
plan is visible, but hard to optimize • You need to force index usage and specify JOIN types • Native API is much better for optimization of complex transactions SQL is there, but native API is better

• A transaction can be aborted or restarted a few
times before it is committed • Transaction state must be valid in between these retry attempts • Try not to use external state in between transaction retries The coolest bug we had - External state

• Running on “someone else’s technology” can help product teams
to move faster • Forcing “the old way” on new technology leads to lost time • Learning “the right way” leads to results • Every minute counts :) Ending notes

Thank you! Questions?

[Dmitriy Novakovskiy, Ivan Prisyazhnyy] Making ...

[Dmitriy Novakovskiy, Ivan Prisyazhnyy] Making every minute count: How Nimses scales it's mobile platform on Google Cloud

More Decks by Google Developers Group Lviv

Other Decks in Programming

Featured

Transcript