[Dmitriy Novakovskiy, Ivan Prisyazhnyy] Making every minute count: How Nimses scales it's mobile platform on Google Cloud

[Dmitriy Novakovskiy, Ivan Prisyazhnyy] Making every minute count: How Nimses scales it's mobile platform on Google Cloud

Presentation from GDG DevFest Ukraine 2018 - the biggest community-driven Google tech conference in the CEE.

Learn more at: https://devfest.gdg.org.ua

__

In this session we will share practical “battle stories”. Real ones, lots of them. This will be a retrospective of >1 year journey, over the course of which Nimses engineers were adopting Google Cloud Platform services such as:

Google Kubernetes Engine
Google Cloud Datastore
Google Cloud Spanner
Google BigQuery
Google Cloud SQL
Cloud Pub/Sub
Cloud Dataflow -Cloud Dataproc
Nimses (http://http://nimses.com) is a location-based social mobile platform that turns every minute of user’s time into a unit of value. Each minute of a person's life within Nimses generates a single unit of digital currency called Nim. Nimses users can spend Nims to interact with each other and make purchases from local vendors, based on geolocation. Nimses helps people and businesses to connect with each other in a meaningful way, with value of each action measured in Nims.

As of today over 5 million people are using Nimses across 20 countries.

In July 2017 Nimses has decided to migrate all of it's services into Google Cloud Platform. The main driver was rapid growth of user base that elevated requirements for infrastructure scalability and data storage to seemingly unreachable levels.

Together with Google Cloud team Nimses engineers were able to migrate main application services into Google Kubernetes Engine in a matter of few weeks. Later on Nimses has gradually taped on other platform services (Datastore, BigQuery, Dataflow), allowing the team to analyze large amounts of data and develop new features faster.

Most recently Nimses has launched Nimses Blockchain (https://nimses.com/posts/blockchain-beta/) payment platform, with ledger of transactions powered by Google Cloud Spanner.

3a6de6bc902de7f75c0e753b3202ed52?s=128

Google Developers Group Lviv

October 12, 2018
Tweet

Transcript

  1. Making every minute count How Nimses scales it's mobile platform

    on Google Cloud
  2. None
  3. Who are we? Dmitriy Novakovskiy Cloud Sales Engineer @Google novakovskiy@google.com

    Ivan Prisyazhnyy Software Engineer @Nimses Core ivan@nimses.com
  4. So, what is Nimses? Let’s take a look

  5. What is Nimses? • Mobile app that turns every minute

    of life into a unit value • Geolocation based • > 6 million users and growing globally • > 100 microservices • Private blockchain, NIM exchange • Ads, Marketplace, ML
  6. Original architecture • A handful of microservices • Written in

    Go, GRPC, go-kit • EC2 w/ Docker-compose • RDS w/ PostgreSQL • Memcached, ELK • Amazon CDN (CloudFront)
  7. Managed infrastructure • CI/CD & microservices • Autoscaling • Ops

    automation Distributed transactions • Blockchain • Global scale Scaling up while staying small Horizontally scalable NoSQL • Store app data • Handle >100M users • No data management toil
  8. Step 1: Managed Kubernetes “We lost a few nodes in

    one Zone last night. No human impact, all things recovered on their own“.
  9. None
  10. • Container scheduling • Lifecycle and health • Autoscaling (vertical

    and horizontal) • Service discovery • Load balancing • Logging and monitoring • Storage management • Network connectivity Kubernetes gives us
  11. • Provisioning Infrastructure: Create VM’s and bootstrap core components for

    Kubernetes control plane • Implement Networking: IP Ranges for Pods, Services, configuring an overlay • Bootstrap Cluster Services: Provision kube-dns, logging daemons, monitoring • Ensuring Node Resiliency: At a master & node level, ensuring uptime, configuring kernel upgrades, OS updates But running Kubernetes “your way” is not easy
  12. Kubernetes

  13. Lessons learned from our GKE migration

  14. This is deployment.yaml for Kubernetes

  15. You need package manager - Helm • Reusable templates for

    service deployments • Expose only required parameters to developers • Smaller manifests -> less errors
  16. • Developers don’t like operations :) • Let them care

    about values.yaml only • Leave Ingresses, ReplicaSets and other k8s constructs to Ops Decouple operations from development values.yaml Ingresses, ReplicaSets, Services...
  17. • There is no security “out of the box” •

    Containerized applications may have “defaults” like root account and no resource limits • Use audit tools and “best practices”, for example: https://kubesec.io/ Security still matters
  18. • On Production Day 1 we switched 100% of user

    traffic into newly deployed GKE cluster • Users started receiving auth errors right away • We had to debug and fix Ingress configuration “on the go” • Probably we should’ve done Canary :) Be careful at first Production rollout
  19. What’s next? Service mesh with Istio Intelligent routing • Dynamic

    route configuration • A/B tests • Canaries • Gradually upgrade versions Resilience • Timeouts • Retries • Health checks • Circuit breakers Security & policy • Mutual TLS • Organizational policy • Access policies • Rate Limiting Telemetry • Service Dependencies • Traffic Flow • Distributed Tracing
  20. Step 2: Cloud Datastore A database without DBA required

  21. • NoSQL document-oriented database ◦ Row -> Entity / Entity

    group ◦ Table -> Kind ◦ Field -> Property • NoOps DB that actually works • Horizontal scale with no DBA effort • “Pay-per–use” (storage and operations) • No capacity management required What Datastore gives us
  22. • Isolation and consistency depends on operation and index type

    • “Hotspotting” on storage layer impacts key choices, indexing and ingestion • Datastore does autopartitioning of key ranges - hotspots can take time to split • 1 update per second per Entity Group We had to learn the “Managed NoSQL way”
  23. • Datastore charges for bulks of operations (reads, writes, etc)

    • It’s useful to have caching layer in front of Datastore for cost saving • It’s not always easy to spot a bug in your code that is “wasting” your reads, or a suboptimal “background” script Software bugs can impact your DB costs
  24. • We needed visibility into operational stats and index usage

    • UI stats are limited, but Google Support can help with granular reports • We needed backups to protect against bugs that can corrupt data • Managed Backups are NOT incremental and charge for every read Tooling around Managed services can be limited
  25. Take care of your indexes and data model • Indexes

    consume space and performance • Reindexing is an involved operation • Visibility into index usage can be limited • Migration of data model requires strategy, remember about backward and forward compatibility • Distribute load evenly • Don’t index monotonically increasing properties
  26. We still use multiple DB services • We moved most

    of backend services to Datastore: user profiles, chats, lobby, authentication, feed, events, music • Geospatial queries are still on PostgreSQL because of PostGIS (CloudSQL) ( ( • Our blockchain works on Cloud Spanner because of stricter consistency and higher throughput • All analytics are in BigQuery
  27. What’s next? Firestore • Runs on top of Spanner (replacing

    Megastore) • Removes Datastore restrictions: ◦ Eventual consistency, All queries become strongly consistent ◦ Transactions are no longer limited to 25 entity groups ◦ Writes to an entity group are no longer limited to 1 per second • Has Datastore API compatibility mode • Currently in Beta, automatic upgrade of existing DBs in 2019
  28. Step 3: Google Cloud Spanner Going global with relational database

  29. Nimses Blockchain • Financial component to support transactions in Nims

    • Trust & Transparency • Governance • Performance & Scalability
  30. • Serializable transactions with external consistency • Synchronous data replication

    across multiple locations • 99.999% availability • Horizontally scalable Reads and Writes, automated sharding and re-sharding • Fully managed, no maintenance downtime What Cloud Spanner gives us
  31. Performance optimization requires iterations • Whitepapers and “best practice” docs

    are useful • Understanding of core concepts (Splits, Interleaving) is mandatory • Data layout will evolve along with business requirements • Minimize the number of Splits (a.k.a. shards) participating in a single transaction
  32. It’s a Managed DB, but with some Ops • Use

    native Spanner API to gather profiling and stats on transactions • Spanner does data maintenance in the background. Keep CPU load at 70-75% • You can always scale up by adding a node, no downtime • Backups are easy, with Dataflow integrated on UI
  33. • No DML support yet with SQL • Query execution

    plan is visible, but hard to optimize • You need to force index usage and specify JOIN types • Native API is much better for optimization of complex transactions SQL is there, but native API is better
  34. • A transaction can be aborted or restarted a few

    times before it is committed • Transaction state must be valid in between these retry attempts • Try not to use external state in between transaction retries The coolest bug we had - External state
  35. • Running on “someone else’s technology” can help product teams

    to move faster • Forcing “the old way” on new technology leads to lost time • Learning “the right way” leads to results • Every minute counts :) Ending notes
  36. Thank you! Questions?