Upgrade to Pro — share decks privately, control downloads, hide ads and more …

[Dmitriy Novakovskiy, Ivan Prisyazhnyy] Making ...

[Dmitriy Novakovskiy, Ivan Prisyazhnyy] Making every minute count: How Nimses scales it's mobile platform on Google Cloud

Presentation from GDG DevFest Ukraine 2018 - the biggest community-driven Google tech conference in the CEE.

Learn more at: https://devfest.gdg.org.ua

__

In this session we will share practical “battle stories”. Real ones, lots of them. This will be a retrospective of >1 year journey, over the course of which Nimses engineers were adopting Google Cloud Platform services such as:

Google Kubernetes Engine
Google Cloud Datastore
Google Cloud Spanner
Google BigQuery
Google Cloud SQL
Cloud Pub/Sub
Cloud Dataflow -Cloud Dataproc
Nimses (http://http://nimses.com) is a location-based social mobile platform that turns every minute of user’s time into a unit of value. Each minute of a person's life within Nimses generates a single unit of digital currency called Nim. Nimses users can spend Nims to interact with each other and make purchases from local vendors, based on geolocation. Nimses helps people and businesses to connect with each other in a meaningful way, with value of each action measured in Nims.

As of today over 5 million people are using Nimses across 20 countries.

In July 2017 Nimses has decided to migrate all of it's services into Google Cloud Platform. The main driver was rapid growth of user base that elevated requirements for infrastructure scalability and data storage to seemingly unreachable levels.

Together with Google Cloud team Nimses engineers were able to migrate main application services into Google Kubernetes Engine in a matter of few weeks. Later on Nimses has gradually taped on other platform services (Datastore, BigQuery, Dataflow), allowing the team to analyze large amounts of data and develop new features faster.

Most recently Nimses has launched Nimses Blockchain (https://nimses.com/posts/blockchain-beta/) payment platform, with ledger of transactions powered by Google Cloud Spanner.

Google Developers Group Lviv

October 12, 2018
Tweet

More Decks by Google Developers Group Lviv

Other Decks in Programming

Transcript

  1. What is Nimses? • Mobile app that turns every minute

    of life into a unit value • Geolocation based • > 6 million users and growing globally • > 100 microservices • Private blockchain, NIM exchange • Ads, Marketplace, ML
  2. Original architecture • A handful of microservices • Written in

    Go, GRPC, go-kit • EC2 w/ Docker-compose • RDS w/ PostgreSQL • Memcached, ELK • Amazon CDN (CloudFront)
  3. Managed infrastructure • CI/CD & microservices • Autoscaling • Ops

    automation Distributed transactions • Blockchain • Global scale Scaling up while staying small Horizontally scalable NoSQL • Store app data • Handle >100M users • No data management toil
  4. Step 1: Managed Kubernetes “We lost a few nodes in

    one Zone last night. No human impact, all things recovered on their own“.
  5. • Container scheduling • Lifecycle and health • Autoscaling (vertical

    and horizontal) • Service discovery • Load balancing • Logging and monitoring • Storage management • Network connectivity Kubernetes gives us
  6. • Provisioning Infrastructure: Create VM’s and bootstrap core components for

    Kubernetes control plane • Implement Networking: IP Ranges for Pods, Services, configuring an overlay • Bootstrap Cluster Services: Provision kube-dns, logging daemons, monitoring • Ensuring Node Resiliency: At a master & node level, ensuring uptime, configuring kernel upgrades, OS updates But running Kubernetes “your way” is not easy
  7. You need package manager - Helm • Reusable templates for

    service deployments • Expose only required parameters to developers • Smaller manifests -> less errors
  8. • Developers don’t like operations :) • Let them care

    about values.yaml only • Leave Ingresses, ReplicaSets and other k8s constructs to Ops Decouple operations from development values.yaml Ingresses, ReplicaSets, Services...
  9. • There is no security “out of the box” •

    Containerized applications may have “defaults” like root account and no resource limits • Use audit tools and “best practices”, for example: https://kubesec.io/ Security still matters
  10. • On Production Day 1 we switched 100% of user

    traffic into newly deployed GKE cluster • Users started receiving auth errors right away • We had to debug and fix Ingress configuration “on the go” • Probably we should’ve done Canary :) Be careful at first Production rollout
  11. What’s next? Service mesh with Istio Intelligent routing • Dynamic

    route configuration • A/B tests • Canaries • Gradually upgrade versions Resilience • Timeouts • Retries • Health checks • Circuit breakers Security & policy • Mutual TLS • Organizational policy • Access policies • Rate Limiting Telemetry • Service Dependencies • Traffic Flow • Distributed Tracing
  12. • NoSQL document-oriented database ◦ Row -> Entity / Entity

    group ◦ Table -> Kind ◦ Field -> Property • NoOps DB that actually works • Horizontal scale with no DBA effort • “Pay-per–use” (storage and operations) • No capacity management required What Datastore gives us
  13. • Isolation and consistency depends on operation and index type

    • “Hotspotting” on storage layer impacts key choices, indexing and ingestion • Datastore does autopartitioning of key ranges - hotspots can take time to split • 1 update per second per Entity Group We had to learn the “Managed NoSQL way”
  14. • Datastore charges for bulks of operations (reads, writes, etc)

    • It’s useful to have caching layer in front of Datastore for cost saving • It’s not always easy to spot a bug in your code that is “wasting” your reads, or a suboptimal “background” script Software bugs can impact your DB costs
  15. • We needed visibility into operational stats and index usage

    • UI stats are limited, but Google Support can help with granular reports • We needed backups to protect against bugs that can corrupt data • Managed Backups are NOT incremental and charge for every read Tooling around Managed services can be limited
  16. Take care of your indexes and data model • Indexes

    consume space and performance • Reindexing is an involved operation • Visibility into index usage can be limited • Migration of data model requires strategy, remember about backward and forward compatibility • Distribute load evenly • Don’t index monotonically increasing properties
  17. We still use multiple DB services • We moved most

    of backend services to Datastore: user profiles, chats, lobby, authentication, feed, events, music • Geospatial queries are still on PostgreSQL because of PostGIS (CloudSQL) ( ( • Our blockchain works on Cloud Spanner because of stricter consistency and higher throughput • All analytics are in BigQuery
  18. What’s next? Firestore • Runs on top of Spanner (replacing

    Megastore) • Removes Datastore restrictions: ◦ Eventual consistency, All queries become strongly consistent ◦ Transactions are no longer limited to 25 entity groups ◦ Writes to an entity group are no longer limited to 1 per second • Has Datastore API compatibility mode • Currently in Beta, automatic upgrade of existing DBs in 2019
  19. Nimses Blockchain • Financial component to support transactions in Nims

    • Trust & Transparency • Governance • Performance & Scalability
  20. • Serializable transactions with external consistency • Synchronous data replication

    across multiple locations • 99.999% availability • Horizontally scalable Reads and Writes, automated sharding and re-sharding • Fully managed, no maintenance downtime What Cloud Spanner gives us
  21. Performance optimization requires iterations • Whitepapers and “best practice” docs

    are useful • Understanding of core concepts (Splits, Interleaving) is mandatory • Data layout will evolve along with business requirements • Minimize the number of Splits (a.k.a. shards) participating in a single transaction
  22. It’s a Managed DB, but with some Ops • Use

    native Spanner API to gather profiling and stats on transactions • Spanner does data maintenance in the background. Keep CPU load at 70-75% • You can always scale up by adding a node, no downtime • Backups are easy, with Dataflow integrated on UI
  23. • No DML support yet with SQL • Query execution

    plan is visible, but hard to optimize • You need to force index usage and specify JOIN types • Native API is much better for optimization of complex transactions SQL is there, but native API is better
  24. • A transaction can be aborted or restarted a few

    times before it is committed • Transaction state must be valid in between these retry attempts • Try not to use external state in between transaction retries The coolest bug we had - External state
  25. • Running on “someone else’s technology” can help product teams

    to move faster • Forcing “the old way” on new technology leads to lost time • Learning “the right way” leads to results • Every minute counts :) Ending notes