Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Scalability @ Sale Stock

Scalability @ Sale Stock

At Sale Stock, we are trying to solve the problem of providing easy access to great quality clothing at an affordable price for everybody. During the past year, we experienced explosive growth on a wide array of metrics: user base, revenue, user traffic, and team size -- to name a few. As an engineering team, it meant we had to scale quickly on all dimensions. In this talk, we will share a technical deep-dive on a wide array of strategies we employ to meet said scalability challenges in our team -- things like backend infrastructure, developer tooling, platform abstractions, deployment workflow, monitoring, data infrastructure, and others that allow our engineers to move quickly and efficiently to solve these challenges.

About the speakers:

Garindra Prahandono
Garindra is the Chief Technology Officer at Sale Stock Indonesia. Previously, he worked at Sony America, working on core products such as the PlayStation 4 and PlayStation Now. His work, used by tens of millions of people around the world, spans from server-side infrastructure, user interface core abstractions, to internal test automation infrastructure.

Thomas Diong
Thomas is the Chief Data Officer at Sale Stock Indonesia. He was previously at Yahoo! where he handled global tech initiatives working on Yahoo! Messenger, Yahoo! Application Platform, Yahoo! Games and the likes. He then went on to Apple where he worked on business process improvements and streamlining with automation, and subsequently Spuul (a movie streaming company), where he led growth and data efforts before moving to Veritrans.

Wilson Lauw
Wilson is a Data Engineer at Sale Stock Indonesia, working on data infrastructure for analytics and machine learning. Previously, he worked at Healint, a big data analytics in healthcare industry, as data scientist, working both on data analysis as well data infrastructure.

Sale Stock Engineering

March 29, 2016
Tweet

More Decks by Sale Stock Engineering

Other Decks in Technology

Transcript

  1. Scalability @ Sale Stock
    March 29th, 2016

    View Slide

  2. Welcome!

    View Slide

  3. Who are we?

    View Slide

  4. Who are we?
    ● Tech startup that sells mid-low women’s fashion
    ● Engineering team started ~1 year ago
    ● Launched our in-house website ~8 months ago

    View Slide

  5. Increase in various metrics
    ● Revenue
    ● Team Size
    ● User base
    ● Traffic

    View Slide

  6. Scalability Problems:
    ● Iteration Speed
    ● Code Quality
    ● Backend Infrastructure
    ● etc.

    View Slide

  7. Iteration Speed Scalability

    View Slide

  8. GitFlow

    View Slide

  9. View Slide

  10. Git Flow
    ● Dual main branches: master & develop
    ● Long-living feature branches

    View Slide

  11. GitFlow downsides
    ● Isolated feature branches
    ● Horribly painful merges
    ● Horribly risky deploys

    View Slide

  12. Trunk-based Development

    View Slide

  13. Trunk-based Development
    ● Single main branch (master-only)
    ● Discouragement of long-living feature branches

    View Slide

  14. View Slide

  15. Trunk-based Development gives us:
    ● Less merge conflicts
    ● Less risky deploys
    ● Faster iteration speed
    ● Fewer dedicated non-prod environments

    View Slide

  16. More frequent merges directly to master… that’s scary.

    View Slide

  17. What we’re doing:
    ● Automated test suite
    ● Feature gating

    View Slide

  18. Automated Test Suite

    View Slide

  19. Automated Test Suite
    1. Core Test Suite
    2. Comprehensive Test Suite
    3. Continuous Production Smoke Test

    View Slide

  20. Core Test Execution
    ● Runs on every merge cycle of our www codebase
    ● Results decide whether we execute auto-deploy for latest merge
    ● Optimized for the best coverage-over-speed investment ratio
    ● Consists of hundreds of functional test cases
    ● Runs on 20-node test cluster for speedy execution

    View Slide

  21. Comprehensive Test Execution
    ● Ultra-complete test coverage -- covers all user usage paths
    ● Runs on multiple devices and browsers
    ● Runs periodically out of merge cycle

    View Slide

  22. Continuous Prod Smoke Test
    ● Runs continuously against prod environment
    ● Simulates real users
    ● More sane, useful, accurate form of continuous monitoring compared to
    regular uptime alerting.

    View Slide

  23. Feature Gating

    View Slide

  24. Feature Gating
    ● Allows code paths to be activated to a subset of users / only employees

    View Slide

  25. View Slide

  26. Codebase Scalability

    View Slide

  27. SOA / Microservice Architecture
    ● One domain → One service
    ● Clear engineer / team ownership
    ● Downside:
    ○ Increasing number of features and services makes for complex development & deployment

    View Slide

  28. Problems:
    ● No standards around development of many-services cluster
    ● No standards around production deployment of many-services cluster

    View Slide

  29. View Slide

  30. View Slide

  31. Development Requirements
    ● Download the software needed for each service / stack type
    ● Run each services (preferably in topological order)
    ● Run dependency processes (MySQL / Redis / Kafka)
    ● Connect the services & databases properly (through env vars)

    View Slide

  32. Deployment Requirements
    ● Create & run containers for each service
    ● Run each services (preferably in topological order)
    ● Scale the services properly
    ● Connect the services & databases properly (through env vars)

    View Slide

  33. ClusterGraph

    View Slide

  34. ClusterGraph
    A data structure about how a cluster is formed from
    different services.

    View Slide

  35. View Slide

  36. How do we build this?

    View Slide

  37. ClusterGraph
    ● Monorepo
    ● Microservice within top-level folders
    ● In each of the top-level folder, define service.yaml, which contains:
    ○ name
    ○ stack
    ○ dependency list (list of other service’s names)
    ○ database dependencies
    ○ etc.
    ● The service.yamls of all the services are then used to statically
    build the cluster graph

    View Slide

  38. View Slide

  39. View Slide

  40. View Slide

  41. View Slide

  42. ClusterGraph
    ● This also means is cluster graph is versionable per git commit
    ● Can technically do atomic graph refactoring per single commit

    View Slide

  43. ssi

    View Slide

  44. ssi
    ● Internal command-line program
    ● Able to construct cluster graph out of our source code
    ● Execute them locally for development
    ● Instantiate databases

    View Slide

  45. View Slide

  46. komandan
    ● Production-stage executor of ClusterGraph
    ● Uses Kubernetes under-the-hood

    View Slide

  47. Kubernetes

    View Slide

  48. View Slide

  49. komandan
    ● Stores multiple cluster graph versions
    ● Can deploy complete cluster in ~15 seconds
    ● Revert in the same amount of time
    ● Handles service discovery through env var injection

    View Slide

  50. komandan
    ● Since it’s so cheap (and fast) to create new clusters, it’s possible to do:
    ○ Transient clusters for test suite executions
    ○ Transient clusters for open PRs

    View Slide

  51. Why is this important?
    ● Development of complex clusters are more productive
    ● Deployment of complex clusters are simpler and more robust
    ● Allows us to build more features, quicker

    View Slide

  52. Thomas Diong

    View Slide

  53. Scaling Sale Stock with Products

    View Slide

  54. Machine Learning & AI Products
    ● NLP
    ● Recommender System

    View Slide

  55. NLP

    View Slide

  56. Customer Behavior
    ● Customers are mostly outside of cities
    ● Don’t own desktop or laptop
    ● First computer is a low-end Android, terrible internet connection
    ● Buying behavior is still on offline shops, risk-averse
    ● Understanding of purchase is through a conversation

    View Slide

  57. AI Needs to be Able To
    ● Indonesian Language
    ● Natural
    ● Understand eCommerce context

    View Slide

  58. Usual Customer’s Chat

    View Slide

  59. AI Needs to be Able To
    ● Indonesian Language
    ● Natural
    ● Understands eCommerce Context
    ● Speaks Alay

    View Slide

  60. Process
    Preprocessing
    - Tokenize
    - Vectorize
    Learning
    - Deep learning (Tensorflow)
    Output
    - Word by word generation until end of line

    View Slide

  61. Usual Customer’s Chat

    View Slide

  62. Current Limitations

    View Slide

  63. Recommender System

    View Slide

  64. Personalization
    ● Over 20k SKUs and increasing
    ● Different types of items. Muslim wear, dress, skirts, tops, bottom, bags,
    shoes, accessories etc
    ● Different people have very differing taste
    ● Customer complain about not finding things they like

    View Slide

  65. Recommender System
    ● Many ways to do it
    ● Costly and time-consuming to experiment, iterate with different methods

    View Slide

  66. Recommender System Ideals
    ● Add new models from new data points
    ● Improve existing models
    ● Continuously A/B Test

    View Slide

  67. Modular Design
    W1(item-to-item similarity score) + W2(Interest in Item Based on View)
    + W3(Interest in Item Based on Historical Transaction) + … + etc

    View Slide

  68. Advantages
    1) Each individual modules can be used to build other interesting projects
    outside of Recommender System
    - “Produk Menarik Lain”
    - Marketing Push
    2) Improvement or addition of modules independent of each other
    3) Aggressively AB test continuously without having to rebuild

    View Slide

  69. Next On Recommender
    ● Online learning

    View Slide

  70. SALESTOCK
    DATA
    INFRASTRUCTURE

    View Slide

  71. File Storage
    1.
    FILE STORAGE
    HDFS
    - Scalable distributed file system for fast read/write and fault
    tolerant.
    - Data locality for faster access.

    View Slide

  72. File Storage Data Management & ETL
    2.
    DATA MANAGEMENT & ETL
    Hive
    - Define tables, partitions, bucketing, and file formats used for
    specific requirements.
    - Translate SQL into MapReduce jobs.
    - Can write UDF for custom requirements.

    View Slide

  73. File Storage Data Management & ETL Random Read / Write
    3.
    RANDOM READ / WRITE
    HBase
    - Consistent random read/write on top of HDFS.
    - Flexibility on key distribution and column design.
    - Apache Phoenix for SQL skin.

    View Slide

  74. File Storage Data Management & ETL Random Read / Write
    IMPALA
    SQL Query & ETL
    4.
    SQL QUERY & ETL
    Impala
    - Translate SQL into MPP jobs.
    - Uses Hive Metastore & UDF.
    - Does not use MapReduce to process query.
    - Can read files from HDFS/HBase/S3.

    View Slide

  75. Complex ETL + Machine Learning
    File Storage Data Management & ETL Random Read / Write
    IMPALA
    SQL Query & ETL
    5.
    COMPLEX ETL + MACHINE LEARNING
    Spark
    - In memory processing, faster and easier to express parallel
    processing compared to MapReduce.
    - Can read/write from multiple sources, HDFS/HBase/S3.

    View Slide

  76. Front End Portal
    Complex ETL + Machine Learning
    File Storage Data Management & ETL Random Read / Write
    IMPALA
    SQL Query & ETL
    6.
    FRONT END PORTAL
    Hue
    - Since Impala is used a lot by non-developers, we
    need a good GUI to help them use it easily.
    - Besides that, also have a decent HDFS/HBase
    explorer.
    - Can query RDBMS if needed.

    View Slide

  77. Job
    Scheduling
    Front End Portal
    Complex ETL + Machine Learning
    File Storage Data Management & ETL Random Read / Write
    IMPALA
    SQL Query & ETL
    7.
    JOB SCHEDULING
    Azkaban
    - Good DAG visualization.
    - Simple job configuration.
    - Easier to inspect logs in case
    of exception happens.

    View Slide

  78. 8.
    ARCHIVING
    AWS S3
    Archiving
    Job
    Scheduling
    Front End Portal
    Complex ETL + Machine Learning
    File Storage Data Management & ETL Random Read / Write
    IMPALA
    SQL Query & ETL

    View Slide

  79. 9.
    DATA INGESTION
    Kafka + Spark Streaming
    MySQL + Sqoop
    IMPALA
    - Import MySQL tables to Hive tables
    - Real time data stream

    View Slide

  80. View Slide

  81. We’re Hiring!

    View Slide

  82. We’re Hiring!
    Some of our team members hail from:

    View Slide

  83. We’re Hiring!
    Positions:
    DevOps Engineer
    Front-end Engineer
    Back-end Engineer
    Quality Assurance Engineer
    Data Scientist
    Data Infrastructure Engineer
    Business Intelligence Analyst

    View Slide

  84. We’re Hiring!
    Competitive salary
    Company shares
    Option for working remotely
    Relocation support
    Skills development support
    Health benefits cover family
    Lunch and meals provided
    Flexible working hours
    Career development
    Periodic team gathering
    Regular company hackathon

    View Slide

  85. Reach us at: [email protected]

    View Slide

  86. Thanks for your time!

    View Slide