Scalability @ Sale Stock

Scalability @ Sale Stock

At Sale Stock, we are trying to solve the problem of providing easy access to great quality clothing at an affordable price for everybody. During the past year, we experienced explosive growth on a wide array of metrics: user base, revenue, user traffic, and team size -- to name a few. As an engineering team, it meant we had to scale quickly on all dimensions. In this talk, we will share a technical deep-dive on a wide array of strategies we employ to meet said scalability challenges in our team -- things like backend infrastructure, developer tooling, platform abstractions, deployment workflow, monitoring, data infrastructure, and others that allow our engineers to move quickly and efficiently to solve these challenges.

About the speakers:

Garindra Prahandono
Garindra is the Chief Technology Officer at Sale Stock Indonesia. Previously, he worked at Sony America, working on core products such as the PlayStation 4 and PlayStation Now. His work, used by tens of millions of people around the world, spans from server-side infrastructure, user interface core abstractions, to internal test automation infrastructure.

Thomas Diong
Thomas is the Chief Data Officer at Sale Stock Indonesia. He was previously at Yahoo! where he handled global tech initiatives working on Yahoo! Messenger, Yahoo! Application Platform, Yahoo! Games and the likes. He then went on to Apple where he worked on business process improvements and streamlining with automation, and subsequently Spuul (a movie streaming company), where he led growth and data efforts before moving to Veritrans.

Wilson Lauw
Wilson is a Data Engineer at Sale Stock Indonesia, working on data infrastructure for analytics and machine learning. Previously, he worked at Healint, a big data analytics in healthcare industry, as data scientist, working both on data analysis as well data infrastructure.

92f4a0837ea2a2a907280e09effaa65a?s=128

Sale Stock Engineering

March 29, 2016
Tweet

Transcript

  1. Scalability @ Sale Stock March 29th, 2016

  2. Welcome!

  3. Who are we?

  4. Who are we? • Tech startup that sells mid-low women’s

    fashion • Engineering team started ~1 year ago • Launched our in-house website ~8 months ago
  5. Increase in various metrics • Revenue • Team Size •

    User base • Traffic
  6. Scalability Problems: • Iteration Speed • Code Quality • Backend

    Infrastructure • etc.
  7. Iteration Speed Scalability

  8. GitFlow

  9. None
  10. Git Flow • Dual main branches: master & develop •

    Long-living feature branches
  11. GitFlow downsides • Isolated feature branches • Horribly painful merges

    • Horribly risky deploys
  12. Trunk-based Development

  13. Trunk-based Development • Single main branch (master-only) • Discouragement of

    long-living feature branches
  14. None
  15. Trunk-based Development gives us: • Less merge conflicts • Less

    risky deploys • Faster iteration speed • Fewer dedicated non-prod environments
  16. More frequent merges directly to master… that’s scary.

  17. What we’re doing: • Automated test suite • Feature gating

  18. Automated Test Suite

  19. Automated Test Suite 1. Core Test Suite 2. Comprehensive Test

    Suite 3. Continuous Production Smoke Test
  20. Core Test Execution • Runs on every merge cycle of

    our www codebase • Results decide whether we execute auto-deploy for latest merge • Optimized for the best coverage-over-speed investment ratio • Consists of hundreds of functional test cases • Runs on 20-node test cluster for speedy execution
  21. Comprehensive Test Execution • Ultra-complete test coverage -- covers all

    user usage paths • Runs on multiple devices and browsers • Runs periodically out of merge cycle
  22. Continuous Prod Smoke Test • Runs continuously against prod environment

    • Simulates real users • More sane, useful, accurate form of continuous monitoring compared to regular uptime alerting.
  23. Feature Gating

  24. Feature Gating • Allows code paths to be activated to

    a subset of users / only employees
  25. None
  26. Codebase Scalability

  27. SOA / Microservice Architecture • One domain → One service

    • Clear engineer / team ownership • Downside: ◦ Increasing number of features and services makes for complex development & deployment
  28. Problems: • No standards around development of many-services cluster •

    No standards around production deployment of many-services cluster
  29. None
  30. None
  31. Development Requirements • Download the software needed for each service

    / stack type • Run each services (preferably in topological order) • Run dependency processes (MySQL / Redis / Kafka) • Connect the services & databases properly (through env vars)
  32. Deployment Requirements • Create & run containers for each service

    • Run each services (preferably in topological order) • Scale the services properly • Connect the services & databases properly (through env vars)
  33. ClusterGraph

  34. ClusterGraph A data structure about how a cluster is formed

    from different services.
  35. None
  36. How do we build this?

  37. ClusterGraph • Monorepo • Microservice within top-level folders • In

    each of the top-level folder, define service.yaml, which contains: ◦ name ◦ stack ◦ dependency list (list of other service’s names) ◦ database dependencies ◦ etc. • The service.yamls of all the services are then used to statically build the cluster graph
  38. None
  39. None
  40. None
  41. None
  42. ClusterGraph • This also means is cluster graph is versionable

    per git commit • Can technically do atomic graph refactoring per single commit
  43. ssi

  44. ssi • Internal command-line program • Able to construct cluster

    graph out of our source code • Execute them locally for development • Instantiate databases
  45. None
  46. komandan • Production-stage executor of ClusterGraph • Uses Kubernetes under-the-hood

  47. Kubernetes

  48. None
  49. komandan • Stores multiple cluster graph versions • Can deploy

    complete cluster in ~15 seconds • Revert in the same amount of time • Handles service discovery through env var injection
  50. komandan • Since it’s so cheap (and fast) to create

    new clusters, it’s possible to do: ◦ Transient clusters for test suite executions ◦ Transient clusters for open PRs
  51. Why is this important? • Development of complex clusters are

    more productive • Deployment of complex clusters are simpler and more robust • Allows us to build more features, quicker
  52. Thomas Diong

  53. Scaling Sale Stock with Products

  54. Machine Learning & AI Products • NLP • Recommender System

  55. NLP

  56. Customer Behavior • Customers are mostly outside of cities •

    Don’t own desktop or laptop • First computer is a low-end Android, terrible internet connection • Buying behavior is still on offline shops, risk-averse • Understanding of purchase is through a conversation
  57. AI Needs to be Able To • Indonesian Language •

    Natural • Understand eCommerce context
  58. Usual Customer’s Chat

  59. AI Needs to be Able To • Indonesian Language •

    Natural • Understands eCommerce Context • Speaks Alay
  60. Process Preprocessing - Tokenize - Vectorize Learning - Deep learning

    (Tensorflow) Output - Word by word generation until end of line
  61. Usual Customer’s Chat

  62. Current Limitations

  63. Recommender System

  64. Personalization • Over 20k SKUs and increasing • Different types

    of items. Muslim wear, dress, skirts, tops, bottom, bags, shoes, accessories etc • Different people have very differing taste • Customer complain about not finding things they like
  65. Recommender System • Many ways to do it • Costly

    and time-consuming to experiment, iterate with different methods
  66. Recommender System Ideals • Add new models from new data

    points • Improve existing models • Continuously A/B Test
  67. Modular Design W1(item-to-item similarity score) + W2(Interest in Item Based

    on View) + W3(Interest in Item Based on Historical Transaction) + … + etc ∑
  68. Advantages 1) Each individual modules can be used to build

    other interesting projects outside of Recommender System - “Produk Menarik Lain” - Marketing Push 2) Improvement or addition of modules independent of each other 3) Aggressively AB test continuously without having to rebuild
  69. Next On Recommender • Online learning

  70. SALESTOCK DATA INFRASTRUCTURE

  71. File Storage 1. FILE STORAGE HDFS - Scalable distributed file

    system for fast read/write and fault tolerant. - Data locality for faster access.
  72. File Storage Data Management & ETL 2. DATA MANAGEMENT &

    ETL Hive - Define tables, partitions, bucketing, and file formats used for specific requirements. - Translate SQL into MapReduce jobs. - Can write UDF for custom requirements.
  73. File Storage Data Management & ETL Random Read / Write

    3. RANDOM READ / WRITE HBase - Consistent random read/write on top of HDFS. - Flexibility on key distribution and column design. - Apache Phoenix for SQL skin.
  74. File Storage Data Management & ETL Random Read / Write

    IMPALA SQL Query & ETL 4. SQL QUERY & ETL Impala - Translate SQL into MPP jobs. - Uses Hive Metastore & UDF. - Does not use MapReduce to process query. - Can read files from HDFS/HBase/S3.
  75. Complex ETL + Machine Learning File Storage Data Management &

    ETL Random Read / Write IMPALA SQL Query & ETL 5. COMPLEX ETL + MACHINE LEARNING Spark - In memory processing, faster and easier to express parallel processing compared to MapReduce. - Can read/write from multiple sources, HDFS/HBase/S3.
  76. Front End Portal Complex ETL + Machine Learning File Storage

    Data Management & ETL Random Read / Write IMPALA SQL Query & ETL 6. FRONT END PORTAL Hue - Since Impala is used a lot by non-developers, we need a good GUI to help them use it easily. - Besides that, also have a decent HDFS/HBase explorer. - Can query RDBMS if needed.
  77. Job Scheduling Front End Portal Complex ETL + Machine Learning

    File Storage Data Management & ETL Random Read / Write IMPALA SQL Query & ETL 7. JOB SCHEDULING Azkaban - Good DAG visualization. - Simple job configuration. - Easier to inspect logs in case of exception happens.
  78. 8. ARCHIVING AWS S3 Archiving Job Scheduling Front End Portal

    Complex ETL + Machine Learning File Storage Data Management & ETL Random Read / Write IMPALA SQL Query & ETL
  79. 9. DATA INGESTION Kafka + Spark Streaming MySQL + Sqoop

    IMPALA - Import MySQL tables to Hive tables - Real time data stream
  80. None
  81. We’re Hiring!

  82. We’re Hiring! Some of our team members hail from:

  83. We’re Hiring! Positions: DevOps Engineer Front-end Engineer Back-end Engineer Quality

    Assurance Engineer Data Scientist Data Infrastructure Engineer Business Intelligence Analyst
  84. We’re Hiring! Competitive salary Company shares Option for working remotely

    Relocation support Skills development support Health benefits cover family Lunch and meals provided Flexible working hours Career development Periodic team gathering Regular company hackathon
  85. Reach us at: joinus@salestock.id

  86. Thanks for your time!