Scalability @ Sale Stock

Scalability @ Sale Stock

At Sale Stock, we are trying to solve the problem of providing easy access to great quality clothing at an affordable price for everybody. During the past year, we experienced explosive growth on a wide array of metrics: user base, revenue, user traffic, and team size -- to name a few. As an engineering team, it meant we had to scale quickly on all dimensions. In this talk, we will share a technical deep-dive on a wide array of strategies we employ to meet said scalability challenges in our team -- things like backend infrastructure, developer tooling, platform abstractions, deployment workflow, monitoring, data infrastructure, and others that allow our engineers to move quickly and efficiently to solve these challenges.

About the speakers:

Garindra Prahandono
Garindra is the Chief Technology Officer at Sale Stock Indonesia. Previously, he worked at Sony America, working on core products such as the PlayStation 4 and PlayStation Now. His work, used by tens of millions of people around the world, spans from server-side infrastructure, user interface core abstractions, to internal test automation infrastructure.

Thomas Diong
Thomas is the Chief Data Officer at Sale Stock Indonesia. He was previously at Yahoo! where he handled global tech initiatives working on Yahoo! Messenger, Yahoo! Application Platform, Yahoo! Games and the likes. He then went on to Apple where he worked on business process improvements and streamlining with automation, and subsequently Spuul (a movie streaming company), where he led growth and data efforts before moving to Veritrans.

Wilson Lauw
Wilson is a Data Engineer at Sale Stock Indonesia, working on data infrastructure for analytics and machine learning. Previously, he worked at Healint, a big data analytics in healthcare industry, as data scientist, working both on data analysis as well data infrastructure.

92f4a0837ea2a2a907280e09effaa65a?s=128

Sale Stock Engineering

March 29, 2016
Tweet

Transcript

  1. 4.

    Who are we? • Tech startup that sells mid-low women’s

    fashion • Engineering team started ~1 year ago • Launched our in-house website ~8 months ago
  2. 8.
  3. 9.
  4. 14.
  5. 15.

    Trunk-based Development gives us: • Less merge conflicts • Less

    risky deploys • Faster iteration speed • Fewer dedicated non-prod environments
  6. 19.

    Automated Test Suite 1. Core Test Suite 2. Comprehensive Test

    Suite 3. Continuous Production Smoke Test
  7. 20.

    Core Test Execution • Runs on every merge cycle of

    our www codebase • Results decide whether we execute auto-deploy for latest merge • Optimized for the best coverage-over-speed investment ratio • Consists of hundreds of functional test cases • Runs on 20-node test cluster for speedy execution
  8. 21.

    Comprehensive Test Execution • Ultra-complete test coverage -- covers all

    user usage paths • Runs on multiple devices and browsers • Runs periodically out of merge cycle
  9. 22.

    Continuous Prod Smoke Test • Runs continuously against prod environment

    • Simulates real users • More sane, useful, accurate form of continuous monitoring compared to regular uptime alerting.
  10. 24.

    Feature Gating • Allows code paths to be activated to

    a subset of users / only employees
  11. 25.
  12. 27.

    SOA / Microservice Architecture • One domain → One service

    • Clear engineer / team ownership • Downside: ◦ Increasing number of features and services makes for complex development & deployment
  13. 28.

    Problems: • No standards around development of many-services cluster •

    No standards around production deployment of many-services cluster
  14. 29.
  15. 30.
  16. 31.

    Development Requirements • Download the software needed for each service

    / stack type • Run each services (preferably in topological order) • Run dependency processes (MySQL / Redis / Kafka) • Connect the services & databases properly (through env vars)
  17. 32.

    Deployment Requirements • Create & run containers for each service

    • Run each services (preferably in topological order) • Scale the services properly • Connect the services & databases properly (through env vars)
  18. 35.
  19. 37.

    ClusterGraph • Monorepo • Microservice within top-level folders • In

    each of the top-level folder, define service.yaml, which contains: ◦ name ◦ stack ◦ dependency list (list of other service’s names) ◦ database dependencies ◦ etc. • The service.yamls of all the services are then used to statically build the cluster graph
  20. 38.
  21. 39.
  22. 40.
  23. 41.
  24. 42.

    ClusterGraph • This also means is cluster graph is versionable

    per git commit • Can technically do atomic graph refactoring per single commit
  25. 43.

    ssi

  26. 44.

    ssi • Internal command-line program • Able to construct cluster

    graph out of our source code • Execute them locally for development • Instantiate databases
  27. 45.
  28. 48.
  29. 49.

    komandan • Stores multiple cluster graph versions • Can deploy

    complete cluster in ~15 seconds • Revert in the same amount of time • Handles service discovery through env var injection
  30. 50.

    komandan • Since it’s so cheap (and fast) to create

    new clusters, it’s possible to do: ◦ Transient clusters for test suite executions ◦ Transient clusters for open PRs
  31. 51.

    Why is this important? • Development of complex clusters are

    more productive • Deployment of complex clusters are simpler and more robust • Allows us to build more features, quicker
  32. 55.

    NLP

  33. 56.

    Customer Behavior • Customers are mostly outside of cities •

    Don’t own desktop or laptop • First computer is a low-end Android, terrible internet connection • Buying behavior is still on offline shops, risk-averse • Understanding of purchase is through a conversation
  34. 57.

    AI Needs to be Able To • Indonesian Language •

    Natural • Understand eCommerce context
  35. 59.

    AI Needs to be Able To • Indonesian Language •

    Natural • Understands eCommerce Context • Speaks Alay
  36. 60.

    Process Preprocessing - Tokenize - Vectorize Learning - Deep learning

    (Tensorflow) Output - Word by word generation until end of line
  37. 64.

    Personalization • Over 20k SKUs and increasing • Different types

    of items. Muslim wear, dress, skirts, tops, bottom, bags, shoes, accessories etc • Different people have very differing taste • Customer complain about not finding things they like
  38. 65.

    Recommender System • Many ways to do it • Costly

    and time-consuming to experiment, iterate with different methods
  39. 66.

    Recommender System Ideals • Add new models from new data

    points • Improve existing models • Continuously A/B Test
  40. 67.

    Modular Design W1(item-to-item similarity score) + W2(Interest in Item Based

    on View) + W3(Interest in Item Based on Historical Transaction) + … + etc ∑
  41. 68.

    Advantages 1) Each individual modules can be used to build

    other interesting projects outside of Recommender System - “Produk Menarik Lain” - Marketing Push 2) Improvement or addition of modules independent of each other 3) Aggressively AB test continuously without having to rebuild
  42. 71.

    File Storage 1. FILE STORAGE HDFS - Scalable distributed file

    system for fast read/write and fault tolerant. - Data locality for faster access.
  43. 72.

    File Storage Data Management & ETL 2. DATA MANAGEMENT &

    ETL Hive - Define tables, partitions, bucketing, and file formats used for specific requirements. - Translate SQL into MapReduce jobs. - Can write UDF for custom requirements.
  44. 73.

    File Storage Data Management & ETL Random Read / Write

    3. RANDOM READ / WRITE HBase - Consistent random read/write on top of HDFS. - Flexibility on key distribution and column design. - Apache Phoenix for SQL skin.
  45. 74.

    File Storage Data Management & ETL Random Read / Write

    IMPALA SQL Query & ETL 4. SQL QUERY & ETL Impala - Translate SQL into MPP jobs. - Uses Hive Metastore & UDF. - Does not use MapReduce to process query. - Can read files from HDFS/HBase/S3.
  46. 75.

    Complex ETL + Machine Learning File Storage Data Management &

    ETL Random Read / Write IMPALA SQL Query & ETL 5. COMPLEX ETL + MACHINE LEARNING Spark - In memory processing, faster and easier to express parallel processing compared to MapReduce. - Can read/write from multiple sources, HDFS/HBase/S3.
  47. 76.

    Front End Portal Complex ETL + Machine Learning File Storage

    Data Management & ETL Random Read / Write IMPALA SQL Query & ETL 6. FRONT END PORTAL Hue - Since Impala is used a lot by non-developers, we need a good GUI to help them use it easily. - Besides that, also have a decent HDFS/HBase explorer. - Can query RDBMS if needed.
  48. 77.

    Job Scheduling Front End Portal Complex ETL + Machine Learning

    File Storage Data Management & ETL Random Read / Write IMPALA SQL Query & ETL 7. JOB SCHEDULING Azkaban - Good DAG visualization. - Simple job configuration. - Easier to inspect logs in case of exception happens.
  49. 78.

    8. ARCHIVING AWS S3 Archiving Job Scheduling Front End Portal

    Complex ETL + Machine Learning File Storage Data Management & ETL Random Read / Write IMPALA SQL Query & ETL
  50. 79.

    9. DATA INGESTION Kafka + Spark Streaming MySQL + Sqoop

    IMPALA - Import MySQL tables to Hive tables - Real time data stream
  51. 80.
  52. 83.

    We’re Hiring! Positions: DevOps Engineer Front-end Engineer Back-end Engineer Quality

    Assurance Engineer Data Scientist Data Infrastructure Engineer Business Intelligence Analyst
  53. 84.

    We’re Hiring! Competitive salary Company shares Option for working remotely

    Relocation support Skills development support Health benefits cover family Lunch and meals provided Flexible working hours Career development Periodic team gathering Regular company hackathon