Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Spark Summit East 2015: Finding Shoe Stores in more than 100k Merchants: Using Apache Spark to group all things!

Spark Summit East 2015: Finding Shoe Stores in more than 100k Merchants: Using Apache Spark to group all things!

ABSTRACT:
Shopify is the world’s fastest growing e-commerce platform with more than 120k active stores and surpassing $5,000,000,000 in Gross Merchandise Volume. At Shopify, we help emerging small businesses get off the ground and grow into successful companies. Individual businesses can sell a variety of products and services, both online and offline in a single store. The data team at Shopify provides business intelligence (BI) reporting as well as data-driven insights using machine learning and statistical analysis that go beyond BI reports to help merchants be more successful. I joined Shopify’s data team a year ago and what started as a personal project for me, finding stores that sell shoes, turned into a bigger problem: What types of products are sold by our stores? The question became critical when we needed to select eligible stores for a business partnership and our data warehouse kept timing out running data mining queries. During the same time we started using Spark technology and moved our operational data to HDFS. We were able to use the power of distributed computing to mine through millions of records of data: Spark was able to process 67 million records while I went to get a fresh cup of coffee. We were able to successfully categorize our stores based on their products and get the list of partnership-eligible stores. Later on, using Spark’s machine learning library, together with classification data obtained through Amazon’s Mechanical Turk, we were able to form clusters of similar stores and provide industry specific success metrics to them. The ease of use and accessibility of Spark has made me an avid fan and I’d like to share this data journey and the experience that exceeded my expectations.

BIO:
Solmaz is a data analyst at Shopify, providing data-driven insights to the world’s fastest growing e-commerce platform. With multiple graduate degrees in machine learning and computer science, she has employed her skills in cancer research, finance and e-commerce for the past 8 years. At Shopify she used Apache Spark to categorize more than 100k stores based on their products and provided industry specific success metrics to stores. She has a passion for building high quality data warehouses that ensure accuracy and agility of analysis. Prior to Shopify, she worked at Morgan Stanley as an analyst and a developer.

Solmaz Shahalizadeh

March 18, 2015
Tweet

Other Decks in Technology

Transcript

  1. Finding Shoe Stores in >100k Merchants: Using Spark to Group

    All Things 1! Solmaz Shahalizadeh (@solmaz_sh) Shopify
  2. About me Currently: •  Finance Data @Shopify Previous Lives: • 

    Playing with data in Finance/ Bioinformatics/ Cancer Research
  3. Will Talk about: •  Finding a needle in a haystack

    •  Trying all the wrong tools for getting insights out of data and course correction on the way •  Having fun during the process
  4. Where are the shoes? •  Started ~ 1 year ago

    @Shopify •  Wanted desperately to buy something shoes from our merchants
  5. Problems: •  No distributed data •  Not enough processing power

    •  Not very smart filters •  Not many examples of actual stores selling shoes
  6. Shopify + Pinterest “Can you create a whitelist of eligible

    stores for a collaboration with Pinterest, stores not selling ammunition, adult material, cigarettes, etc.? It keeps timing out for me, but here is the SQL query that needs to be run.”
  7. •  SELECT shops.domain FROM customers join shopify.products products on customers.shop_id

    = products.shop_id where ( description not ilike '%ammunition%' and description not ilike '%cigarette %' and description not ilike '%bong%' and description not ilike '%ecstasy%' and description not ilike '%heroin %' and description not ilike '%opium%’ and description not ilike '%cocaine%' and description not ilike '%amphetamine%' and description not ilike '%mdma%' and description not ilike '%ghb%' and description not ilike '%ketamine%' and description not ilike '%pcp%' and description not ilike '%LSD%' and description not ilike '%steroid%' and description not ilike '%mescaline%' and description not ilike '%vaporizer%' and description not ilike '%hashish%' and description not ilike '%nicotine%' and description not ilike '%viagra%' and description not ilike '%cialis%' and description not ilike '%THC%' and description not
  8. Distribute Code and Data •  Get data in distributed file

    system •  Use better-than-sql tools for analysis
  9. Something was still missing We had filtered around 30k merchants:

    too much!! “The Buckshots, the world’s first ammunition for your thighs”
  10. Mechanical Turk The Amazon’s Mechanical Turk (MTurk) is a crowdsourcing

    market place that enables individuals or businesses to co- ordinate the use of human intelligence to perform the tasks that computers are currently unable to do. •  classification •  sentiment analysis •  data cleaning
  11. Lets try this! •  Show the images of the 4

    top selling products of each store to Turkers •  Allow for selection of multiple categories
  12. Finding “Similar” stores Clustering is the task of grouping a

    set of objects in such a way that objects in the same group (called a cluster) are more similar (in some way or another) to each other than to those in other groups (clusters).
  13. Lessons Learned •  Need data: use mechanical Turks •  Need

    processing power: Spark is easy to get started •  Happy Hacking!