Upgrade to Pro — share decks privately, control downloads, hide ads and more …

MTC2018 - Leveraging billions of data items and...

mercari
October 04, 2018

MTC2018 - Leveraging billions of data items and machine learning to map the future

Speaker: Takuma Yamaguchi

With over one billion total listings on the Mercari app to date and with the remarkable advancements observed in recent years in the fields of machine learning and deep learning, there is increasingly greater focus on, and expectations for, data utility. There are a number of projects underway which utilize data on items and user behavior, and several features, including image recognition for listings and automatic detection of fraudulent listings, are already in operation on the service. This session will introduce those algorithms and execution platforms and will also introduce a number of features currently in development.

mercari

October 04, 2018
Tweet

More Decks by mercari

Other Decks in Technology

Transcript

  1. Leveraging billions of data items and machine learning to map

    the future Takuma Yamaguchi Software Engineer (Machine Learning)
  2. Leveraging billion-scale data to map the future A dataset that

    contains billions of data items is huge (Logs ≠ Data)
  3. It is already possible to create various datasets, including: •

    Item image data (with item information) • Item views / search → history of items purchased/not purchased • etc... However, a large dataset is not necessarily a good dataset, and billions of items of data are not currently being used. Large Datasets We are eager to utilize large amounts of data in the future (Either we use it or dispose of it. Otherwise it’s an unnecessary cost) Open datasets ・ ImageNet (ILSVRC 2012): 1.2 million images / 1000 classes ・ Open Images V4: 9 million images / 20,000 classes ・ YouTube-8M Dataset: 8 million videos / 4,000 classes ・ Amazon Product Reviews: 80 million reviews Mercari’s image-related projects ・Classification: 8.5 million images / 14,000 classes ・Image recognition for listing: 50 million images As a matter of fact, ImageNet is relatively small by today’s standards; it “only” contains a million images that cover the specific domain of object classification https://arxiv.org/abs/1807.05520, Facebook AI Research
  4. Leveraging billion-scale data and machine learning to map the future

    Why are huge datasets needed? What are we trying to achieve?
  5. 2017 2016 2018 Mercari’s Machine Learning History Item image recognition

    (Kando Listing) Kaggle: Mercari Price Suggestion Challenge Detection of listings that violate terms of use v1 Detection of age-restricted items Item weight suggestion Detection of transactions that violate terms of use v1 Classification of inquiries Identification of reported items v2 Detection of listings that violate terms of use v2 Identification of reported items v1 Kando Listing v2 Detection of transactions that violate terms of use v2 Kando Listing (US) Price suggestion ・ Simple transactions ・ Safe transactions There are other projects not listed here that were not implemented due to such issues as precision In 2016, we started to see the limitations of handling various operations completely manually. We’re not utilizing machine learning because it’s a trend, we’re utilizing it simply because it’s necessary.
  6. 2017 2016 2018 Mercari’s Machine Learning History Item image recognition

    (Kando Listing) Kaggle: Mercari Price Suggestion Challenge Detection of listings that violate terms of use v1 Detection of age restricted items Item weight suggestion Detection of transactions that violate terms of use v1 Classification of enquiries Identification of reported items v2 Detection of listings that violate terms of use v2 Identification of reported items v1 Kando Listing v2 Detection of transactions that violate terms of use v2 Kando Listing (US) Price suggestion There are other projects not listed here that were not implemented due to such issues as precision In 2016, we started to see the limitations of handling various operations completely manually. We’re not utilizing machine learning because it’s a trend, we’re utilizing it simply because it’s necessary. ・ Simple transactions ・ Safe transactions
  7. Detection of Listings/Transactions that Violate Our Terms of Use Responding

    to terms of use violations Responding to reports Verify reports from users regarding violations of our terms of use and respond by removing problematic listings and suspending offending users Monitoring Extract listings and users that may be in violation of our terms of use, and remove potentially problematic listings and suspend potentially offending users
  8. Detection of Listings/Transactions that Violate Our Terms of Use Responding

    to terms of use violations Responding to reports Verify reports from users regarding violations of our terms of use and respond by removing problematic listings and suspending offending users Monitoring Extract listings and users that may be in violation of our terms of use, and remove potentially problematic listings and suspend potentially offending users Working towards a service where automatic detection would make reports unnecessary Working towards a scalable service through automatic detection Moving forward with automation is not about reducing costs; it is about delivering a speedy and scalable service. If our service grew tenfold, it would be unrealistic to increase our human resources by the same.
  9. Utilizing Multimodal Models Item name Description (Text data) Image Category

    Price (Categorical and numerical data) Make decision A machine learning model using multiple types of data as input Aiming to speedily generate a baseline classification This is already released in the US and the precision is currently being verified in Japan (The structure of the model differs based on differences between the English and Japanese languages)
  10. This is already released in the US and the precision

    is currently being verified in Japan (The structure of the model differs based on differences between the English and Japanese languages) Utilizing Multimodal Models Item name Description (Text data) Image Category Price (Categorical and numerical data) Make decision A machine learning model using multiple types of data as input Aiming to speedily generate a baseline classification • Even if you make it so that multiple types of data can be input, it doesn’t necessarily go as expected and a lot of trial and error is required. • There is little information on the challenges of and unsuccessful attempts at multimodal models and ways to resolve those. • There is plenty of room for improvements and we are continuing to work on this.
  11. 2017 2016 2018 Mercari’s Machine Learning History Item image recognition

    (Kando Listing) Kaggle: Mercari Price Suggestion Challenge Detection of listings that violate terms of use v1 Detection of age restricted items Item weight suggestion Detection of transactions that violate terms of use v1 Classification of enquiries Identification of reported items v2 Detection of listings that violate terms of use v2 Identification of reported items v1 Kando Listing v2 Detection of transactions that violate terms of use v2 Kando Listing (US) Price suggestion There are other projects not listed here that were not implemented due to such issues as precision In 2016, we started to see the limitations of handling various operations completely manually. We’re not utilizing machine learning because it’s a trend, we’re utilizing it simply because it’s necessary. ・Simple transactions ・Safe transactions
  12. Listing Information Suggestion Feature (Kando Listing) Apple wireless keyboard Consumer

    electronics/smartphones and cameras > PC/tablets > PC peripherals Apple This feature was labelled Kando Listing based on the strong impression we want to leave on our users (“kando” can refer to “impression” in Japanese)
  13. Kando Listing (US) Louis Vuitton Wallet Men > Men's accessories

    > Wallets Louis Vuitton Although there are some differences in the data, the Kando Listing feature in the US uses the same algorithm structure as Kando Listing in Japan.
  14. Kando Listing Algorithm CNN (Inception-v3) Image feature vector Similar items

    Title Ralph Lauren polo shirt Category Baby/kids > baby clothes (boys) up to 95cm > tops Brand Ralph Lauren Color Red Price 800 yen - 1,200 yen It’s simple but highly flexible and easy to operate (k-nearest neighbors algorithm) Image feature pool of approx. 50 million images
  15. Listing by Scanning Barcode (Non-ML) If the item can be

    identified, the information will be input automatically (except the condition of the item) (We have master data for some books, CDs, and DVDs)
  16. Kando Listing v2 If the item can be identified, the

    information will be input automatically (except the condition of the item) The front cover image is used instead of the barcode This saves having to turn the book over to take a picture of the barcode
  17. Kando Listing v2 Simply take a picture of the item,

    select the condition of the item, and press the listing button
  18. Infrastructure for Kando Listing As long as a Dockerfile is

    written, any system can be deployed perfectly #mercari_mlops
  19. Infrastructure for Kando Listing Kando Listing infrastructure, provided in one

    week with a Dockerfile Updates to models are carried out with AWS, where the images are saved, and features are provided using GCP (GKE), which can run Kubernetes. Microservices and machine learning models work well together, and provide many benefits in terms of the following: ・Resource management ・Model updates
  20. Infrastructure for Kando Listing (Blue-Green Deployment) Persistent Volume is generated

    each time the model is updated Used as is if only updating the code (Redis is shared) Although the ML model is not included in the Docker Image due to the file size, Immutable Infrastructure is achieved using Persistent Volume (ReadOnlyMany) Deploy (Rollback)
  21. Item Weight Suggestion Feature (US) 1. Upload a picture of

    the item 2. Item information is suggested with Kando Listing 3. User adds/supplements item information 4. Weight is suggested when entering shipping information ◦ Weight class is automatically selected in advance ◦ A warning is given if the user chooses a lighter weight class • User does not have to enter as much shipping information manually • Helps prevent issues related to shipping
  22. Search for Items with Images (Currently in Development) Solves issues

    such as not knowing the right keywords or the hassle of having to input keywords
  23. Solves issues such as not knowing the right keywords or

    the hassle of having to input keywords Search for Items with Images (Currently in Development)
  24. Search for items even if you don’t know the pattern

    name, brand name, or name of the character featured Search for Items with Images (Currently in Development)
  25. Search for items even if you don’t know the pattern

    name, brand name, or name of the character featured Search for Items with Images (Currently in Development)
  26. Also effective if searching for a specific model that you

    don’t know the name of Search for items with images (Currently in development)
  27. Leveraging billion-scale data and machine learning to map the future

    Why are huge datasets needed? What are we trying to achieve?
  28. Software 1.0: • Based on traditional programming • Humans provide

    the logic Software 2.0: • Not programming • Expressed with neural networks • Provide data (input/expected output) for learning • If there is a error, you don’t fix the code but instead increase the amount of data so that it goes well • Doesn’t replace Software 1.0 “Software 2.0”, Andrej Karpathy, https://medium.com/@karpathy/software-2-0-a64152b37c35 Software 2.0 Andrej Karpathy (Director of AI at Tesla)
  29. Software 2.0 When Mercari first implemented image recognition, 1. The

    dataset was created using images of items 2. We had it learn several well-known convolutional neural networks 3. We increased the amount of data for classes that did not produce results as expected 4. We wrote a REST API in order to use that (Software 1.0) Software 2.0 has already begun
  30. Software 1.0: • Based on traditional programming • Humans provide

    the logic Software 2.0: • Not programming • Expressed with neural networks • Provide data (input/expected output) for learning • If there is a error, you don’t fix the code but instead increase the amount of data so that it goes well • Doesn’t replace Software 1.0 “Software 2.0”, Andrej Karpathy, https://medium.com/@karpathy/software-2-0-a64152b37c35 Software 2.0 Andrej Karpathy (Director of AI at Tesla)
  31. One Model To Learn Them All (Google Brain) “Can we

    create a unified deep learning model to solve tasks across multiple domains?” “One model to learn them all”, Kaiser et al., https://arxiv.org/abs/1706.05137, 2017
  32. • Eight different tasks, including speech recognition, image recognition, and

    machine translation • Trained using a single neural network • Better results were obtained compared to training for each individual task ◦ Joint positive effects were realized for completely different tasks One Model To Learn Them All (Google Brain)
  33. AutoML / NASNet (Google Brain) “AutoML for large scale image

    classification and object detection”, https://ai.googleblog.com/2017/11/automl-for-large-scale-image.html • Machine learning models generating machine learning models • NASNet - generated for tasks related to image recognition • Higher performance achieved compared to existing models
  34. Leveraging Billion-Scale Data and Machine Learning to Map the Future

    • Currently, generic machine learning models < task-specific machine learning models • Possible to mass produce practical level models from generic machine learning models/frameworks? ◦ Software 2.0 ◦ One Model To Learn Them All ◦ AutoML ◦ etc... • How well will transfer learning go with public models that have already been trained? ◦ Training a large model from scratch is extremely costly • In the case of a large dataset (hundreds of millions of data items or greater), it may actually be possible to achieve: Generic machine learning models > task-specific machine learning models
  35. Leveraging Billion-Scale Data and Machine Learning to Map the Future

    It is evident that modelling and model training using large datasets containing hundreds of millions to billions of data items is extremely costly. However, we want to see what the future holds. Many services could be operated with Software 2.0.