MTC2018 - Leveraging billions of data items and machine learning to map the future

Leveraging billions of data items and machine learning to map
the future Takuma Yamaguchi Software Engineer (Machine Learning)

Leveraging Billion-Scale Data and Machine Learning to Map the Future

Software Engineer Machine Learning Takuma Yamaguchi

Leveraging billion-scale data to map the future A dataset that
contains billions of data items is huge (Logs ≠ Data)

It is already possible to create various datasets, including: •
Item image data (with item information) • Item views / search → history of items purchased/not purchased • etc... However, a large dataset is not necessarily a good dataset, and billions of items of data are not currently being used. Large Datasets We are eager to utilize large amounts of data in the future (Either we use it or dispose of it. Otherwise it’s an unnecessary cost) Open datasets ・ ImageNet (ILSVRC 2012): 1.2 million images / 1000 classes ・ Open Images V4: 9 million images / 20,000 classes ・ YouTube-8M Dataset: 8 million videos / 4,000 classes ・ Amazon Product Reviews: 80 million reviews Mercari’s image-related projects ・Classification: 8.5 million images / 14,000 classes ・Image recognition for listing: 50 million images As a matter of fact, ImageNet is relatively small by today’s standards; it “only” contains a million images that cover the specific domain of object classification https://arxiv.org/abs/1807.05520, Facebook AI Research

Leveraging billion-scale data and machine learning to map the future
Why are huge datasets needed? What are we trying to achieve?

2017 2016 2018 Mercari’s Machine Learning History Item image recognition
(Kando Listing) Kaggle: Mercari Price Suggestion Challenge Detection of listings that violate terms of use v1 Detection of age-restricted items Item weight suggestion Detection of transactions that violate terms of use v1 Classification of inquiries Identification of reported items v2 Detection of listings that violate terms of use v2 Identification of reported items v1 Kando Listing v2 Detection of transactions that violate terms of use v2 Kando Listing (US) Price suggestion ・ Simple transactions ・ Safe transactions There are other projects not listed here that were not implemented due to such issues as precision In 2016, we started to see the limitations of handling various operations completely manually. We’re not utilizing machine learning because it’s a trend, we’re utilizing it simply because it’s necessary.

Kaggle: Mercari Price Suggestion Challenge

(Kando Listing) Kaggle: Mercari Price Suggestion Challenge Detection of listings that violate terms of use v1 Detection of age restricted items Item weight suggestion Detection of transactions that violate terms of use v1 Classification of enquiries Identification of reported items v2 Detection of listings that violate terms of use v2 Identification of reported items v1 Kando Listing v2 Detection of transactions that violate terms of use v2 Kando Listing (US) Price suggestion There are other projects not listed here that were not implemented due to such issues as precision In 2016, we started to see the limitations of handling various operations completely manually. We’re not utilizing machine learning because it’s a trend, we’re utilizing it simply because it’s necessary. ・ Simple transactions ・ Safe transactions

Detection of Listings/Transactions that Violate Our Terms of Use

Detection of Listings/Transactions that Violate Our Terms of Use Responding
to terms of use violations Responding to reports Verify reports from users regarding violations of our terms of use and respond by removing problematic listings and suspending offending users Monitoring Extract listings and users that may be in violation of our terms of use, and remove potentially problematic listings and suspend potentially offending users

Detection of Listings/Transactions that Violate Our Terms of Use Responding
to terms of use violations Responding to reports Verify reports from users regarding violations of our terms of use and respond by removing problematic listings and suspending offending users Monitoring Extract listings and users that may be in violation of our terms of use, and remove potentially problematic listings and suspend potentially offending users Working towards a service where automatic detection would make reports unnecessary Working towards a scalable service through automatic detection Moving forward with automation is not about reducing costs; it is about delivering a speedy and scalable service. If our service grew tenfold, it would be unrealistic to increase our human resources by the same.

Utilizing Multimodal Models Item name Description (Text data) Image Category
Price (Categorical and numerical data) Make decision A machine learning model using multiple types of data as input Aiming to speedily generate a baseline classification This is already released in the US and the precision is currently being verified in Japan (The structure of the model differs based on differences between the English and Japanese languages)

This is already released in the US and the precision
is currently being verified in Japan (The structure of the model differs based on differences between the English and Japanese languages) Utilizing Multimodal Models Item name Description (Text data) Image Category Price (Categorical and numerical data) Make decision A machine learning model using multiple types of data as input Aiming to speedily generate a baseline classification • Even if you make it so that multiple types of data can be input, it doesn’t necessarily go as expected and a lot of trial and error is required. • There is little information on the challenges of and unsuccessful attempts at multimodal models and ways to resolve those. • There is plenty of room for improvements and we are continuing to work on this.

(Kando Listing) Kaggle: Mercari Price Suggestion Challenge Detection of listings that violate terms of use v1 Detection of age restricted items Item weight suggestion Detection of transactions that violate terms of use v1 Classification of enquiries Identification of reported items v2 Detection of listings that violate terms of use v2 Identification of reported items v1 Kando Listing v2 Detection of transactions that violate terms of use v2 Kando Listing (US) Price suggestion There are other projects not listed here that were not implemented due to such issues as precision In 2016, we started to see the limitations of handling various operations completely manually. We’re not utilizing machine learning because it’s a trend, we’re utilizing it simply because it’s necessary. ・Simple transactions ・Safe transactions

Kando Listing (Item Image Recognition)

Listing Information Suggestion Feature (Kando Listing) Apple wireless keyboard Consumer
electronics/smartphones and cameras > PC/tablets > PC peripherals Apple This feature was labelled Kando Listing based on the strong impression we want to leave on our users (“kando” can refer to “impression” in Japanese)

Kando Listing

Kando Listing (US) Louis Vuitton Wallet Men > Men's accessories
> Wallets Louis Vuitton Although there are some differences in the data, the Kando Listing feature in the US uses the same algorithm structure as Kando Listing in Japan.

Kando Listing Algorithm CNN (Inception-v3) Image feature vector Similar items
Title Ralph Lauren polo shirt Category Baby/kids > baby clothes (boys) up to 95cm > tops Brand Ralph Lauren Color Red Price 800 yen - 1,200 yen It’s simple but highly flexible and easy to operate (k-nearest neighbors algorithm) Image feature pool of approx. 50 million images

Listing by Scanning Barcode (Non-ML) If the item can be
identified, the information will be input automatically (except the condition of the item) (We have master data for some books, CDs, and DVDs)

Listing by Scanning Barcode

Kando Listing v2 If the item can be identified, the
information will be input automatically (except the condition of the item) The front cover image is used instead of the barcode This saves having to turn the book over to take a picture of the barcode

Kando Listing v2

Kando Listing v2 Simply take a picture of the item,
select the condition of the item, and press the listing button

Kando Listing v2

Kando Listing v2 Some books only have a barcode on
the outer packaging

Infrastructure for Kando Listing As long as a Dockerfile is
written, any system can be deployed perfectly #mercari_mlops

Infrastructure for Kando Listing Kando Listing infrastructure, provided in one
week with a Dockerfile Updates to models are carried out with AWS, where the images are saved, and features are provided using GCP (GKE), which can run Kubernetes. Microservices and machine learning models work well together, and provide many benefits in terms of the following: ・Resource management ・Model updates

Infrastructure for Kando Listing (Blue-Green Deployment) Persistent Volume is generated
each time the model is updated Used as is if only updating the code (Redis is shared) Although the ML model is not included in the Docker Image due to the file size, Immutable Infrastructure is achieved using Persistent Volume (ReadOnlyMany) Deploy (Rollback)

Item Weight Suggestion Feature (US)

Item Weight Suggestion Feature (US) 1. Upload a picture of
the item 2. Item information is suggested with Kando Listing 3. User adds/supplements item information 4. Weight is suggested when entering shipping information ◦ Weight class is automatically selected in advance ◦ A warning is given if the user chooses a lighter weight class • User does not have to enter as much shipping information manually • Helps prevent issues related to shipping

Search for Items with Images (Currently in Development)

Search for Items with Images (Currently in Development) Solves issues
such as not knowing the right keywords or the hassle of having to input keywords

Solves issues such as not knowing the right keywords or
the hassle of having to input keywords Search for Items with Images (Currently in Development)

Search for items even if you don’t know the pattern
name, brand name, or name of the character featured Search for Items with Images (Currently in Development)

Also effective if searching for a specific model that you
don’t know the name of Search for items with images (Currently in development)

Leveraging billion-scale data and machine learning to map the future
Why are huge datasets needed? What are we trying to achieve?

Software 1.0: • Based on traditional programming • Humans provide
the logic Software 2.0: • Not programming • Expressed with neural networks • Provide data (input/expected output) for learning • If there is a error, you don’t fix the code but instead increase the amount of data so that it goes well • Doesn’t replace Software 1.0 “Software 2.0”, Andrej Karpathy, https://medium.com/@karpathy/software-2-0-a64152b37c35 Software 2.0 Andrej Karpathy (Director of AI at Tesla)

Software 2.0 When Mercari first implemented image recognition, 1. The
dataset was created using images of items 2. We had it learn several well-known convolutional neural networks 3. We increased the amount of data for classes that did not produce results as expected 4. We wrote a REST API in order to use that (Software 1.0) Software 2.0 has already begun

Software 1.0: • Based on traditional programming • Humans provide
the logic Software 2.0: • Not programming • Expressed with neural networks • Provide data (input/expected output) for learning • If there is a error, you don’t fix the code but instead increase the amount of data so that it goes well • Doesn’t replace Software 1.0 “Software 2.0”, Andrej Karpathy, https://medium.com/@karpathy/software-2-0-a64152b37c35 Software 2.0 Andrej Karpathy (Director of AI at Tesla)

One Model To Learn Them All (Google Brain) “Can we
create a unified deep learning model to solve tasks across multiple domains?” “One model to learn them all”, Kaiser et al., https://arxiv.org/abs/1706.05137, 2017

• Eight different tasks, including speech recognition, image recognition, and
machine translation • Trained using a single neural network • Better results were obtained compared to training for each individual task ◦ Joint positive effects were realized for completely different tasks One Model To Learn Them All (Google Brain)

AutoML / NASNet (Google Brain) “AutoML for large scale image
classification and object detection”, https://ai.googleblog.com/2017/11/automl-for-large-scale-image.html • Machine learning models generating machine learning models • NASNet - generated for tasks related to image recognition • Higher performance achieved compared to existing models

• Currently, generic machine learning models < task-specific machine learning models • Possible to mass produce practical level models from generic machine learning models/frameworks? ◦ Software 2.0 ◦ One Model To Learn Them All ◦ AutoML ◦ etc... • How well will transfer learning go with public models that have already been trained? ◦ Training a large model from scratch is extremely costly • In the case of a large dataset (hundreds of millions of data items or greater), it may actually be possible to achieve: Generic machine learning models > task-specific machine learning models

It is evident that modelling and model training using large datasets containing hundreds of millions to billions of data items is extremely costly. However, we want to see what the future holds. Many services could be operated with Software 2.0.

MTC2018 - Leveraging billions of data items and...

MTC2018 - Leveraging billions of data items and machine learning to map the future

More Decks by mercari

Other Decks in Technology

Featured

Transcript