Upgrade to Pro — share decks privately, control downloads, hide ads and more …

“kurashiru” Developing Personalized Recommendation System with “Showflake”

“kurashiru” Developing Personalized Recommendation System with “Showflake”

Tech-Verse2022

November 17, 2022
Tweet

More Decks by Tech-Verse2022

Other Decks in Technology

Transcript

  1. Self-Introduction › dely, Inc., kurashiru Company › Data enginee ›

    In August 2021, I joined dely Inc. as a data engineer responsible for building a new data platform for kurashiru › Currently, I work as a data engineer/PjM (Scrum Master) in the personalization-related team. › In 2022, I was selected as the Snowflake Data Superhero Yuya Harigae | harry @gappy50
  2. Agenda - Why Did We Build a New Data Platform?

    - Introduction of kurashiru/kurashiru in the Future - Data Platform in the Past - Why Snowflake? - Snowflake’s Near-Real-Time Data Pipeline - How Do We Use Data Platform for kurashiru’s Personalization Function - External Functions to AWS/dbt/for Future ML - kurashiru in the Future
  3. Title 80pt Japan’s top recipe video service Recipe videos App

    downloads Users 100,000 videos 56 million 37 million With the mission of delivering happiness to eight billion people three times a day, we aim to create a service that is easy to use every day and conveys the warmth of people. Search recipes by ingredient or name of dish Light, tasty, and tempting Pork Shabu-Shabu Topped With Green Onions Ultimate tenderness, ultimate flavor The AMAZING CHICKEN TATSUTA Pork Shabu-Shabu Topped With Green Onions The AMAZING CHICKEN TATSUTA Microwavable Moist and Easy Chicken with Sichuan Sauce Super easy and super tantalizing! Baked Zucchini with Cheese Layers Wholesome, Hearty Chicken Ginger API (cooking expert) DARE-UMA [cooking expert] kurashiru Official Goes great with rice
  4. kurashiru in the Past › Recipe posting function by general

    users was added from December 2020ɹ › Recipe content supervised by kurashiru’s chefs › We have more content and more users
  5. kurashiru’s Approach to Personalization Difficulties unique to kurashiru › People

    do not necessarily eat the same food tomorrow as they ate yesterday › If the remaining food is still in the refrigerator, maybe people eat it again › If the remaining food is still in the refrigerator, it is more satisfying for the user to search › Difficulty in setting tasks for making recommendations › We have tried various approaches by building ML models so far › Can we solve the above problem simply by creating ML models to make inferences?
  6. kurashiru’s Approach to Personalization Difficulties unique to kurashiru › People

    do not necessarily eat the same food tomorrow as they ate yesterday › If the remaining food is still in the refrigerator, maybe people eat it again › If the remaining food is still in the refrigerator, it is more satisfying for the user to search › Difficulty in setting tasks for making recommendations › We have tried various approaches by building ML models so far › Can we solve the above problem simply by creating ML models to make inferences?
  7. Savings Time- saving cooking kurashiru in the Future User Personalized

    Recommendation Being able to respond to “value diversification” and “globalization,” among other things Diversification Mass Microwave Mismatch loss Baby food Losing weight Entertainment
  8. What kurashiru Needs in the Future Behavior logs Server-Side Data

    Platform Recommendation The first thing we need is a real-time/reliable data platform (quality/data pipeline/scalability)
  9. Data Analytics Platform in the Past › For KPI and

    verification of effect, kurashiru can perform necessary analysis for kurashiru to some extent › A culture is fostered where everyone writes queries for analyzing/visualizing regardless of whether they are engineers/PdM/marketers › On the other hand, it is also a state where data chaos is created › It required less data engineer resources It was a good data platform optimized to the limit for analysis purposes
  10. BI Aurora Kinesis Firehose S3 Glue S3 Athena It took

    3 hours for behavior logs to reach Number of concurrent executions and performance guarantee required Data Analytics Platform in the Past
  11. Real-time and performance aspects of ETL/data pipelines in the past

    › It took about 3 hours for behavior logs to be analyzed › It took decent time/cost to do ETL the behavior data that is performed 300 million times a day Scalability and responsiveness of DWH › Assuming the use by ML and apps, we need DWH which can guarantee scalability and performance Engineering resources › Especially, data engineer resources are limited and considering additional development and operation of existing data pipelines, agility will be low › It could not support data quality and deal with data chaos such as scattered indicators Issues of Data Analytics Platform in the Past
  12. Data Pipeline Ver. 1 Considered by kurashiru BI Kinesis Firehose

    Snowpipe Stream + Serverless task Staging table Target table Aurora S3
  13. Expectations for/Characteristics of Snowflake › Easy to migrate from existing

    AWS data platform, existing assets and data can be used as they are › Virtually no need to manage infrastructure or detailed management according to workload › There are benefits from storage isolation and multiple approaches are enabled by work load isolation › Computing resources scale instantly with almost no limit › There are many functions to do ELT and create data pipelines to managed › Easy to migrate from existing AWS data platform, existing assets and data can be used as they are Functional convenience for data engineering Good compatibility with multi-cloud/data sharing/modern data stack › Good compatibility with dbt (described later) and data stacks that are mainstream overseas, and is less likely to be locked by vendor › Easy to deploy Snowflake machine learning and options with AI services are available › When data utilization advances, we can expect a future where kurashiru log data sharing becomes possible
  14. Data Pipeline Ver. 1 Considered by kurashiru BI Kinesis Firehose

    S3 Snowpipe Stream + Serverless task Staging table Target table
  15. Data Pipeline Ver. 1 Considered by kurashiru BI Kinesis Firehose

    S3 Snowpipe Stream + Serverless task Staging table Target table Time can be reduced from 3 hours to 1 to 3 minutes Migration to be completed within 1 month
  16. Snowpipe Serverless task Details of near-real-time data pipelines Data is

    automatically loaded when the file is stored in the cloud storage Stream ʆ Data Pipeline Ver. 1 Considered by kurashiru
  17. Snowpipe Serverless task Details of near-real-time data pipelines Stream Function

    to track and record change history of tables Implementing CDC (change data capture) is possible by utilizing tracked data for subsequent ELT processing ʆ Data Pipeline Ver. 1 Considered by kurashiru
  18. Snowpipe Details of near-real-time data pipelines Stream Ability to run

    queries/scripts periodically Complex data pipelines can be implemented by combining tasks to form a DAG Charged by the second operated, and computing resources are also executed by ideal size depending on the situation ʆ Data Pipeline Ver. 1 Considered by kurashiru Serverless task
  19. Snowpipe Serverless task Details of near-real-time data pipelines Stream Ability

    to run queries/scripts periodically Complex data pipelines can be implemented by combining tasks to form a DAG Charged by the second operated, and computing resources are also executed by ideal size depending on the situation ʆ Data Pipeline Ver. 1 Considered by kurashiru
  20. Snowpipe Serverless task Stream By combining these functions, a near-real-time

    data pipeline can be implemented Details of near-real-time data pipelines Data Pipeline Ver. 1 Considered by kurashiru
  21. Effects Obtained So Far Thanks to Snowflake, › With minimal

    data engineering resources, the migration from the existing data analytics platform to new data platform was completed within a few months from looking into implementation to production › Taking advantage of the characteristics of near-zero maintenance, we are able to operate in a state where there is almost no need for fault tolerance and workload monitoring (such state continues even now, about a year after the migration) › Operation is easy just to pay a little attention to cost on a daily basis Thanks to near-real-time data pipeline, › Now we have obtained an environment where client behaviors can be analyzed within minutes › Managed ELT processing with CDC/serverless tasks enabled cost-optimized pipeline operation
  22. Savings Time- saving cooking User Personalized recommendation Diversification Mass Microwave

    Mismatch loss Baby food We want to respond to “value diversification” and “globalization,” among other things Losing weight kurashiru’s Approach to Personalization Entertainment
  23. We want users to enjoy kurashiru more by recommending diverse

    contents unique to kurashiru We want users to enjoy eating and living more by increasing the diverse contents kurashiru’s Approach to Personalization
  24. Data Pipeline Ver. 2 Considered by kurashiru BI Reverse ETL

    by external function to AWS Serverside API Real-time recommendation JOB
  25. Data Pipeline Ver. 2 Considered by kurashiru BI Reverse ETL

    by external function to AWS Serverside API Real-time recommendation JOB Modeling by dbt
  26. Modeling by dbt User behaviors and content data modeling and

    centralization of data for analysis › All development of data pipelines other than recommendations is realized by dbt › Multiple people can develop data pipelines as long as they can write queries › The quality of data processing and collection has been greatly improved by linking function with GitHub and testing during PR and deployment!
  27. Modeling by dbt The threshold for data engineering is lowered,

    and responses beyond professional ability within the team will be possible
  28. Modeling by dbt The threshold for data engineering is lowered,

    and responses beyond professional ability within the team will be possible
  29. Modeling by dbt › Differences from the data pipelines implemented

    by Snowflake › Snowflake’s data pipelines › ELT processing in situations where real-time performance and cost performance are required, and closer to the staging layer › Benefits from the advantages of Snowpipe and Serverless Task › When an approach which is closer to data engineer is required › Data pipeline implemented in dbt › To be used for ETL processing from DWH layer › Data which is not required for daily/hourly processing or real-time property › Cases where it is necessary to ensure data quality The threshold for data engineering is lowered, and responses beyond professional ability within the team will be possible
  30. Real-Time Recommendation JOB A series of events of data pipeline

    makes recommendations from user behaviors possible › Implementing processing from preprocessing to recommendation processing to reflection to applications in a pipeline on Snowflake › It is now possible to reflect to recommendations in near-real-time from user behaviors › Currently, recommendation processing is implemented based on rules › In the future, applications are expected such as API for AI Service, cloud managed service and a foundation for machine learning for performing learning/inference using Snowpark for Python
  31. Real-Time Recommendation JOB A series of events of data pipeline

    makes recommendations from user behaviors possible › Implementing processing from preprocessing to recommendation processing to reflection to applications in a pipeline on Snowflake › It is now possible to reflect to recommendations in near-real-time from user behaviors › Currently, recommendation processing is implemented based on rules › In the future, applications are expected such as API for AI Service, cloud managed service and a foundation for machine learning for performing learning/inference using Snowpark for Python Collect necessary data
  32. Real-Time Recommendation JOB A series of events of data pipeline

    makes recommendations from user behaviors possible › Implementing processing from preprocessing to recommendation processing to reflection to applications in a pipeline on Snowflake › It is now possible to reflect to recommendations in near-real-time from user behaviors › Currently, recommendation processing is implemented based on rules › In the future, applications are expected such as API for AI Service, cloud managed service and a foundation for machine learning for performing learning/inference using Snowpark for Python Execute preprocessing
  33. Real-Time Recommendation JOB A series of events of data pipeline

    makes recommendations from user behaviors possible › Implementing processing from preprocessing to recommendation processing to reflection to applications in a pipeline on Snowflake › It is now possible to reflect to recommendations in near-real-time from user behaviors › Currently, recommendation processing is implemented based on rules › In the future, applications are expected such as API for AI Service, cloud managed service and a foundation for machine learning for performing learning/inference using Snowpark for Python Recommendation processing
  34. Real-Time Recommendation JOB A series of events of data pipeline

    makes recommendations from user behaviors possible › Implementing processing from preprocessing to recommendation processing to reflection to applications in a pipeline on Snowflake › It is now possible to reflect to recommendations in near-real-time from user behaviors › Currently, recommendation processing is implemented based on rules › In the future, applications are expected such as API for AI Service, cloud managed service and a foundation for machine learning for performing learning/inference using Snowpark for Python Reflect to server
  35. Reverse ETL by External Function to AWS https://docs.snowflake.com/ja/sql-reference/external-functions-introduction.html#what-is-an-external-function External API

    can be used as SQL function/Data is stored in apps by a query from data pipeline › Can be used like normal functions › Can be called by a query › External code can be called › For example, calling a string translation API › This time, Reverse ETL on the data pipeline is realized by creating a process to write the results related to inference to DynamoDB as an input to the API Gateway hosted on AWS
  36. Data Pipeline Ver. 2 Considered by kurashiru BI Reverse ETL

    by external function to AWS Serverside API Real-time recommendation JOB Modeling by dbt
  37. A new phase to utilize machine learning from the data

    of the data pipeline which has been built up to now › We want to fulfill the purpose of recommending diverse contents in kurashiru › A phase in which we feel it is difficult to continue rule based changes › We will create a better recommendation system by developing the idea of the pipeline in the past › A managed machine learning service delivered by the cloud › AI Service by a third party › Development of a recommendation system that learns and infers on Snowflake using Snowpark and dbt python model, and building a machine learning pipeline kurashiru in the Future
  38. kurashiru in the Future BI Serverside API ML AI Service

    Building, learning and deploying ML models using cloud services and collaboration with managed AI services are also possible
  39. kurashiru in the Future BI Serverside API ML AI Service

    By running Snowpark (or dbt python model) on Snowflake, it is possible to build a machine learning pipeline without exposing data
  40. We want users to enjoy eating and living more by

    increasing the diverse contents kurashiru’s Approach to Personalization We want users to enjoy kurashiru more by recommending diverse contents unique to kurashiru
  41. We want users to enjoy eating and living more by

    increasing the diverse contents We will have a new phase which will be more fun! kurashiru’s Approach to Personalization We want users to enjoy kurashiru more by recommending diverse contents unique to kurashiru