Upgrade to Pro — share decks privately, control downloads, hide ads and more …

RedshiftML in Cookpad

RedshiftML in Cookpad

2021/12/11 に行われた Redshift MLハンズオン + re:Invent re:Cap Analytics編 (https://awsbasics.connpass.com/event/230126/ ) で使用したスライドです。

参考URL
- Queuery: https://techlife.cookpad.com/entry/2021/12/03/093000
- Bricolage: https://techlife.cookpad.com/entry/2015/06/27/154407

1e39e91f511cffb1f69d6a1bd335a835?s=128

Yusuke Fukasawa

December 11, 2021
Tweet

Other Decks in Technology

Transcript

  1. RedshiftML in Cookpad 2021/12/11 Redshift MLϋϯζΦϯ + re:Invent re:Cap Analytic

    s Cookpad R& D Yusuke Fukasawa
  2. ΞδΣϯμ • ࣗݾ঺հ • Cookpad ʹ͓͚Δ RedshiftML ͷ࢖͍ํ • Cookpad

    ʹ͓͚Δ RedshiftML Λ࢖͍͜ͳͨ͢Ίͷ޻෉
  3. ࣗݾ঺հ • Yusuke Fukasawa (@fufufukakaka ) • ݚڀ։ൃ෦ɻςʔϒϧσʔλɾࣗવݴޠॲཧɾϨίϝϯυʹڵຯ͕͋Δ • લ৬͸੨͍Rͷձࣾ

    • ͜ͷҰ೥͸৽ଔΤϯδχΞ࠾༻୲౰΋݉೚
  4. ΫοΫύουʹ͍ͭͯ https://speakerdeck.com/cookpadhr/cookpad-introduction

  5. ΫοΫύουͰͷ R&D https://research.cookpad.com/

  6. Cookpad ʹ͓͚Δ RedshiftML ͷ࢖͍ํ

  7. Cookpad ͱ RedshiftML • RedshiftML ͷ૝ఆσʔλ͸ςʔϒϧσʔλ(ߏ଄Խσʔλ ) • ͕ɺΫοΫύουʹ͓͚Δ ML

    λεΫͷຆͲ͕ඇߏ଄Խσʔλ(ݴޠɾը ૾)ʹର͢Δ΋ͷͰ͋ΔͨΊɺexample case ʹͦͷ··Ԋ͑ͳ͍ • ඇߏ଄Խσʔλ(ಛʹݴޠ)ʹରͯ͠ RedshiftML Λӡ༻Ͱ͖Δ͔Ͳ͏͔ ◦ ͲͷΑ͏ͳλεΫΛ RedshiftML ʹҠ͍ͯ͘͠ͱྑͦ͞͏͔ʁ • ·ͣɺϨγϐͷΧςΰϦ෼ྨύΠϓϥΠϯΛ RedshiftML ্Ͱ࣮ݱͰ͖Δ ͔ɺʹ͍ͭͯݕূΛ࣮ࢪ
  8. ϨγϐΧςΰϦ෼ྨʹ͍ͭͯ • ΫοΫύουͰ͸༷ʑͳ৔໘ͰϨγϐ σʔλʹର͢Δ෼ྨ͕ߦΘΕ͍ͯΔ • e.g. MYϑΥϧμʹอଘͨ͠Ϩγϐʹ ͍ͭͯߜΓࠐΉػೳ ◦ called

    asʮ͓·͔ͤ੔ཧʯ • ϨγϐͷλΠτϧ͕ओͳೖྗͱͳΓɺ ֤ΧςΰϦ͝ͱͷೋ஋෼ྨث͕ଘࡏ
  9. ϨγϐΧςΰϦ෼ྨʹ͍ͭͯ • ೉қ౓ͷ௿͍λεΫ׌ͭ਺͕ଟ͍ͷ ͰɺSQL͚ͩͰ׬݁Ͱ͖Δ RedshiftML ͸ັྗతʂ

  10. ϕΫλϥΠζ ௒ϔϧγʔʂαϯυΠον • ϨγϐλΠτϧ͸ͦͷ·· RedshiftML ͷೖྗʹͰ͖ͳ͍ • ݫີʹݴ͑͹ɺจࣈྻΛೖྗʹ͢Δ͜ͱ͸Ͱ͖·͕͢ɺςΩετʹಛԽ ͨ͠લॲཧ͕ͳ͍ ◦

    word2vecͰͷϕΫλϥΠζɺbag-of-words ͳͲ͸ͳ͍ • ͳͷͰɺ༧ΊςΩετΛςʔϒϧσʔλʹม׵͢Δ • Πϝʔδˣ 0.23, 0.45, 0.67, 0.91, ...
  11. ϕΫλϥΠζ • Tokenize By MeCab(ipadic) • Vectorize By fasttext (trained

    by cookpad recipe title texts) • 1 sentence → 100 dim vector • ͜ΕΛ csv ʹ͠·͢ logger.info("Load latest fasttext") embeddings_model = fasttext.FastText.load_model("fasttext.model") logger.info("Write vectors to CSV.GZ") with gzip.open(output_file_name, "wb") as file: for id, segmented_text in zip(ids, segmented_texts): tokens = segmented_text.split() embed = np.zeros(embeddings_model.get_dimension()) for token in tokens: embed += embeddings_model[token] embed /= len(tokens) _embed: List[float] = embed.tolist() _embed.insert(0, id) file.write(",".join([str(dim) for dim in _embed]).encode("utf-8")) file.write("\n".encode("utf-8"))
  12. ϕΫλϥΠζ • ϕΫτϧͷͨΊͷςʔϒϧΛ࡞Γ ·͢ • ࠓճ͸100࣍ݩͱͨ͠ͷͰɺΧϥ Ϝ͸100ݸ͋Γ·͢ CREATE TABLE research.recipe_title_vectors

    ( dim_0 float, dim_1 float, ..., dim_99 float, label int )
  13. Vector ςʔϒϧ • ઌఔ༻ҙͨ͠csvΛςʔϒϧʹྲྀ͠ࠐΈ·͢ • ͜ΕͰςΩετʹର͢Δ෼ྨ͕Ͱ͖·͢

  14. Ϟσϧֶश • CREATE MODEL ͷඞཁͳه ड़ΛຒΊ·͢ ◦ ઃఆΛ΄΅ॻ͔ͣʹ Auto Ͱͷ࣮ߦ΋Մೳ

    • bread_prediction_examp l • ͋ΔϨγϐ͕ύϯྉཧ͔ Ͳ͏͔ͷڭࢣσʔλΛ· ͱΊͨςʔϒϧ DROP MODEL IF EXISTS research.bread_prediction_model; CREATE MODEL research.bread_prediction_model from ( select label, dim_0, dim_1, ..., dim_99 from research.recipe_title_vectors as v join ( select recipe_id, label from research.bread_prediction_examples where is_train = TRUE ) as id_label on id_label.recipe_id = v.recipe_id ) TARGET label function redshiftml_fn_bread_prediction IAM_ROLE 'arn:aws:iam::xxxxxxxxxxxx:role/RedshiftSystemAccess' MODEL_TYPE XGBOOST PROBLEM_TYPE BINARY_CLASSIFICATION OBJECTIVE 'Accuracy' SETTINGS ( S3_BUCKET 'xxxxx' );
  15. ਫ਼౓Λ֬ೝ • ݱࡏ production Ͱಈ͍͍ͯΔϞσϧ͸୯ ޠස౓+SVCͰɺF1͕90 % • ରͯ͠ RedshiftML(fasttext+xgboost)Ͱ͸

    F1 95 % • ਫ਼౓తʹ͸΄΅มΘΒͣ(एׯྑ͘ͳͬ ͨ)Ͱɺஔ͖׵͑ͯ΋໰୊ͳͦ͞͏Ͱ͋Δ ͜ͱΛ֬ೝ svc redshiftml
  16. όονਤ • ಉֶ࣌श਺ʹ4ͱ͍͏ ੍ݶ͕͋ΔͨΊɺ9ݸ ͷϞσϧΛ3ݸͣͭʹ ෼ֶ͚ͯशΛεέ δϡʔϦϯά • ֶश͸ຖ݄ɺਪ࿦͸ ຖ೔࣮ࢪ

  17. RedshiftML Λ࢖͍͜ͳͨ͢Ίͷ޻෉

  18. RedshiftML ͷͰ͖ͳ͍͜ͱ • ඇߏ଄Խσʔλʹର͢Δߴ౓ͳϞσϧ͸૊Ίͳ͍ • ྫ͑͹ɺը૾ʹରͯ͠CNNΛ૊ΜͩΓɺςΩετʹରͯ͠຋༁Ϟσϧɺ NERϞσϧΛ૊ΜͩΓ • ඇߏ଄ԽσʔλΛҰ౓ςʔϒϧܗࣜʹམͱ͞ͳ͚Ε͹͍͚ͳ͍࣌఺Ͱɺ Ͱ

    ͖Δ͜ͱ͸͔ͳΓ؆୯ͳϞσϧΛ૊Ή͜ͱʹݶఆ͞ΕΔ
  19. RedshiftML ͷྑ͍ͱ͜Ζ • ϞσϧΛ૊Ή࡞ۀ͕શͯ SQL Ͱ׬݁͢Δ • ਪ࿦ίετ͕͔͔Βͳ্͍ʹ͔ͳΓૣ͍ ◦ Ϩγϐ300ສ݅ʹର͢Δਪ࿦͕Ұ෼ఔ౓Ͱ׬ྃ

    • Ϟσϧͷύϥϝʔλαʔν΋΍ͬͯ͘ΕΔ • ֶशͨ͠Ϟσϧ͸୭Ͱ΋ SQL Λॻ͘͜ͱͰར༻Մೳ ◦ grant จͰݖݶΛ։์͢Δ͜ͱ͕ඞཁ
  20. RedshiftML ΛԿͱ͔ͯ͠࢖͍͜ͳ͢ • RedshiftML Λຊ൪ӡ༻͢Δ্Ͱରॲͨ͜͠ͱ 1. Ϟσϧ͕σάϨʔγϣϯ͍ͯ͠Δ͔Ͳ͏͔͔֬ΊΔͨΊʹɺΦϑϥΠϯς ετͷϝτϦΫεΛ௥੻͍ͨ͠ 2. ਪ࿦όονΛಈ͔͢લʹʮRedshiftML

    ͷ͜ͷϞσϧΛ࢖͑Δ͔Ͳ͏͔ʯ Λ֬ೝͰ͖ΔΑ͏ʹ͍ͨ͠
  21. RedshiftML ͷ ϝτϦΫεΛ௥੻͢Δ

  22. RedshiftML Ϟσϧͷ ϝτϦΫε֬ೝ • RedshiftML Ͱ͸ Optimize ʹࢦఆ ͨ͠ϝτϦΫε͔͠ SageMaker

    ଆ͔Βฦͬͯ͜ͳ͍ • ͨͱ͑͹ Accuracy Λࢦඪͱͯ͠ AutoPilot ʹ౤͛ΔͱɺͦΕ͔͠ Θ͔Βͳ͍(ෳ਺ࢦఆ͸Ͱ͖ͳ͍) ͜Ε
  23. RedshiftML ϞσϧͷϝτϦΫε֬ೝ • ͜ͷ໰୊Λղܾ͢ΔͨΊʹ؆୯ͳΞϓϦΛॻ͖·ͨ͠ˣ

  24. MetricsTracer ΋ͱ΋ͱ͸ ී௨ͷػցֶशϓϩδΣΫτ ͷΦϑϥΠϯςετͰͷ ϝτϦΫεΛه࿥ɾूܭ͢Δ ͨΊʹ࡞ͬͨ΋ͷ ࠓճɺRedshiftMLͷͨΊʹ গ֦͠ுͨ͠ NextJS +

    Chakra-UI
  25. MetricsTracer: RedshiftMLϞσϧৄࡉ

  26. MetricsTracer ͕΍Δ͜ͱ • ֤ϞσϧΛҰཡͰ͖Δϖʔδ • ֤Ϟσϧͷৄࡉ ◦ ςετσʔλʹରͯ͠ͷϝτϦΫεΛ஝ੵɾදࣔ ◦ Ϟσϧ͕࠶ֶश͞ΕͨΒͦΕΛݕ஌ͯ͠ϝτϦΫεΛܭࢉ͠ɺ஝ੵ͠·͢

    ◦ ͲΜͳSQLͰਪ࿦͕Ͱ͖Δ͔ɺαϯϓϧͷSQLΛදࣔ͠·͢ • RedshiftML ͚ͩͰ͸ΧόʔͰ͖ͳ͍ʮΦϑϥΠϯςετͰͷϝτϦΫε௥੻ʯΛ࣮ݱ͠Α͏ͱ ͨࣾ͠಺πʔϧͰ͢ ◦ RedshiftML ͷΦϑϥΠϯςετΛߦ͏όον͕ฒߦͯ͠ಈ͍͍ͯ·͢ • ࠓͷͱ͜Ζ͸໰୊ͳ͘ಈ͍͍ͯ·͢
  27. RedshiftML ͷΦϑϥΠϯςετ 1 . ఆظతʹϞσϧঢ়ଶΛ໰ ͍߹ΘͤΔ (2,3࣌ؒʹҰճ) 2 . Ϟσϧͷঢ়ଶΛฦ͢

    3 . ΋͠Ϟσϧͷঢ়ଶ͕લճ NOT READY Ͱ ৽ͨʹ READY ʹͳͬͨͷͰ͋Ε͹(retrain ) ςετσʔλʹରͯ͠ͷ༧ଌ݁ՌΛܭࢉ͢ΔΫΤ ϦΛ౤͛Δ 4 . ༧ଌ݁Ռ͔Β ϝτϦΫεΛܭࢉ͢Δ
  28. RedshiftML ͷ ਪ࿦όονΛ੍ޚ͢Δ

  29. RedshiftML Ϟσϧͷঢ়ଶ֬ೝ: લఏ • ΫοΫύουʹ͸ศརͳ SQL όον࣮ߦج൫͕͋Δ • Bricolageɾkuroko2 ͱ͍͏γεςϜΛհͯ͠

    SQL ΛόονॲཧͰྲྀ͢͜ ͱʹؔͯ͠͸ຆͲखؒͳ࣮͘ߦՄೳ • ͜Ε͸ RedshiftML ͱ͸ඇৗʹ૬ੑ͕Α͘ɺϞσϧֶश͔Βਪ࿦·Ͱ Bricolage ʹ৐ͤΔ͚ͩͰ׬݁ • ͔͠͠ɺঢ়ଶ؅ཧΛߟ͑ΔͱɺҰखؒඞཁ
  30. RedshiftML Ϟσϧͷঢ়ଶ֬ೝ • CREATE MODEL จΛ࣮ߦ͢Δͱɺֶश͕੒ޭ͢Δ͠ͳ͍ͷҎલʹֶश σʔλΛ SageMaker ʹૹͬͨ࣌఺ͰͦͷΫΤϦ͸׬ྃ͠·͢ •

    ඇಉظʹֶशδϣϒ͕ಈ͘ͷͰɺֶशδϣϒ͕׬ྃͨ͠Βશ݅ਪ࿦όον Λճ͢ʂͱ͍ͬͨґଘؔ܎Λ૊Ή͜ͱ͕গ͠೉͍͠Ͱ͢
  31. RedshiftML Ϟσϧͷঢ়ଶ֬ೝ • શ݅ਪ࿦όον͕ಈ͘௚લʹɺ୭͔͕ϞσϧֶशΛෆҙʹಈ͔ͨ͠Γ͢Δ ͱ(͋·Γͳ͍ͱࢥ͍·͕͢)ɺશ݅ਪ࿦όονʹεςʔλενΣοΫػߏ ͕ͳ͍৔߹ɺෆҙʹશ݅ਪ࿦όον͕མͪͯ͠·͏ɺͱ͍͏͜ͱʹͳΓ· ͢ • ͜Ε͸ਫ਼ਆӴੜ্(΋ͪΖΜӡ༻্΋)͋·ΓΑ͋͘Γ·ͤΜ

  32. RedshiftML Ϟσϧͷঢ়ଶ֬ೝ • ͦ͜ͰɺRedshiftML ͷϞσϧঢ়ଶΛνΣοΫͯ͠OKͳΒਪ࿦όονΛಈ͔͢ɺͱ͍͏੍ޚΛ ߦ͏ΞϓϦέʔγϣϯΛॻ͖·ͨ͠ • ʮRedshiftml-Batchʯ

  33. Redshiftml-Batch ͕΍Δ͜ͱ • StatusChecker.py

  34. Redshiftml-Batch ͕΍Δ͜ͱ • BatchRunner.py

  35. Redshiftml-Batch ͕΍Δ͜ͱ • run.py s3 s3

  36. ·ͱΊ

  37. RedshiftML in Cookpad: ·ͱΊ • Cookpad ͰͷϝΠϯλʔήοτ: ςΩετσʔλ΁ͷ׆༻ʹτϥΠ • ಛ௃ྔ࡞੒ύΠϓϥΠϯͳͲ͕ผͰඞཁͰ͸͋Δ΋ͷͷɺظ଴Ҏ্ͷਫ਼౓

    ͕ಘΒΕΔ͜ͱ͕Θ͔ͬͨ • ҰํͰɺΦϑϥΠϯςετͰͷϝτϦΫε௥੻ɺਪ࿦όον࣮ߦ࣌ͷঢ়ଶ ֬ೝͳͲͷิ׬͢΂͖఺΋Θ͔ͬͨ • Cookpad Ͱ͸ͦΕΛิ͏ࣾ಺πʔϧΛ࣮૷ͯ͠೔ʑͷӡ༻ʹ଱͑͏Δ඼࣭ Λ୲อ͠Α͏ͱ͍ͯ͠·͢