$30 off During Our Annual Pro Sale. View Details »

RedshiftML in Cookpad

RedshiftML in Cookpad

2021/12/11 に行われた Redshift MLハンズオン + re:Invent re:Cap Analytics編 (https://awsbasics.connpass.com/event/230126/ ) で使用したスライドです。

参考URL
- Queuery: https://techlife.cookpad.com/entry/2021/12/03/093000
- Bricolage: https://techlife.cookpad.com/entry/2015/06/27/154407

Yusuke Fukasawa

December 11, 2021
Tweet

More Decks by Yusuke Fukasawa

Other Decks in Technology

Transcript

  1. RedshiftML in Cookpad
    2021/12/11 Redshift MLϋϯζΦϯ + re:Invent re:Cap Analytic
    s

    Cookpad R&
    D

    Yusuke Fukasawa

    View Slide

  2. ΞδΣϯμ
    ● ࣗݾ঺հ
    ● Cookpad ʹ͓͚Δ RedshiftML ͷ࢖͍ํ
    ● Cookpad ʹ͓͚Δ RedshiftML Λ࢖͍͜ͳͨ͢Ίͷ޻෉

    View Slide

  3. ࣗݾ঺հ
    ● Yusuke Fukasawa (@fufufukakaka
    )

    ● ݚڀ։ൃ෦ɻςʔϒϧσʔλɾࣗવݴޠॲཧɾϨίϝϯυʹڵຯ͕͋Δ
    ● લ৬͸੨͍Rͷձࣾ
    ● ͜ͷҰ೥͸৽ଔΤϯδχΞ࠾༻୲౰΋݉೚

    View Slide

  4. ΫοΫύουʹ͍ͭͯ
    https://speakerdeck.com/cookpadhr/cookpad-introduction

    View Slide

  5. ΫοΫύουͰͷ R&D
    https://research.cookpad.com/

    View Slide

  6. Cookpad ʹ͓͚Δ


    RedshiftML ͷ࢖͍ํ

    View Slide

  7. Cookpad ͱ RedshiftML
    ● RedshiftML ͷ૝ఆσʔλ͸ςʔϒϧσʔλ(ߏ଄Խσʔλ
    )

    ● ͕ɺΫοΫύουʹ͓͚Δ ML λεΫͷຆͲ͕ඇߏ଄Խσʔλ(ݴޠɾը
    ૾)ʹର͢Δ΋ͷͰ͋ΔͨΊɺexample case ʹͦͷ··Ԋ͑ͳ͍
    ● ඇߏ଄Խσʔλ(ಛʹݴޠ)ʹରͯ͠ RedshiftML Λӡ༻Ͱ͖Δ͔Ͳ͏͔
    ○ ͲͷΑ͏ͳλεΫΛ RedshiftML ʹҠ͍ͯ͘͠ͱྑͦ͞͏͔ʁ
    ● ·ͣɺϨγϐͷΧςΰϦ෼ྨύΠϓϥΠϯΛ RedshiftML ্Ͱ࣮ݱͰ͖Δ
    ͔ɺʹ͍ͭͯݕূΛ࣮ࢪ

    View Slide

  8. ϨγϐΧςΰϦ෼ྨʹ͍ͭͯ
    ● ΫοΫύουͰ͸༷ʑͳ৔໘ͰϨγϐ
    σʔλʹର͢Δ෼ྨ͕ߦΘΕ͍ͯΔ
    ● e.g. MYϑΥϧμʹอଘͨ͠Ϩγϐʹ
    ͍ͭͯߜΓࠐΉػೳ
    ○ called asʮ͓·͔ͤ੔ཧʯ
    ● ϨγϐͷλΠτϧ͕ओͳೖྗͱͳΓɺ
    ֤ΧςΰϦ͝ͱͷೋ஋෼ྨث͕ଘࡏ

    View Slide

  9. ϨγϐΧςΰϦ෼ྨʹ͍ͭͯ
    ● ೉қ౓ͷ௿͍λεΫ׌ͭ਺͕ଟ͍ͷ
    ͰɺSQL͚ͩͰ׬݁Ͱ͖Δ
    RedshiftML ͸ັྗతʂ

    View Slide

  10. ϕΫλϥΠζ
    ௒ϔϧγʔʂαϯυΠον
    ● ϨγϐλΠτϧ͸ͦͷ·· RedshiftML ͷೖྗʹͰ͖ͳ͍
    ● ݫີʹݴ͑͹ɺจࣈྻΛೖྗʹ͢Δ͜ͱ͸Ͱ͖·͕͢ɺςΩετʹಛԽ
    ͨ͠લॲཧ͕ͳ͍
    ○ word2vecͰͷϕΫλϥΠζɺbag-of-words ͳͲ͸ͳ͍
    ● ͳͷͰɺ༧ΊςΩετΛςʔϒϧσʔλʹม׵͢Δ
    ● Πϝʔδˣ
    0.23, 0.45, 0.67, 0.91, ...

    View Slide

  11. ϕΫλϥΠζ
    ● Tokenize By MeCab(ipadic)


    ● Vectorize By fasttext
    (trained by cookpad recipe
    title texts)


    ● 1 sentence → 100 dim
    vector


    ● ͜ΕΛ csv ʹ͠·͢
    logger.info("Load latest fasttext")


    embeddings_model =
    fasttext.FastText.load_model("fasttext.model")


    logger.info("Write vectors to CSV.GZ")


    with gzip.open(output_file_name, "wb") as file:


    for id, segmented_text in zip(ids, segmented_texts):


    tokens = segmented_text.split()


    embed = np.zeros(embeddings_model.get_dimension())


    for token in tokens:


    embed += embeddings_model[token]


    embed /= len(tokens)


    _embed: List[float] = embed.tolist()


    _embed.insert(0, id)


    file.write(",".join([str(dim) for dim in _embed]).encode("utf-8"))


    file.write("\n".encode("utf-8"))

    View Slide

  12. ϕΫλϥΠζ
    ● ϕΫτϧͷͨΊͷςʔϒϧΛ࡞Γ
    ·͢
    ● ࠓճ͸100࣍ݩͱͨ͠ͷͰɺΧϥ
    Ϝ͸100ݸ͋Γ·͢
    CREATE TABLE research.recipe_title_vectors (


    dim_0 float,


    dim_1 float,


    ...,


    dim_99 float,


    label int


    )


    View Slide

  13. Vector ςʔϒϧ
    ● ઌఔ༻ҙͨ͠csvΛςʔϒϧʹྲྀ͠ࠐΈ·͢
    ● ͜ΕͰςΩετʹର͢Δ෼ྨ͕Ͱ͖·͢

    View Slide

  14. Ϟσϧֶश
    ● CREATE MODEL ͷඞཁͳه
    ड़ΛຒΊ·͢
    ○ ઃఆΛ΄΅ॻ͔ͣʹ Auto
    Ͱͷ࣮ߦ΋Մೳ
    ● bread_prediction_examp
    l

    ● ͋ΔϨγϐ͕ύϯྉཧ͔
    Ͳ͏͔ͷڭࢣσʔλΛ·
    ͱΊͨςʔϒϧ
    DROP MODEL IF EXISTS research.bread_prediction_model;


    CREATE MODEL research.bread_prediction_model


    from (


    select label, dim_0, dim_1, ..., dim_99


    from


    research.recipe_title_vectors as v


    join (


    select recipe_id, label


    from research.bread_prediction_examples


    where is_train = TRUE


    ) as id_label on id_label.recipe_id = v.recipe_id


    )


    TARGET label function redshiftml_fn_bread_prediction


    IAM_ROLE 'arn:aws:iam::xxxxxxxxxxxx:role/RedshiftSystemAccess'


    MODEL_TYPE XGBOOST


    PROBLEM_TYPE BINARY_CLASSIFICATION


    OBJECTIVE 'Accuracy'


    SETTINGS


    (


    S3_BUCKET 'xxxxx'


    );

    View Slide

  15. ਫ਼౓Λ֬ೝ
    ● ݱࡏ production Ͱಈ͍͍ͯΔϞσϧ͸୯
    ޠස౓+SVCͰɺF1͕90
    %

    ● ରͯ͠ RedshiftML(fasttext+xgboost)Ͱ͸
    F1 95
    %

    ● ਫ਼౓తʹ͸΄΅มΘΒͣ(एׯྑ͘ͳͬ
    ͨ)Ͱɺஔ͖׵͑ͯ΋໰୊ͳͦ͞͏Ͱ͋Δ
    ͜ͱΛ֬ೝ
    svc
    redshiftml

    View Slide

  16. όονਤ
    ● ಉֶ࣌श਺ʹ4ͱ͍͏
    ੍ݶ͕͋ΔͨΊɺ9ݸ
    ͷϞσϧΛ3ݸͣͭʹ
    ෼ֶ͚ͯशΛεέ
    δϡʔϦϯά
    ● ֶश͸ຖ݄ɺਪ࿦͸
    ຖ೔࣮ࢪ

    View Slide

  17. RedshiftML Λ࢖͍͜ͳͨ͢Ίͷ޻෉

    View Slide

  18. RedshiftML ͷͰ͖ͳ͍͜ͱ
    ● ඇߏ଄Խσʔλʹର͢Δߴ౓ͳϞσϧ͸૊Ίͳ͍
    ● ྫ͑͹ɺը૾ʹରͯ͠CNNΛ૊ΜͩΓɺςΩετʹରͯ͠຋༁Ϟσϧɺ
    NERϞσϧΛ૊ΜͩΓ
    ● ඇߏ଄ԽσʔλΛҰ౓ςʔϒϧܗࣜʹམͱ͞ͳ͚Ε͹͍͚ͳ͍࣌఺Ͱɺ Ͱ
    ͖Δ͜ͱ͸͔ͳΓ؆୯ͳϞσϧΛ૊Ή͜ͱʹݶఆ͞ΕΔ

    View Slide

  19. RedshiftML ͷྑ͍ͱ͜Ζ
    ● ϞσϧΛ૊Ή࡞ۀ͕શͯ SQL Ͱ׬݁͢Δ
    ● ਪ࿦ίετ͕͔͔Βͳ্͍ʹ͔ͳΓૣ͍
    ○ Ϩγϐ300ສ݅ʹର͢Δਪ࿦͕Ұ෼ఔ౓Ͱ׬ྃ
    ● Ϟσϧͷύϥϝʔλαʔν΋΍ͬͯ͘ΕΔ
    ● ֶशͨ͠Ϟσϧ͸୭Ͱ΋ SQL Λॻ͘͜ͱͰར༻Մೳ
    ○ grant จͰݖݶΛ։์͢Δ͜ͱ͕ඞཁ

    View Slide

  20. RedshiftML ΛԿͱ͔ͯ͠࢖͍͜ͳ͢
    ● RedshiftML Λຊ൪ӡ༻͢Δ্Ͱରॲͨ͜͠ͱ
    1. Ϟσϧ͕σάϨʔγϣϯ͍ͯ͠Δ͔Ͳ͏͔͔֬ΊΔͨΊʹɺΦϑϥΠϯς
    ετͷϝτϦΫεΛ௥੻͍ͨ͠
    2. ਪ࿦όονΛಈ͔͢લʹʮRedshiftML ͷ͜ͷϞσϧΛ࢖͑Δ͔Ͳ͏͔ʯ
    Λ֬ೝͰ͖ΔΑ͏ʹ͍ͨ͠

    View Slide

  21. RedshiftML ͷ
    ϝτϦΫεΛ௥੻͢Δ

    View Slide

  22. RedshiftML Ϟσϧͷ
    ϝτϦΫε֬ೝ
    ● RedshiftML Ͱ͸ Optimize ʹࢦఆ
    ͨ͠ϝτϦΫε͔͠ SageMaker
    ଆ͔Βฦͬͯ͜ͳ͍
    ● ͨͱ͑͹ Accuracy Λࢦඪͱͯ͠
    AutoPilot ʹ౤͛ΔͱɺͦΕ͔͠
    Θ͔Βͳ͍(ෳ਺ࢦఆ͸Ͱ͖ͳ͍)
    ͜Ε

    View Slide

  23. RedshiftML ϞσϧͷϝτϦΫε֬ೝ
    ● ͜ͷ໰୊Λղܾ͢ΔͨΊʹ؆୯ͳΞϓϦΛॻ͖·ͨ͠ˣ

    View Slide

  24. MetricsTracer
    ΋ͱ΋ͱ͸
    ී௨ͷػցֶशϓϩδΣΫτ
    ͷΦϑϥΠϯςετͰͷ
    ϝτϦΫεΛه࿥ɾूܭ͢Δ
    ͨΊʹ࡞ͬͨ΋ͷ
    ࠓճɺRedshiftMLͷͨΊʹ
    গ֦͠ுͨ͠
    NextJS + Chakra-UI

    View Slide

  25. MetricsTracer: RedshiftMLϞσϧৄࡉ

    View Slide

  26. MetricsTracer ͕΍Δ͜ͱ
    ● ֤ϞσϧΛҰཡͰ͖Δϖʔδ


    ● ֤Ϟσϧͷৄࡉ


    ○ ςετσʔλʹରͯ͠ͷϝτϦΫεΛ஝ੵɾදࣔ


    ○ Ϟσϧ͕࠶ֶश͞ΕͨΒͦΕΛݕ஌ͯ͠ϝτϦΫεΛܭࢉ͠ɺ஝ੵ͠·͢


    ○ ͲΜͳSQLͰਪ࿦͕Ͱ͖Δ͔ɺαϯϓϧͷSQLΛදࣔ͠·͢


    ● RedshiftML ͚ͩͰ͸ΧόʔͰ͖ͳ͍ʮΦϑϥΠϯςετͰͷϝτϦΫε௥੻ʯΛ࣮ݱ͠Α͏ͱ
    ͨࣾ͠಺πʔϧͰ͢


    ○ RedshiftML ͷΦϑϥΠϯςετΛߦ͏όον͕ฒߦͯ͠ಈ͍͍ͯ·͢


    ● ࠓͷͱ͜Ζ͸໰୊ͳ͘ಈ͍͍ͯ·͢

    View Slide

  27. RedshiftML ͷΦϑϥΠϯςετ
    1
    .

    ఆظతʹϞσϧঢ়ଶΛ໰
    ͍߹ΘͤΔ
    (2,3࣌ؒʹҰճ)
    2
    .

    Ϟσϧͷঢ়ଶΛฦ͢
    3
    .

    ΋͠Ϟσϧͷঢ়ଶ͕લճ NOT READY Ͱ
    ৽ͨʹ READY ʹͳͬͨͷͰ͋Ε͹(retrain
    )

    ςετσʔλʹରͯ͠ͷ༧ଌ݁ՌΛܭࢉ͢ΔΫΤ
    ϦΛ౤͛Δ
    4
    .

    ༧ଌ݁Ռ͔Β
    ϝτϦΫεΛܭࢉ͢Δ

    View Slide

  28. RedshiftML ͷ
    ਪ࿦όονΛ੍ޚ͢Δ

    View Slide

  29. RedshiftML Ϟσϧͷঢ়ଶ֬ೝ: લఏ
    ● ΫοΫύουʹ͸ศརͳ SQL όον࣮ߦج൫͕͋Δ
    ● Bricolageɾkuroko2 ͱ͍͏γεςϜΛհͯ͠ SQL ΛόονॲཧͰྲྀ͢͜
    ͱʹؔͯ͠͸ຆͲखؒͳ࣮͘ߦՄೳ
    ● ͜Ε͸ RedshiftML ͱ͸ඇৗʹ૬ੑ͕Α͘ɺϞσϧֶश͔Βਪ࿦·Ͱ
    Bricolage ʹ৐ͤΔ͚ͩͰ׬݁
    ● ͔͠͠ɺঢ়ଶ؅ཧΛߟ͑ΔͱɺҰखؒඞཁ

    View Slide

  30. RedshiftML Ϟσϧͷঢ়ଶ֬ೝ
    ● CREATE MODEL จΛ࣮ߦ͢Δͱɺֶश͕੒ޭ͢Δ͠ͳ͍ͷҎલʹֶश
    σʔλΛ SageMaker ʹૹͬͨ࣌఺ͰͦͷΫΤϦ͸׬ྃ͠·͢
    ● ඇಉظʹֶशδϣϒ͕ಈ͘ͷͰɺֶशδϣϒ͕׬ྃͨ͠Βશ݅ਪ࿦όον
    Λճ͢ʂͱ͍ͬͨґଘؔ܎Λ૊Ή͜ͱ͕গ͠೉͍͠Ͱ͢

    View Slide

  31. RedshiftML Ϟσϧͷঢ়ଶ֬ೝ
    ● શ݅ਪ࿦όον͕ಈ͘௚લʹɺ୭͔͕ϞσϧֶशΛෆҙʹಈ͔ͨ͠Γ͢Δ
    ͱ(͋·Γͳ͍ͱࢥ͍·͕͢)ɺશ݅ਪ࿦όονʹεςʔλενΣοΫػߏ
    ͕ͳ͍৔߹ɺෆҙʹશ݅ਪ࿦όον͕མͪͯ͠·͏ɺͱ͍͏͜ͱʹͳΓ·
    ͢
    ● ͜Ε͸ਫ਼ਆӴੜ্(΋ͪΖΜӡ༻্΋)͋·ΓΑ͋͘Γ·ͤΜ

    View Slide

  32. RedshiftML Ϟσϧͷঢ়ଶ֬ೝ
    ● ͦ͜ͰɺRedshiftML ͷϞσϧঢ়ଶΛνΣοΫͯ͠OKͳΒਪ࿦όονΛಈ͔͢ɺͱ͍͏੍ޚΛ
    ߦ͏ΞϓϦέʔγϣϯΛॻ͖·ͨ͠


    ● ʮRedshiftml-Batchʯ

    View Slide

  33. Redshiftml-Batch


    ͕΍Δ͜ͱ
    ● StatusChecker.py

    View Slide

  34. Redshiftml-Batch


    ͕΍Δ͜ͱ
    ● BatchRunner.py

    View Slide

  35. Redshiftml-Batch


    ͕΍Δ͜ͱ
    ● run.py
    s3
    s3

    View Slide

  36. ·ͱΊ

    View Slide

  37. RedshiftML in Cookpad: ·ͱΊ
    ● Cookpad ͰͷϝΠϯλʔήοτ: ςΩετσʔλ΁ͷ׆༻ʹτϥΠ
    ● ಛ௃ྔ࡞੒ύΠϓϥΠϯͳͲ͕ผͰඞཁͰ͸͋Δ΋ͷͷɺظ଴Ҏ্ͷਫ਼౓
    ͕ಘΒΕΔ͜ͱ͕Θ͔ͬͨ
    ● ҰํͰɺΦϑϥΠϯςετͰͷϝτϦΫε௥੻ɺਪ࿦όον࣮ߦ࣌ͷঢ়ଶ
    ֬ೝͳͲͷิ׬͢΂͖఺΋Θ͔ͬͨ
    ● Cookpad Ͱ͸ͦΕΛิ͏ࣾ಺πʔϧΛ࣮૷ͯ͠೔ʑͷӡ༻ʹ଱͑͏Δ඼࣭
    Λ୲อ͠Α͏ͱ͍ͯ͠·͢

    View Slide