Upgrade to Pro — share decks privately, control downloads, hide ads and more …

RedshiftML in Cookpad

RedshiftML in Cookpad

2021/12/11 に行われた Redshift MLハンズオン + re:Invent re:Cap Analytics編 (https://awsbasics.connpass.com/event/230126/ ) で使用したスライドです。

参考URL
- Queuery: https://techlife.cookpad.com/entry/2021/12/03/093000
- Bricolage: https://techlife.cookpad.com/entry/2015/06/27/154407

Avatar for Yusuke Fukasawa

Yusuke Fukasawa

December 11, 2021
Tweet

More Decks by Yusuke Fukasawa

Other Decks in Technology

Transcript

  1. Cookpad ͱ RedshiftML • RedshiftML ͷ૝ఆσʔλ͸ςʔϒϧσʔλ(ߏ଄Խσʔλ ) • ͕ɺΫοΫύουʹ͓͚Δ ML

    λεΫͷຆͲ͕ඇߏ଄Խσʔλ(ݴޠɾը ૾)ʹର͢Δ΋ͷͰ͋ΔͨΊɺexample case ʹͦͷ··Ԋ͑ͳ͍ • ඇߏ଄Խσʔλ(ಛʹݴޠ)ʹରͯ͠ RedshiftML Λӡ༻Ͱ͖Δ͔Ͳ͏͔ ◦ ͲͷΑ͏ͳλεΫΛ RedshiftML ʹҠ͍ͯ͘͠ͱྑͦ͞͏͔ʁ • ·ͣɺϨγϐͷΧςΰϦ෼ྨύΠϓϥΠϯΛ RedshiftML ্Ͱ࣮ݱͰ͖Δ ͔ɺʹ͍ͭͯݕূΛ࣮ࢪ
  2. ϕΫλϥΠζ ௒ϔϧγʔʂαϯυΠον • ϨγϐλΠτϧ͸ͦͷ·· RedshiftML ͷೖྗʹͰ͖ͳ͍ • ݫີʹݴ͑͹ɺจࣈྻΛೖྗʹ͢Δ͜ͱ͸Ͱ͖·͕͢ɺςΩετʹಛԽ ͨ͠લॲཧ͕ͳ͍ ◦

    word2vecͰͷϕΫλϥΠζɺbag-of-words ͳͲ͸ͳ͍ • ͳͷͰɺ༧ΊςΩετΛςʔϒϧσʔλʹม׵͢Δ • Πϝʔδˣ 0.23, 0.45, 0.67, 0.91, ...
  3. ϕΫλϥΠζ • Tokenize By MeCab(ipadic) • Vectorize By fasttext (trained

    by cookpad recipe title texts) • 1 sentence → 100 dim vector • ͜ΕΛ csv ʹ͠·͢ logger.info("Load latest fasttext") embeddings_model = fasttext.FastText.load_model("fasttext.model") logger.info("Write vectors to CSV.GZ") with gzip.open(output_file_name, "wb") as file: for id, segmented_text in zip(ids, segmented_texts): tokens = segmented_text.split() embed = np.zeros(embeddings_model.get_dimension()) for token in tokens: embed += embeddings_model[token] embed /= len(tokens) _embed: List[float] = embed.tolist() _embed.insert(0, id) file.write(",".join([str(dim) for dim in _embed]).encode("utf-8")) file.write("\n".encode("utf-8"))
  4. Ϟσϧֶश • CREATE MODEL ͷඞཁͳه ड़ΛຒΊ·͢ ◦ ઃఆΛ΄΅ॻ͔ͣʹ Auto Ͱͷ࣮ߦ΋Մೳ

    • bread_prediction_examp l • ͋ΔϨγϐ͕ύϯྉཧ͔ Ͳ͏͔ͷڭࢣσʔλΛ· ͱΊͨςʔϒϧ DROP MODEL IF EXISTS research.bread_prediction_model; CREATE MODEL research.bread_prediction_model from ( select label, dim_0, dim_1, ..., dim_99 from research.recipe_title_vectors as v join ( select recipe_id, label from research.bread_prediction_examples where is_train = TRUE ) as id_label on id_label.recipe_id = v.recipe_id ) TARGET label function redshiftml_fn_bread_prediction IAM_ROLE 'arn:aws:iam::xxxxxxxxxxxx:role/RedshiftSystemAccess' MODEL_TYPE XGBOOST PROBLEM_TYPE BINARY_CLASSIFICATION OBJECTIVE 'Accuracy' SETTINGS ( S3_BUCKET 'xxxxx' );
  5. ਫ਼౓Λ֬ೝ • ݱࡏ production Ͱಈ͍͍ͯΔϞσϧ͸୯ ޠස౓+SVCͰɺF1͕90 % • ରͯ͠ RedshiftML(fasttext+xgboost)Ͱ͸

    F1 95 % • ਫ਼౓తʹ͸΄΅มΘΒͣ(एׯྑ͘ͳͬ ͨ)Ͱɺஔ͖׵͑ͯ΋໰୊ͳͦ͞͏Ͱ͋Δ ͜ͱΛ֬ೝ svc redshiftml
  6. RedshiftML ͷྑ͍ͱ͜Ζ • ϞσϧΛ૊Ή࡞ۀ͕શͯ SQL Ͱ׬݁͢Δ • ਪ࿦ίετ͕͔͔Βͳ্͍ʹ͔ͳΓૣ͍ ◦ Ϩγϐ300ສ݅ʹର͢Δਪ࿦͕Ұ෼ఔ౓Ͱ׬ྃ

    • Ϟσϧͷύϥϝʔλαʔν΋΍ͬͯ͘ΕΔ • ֶशͨ͠Ϟσϧ͸୭Ͱ΋ SQL Λॻ͘͜ͱͰར༻Մೳ ◦ grant จͰݖݶΛ։์͢Δ͜ͱ͕ඞཁ
  7. RedshiftML Ϟσϧͷ ϝτϦΫε֬ೝ • RedshiftML Ͱ͸ Optimize ʹࢦఆ ͨ͠ϝτϦΫε͔͠ SageMaker

    ଆ͔Βฦͬͯ͜ͳ͍ • ͨͱ͑͹ Accuracy Λࢦඪͱͯ͠ AutoPilot ʹ౤͛ΔͱɺͦΕ͔͠ Θ͔Βͳ͍(ෳ਺ࢦఆ͸Ͱ͖ͳ͍) ͜Ε
  8. MetricsTracer ͕΍Δ͜ͱ • ֤ϞσϧΛҰཡͰ͖Δϖʔδ • ֤Ϟσϧͷৄࡉ ◦ ςετσʔλʹରͯ͠ͷϝτϦΫεΛ஝ੵɾදࣔ ◦ Ϟσϧ͕࠶ֶश͞ΕͨΒͦΕΛݕ஌ͯ͠ϝτϦΫεΛܭࢉ͠ɺ஝ੵ͠·͢

    ◦ ͲΜͳSQLͰਪ࿦͕Ͱ͖Δ͔ɺαϯϓϧͷSQLΛදࣔ͠·͢ • RedshiftML ͚ͩͰ͸ΧόʔͰ͖ͳ͍ʮΦϑϥΠϯςετͰͷϝτϦΫε௥੻ʯΛ࣮ݱ͠Α͏ͱ ͨࣾ͠಺πʔϧͰ͢ ◦ RedshiftML ͷΦϑϥΠϯςετΛߦ͏όον͕ฒߦͯ͠ಈ͍͍ͯ·͢ • ࠓͷͱ͜Ζ͸໰୊ͳ͘ಈ͍͍ͯ·͢
  9. RedshiftML ͷΦϑϥΠϯςετ 1 . ఆظతʹϞσϧঢ়ଶΛ໰ ͍߹ΘͤΔ (2,3࣌ؒʹҰճ) 2 . Ϟσϧͷঢ়ଶΛฦ͢

    3 . ΋͠Ϟσϧͷঢ়ଶ͕લճ NOT READY Ͱ ৽ͨʹ READY ʹͳͬͨͷͰ͋Ε͹(retrain ) ςετσʔλʹରͯ͠ͷ༧ଌ݁ՌΛܭࢉ͢ΔΫΤ ϦΛ౤͛Δ 4 . ༧ଌ݁Ռ͔Β ϝτϦΫεΛܭࢉ͢Δ
  10. RedshiftML Ϟσϧͷঢ়ଶ֬ೝ: લఏ • ΫοΫύουʹ͸ศརͳ SQL όον࣮ߦج൫͕͋Δ • Bricolageɾkuroko2 ͱ͍͏γεςϜΛհͯ͠

    SQL ΛόονॲཧͰྲྀ͢͜ ͱʹؔͯ͠͸ຆͲखؒͳ࣮͘ߦՄೳ • ͜Ε͸ RedshiftML ͱ͸ඇৗʹ૬ੑ͕Α͘ɺϞσϧֶश͔Βਪ࿦·Ͱ Bricolage ʹ৐ͤΔ͚ͩͰ׬݁ • ͔͠͠ɺঢ়ଶ؅ཧΛߟ͑ΔͱɺҰखؒඞཁ
  11. RedshiftML Ϟσϧͷঢ়ଶ֬ೝ • CREATE MODEL จΛ࣮ߦ͢Δͱɺֶश͕੒ޭ͢Δ͠ͳ͍ͷҎલʹֶश σʔλΛ SageMaker ʹૹͬͨ࣌఺ͰͦͷΫΤϦ͸׬ྃ͠·͢ •

    ඇಉظʹֶशδϣϒ͕ಈ͘ͷͰɺֶशδϣϒ͕׬ྃͨ͠Βશ݅ਪ࿦όον Λճ͢ʂͱ͍ͬͨґଘؔ܎Λ૊Ή͜ͱ͕গ͠೉͍͠Ͱ͢
  12. RedshiftML in Cookpad: ·ͱΊ • Cookpad ͰͷϝΠϯλʔήοτ: ςΩετσʔλ΁ͷ׆༻ʹτϥΠ • ಛ௃ྔ࡞੒ύΠϓϥΠϯͳͲ͕ผͰඞཁͰ͸͋Δ΋ͷͷɺظ଴Ҏ্ͷਫ਼౓

    ͕ಘΒΕΔ͜ͱ͕Θ͔ͬͨ • ҰํͰɺΦϑϥΠϯςετͰͷϝτϦΫε௥੻ɺਪ࿦όον࣮ߦ࣌ͷঢ়ଶ ֬ೝͳͲͷิ׬͢΂͖఺΋Θ͔ͬͨ • Cookpad Ͱ͸ͦΕΛิ͏ࣾ಺πʔϧΛ࣮૷ͯ͠೔ʑͷӡ༻ʹ଱͑͏Δ඼࣭ Λ୲อ͠Α͏ͱ͍ͯ͠·͢