Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Cloud Composerで組む機械学習パイプライン

402c2503882719be7104c2e661cfd355?s=47 2kyym
August 21, 2020

Cloud Composerで組む機械学習パイプライン

Discovery DataScience Meet up (DsDS) #0 にて発表した内容に、スライドを幾つか追加したものになります。
https://scramble.connpass.com/event/171602/

402c2503882719be7104c2e661cfd355?s=128

2kyym

August 21, 2020
Tweet

More Decks by 2kyym

Other Decks in Programming

Transcript

  1. Ͱ૊ΉػցֶशύΠϓϥΠϯ DsDs #0 / 2020.08.21  Masao Tsukiyama Cloud Composer

  2.  ஙࢁকԝʛ.BTBP5TVLJZBNB גࣜձࣾ.PCJMJUZ5FDIOPMPHJFTʛ.-0QTΤϯδχΞ ೥݄ʹ%F/"ʹ৽ଔೖࣾɻ גࣜձࣾ.PCJMJUZ5FDIOPMPHJFTʹग़޲͓ͯ͠Γɺ.-ΤϯδχΞϦϯάୈҰάϧʔϓॴଐɻ ֶੜ࣌୅͸ίϯϐϡʔλϏδϣϯͷݚڀʹैࣄͭͭ͠ɺελʔτΞοϓ౳Ͱ8FC։ൃΛߦ͍ͬͯͨɻ ग़޲લ͔ΒݱࡏʹࢸΔ·ͰɺΦʔτϞʔςΟϒ෼໺ʹ͓͚Δ.-γεςϜͷ։ൃӡ༻ʹܞΘ͍ͬͯΔɻ .-0QTΛத৺ʹɺΫϥ΢υωΠςΟϒΞʔΩςΫνϟ΍ࣗಈԽʹಛʹؔ৺͕͋Δɻ

  3. 

  4.  ৽λΫγʔΞϓϦʰ(0ʱ݄ϦϦʔε༧ఆ

  5.  MLOps ྖҬ͕͔ͳΓ޿͍ %T%4ͷςʔϚ

  6.  %4ۀ຿ޮ཰Խ Ϟσϧਫ਼౓୲อ Πϯϑϥ੔උ ࣗಈԽ $*$% σʔλج൫ etc…

  7.  %4ۀ຿ޮ཰Խ Ϟσϧਫ਼౓୲อ Πϯϑϥ੔උ ࣗಈԽ $*$% σʔλج൫ etc… ࠓճѻ͏ྖҬ

  8.  w $MPVE$PNQPTFSνϡʔτϦΞϧ w $PNQPTFS.-ύΠϓϥΠϯ։ൃͷํ਑ w ࣮ӡ༻ɾ։ൃʹ͓͚Δ$*΍ࣗಈσϓϩΠ ࠓճѻ͏τϐοΫ

  9.  w $MPVE$PNQPTFSνϡʔτϦΞϧ w $PNQPTFS.-ύΠϓϥΠϯ։ൃͷํ਑ w ࣮ӡ༻ɾ։ൃʹ͓͚Δ$*΍ࣗಈσϓϩΠ ࠓճѻ͏τϐοΫ εϥΠυ͸Ξοϓϩʔυ͞ΕΔͷͰޙ΄Ͳ͝ཡ͍ͩ͘͞ʂ

  10.  ৐຿һ޲͚ػೳʰ͓٬༷୳ࡧφϏʱ

  11.  धཁΛ༧ଌͯ͠࠷దͳӦۀܦ࿏ΛఏҊ

  12.  ʰ͓٬༷୳ࡧφϏʱͷػցֶशύΠϓϥΠϯ

  13.  $MPVE$PNQPTFSͰલॲཧ͔ΒσϓϩΠ·ͰࣗಈԽ

  14.  ˞໼ҹ͸σʔλͷྲྀΕͰ͸ͳ͘λεΫͷґଘؔ܎ σϓϩΠύΠϓϥΠϯ֓؍

  15.  w మ൘ϫʔΫϑϩʔͷ"JSqPXΛ࠾༻ w 1ZUIPOͰॻ͚Δ w ଞͷ($1ػೳͱͷ࿈ܞ͕༰қ w Ϋϥελӡ༻΍ϩάӡ༻͕Ϛωʔδυ Cloud

    Composer
  16.  Composer Tutorial

  17.  import datetime import logging from airflow.models import DAG from

    airflow.operators.bash_operator import BashOperator from airflow.operators.python_operator import PythonOperator def greeting(): logging.info("Hello World!") DEFAULT_ARGS = { "start_date": datetime.datetime(2018, 1, 1), "retries": 5, } dag = DAG( dag_id="test_dag", schedule_interval=datetime.timedelta(days=1), default_args=DEFAULT_ARGS, ) hello_python = PythonOperator(task_id="hello", python_callable=greeting, dag=dag) goodbye_bash = BashOperator(task_id="bye", bash_command="echo Goodbye.", dag=dag) hello_python >> goodbye_bash ύΠϓϥΠϯ %BH ఆٛ ԋࢉࢠʢ0QFSBUPSʣ͔Β λεΫΛੜ੒ λεΫ࣮ߦॱংΛఆٛ
  18.  ͱͯ΋୯७ͳػցֶशύΠϓϥΠϯͷ৔߹ profiler_task >> [preprocess_a_task, preprocess_b_task] >> trainer_task

  19.  Α͘࢖͏0QFSBUPSʢλεΫͷ୯Ґʣ w #BTI0QFSBUPSʛγΣϧίϚϯυΛ࣮ߦ͢ΔɻHDMPVE΍LVCFDUM͕࢖͑Δɻ w 1ZUIPO0QFSBUPSʛ$PNQPTFS؀ڥ಺Ͱ1ZUIPOؔ਺Λ࣮ߦ͢Δɻ w 1ZUIPO7JSUVBM&OW0QFSBUPSʛ1ZUIPOԾ૝؀ڥ্Ͱؔ਺Λ࣮ߦ͢Δɻ w (,&1PE0QFSBUPSʛࢦఆͨ͠(,&Ϋϥελ্Ͱ೚ҙͷίϯςφΛ࣮ߦ͢Δɻ

    w #JH2VFSZ0QFSBUPSʛ#JH2VFSZ+PCΛൃߦ͢Δɻ
  20.  Variables w $PNQPTFS؀ڥશମͰڞ༗͞ΕΔ؀ڥม਺ͷΑ͏ͳ΋ͷɻ w ੩తͳ஋Λ֨ೲ͓ͯ͘͠ɻҟͳΔ%BHͰಉ͡,FZΛ࢖Θͳ͍Α͏஫ҙɻ w ྫʛ($4ͷೖग़ྗύεɺ(,&ͷΫϥελ໊ɺ($3*NBHF%JHFTUͳͲ XCom (cross

    communication) w %BH3VO಺ͰͷΈڞ༗͞Εɺ͋ΔλεΫ͔ΒλεΫ΁ͱड͚౉͞ΕΔࣙॻɻ w 3VO͝ͱʹมΘΔ஋Λ౉͢ɻ5BTL*OTUBODFΦϒδΣΫτ͔ΒࢀরͰ͖Δɻ w ྫʛֶशσʔλͷूܭظؒɺϑΝΠϧ໊ʹ෇͚Δϋογϡ஋ͳͲ λεΫ΁ͷ৘ใͷ౉͠ํ
  21.  def set_env_variables(c, key, value): c.run( f"gcloud --project {PROJECT} composer

    environments run {COMPOSER_NAME} --location {LOCATION} \ variables -- -s {key} {value}" ) ) 7BSJBCMFͷ௥Ճ from airflow.models import Variable VALUE = Variable.get(key) 7BSJBCMFͷࢀর
  22.  7BSJBCMFT͸"JSGMPX8FC6*͔Β΋ࢀরɾฤूͰ͖Δ

  23.  9$PN7BMVFͷ௥Ճྫ def create_args(**kwargs): execution_date = kwargs["execution_date"] preprocess_start_datetime = execution_date

    - timedelta(days=PREPROCESS_DIFF) kwargs["ti"].xcom_push( key="preprocess_start_datetime", value=preprocess_start_datetime.strftime("%Y-%m-%dT%H:%M:%S"), ) create_args_task = PythonOperator( task_id="create_args", python_callable=create_args, dag=dag ) 1ZUIPO0QFSBPSͰͷؔ਺࣮ߦ࣌ɺՄม௕Ҿ਺͔Β࣮ߦ೔࣌΍λεΫΠϯελϯεΛࢀরͰ͖Δ ˞UJ͸UBTLJOTUBODFͷུ UBTL@JOTUBODF YDPN@QVTI ࠷ॳͷλεΫͰ QVTI͓ͯ͘͠
  24.  9$PN7BMVFͷࢀরྫ def preprocess(**kwargs): preprocess_start_datetime = kwargs["ti"].xcom_pull(key="preprocess_start_datetime") …… 1ZUIPO0QFSBPSͰͷؔ਺࣮ߦ࣌ɺՄม௕Ҿ਺͔Β࣮ߦ೔࣌΍λεΫΠϯελϯεΛࢀরͰ͖Δ ˞UJ͸UBTLJOTUBODFͷུ

    UBTL@JOTUBODF YDPN@QVMM
  25.  NBJO@EBHQZ ڞ௨෦෼Λ੾Γग़͠ɺλεΫ͝ͱͷ0QFSBUPSϥούʔΛ࡞Δͱ͖ͬ͢Γ͢Δ create_args_task = PythonOperator( task_id="create_args", python_callable=create_args, dag=dag )

    profiler_task = profiler_operator.create_operator(dag) preprocess_a_task = preprocess_operator.create_operator(dag, "a") preprocess_b_task = preprocess_operator.create_operator(dag, "b") train_task = train_operator.create_operator(dag) create_args_task >> profiler_task >> [ preprocess_a_task, preprocess_b_task, ] >> train_task def create_operator(dag, task_id, create_args_task_id): container_arguments = [ "--bucket_name", BUCKET_NAME, "preprocess", "--start_datetime", "{{ ti.xcom_pull(task_ids='" + create_args_task_id + "', key='preprocess_start_datetime') }}", "--bq_dataset_name", BQ_DATASET_NAME, "--gcs_path", GCS_PATH, ] operator = GKEPodOperator( task_id=task_id, project_id=PROJECT, location=CLUSTER_LOCATION, cluster_name=CLUSTER_NAME, namespace="default", image=IMAGE, arguments=container_arguments, dag=dag, ) return operator QSFQSPDFTT@PQFSBUPSQZ Import
  26.  ୯७ͳػցֶशύΠϓϥΠϯΛ࡞ͬͯΈΔ

  27.  ֶश͸1ZUIPO0QFSBUPSʁ લॲཧ͸#JH2VFSZ0QFSUBUPSʁ

  28. (,&1PE0QFSBUPSͰશ෦΍Δ  جຊํ਑

  29.  (,&1PE0QFSBUPSͰશ෦΍Δ ཧ༝ w .-ଆͷ࣮૷ͱύΠϓϥΠϯ࣮૷Λग़དྷΔ͚ͩಠཱ͍ͤͨ͞ w σʔλαΠΤϯςΟετਞʹύΠϓϥΠϯଆͷ࣮૷Λҙࣝͤͨ͘͞ͳ͍ w 1ZUIPO0QFSBUPSͷ੍໿ʢޙड़ʣ౳ɺ$PNQPTFS؀ڥ͸ෳࡶͳॲཧʹෆ޲͖ ۩ମతʹ

    w ผϦϙδτϦΛ࡞Γɺ.-ΞϧΰϦζϜ౳ͷ࣮૷͸ͦͪΒͰ؅ཧ͢Δ w લॲཧɺֶशɺͦͷଞࡉʑͨ͠ॲཧ͸શͯ%PDLFSΠϝʔδʹด͡ࠐΊΔ w #JH2VFSZΛ࢖͏৔߹΋ɺ42-ͱ+PCൃߦॲཧ͸ˢͷΠϝʔδʹด͡ࠐΊΔ w ෼ੳɺ࣮ݧɺϩʔΧϧͰͷ։ൃΛߟྀͯ͠΋͜ͷํ๏͕ಘࡦ
  30.  1ZUIPO0QFSBUPSͷ੍໿ PythonOperator w 1Z1*ύοέʔδΛඞཁͱ͠ͳ͍ൣғͷ؆୯ͳॲཧͳΒ͓ͦΒ͘࠷దղ w 7BSJBCMFTHFUTFUͰ஋ͷड͚౉͕͠ඇৗʹָɺ9$PN΋༰қʹ࢖͑Δ w Ұํɺ1Z1*ύοέʔδΛඞཁͱ͢ΔॲཧͰ͸$PNQPTFS؀ڥΛԚછ͢Δ w

    "JSqPXͷύοέʔδґଘͱিಥ͢ΔͳͲɺ࠶ݱੑ͕ݫ͍͠ PythonVirtualenvOperator w ྑ͍ͱ͜औΓ͔ͱࢥ͍͖΍ѱ͍ͱ͜औΓͩͬͨ w 7BSJBCMFT΋9$PN΋࢖͑ͣɺ࢖͍উख͸(,&1PE0QFSBUPSҎԼ w ҰͭͷDBMMBCMFʹશͯΛ٧ΊࠐΉඞཁ͕͋Γɺඇৗʹ࢖͍ͮΒ͍
  31.  8FC6*͔HDMPVEͰύοέʔδΛಋೖ͢Δඞཁ͕͋Γɺ؀ڥશମΛԚછ͢Δ ˞͔͠΋ߴ֬཰Ͱ"JSGMPXͷґଘͱিಥ͢Δ ˞ͪͳΈʹিಥ͢Δͱ"JSGMPX8PSLFS͕ࢮΜͰ%BH͕࣮ߦ͞Εͳ͘ͳΔ

  32.  (,&1PE0QFSBUPSͰશ෦΍Δ࣌ͷ஫ҙ఺ σϝϦοτ w λεΫ࣮ߦ࣌ʹ7BSJBCMFT͕ίʔυ͔ΒࢀরͰ͖ͳ͍ w YDPN@QVTI YDPN@QVMM ΋࢖͑ͳ͍ ղܾࡦ

    w (,&1PE0QFSBUPSͰίϯςφҾ਺͔Βશͯ౉ͯ͠΍Δ w %PDLFSpMFͰHDMPVE4%,ͱLVCFDUMΛೖΕΕ͹େମԿͰ΋Ͱ͖Δ w ผ؀ڥͷݖݶ͕ඞཁͳ৔߹͸4FSWJDF"DDPVOU,FZpMFΛ҉߸Խͯ͠౉͢
  33.  ೖग़ྗύε΍ूܭظؒͳͲ΋શͯίϚϯυϥΠϯҾ਺Ͱ੍ޚͰ͖ΔΑ͏ʹ͓ͯ͘͠ ˞ຊൃදͷൣғ֎͕ͩɺ1ZUIPO'JSF΍*OWPLF 'BCSJD Λ࢖͏ͱָ container_arguments = [ “preprocess", "--bucket_name",

    BUCKET_NAME, "--start_datetime", PREPROCESS_START_DATETIME, "--bq_dataset_name", BQ_DATASET_NAME, “—gcs_export_path", GCS_EXPORT_PATH, ] (,&1PE0QFSBUPSʹҾ਺Λ౉͢
  34.  (,&1PE0QFSBUPSʹҾ਺Λ౉͢ BUCKET_NAME = Variable.get("bucket_name") CLUSTER_NAME = Variable.get("cluster_name") CLUSTER_LOCATION =

    Variable.get("cluster_location") IMAGE = f"gcr.io/{PROJECT}/test-image@{Variable.get('test_image_digest')}" BQ_PROFILE_DATASET_NAME = Variable.get("bq_dataset_name") ඞཁͳ7BSJBCMFT͸ࣄલʹऔಘ͓ͯ͘͠
  35.  def create_operator(dag, task_id, create_args_task_id): container_arguments = [ “preprocess", "--bucket_name",

    BUCKET_NAME, "--start_datetime", "{{ ti.xcom_pull(task_ids='" + create_args_task_id + "', key='preprocess_start_datetime') }}", "--bq_dataset_name", BQ_DATASET_NAME, "--gcs_path", GCS_PATH, ] operator = GKEPodOperator( task_id=task_id, project_id=PROJECT, location=CLUSTER_LOCATION, cluster_name=CLUSTER_NAME, namespace="default", image=IMAGE, arguments=container_arguments, dag=dag, ) return operator +JOKBςϯϓϨʔτͰ 9$PN஋ΛࢀরͰ͖Δ BSHVNFOUTҾ਺͸ ςϯϓϨʔτஔ׵ର৅ ˞೾ׅހͰғͬͨจࣈྻ͕λεΫ࣮ߦ௚લʹ ςϯϓϨʔτஔ׵͞ΕΔ
  36.  https://cloud.google.com/composer/docs/how-to/using/writing-dags

  37.  ࣮ࡍͷӡ༻ྫ

  38.  .-ΞϧΰϦζϜͷߋ৽Λࣗಈ൓ө ໨ඪ w .-ଆϦϙδτϦʹมߋ͕͋ͬͯ΋ɺύΠϓϥΠϯଆ͸मਖ਼ෆཁͳঢ়ଶ͕ཧ૝ w (,&1PE0QFSBUPSͰ࣮ߦ͞ΕΔΠϝʔδΛߋ৽͢Ε͹͍͍͚ͩɺͱ͍͏ঢ়ଶ ۩ମతʹ w .-ଆϦϙδτϦͷNBTUFSϒϥϯνʹϚʔδ͞Εͨࡍɺ$JSDMF$*ͰࣗಈϏϧυ

    w Ϗϧυ͞Εͨ($3*NBHF%JHFTUΛHDMPVEDPNQPTFSWBSJBCMFTTFUͰઃఆ͢Δ w ࣍ճύΠϓϥΠϯ࣮ߦ࣌ʹ͸উखʹߋ৽͕൓ө͞Ε͍ͯΔ w ίϚϯυϥΠϯҾ਺มߋ΍ػೳ௥Ճ͕͋ͬͨࡍ͸΍Ήͳ͘ύΠϓϥΠϯΛमਖ਼
  39.  build_dev: docker: - image: google/cloud-sdk environment: GCP_PROJECT: dummy-gcp COMPOSER_NAME:

    dummy-composer IMAGE_TAG: dummy-tag steps: - checkout - setup_remote_docker: docker_layer_caching: true - attach_workspace: at: . - run: name: build command: &build | TAG=gcr.io/${GCP_PROJECT}/test-image:${IMAGE_TAG} docker build -t ${TAG} -f images/runner/Dockerfile . docker push ${TAG} IMAGE_DIGEST=$(gcloud container images describe gcr.io/${GCP_PROJECT}/test-image: ${IMAGE_TAG} —format='value(image_summary.digest)') gcloud composer environments run ${COMPOSER_NAME} --location asia-northeast1 variables -- -s pipeline_image_digest ${IMAGE_DIGEST} .-ଆϦϙδτϦ಺ʹஔ͔Εͨ$JSDMF$*༻ͷDPOGJHZNM ($3*NBHF%JHFTUΛ 7BSJBCMFTʹొ࿥
  40.  ʰ͓٬༷୳ࡧφϏʱͷਪ࿦ύΠϓϥΠϯ ௒୯७ʛ෼͝ͱʹਪ࿦όονΛ࣮ߦ͢Δ͚ͩ σϓϩΠύΠϓϥΠϯͰ࢖͏*NBHFʹϞσϧͷ1JDLMFΛՃ͚͑ͨͩͷਪ࿦༻*NBHFΛ༻͍͍ͯΔ

  41.  ϞσϧͷධՁͱࣗಈσϓϩΠ ϞσϧͷࣗಈσϓϩΠ w ਪ࿦༻ͷ*NBHF%JHFTUΛ࣋ͭ7BSJBCMFΛ্ॻ͖͢Ε͹Α͍ w ͭ·Γɺ$JSDMF$*ͰσϓϩΠύΠϓϥΠϯ*NBHFΛߋ৽͍ͯͨ͠ͷͱຆͲಉ͡ w ࠷ޙஈͷλεΫͰ৽چϞσϧͷൺֱධՁͱ7BSJBCMFͷ্ॻ͖Λߦ͏ ϞσϧͷධՁ

    w ৄࡉ͸ল͕͘ɺλΫγʔ৐຿γϛϡϨʔλͰ࠷ऴతͳϞσϧධՁΛߦ͍ͬͯΔ w ৽ϞσϧͱطଘϞσϧͷ྆ํͰόονਪ࿦ͱγϛϡϨʔγϣϯΛฒྻ࣮ߦ w Ϟσϧߋ৽ج४ʛ̎िؒ࿈ଓͰطଘϞσϧͷύϑΥʔϚϯεΛ্ճΔ͜ͱ
  42.  ਪ࿦༻ΠϝʔδΛϏϧυ ৽ϞσϧͱطଘϞσϧ ฒྻʹόονਪ࿦ σϓϩΠ൑ఆ͠ɺ 7BSJBCMFΛ্ॻ͖

  43.  ίετ໘ͷ΋Ζ΋Ζ w $PNQPTFS͸Ϋϥελʹཁٻ͢Δ࠷খϦιʔε͕ॏΊ w Ҋ݅͝ͱʹ$PNQPTFS؀ڥΛ༻ҙ͢Δͱ͚ͬ͜͏ߴ͍ w શ͘ҧ͏ϓϩμΫτͰͳ͚Ε͹ಉ͡؀ڥʹ%BHΛڞଘͤ͞Δ w (,&ͷݻఆඅ࡟ݮͷͨΊɺ"*1MBUGPSN+PCΛ׆༻͢Δ

  44.  ·ͱΊ

  45.  $MPVE$PNQPTFS͸ߏஙɾӡ༻ɾ࣮૷͕͓खܰͳϫʔΫϑϩʔΤϯδϯ ύΠϓϥΠϯ͸͋͘·ͰΨϫͰ͔͠ͳ͍ɻ࣮ࡍͷॲཧ͸(,&΍"*1MBUGPSNʹશͯ೚ͤΔ .-ଆϦϙδτϦͷߋ৽ΛࣗಈͰऔΓࠐΈɺύΠϓϥΠϯଆͷ࣮૷͸શ͘मਖ਼ෆཁͳঢ়ଶ͕ཧ૝ $PNQPTFSͷ༗ޮ׆༻ʹΑͬͯɺલॲཧ͔ΒϞσϧͷຊ൪σϓϩΠ·ͰࣗಈԽͱ҆ఆӡ༻Λ࣮ݱͰ͖Δ