Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Google_Cloud_for_Machine_Learning_-_Sungmin_Han__Cloud_Next_.pdf

Sungmin Han
October 15, 2023
20

 Google_Cloud_for_Machine_Learning_-_Sungmin_Han__Cloud_Next_.pdf

Sungmin Han

October 15, 2023
Tweet

More Decks by Sungmin Han

Transcript

  1. MLOps lead, Riiid
    Google Developers Expert for ML
    Sungmin Han
    Google Cloud for
    Machine Learning

    View full-size slide

  2. 01 02
    Cloud TPU
    GPUs in
    Google Cloud

    View full-size slide

  3. GPUs
    in Google Cloud
    Training resources

    View full-size slide

  4. 2018. Oct
    BERT
    340M parameters
    2019. Oct
    T5
    770M parameters
    2019. Feb
    GPT-2
    1.5B parameters
    2020. Jun
    GPT-3
    175B parameters
    2022. Apr
    DALL-E 2
    3.5B parameters
    2021. Jan
    DALL-E
    12B parameters
    And more..

    View full-size slide

  5. Compute Engine
    In preview
    GPU on
    Google Cloud

    View full-size slide

  6. NVIDIA A100
    NVIDIA T4
    NVIDIA V100
    NVIDIA P100
    NVIDIA P4
    NVIDIA K80
    HBM2 40GB, 80GB
    GDDR6 16GB
    HBM2 16GB
    HBB2 8GB
    GDDR5 8GB
    GDDR5 12GB

    View full-size slide

  7. A100
    Multi-region support

    View full-size slide

  8. https://cloud.google.com/blog/products/compute/a2-vms-with-nvidia-a100-gpus-are-ga

    View full-size slide

  9. https://developer.nvidia.com/blog/introducing-hgx-a100-most-powerful-accelerated-server-platform-for-ai-hpc/

    View full-size slide

  10. https://cloud.google.com/blog/products/compute/a2-vms-with-nvidia-a100-gpus-are-ga

    View full-size slide

  11. GDPR
    CCPA
    COPPA
    SOC1, 2, 3
    ISO/IEC 27701
    ISO/IEC 27018
    HIPAA
    General Data Protection Regulation (유럽)
    California Consumer Privacy Act (USA)
    Children's Online Privacy Protection Act
    General / Global System Security Certification
    General / Global Privacy Certification
    General / Global Cloud Privacy Certification
    Health Insurance Portability and Accountability Act

    View full-size slide

  12. Cloud Billing Monitoring
    With BigQuery

    View full-size slide

  13. 33%
    Cheaper by using preemptible
    (Spot Instance)

    View full-size slide

  14. On-demand
    Multi-region support
    Dashboard support (BigQuery)
    01
    02
    03
    04
    No setup required (IPMI, Firewall device..)

    View full-size slide

  15. On-demand
    Multi-region support
    Dashboard support (BigQuery)
    01
    02
    03
    No setup required (IPMI, Firewall device..)
    04

    View full-size slide

  16. On-demand
    Multi-region support
    Dashboard support (BigQuery)
    01
    02
    03
    No setup required (IPMI, Firewall device..)
    04

    View full-size slide

  17. On-demand
    Multi-region support
    Dashboard support (BigQuery)
    01
    02
    03
    No setup required (IPMI, Firewall device..)
    04

    View full-size slide

  18. Cloud TPU
    Tensor Processing Unit

    View full-size slide

  19. https://cloud.google.com/blog/products/ai-machine-learning/cloud-tpu-v4-mlperf-2-0-results

    View full-size slide

  20. TPUv3 consists with 2 Systolic array
    With 128x128 ALUs.
    https://cloud.google.com/tpu/docs/intro-to-tpu?hl=ko

    View full-size slide

  21. By replacing to TPU
    Saving costs
    38%
    (compare with V100)

    View full-size slide

  22. Suitable for long-
    term training of
    several weeks or
    longer

    View full-size slide

  23. VertexAI
    Subheading text

    View full-size slide

  24. https://codelabs.developers.google.com/vertex-pipelines-intro#1

    View full-size slide

  25. NVIDIA A100
    NVIDIA T4
    NVIDIA V100
    NVIDIA P100
    NVIDIA P4
    NVIDIA K80

    View full-size slide

  26. https://cloud.google.com/blog/topics/developers-practitioners/pytorch-google-cloud-how-train-and-tune-pytorch-models-vertex-ai

    View full-size slide

  27. VertexAI Pipeline

    View full-size slide

  28. https://www.kubeflow.org/docs/components/pipelines/v1/introduction/
    Kubeflow Pipeline

    View full-size slide

  29. # keras model python path including run_fn().
    module_file=os.path.join(MODULE_ROOT, _trainer_module_file)
    example_gen = tfx.components.CsvExampleGen(input_base=data_root)
    trainer = tfx.extensions.google_cloud_ai_platform.Trainer(
    module_file=module_file,
    examples=example_gen.outputs['examples'],
    train_args=tfx.proto.TrainArgs(num_steps=100),
    eval_args=tfx.proto.EvalArgs(num_steps=5),
    )
    pusher = tfx.extensions.google_cloud_ai_platform.Pusher(
    model=trainer.outputs['model']
    )
    components = [example_gen, trainer, pusher]
    return tfx.dsl.Pipeline(
    pipeline_name=pipeline_name,
    pipeline_root=pipeline_root,
    components=components)

    View full-size slide

  30. client_options = {
    'api_endpoint': GOOGLE_CLOUD_REGION + '-aiplatform.googleapis.com'
    }
    client = aiplatform.gapic.PredictionServiceClient(client_options=client_options)
    instances = [{
    'culmen_length_mm':[0.71],
    'culmen_depth_mm':[0.38],
    'flipper_length_mm':[0.98],
    'body_mass_g': [0.78],
    }]
    endpoint = client.endpoint_path(
    project=GOOGLE_CLOUD_PROJECT,
    location=GOOGLE_CLOUD_REGION,
    endpoint=ENDPOINT_ID,
    )
    response = client.predict(endpoint=endpoint, instances=instances)
    print('species:', np.argmax(response.predictions[0]))

    View full-size slide

  31. Serving
    (Endpoint)
    Registry
    (Model registry)

    View full-size slide