Google_Cloud_for_Machine_Learning_-_Sungmin_Han__Cloud_Next_.pdf

MLOps lead, Riiid Google Developers Expert for ML Sungmin Han
Google Cloud for Machine Learning

01 02 Cloud TPU GPUs in Google Cloud

03 VertexAI

GPUs in Google Cloud Training resources

2018. Oct BERT 340M parameters 2019. Oct T5 770M parameters
2019. Feb GPT-2 1.5B parameters 2020. Jun GPT-3 175B parameters 2022. Apr DALL-E 2 3.5B parameters 2021. Jan DALL-E 12B parameters And more..

Compute Engine In preview GPU on Google Cloud

NVIDIA A100 NVIDIA T4 NVIDIA V100 NVIDIA P100 NVIDIA P4
NVIDIA K80 HBM2 40GB, 80GB GDDR6 16GB HBM2 16GB HBB2 8GB GDDR5 8GB GDDR5 12GB

A100 Multi-region support

https://cloud.google.com/blog/products/compute/a2-vms-with-nvidia-a100-gpus-are-ga

https://developer.nvidia.com/blog/introducing-hgx-a100-most-powerful-accelerated-server-platform-for-ai-hpc/

https://cloud.google.com/blog/products/compute/a2-vms-with-nvidia-a100-gpus-are-ga

GDPR CCPA COPPA SOC1, 2, 3 ISO/IEC 27701 ISO/IEC 27018
HIPAA General Data Protection Regulation (유럽) California Consumer Privacy Act (USA) Children's Online Privacy Protection Act General / Global System Security Certification General / Global Privacy Certification General / Global Cloud Privacy Certification Health Insurance Portability and Accountability Act

Cloud Billing Monitoring With BigQuery

33% Cheaper by using preemptible (Spot Instance)

On-demand Multi-region support Dashboard support (BigQuery) 01 02 03 04
No setup required (IPMI, Firewall device..)

On-demand Multi-region support Dashboard support (BigQuery) 01 02 03 No
setup required (IPMI, Firewall device..) 04

Cloud TPU Tensor Processing Unit

https://cloud.google.com/blog/products/ai-machine-learning/cloud-tpu-v4-mlperf-2-0-results

TPUv3 consists with 2 Systolic array With 128x128 ALUs. https://cloud.google.com/tpu/docs/intro-to-tpu?hl=ko

By replacing to TPU Saving costs 38% (compare with V100)

Suitable for long- term training of several weeks or longer

VertexAI Subheading text

https://codelabs.developers.google.com/vertex-pipelines-intro#1

NVIDIA A100 NVIDIA T4 NVIDIA V100 NVIDIA P100 NVIDIA P4
NVIDIA K80

https://cloud.google.com/blog/topics/developers-practitioners/pytorch-google-cloud-how-train-and-tune-pytorch-models-vertex-ai

VertexAI Pipeline

https://www.kubeflow.org/docs/components/pipelines/v1/introduction/ Kubeflow Pipeline

# keras model python path including run_fn(). module_file=os.path.join(MODULE_ROOT, _trainer_module_file) example_gen
= tfx.components.CsvExampleGen(input_base=data_root) trainer = tfx.extensions.google_cloud_ai_platform.Trainer( module_file=module_file, examples=example_gen.outputs['examples'], train_args=tfx.proto.TrainArgs(num_steps=100), eval_args=tfx.proto.EvalArgs(num_steps=5), ) pusher = tfx.extensions.google_cloud_ai_platform.Pusher( model=trainer.outputs['model'] ) components = [example_gen, trainer, pusher] return tfx.dsl.Pipeline( pipeline_name=pipeline_name, pipeline_root=pipeline_root, components=components)

client_options = { 'api_endpoint': GOOGLE_CLOUD_REGION + '-aiplatform.googleapis.com' } client =
aiplatform.gapic.PredictionServiceClient(client_options=client_options) instances = [{ 'culmen_length_mm':[0.71], 'culmen_depth_mm':[0.38], 'flipper_length_mm':[0.98], 'body_mass_g': [0.78], }] endpoint = client.endpoint_path( project=GOOGLE_CLOUD_PROJECT, location=GOOGLE_CLOUD_REGION, endpoint=ENDPOINT_ID, ) response = client.predict(endpoint=endpoint, instances=instances) print('species:', np.argmax(response.predictions[0]))

Serving (Endpoint) Registry (Model registry)

Thank you!

Google_Cloud_for_Machine_Learning_-_Sungmin_Han...

Google_Cloud_for_Machine_Learning_-_Sungmin_Han__Cloud_Next_.pdf

Sungmin Han

More Decks by Sungmin Han

Featured

Transcript

MLOps lead, Riiid Google Developers Expert for ML Sungmin Han

01 02 Cloud TPU GPUs in Google Cloud

03 VertexAI

GPUs in Google Cloud Training resources

2018. Oct BERT 340M parameters 2019. Oct T5 770M parameters

Compute Engine In preview GPU on Google Cloud

NVIDIA A100 NVIDIA T4 NVIDIA V100 NVIDIA P100 NVIDIA P4

A100 Multi-region support

https://cloud.google.com/blog/products/compute/a2-vms-with-nvidia-a100-gpus-are-ga

https://developer.nvidia.com/blog/introducing-hgx-a100-most-powerful-accelerated-server-platform-for-ai-hpc/

https://cloud.google.com/blog/products/compute/a2-vms-with-nvidia-a100-gpus-are-ga

GDPR CCPA COPPA SOC1, 2, 3 ISO/IEC 27701 ISO/IEC 27018

Cloud Billing Monitoring With BigQuery

33% Cheaper by using preemptible (Spot Instance)

On-demand Multi-region support Dashboard support (BigQuery) 01 02 03 04

On-demand Multi-region support Dashboard support (BigQuery) 01 02 03 No

On-demand Multi-region support Dashboard support (BigQuery) 01 02 03 No

On-demand Multi-region support Dashboard support (BigQuery) 01 02 03 No

Cloud TPU Tensor Processing Unit

https://cloud.google.com/blog/products/ai-machine-learning/cloud-tpu-v4-mlperf-2-0-results

TPUv3 consists with 2 Systolic array With 128x128 ALUs. https://cloud.google.com/tpu/docs/intro-to-tpu?hl=ko

By replacing to TPU Saving costs 38% (compare with V100)

Suitable for long- term training of several weeks or longer

VertexAI Subheading text

https://codelabs.developers.google.com/vertex-pipelines-intro#1

NVIDIA A100 NVIDIA T4 NVIDIA V100 NVIDIA P100 NVIDIA P4

https://cloud.google.com/blog/topics/developers-practitioners/pytorch-google-cloud-how-train-and-tune-pytorch-models-vertex-ai

VertexAI Pipeline

https://www.kubeflow.org/docs/components/pipelines/v1/introduction/ Kubeflow Pipeline

# keras model python path including run_fn(). module_file=os.path.join(MODULE_ROOT, _trainer_module_file) example_gen

client_options = { 'api_endpoint': GOOGLE_CLOUD_REGION + '-aiplatform.googleapis.com' } client =

Serving (Endpoint) Registry (Model registry)

Thank you!