Slide 1

Slide 1 text

Open Source ML - from pretrained models to production Run State of the Art Open Source LLMs in Production

Slide 2

Slide 2 text

No content

Slide 3

Slide 3 text

Models 1 What exists out there?

Slide 4

Slide 4 text

The Hugging Face Hub Models Spaces Access over 200k models shared by the community. Build ML Apps and Demos to showcase how models work. Datasets Share, access and collaborate on over 45k datasets.

Slide 5

Slide 5 text

The Hugging Face Hub Models Spaces Access over 200k models shared by the community Build ML Apps and Demos to showcase how models work. Datasets Share, access and collaborate on over 45k datasets. 99k-> 200k 19k->60k 16k->45k

Slide 6

Slide 6 text

The Model Hub ● Models across modalities (Computer Vision, NLP, Audio, multimodal, RL, tabular) ● Multiple libraries (PyTorch, Keras, fastai, SpaCy, NeMo, PaddlePaddle, Stanza, timm) ● 180+ supported languages ● Model cards for documentation ○ Metrics reporting ○ CO2 emissions ○ TensorBoard hosting ○ Interactive widgets

Slide 7

Slide 7 text

Inference 2 How to do inference of LLMs?

Slide 8

Slide 8 text

StarCoder LLaMA Falcon Recent popular models ● Code generation ● 15.5B parameters ● OpenRAIL License ● 80+ languages ● 1 trillion tokens ● Large ecosystem ● 7B to 65B parameters ● Non-commercial ● 1-1.4 trillion tokens ● Best OS model ● 7B to 40B parameters ● Apache 2.0 ● Multilingual ● 1 trillion tokens

Slide 9

Slide 9 text

Challenges Evaluation Existing benchmarks don’t fully capture real world use cases (e.g. multi-turn). Customizability Users want models tuned to their own data or use cases while preserving privacy. Model size LLMs require lots of memory, might not fit into a single machine, require complex parallelism and communication. Optimization Due to model size, latency and throughput are often impacted leading to require optimized models.

Slide 10

Slide 10 text

Some things you can do Load in 4-bit or 8-bit mode (bitsandbytes, accelerate) Loading Distribute among GPUs (accelerate) Multi-GPU Use tools optimized for LLMs (text-generation-inference) Inference Libraries Set device_map="auto" or even ooad layers to CPU (slow) Falcon 40B with 45GB (8-bit) or 27GB (4-bit) of RAM Used by HF in production!

Slide 11

Slide 11 text

Text-generation-inference (TGI) Tensor Parallelism Token Streaming Metrics and monitoring TGI supports most popular LLMs, such as StarCoder and SantaCoder Falcon LLaMA, Galactica and OPT GPT-NeoX Quantization Optimizations Security

Slide 12

Slide 12 text

Some users HuggingChat OpenAssistant nat.dev

Slide 13

Slide 13 text

Training 3 How to adjust models to your own use cases?

Slide 14

Slide 14 text

Training Fine-tuning PEFT ● $$$ ● Lots and lots of data ● Lots of expertise ● $$ ● Much less data and compute ● $ ● Even less compute Recent popular models overview (Parameter Eicient Fine-Tuning) You can fine-tune Whisper or Falcon-7b in free Collab

Slide 15

Slide 15 text

Example: Whisper ● 1% of trainable params, 5x more batch size ● Fine-tune a 1.6B parameter model with less than 8GB GPU VRAM ● The resulting checkpoints were less than 1% the size of the original model Full-Tuning Results in OOM LoRA

Slide 16

Slide 16 text

Example: Stable Diffusion “dog” adapter “toy” adapter “toy” + “dog” adapter

Slide 17

Slide 17 text

QLoRA 4-bit Quantization 4-bit quantized pretrained LM RLHF Base model with multiple adapters Efficient Fine-tune 65B parameter model on a single 48GB GPU

Slide 18

Slide 18 text

Building demos 4 How to build and share my ML apps?

Slide 19

Slide 19 text

Why demos? ● Easily present to a wide audience ● Increase reproducibility of research ● Diverse users can identify and debug failure points

Slide 20

Slide 20 text

No content

Slide 21

Slide 21 text

Gradio: typical usage import gradio app = gradio.Interface( classify_image, inputs=“image”, outputs=“label”) app.launch()

Slide 22

Slide 22 text

Turning point in usage of ML ML/software engineers anyone who can use a GUI/browser

Slide 23

Slide 23 text

CREDITS: This presentation template was created by Slidesgo, and includes icons by Flaticon, infographics & images by Freepik and illustrations by Storyset Thanks! [email protected] Omar Sanseviero @osanseviero CREDITS: This presentation template was created by Slidesgo, and includes icons by Flaticon, infographics & images by Freepik and illustrations by Storyset and Chunte Lee