Upgrade to Pro — share decks privately, control downloads, hide ads and more …

CLIP Indonesian

CLIP Indonesian

PyCon ID 2021

C57af1a97254c871ece1cee87979a222?s=128

Galuh Sahid

December 04, 2021
Tweet

Transcript

  1. Galuh Sahid | Dec 4, 2021 CLIP-Indonesian Contrastive Language–Image Pre-training

    model trained on Indonesian Data
  2. Outline • High-level overview of CLIP • Building CLIP-Indonesian •

    Introducing JAX • Environment setup • Datasets • Code • Monitoring • Experiments • Demo
  3. Slides https://bit.ly/pycon-clip-indonesian GitHub repository https://github.com/galuhsahid/clip-indonesian Still a work in progress,

    so may not give the best performance (yet) :)
  4. High-level overview of CLIP: how to connect images and text

    with CLIP
  5. https://openai.com/blog/clip/

  6. What CLIP does Image classi fi cation task https://openai.com/blog/clip/

  7. What CLIP does Image classi fi cation task https://openai.com/blog/clip/

  8. What CLIP does Image search https://cloud.google.com/blog/topics/developers-practitioners/image-search-natural-language-queries

  9. How does CLIP work? Encoders The CLIP model consists of

    dual encoders: • a text encoder that will embed text into mathematical space • Examples: BERT, RoBERTa • an image encoder that will embed images into mathematical space • Examples: Vision transformer (ViT) https://openai.com/blog/clip/
  10. How does CLIP work? Measuring how good our model is

    https://openai.com/blog/clip/
  11. How does CLIP work? Measuring how good our model is

    https://openai.com/blog/clip/
  12. How does CLIP work? Zero-shot prediction https://openai.com/blog/clip/

  13. Fun 😭 fact: the original CLIP was trained on 400

    million image-text pairs and the training process took 30 days across 592 V100 GPUs.
  14. Fun 😭 fact: the original CLIP was trained on 400

    million image-text pairs and the training process took 30 days across 592 V100 GPUs. So... how can we build our own CLIP?
  15. Building CLIP-Indonesian

  16. Building CLIP-Indonesian What we need • Computing resources • Dataset

    • Code, compute-intensive NLP+CV • Monitoring system
  17. Building CLIP-Indonesian What we need • Computing resources → TPU

    Research Cloud • Dataset → A large image-text pairs dataset in Indonesian • Code, compute-intensive NLP+CV → Flax/Jax + HuggingFace • Monitoring system → Weights & Biases
  18. Computing resources

  19. Computing resources Signing up to TPU Research Cloud (https://sites.research.google/trc/about/) •

    Free TPU v2-8 and v3-8 device(s)! • Participants in the TRC program will be expected to share their TRC-supported research with the world through peer-reviewed publications, open source code, blog posts, or other means.
  20. Computing resources Signing up to TPU Research Cloud (https://sites.research.google/trc/about/)

  21. Computing resources Setting up the TPU VM • There are

    two ways to set up the TPU VM: GUI (https://console.cloud.google.com/) or CLI • Tip: for projects requiring large datasets, you might need to set up persistent disks • Calculate how much data you'll need (in GB) • The zone for the disks must be the same as the zone of the TPU VM • These will not be covered by TRC, but you can use your GCP free trial credits ($300)
  22. Computing resources Setting up the TPU VM $ gcloud compute

    disks create clip-indonesian-disk-1 \ --size 300GB \ --zone europe-west4-a \ --type pd-balanced 1. Creating the fi rst persistent disk $ gcloud compute disks create clip-indonesian-disk-2 \ --size 300GB \ --zone europe-west4-a \ --type pd-balanced 2. Creating the second persistent disk
  23. Computing resources Setting up the TPU VM $ gcloud alpha

    compute tpus tpu-vm create clip- indonesian \ --zone=europe-west4-a \ --version=v2-alpha \ --accelerator-type="v3-8" \ --data-disk source=projects/clip-indonesian/ zones/europe-west4-a/disks/clip-disk-1 \ --data-disk source=projects/clip-indonesian/ zones/europe-west4-a/disks/clip-disk-2 3. Setting up the TPU VM Complete instruction on setting up TPU VM + persistent disk can be found here. (Don't forget to mount your disks based on the instruction!) 4. SSH to your TPU VM $ gcloud alpha compute tpus tpu-vm ssh clip-indonesian \ --zone=europe-west4-a
  24. Dataset

  25. https://github.com/galuhsahid/clip-indonesian/tree/master/data

  26. Building the Indonesian dataset The original CLIP model was trained

    on 400M pairs of image-text. Is such data a) available for the public b) in Indonesian? • Answer: it wasn't readily available, with a little bit of work, we can get some decent data :)
  27. What datasets are we using to build CLIP-Indonesian? Name Count

    (Train)* Count (Validation)* Original Dataset Link Translated Annotations Link CC12M 9,480,140 650,000 Link Link CC3M 2,520,816 300,000 Link Link COCO 2017 108,285 10,000 Link Link Flickr8k 5,670 800 Link Link WiT 89,610 9,000 Link Dataset is already in Indonesian ( fi lter by lang = id) Total 12,204,521 969,800 *) excludes broken images, SVGs, and images that cannot be downloaded. For WiT, also excludes image-text pairs with captions that have 80% of proper nouns.
  28. Building the Indonesian dataset What are the readily available image-text

    pairs datasets? Conceptual 12M (CC12M) ~12 million image-text pairs; covers a much more diverse set of visual concepts than CC3M. English Conceptual Captions (CC3M) 3 million images, paired with natural-language captions, collected from the web. The raw descriptions were extracted from the alt-text HTML attribute of each image. English COCO (Microsoft Common Objects in Context) 328K images. A large-scale object detection, segmentation, key-point detection, and captioning dataset. English
  29. Building the Indonesian dataset What are the readily available image-text

    pairs datasets? Wikipedia-based Image Text (WIT) A large multimodal multilingual dataset (including Indonesian!) Flickr 8k 8,000 images that are each paired with fi ve di ff erent captions which provide clear descriptions of the salient entities and events. English
  30. Building the Indonesian dataset General step-by-step

  31. Building the Indonesian dataset General step-by-step 1. Translate the dataset

    to Indonesian (except for the WiT dataset) File name Original caption Indonesian caption 1000092795. jpg Two blonde haired youths looked at their hands while hanging out in the courtyard. Dua pemuda berambut lusuh melihat tangan mereka sambil nongkrong di halaman. 1000092795. jpg Two young, white boys were outside near a bunch of bushes. Dua anak muda, laki-laki kulit putih berada di luar dekat banyak semak. 1000092795. jpg Two men in green shirts are standing in the courtyard. Dua pria berkemeja hijau berdiri di halaman. 1000092795. jpg A man in a blue shirt stands in the park. Seorang pria dengan kemeja biru berdiri di taman. 1000092795. jpg Two friends enjoying time spent together. Dua teman menikmati waktu yang dihabiskan bersama.
  32. Building the Indonesian dataset General step-by-step 1. Translate the dataset

    to Indonesian (except for the WiT dataset) Repository: https://github.com/acul3/translated- dataset Available datasets in Indonesian: • Flickr30 • Coco (2017 train) • Sub Caption • VizWiz train • CC3M • CC12M
  33. Building the Indonesian dataset General step-by-step 1. Translate the dataset

    to Indonesian (except for the WiT dataset) $ pip install mariantranslate $ from mariantranslate import Translator lang_from = "en" # source language lang_to = "id" # target language en_id_translator = Translator(lang_from, lang_to) en_id_translator.translate("Due to the limited vegetation cover of the Faroe Islands, it is relatively easy to follow the history of geology.") >>> Karena tumbuhan terbatas menutupi Kepulauan Faroe, relatif mudah untuk mengikuti sejarah geologi.
  34. Building the Indonesian dataset General step-by-step 2. Download the images

  35. Building the Indonesian dataset General step-by-step 2. Download the images

    (link to complete code) # Load data with contexttimer.Timer(prefix="Loading from tsv"): df = pd.read_csv(sys.argv[1], delimiter='\t', header=None) url_to_idx_map = {url: index for index, caption, url in df.itertuples()} base_dir = os.path.join(os.getcwd(), sys.argv[2]) def process(item): url, image_id = item base_url = os.path.basename(url) # extract base url stem, ext = os.path.splitext(base_url) # split into stem and extension filename = f'{image_id:08d}---{stem}.jpg' # create filename filepath = os.path.join(base_dir, filename) # concat to get filepath if not os.path.isfile(filepath): req = requests.get(url, stream=True, timeout=1, verify=False).raw image = Image.open(req).convert('RGB') image.save(filepath) # save PIL image Downloads images Collects URLs that need to be downloaded
  36. Building the Indonesian dataset General step-by-step • Tip: since downloading

    all the images might take a while, it can be bene fi cial to implement multiprocessing • Multiprocessing enables a faster downloading process, however approximately ~20% # of images will be lost. • This might not be a problem for CC3M and CC12M that have a large # large datasets, but it's a problem for WiT data that only have ~100k of images-caption # pairs for Indonesian data. • Thus for WiT, we download the images without multiprocessing in order to preserve all images as many as possible. 2. Download the images (link to complete code)
  37. Building the Indonesian dataset General step-by-step python downloaders/cc12m.py <tsv file>

    <output folder> python downloaders/cc12m.py datasets/cc12m/cc12m.tsv datasets/cc12m/images • Tip: build a command-line interface for your script to make it easier for the programs to be run; you just need to de fi ne the input and output • This way you can make a shell script to replicate the procedures automatically 2. Download the images (link to complete code)
  38. Building the Indonesian dataset General step-by-step 3. Preprocess the dataset

    (link to complete code) • We need to process the datasets so that all datasets will have the same format • The code that we will be using accepts JSON lines (jsonl) fi les as input • The scripts in the `/preprocessors` folder convert JSON or .tsv fi les (depending on the dataset) into JSON lines fi les. • Each dataset will have a separate `train` and `val` dataset. • So in summary what the preprocessing script does is: • Separate training and validation dataset • Convert the original dataset (still in .csv, or .tsv) into a common jsonl fi le • At the end we will have fi les like cc12m_train.json, cc12m_val.json, cc3m_train.json, cc3m_val.json, etc. all following the same format. {"image_path": "29374927984.jpeg", "captions": "Buah di atas meja"} {"image_path": "34875339282.jpeg", "captions": "Orang sedang berlari di pantai"}
  39. Building the Indonesian dataset General step-by-step with open(annotation_file, "r") as

    f: annotations = json.load(f)["annotations"] image_path_to_caption = collections.defaultdict(list) for element in annotations: caption = f"{element['caption'].lower().rstrip('.')}" image_path = images_dir + "/%012d.jpg" % (element["image_id"]) image_path_to_caption[image_path].append(caption) lines = [] for image_path, captions in image_path_to_caption.items(): lines.append(json.dumps({"image_path": image_path, "captions": captions})) # Train and validation split train_lines = lines[:-10_001] valid_lines = lines[-10_001:] with open(output_file+"_train.json", "w") as f: f.write("\n".join(train_lines)) with open(output_file+"_val.json", "w") as f: f.write("\n".join(valid_lines)) 3. Preprocess the dataset (link to complete code). Sampel: COCO dataset Parse the caption and image path Convert to the JSON lines format Separate into training and validation Write into separate training and validation fi les
  40. Building the Indonesian dataset General step-by-step 3. Preprocessing (sample: COCO

    dataset) python preprocessors/coco.py <coco json file> <coco images folder> <output file name (without extension)> python preprocessors/coco.py datasets/coco/coco_captions_train2017.json datasets/coco/images datasets/coco/ coco_dataset • The script will output two fi les: coco_dataset_train.json and coco_dataset_val.json
  41. Building the Indonesian dataset Additional steps for the WiT dataset

    • Tip: There are many di ff erent kinds of preprocessing that you can do to get a high quality dataset
  42. Building the Indonesian dataset Additional steps for the WiT dataset

    (source code) 3. Remove image-text pairs that contain mostly of proper nouns # Setup CRFTagger ct = CRFTagger() ct.set_model_file('all_indo_man_tag_corpus_model.crf.tagger') # Load data df = pd.read_csv(sys.argv[1], delimiter='\t') df = df[["caption_reference_description", "image_url"]] def drop_propn(text): try: if len(text)==0: return True text = text.split() result = ct.tag_sents([text]) nnp_cnt = 0 total = len(result[0]) for x in result[0]: if x[1] == "NNP": nnp_cnt += 1 if (nnp_cnt/total) >= sys.argv[3]: return True return False except Exception as e: print(e) return True df["to_drop"] = df["caption_reference_description"].apply(drop_propn) df = df[df["to_drop"]==False] df = df.drop("to_drop",axis=1) df.to_csv(sys.argv[2], sep='\t') Load part-of-speech (POS) tagger Calculate percentage of proper noun (NNP) Only keep image-caption pairs where to_drop=False
  43. Building the Indonesian dataset Merging them all together awk 1

    cc12m_dataset_disk1_train.json cc12m_dataset_disk2_train.json cc3m_dataset_train.json coco_dataset_train.json flickr8k_dataset_train.json wit_dataset_train.json > train_dataset_v6.json awk 1 cc12m_dataset_disk1_val.json cc12m_dataset_disk2_val.json cc3m_dataset_val.json coco_dataset_val.json flickr8k_dataset_val.json wit_dataset_val.json > val_dataset_v6.json
  44. Code

  45. https://github.com/galuhsahid/clip-indonesian

  46. https://github.com/huggingface/transformers/tree/master/examples/research_projects/jax-projects/hybrid_clip

  47. Code Overview • Uses the JAX/Flax backend • Is a

    vision-text dual encoder model using a pre-trained vision and text encoder • For the image encoder, we use Vision Transformer (ViT), more speci fi cally openai/clip-vit-base-patch32. • For the text encoder, we experimented with two models: IndoBERT Large (indobenchmark/indobert-base-p2) and Indonesian RoBERTa Base ( fl ax- community/indonesian-roberta-base). • The CLIP-Indonesian model uses a modi fi ed code from the HybridCLIP code
  48. Code Intro to JAX/Flax: JAX motivating example Classic Numpy •

    JAX: a framework that is speci fi cally suited for Machine Learning Research • What's missing? • Running on accelerated hardware (GPU/TPU) • Fast optimization via automatic di ff erentiation • Parallelization of data and computation https://www.youtube.com/watch?v=WdTeDXsOSj4
  49. Code Intro to JAX/Flax: JAX motivating example Classic Numpy Replace

    numpy with jax.numpy https://www.youtube.com/watch?v=WdTeDXsOSj4
  50. Code Intro to JAX/Flax: JAX motivating example Classic Numpy https://www.youtube.com/watch?v=WdTeDXsOSj4

    Apply jax.grad
  51. Code Intro to JAX/Flax: JAX motivating example Classic Numpy https://www.youtube.com/watch?v=WdTeDXsOSj4

    Apply jax.vmap
  52. Code Intro to JAX/Flax: JAX motivating example Classic Numpy https://www.youtube.com/watch?v=WdTeDXsOSj4

    Apply Just in Time (JIT) compilation
  53. Code Intro to JAX/Flax: JAX motivating example Classic Numpy https://www.youtube.com/watch?v=WdTeDXsOSj4

    Apply pmap
  54. Code Intro to JAX/Flax: JAX examples in the wild https://www.youtube.com/watch?v=WdTeDXsOSj4

  55. Code Intro to JAX/Flax: What is Flax? • Flax: a

    deep learning framework built on top of JAX • Contains all the usual elements you usually encounter in deep learning frameworks: • Neural network API ( fl ax.linen): Dense, Conv, {Batch|Layer|Group} Norm, Attention, Pooling, {LSTM|GRU} Cell, Dropout
 • Optimizers ( fl ax.optim): SGD, Momentum, Adam, LARS, Adagrad, LAMB, RMSprop
 • Utilities and patterns: replicated training, serialization and checkpointing, metrics, prefetching on device • And more! https://github.com/google/ fl ax
  56. Code FlaxHybridCLIP code from HuggingFace https://github.com/huggingface/transformers/tree/master/examples/research_projects/jax-projects/hybrid_clip

  57. Code FlaxHybridCLIP code from HuggingFace https://github.com/huggingface/transformers/tree/master/examples/research_projects/jax-projects/hybrid_clip python run_hybrid_clip.py \ --output_dir

    ${MODEL_DIR} \ --text_model_name_or_path="roberta-base" \ --vision_model_name_or_path="openai/clip-vit-base- patch32" \ --tokenizer_name="roberta-base" \ --train_file="coco_dataset/train_dataset.json" \ --validation_file="coco_dataset/ validation_dataset.json" \ --do_train --do_eval \ --num_train_epochs="40" --max_seq_length 96 \ --per_device_train_batch_size="64" \ --per_device_eval_batch_size="64" \ --learning_rate="5e-5" --warmup_steps="0" -- weight_decay 0.1 \ --overwrite_output_dir \ --preprocessing_num_workers 32 \ --push_to_hub
  58. Code Prior work: clip-italian https://arxiv.org/pdf/2108.08688.pdf

  59. Code Modi fi cations: Image augmentation 1. Code for image

    augmentation (source code; docs for torchvision transforms; based on clip-italian) self.transforms = torch.nn.Sequential( Resize([image_size], interpolation=InterpolationMode.BICUBIC), RandomCrop([image_size], pad_if_needed=True, padding_mode="edge"), ColorJitter(hue=0.1), RandomHorizontalFlip(), RandomAffine( degrees=15, translate=(0.1, 0.1), scale=(0.8, 1.2), shear=(-15, 15, -15, 15), interpolation=InterpolationMode.BILINEAR, fill=127, ), RandomPerspective( distortion_scale=0.3, p=0.3, interpolation=InterpolationMode.BILINEAR, fill=127, ), RandomAutocontrast(p=0.3), RandomEqualize(p=0.3), ConvertImageDtype(torch.float), Normalize( (0.48145466, 0.4578275, 0.40821073), (0.26862954, 0.26130258, 0.27577711), ), )
  60. Code Modi fi cations: Better optimizer 2. Optimizer (source code;

    based on clip-italian) optimizer = optax.chain( optax.adaptive_grad_clip(0.01, eps=0.001), optax.scale_by_belief(), optax.scale_by_schedule(decay_lr_schedule_fn), optax.scale(-1.0), )
  61. Code Modi fi cations: backbone freezing 3. Backbone Freezing (source

    code; based on clip-italian) image_embeds = vision_outputs[1] if self.freeze_backbones: image_embeds = jax.lax.stop_gradient(image_embeds) image_embeds = self.visual_projection(image_embeds) text_embeds = text_outputs[1] if self.freeze_backbones: text_embeds = jax.lax.stop_gradient(text_embeds) text_embeds = self.text_projection(text_embeds)
  62. Code Running the script #!/bin/bash SCRIPT_DIR=clip-indonesian MODEL_DIR=/mnt/disks/data-1/models/training_indobert IMAGE_ENCODER="openai/clip-vit-base-patch32" TEXT_ENCODER="indobenchmark/indobert-base-p2" python

    ${SCRIPT_DIR}/run_hybrid_clip.py \ --output_dir ${MODEL_DIR} \ --overwrite_output_dir \ --tokenizer_name=${TEXT_ENCODER} \ --train_file="../data/train_dataset_v6.json" \ --validation_file="../data/val_dataset_v6.json" \ --do_train --do_eval \ --num_train_epochs="10" --max_seq_length 96 \ --per_device_train_batch_size="64" \ --per_device_eval_batch_size="64" \ --learning_rate="0.00005" --warmup_ratio 0.1 --weight_decay 0.0 \ --preprocessing_num_workers 16 \ --exp_name training_v3 \ --text_model_name_or_path=${TEXT_ENCODER} \ --vision_model_name_or_path=${IMAGE_ENCODER} \ --eval_steps 500 \ --logging_steps 100 \ --save_steps 500 \ --save_total_limit 5 \ --adabelief \ --freeze_backbones
  63. Monitoring Setting up Weights & Biases

  64. Monitoring Setting up Weights & Biases

  65. Monitoring Setting up Weights & Biases # Enable wandb if

    jax.process_index() == 0 and args.log_wandb: try: wandb.init( name=args.exp_name, entity="galuh", project="clip-Indonesian", sync_tensorboard=True ) wandb.config.update(training_args) wandb.config.update(model_args) wandb.config.update(data_args) except ImportError as e: print(e) Enabling wandb (source code)
  66. Experiments • All image encoders are OpenAI ViT • Interestingly,

    in terms of validation loss, using IndoBERT large vs Roberta Base does not di ff er much
  67. Demo Zero-shot image classification Image search on Unsplash25k dataset

  68. Future improvements • Text/caption augmentation • Quantitative evaluation (e.g. MRR,

    accuracy) • Web app demo
  69. Takeaway It's possible to do a project that requires a

    lot of computation and data on your own • Computing resources -> TPU Research Cloud • Dataset -> A large image-text pairs dataset in Indonesian • Code, compute-intensive NLP+CV -> Flax/Jax + HuggingFace • Monitoring system -> Weights & Biases
  70. References Bianchi, F., Attanasio, G., Pisoni, R., Terragni, S., Sarti,

    G., Lakshmi, S. (2021). Contrastive Language-Image Pre-training for the Italian Language arXiv preprint arXiv:2108.08688 . Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., & Sutskever, I. (2021). Learning Transferable Visual Models From Natural Language Supervision. ICML . Wilie, B., Vincentio, K., Winata, G. I., Cahyawijaya, S., Li, X., Lim, Z. Y., ... & Purwarianti, A. (2020). IndoNLU: Benchmark and resources for evaluating Indonesian natural language understanding. arXiv preprint arXiv:2009.05387 . Hybrid CLIP by the HuggingFace tea m Indonesian Roberta Base by Wilson Wongso, Steven Limcorn, Samsul Rahmadani, and Chew Kok Wa h Indonesian Translated Datasets by Samsul Rahmadan
  71. Thank you! email: galuh.sahid@protonmail.com linkedin: linkedin.com/in/galuhsahid