Upgrade to Pro — share decks privately, control downloads, hide ads and more …

CLIP Indonesian

Galuh Sahid
December 04, 2021

CLIP Indonesian

PyCon ID 2021

Galuh Sahid

December 04, 2021

More Decks by Galuh Sahid

Other Decks in Technology


  1. Outline • High-level overview of CLIP • Building CLIP-Indonesian •

    Introducing JAX • Environment setup • Datasets • Code • Monitoring • Experiments • Demo
  2. How does CLIP work? Encoders The CLIP model consists of

    dual encoders: • a text encoder that will embed text into mathematical space • Examples: BERT, RoBERTa • an image encoder that will embed images into mathematical space • Examples: Vision transformer (ViT) https://openai.com/blog/clip/
  3. Fun 😭 fact: the original CLIP was trained on 400

    million image-text pairs and the training process took 30 days across 592 V100 GPUs.
  4. Fun 😭 fact: the original CLIP was trained on 400

    million image-text pairs and the training process took 30 days across 592 V100 GPUs. So... how can we build our own CLIP?
  5. Building CLIP-Indonesian What we need • Computing resources • Dataset

    • Code, compute-intensive NLP+CV • Monitoring system
  6. Building CLIP-Indonesian What we need • Computing resources → TPU

    Research Cloud • Dataset → A large image-text pairs dataset in Indonesian • Code, compute-intensive NLP+CV → Flax/Jax + HuggingFace • Monitoring system → Weights & Biases
  7. Computing resources Signing up to TPU Research Cloud (https://sites.research.google/trc/about/) •

    Free TPU v2-8 and v3-8 device(s)! • Participants in the TRC program will be expected to share their TRC-supported research with the world through peer-reviewed publications, open source code, blog posts, or other means.
  8. Computing resources Setting up the TPU VM • There are

    two ways to set up the TPU VM: GUI (https://console.cloud.google.com/) or CLI • Tip: for projects requiring large datasets, you might need to set up persistent disks • Calculate how much data you'll need (in GB) • The zone for the disks must be the same as the zone of the TPU VM • These will not be covered by TRC, but you can use your GCP free trial credits ($300)
  9. Computing resources Setting up the TPU VM $ gcloud compute

    disks create clip-indonesian-disk-1 \ --size 300GB \ --zone europe-west4-a \ --type pd-balanced 1. Creating the fi rst persistent disk $ gcloud compute disks create clip-indonesian-disk-2 \ --size 300GB \ --zone europe-west4-a \ --type pd-balanced 2. Creating the second persistent disk
  10. Computing resources Setting up the TPU VM $ gcloud alpha

    compute tpus tpu-vm create clip- indonesian \ --zone=europe-west4-a \ --version=v2-alpha \ --accelerator-type="v3-8" \ --data-disk source=projects/clip-indonesian/ zones/europe-west4-a/disks/clip-disk-1 \ --data-disk source=projects/clip-indonesian/ zones/europe-west4-a/disks/clip-disk-2 3. Setting up the TPU VM Complete instruction on setting up TPU VM + persistent disk can be found here. (Don't forget to mount your disks based on the instruction!) 4. SSH to your TPU VM $ gcloud alpha compute tpus tpu-vm ssh clip-indonesian \ --zone=europe-west4-a
  11. Building the Indonesian dataset The original CLIP model was trained

    on 400M pairs of image-text. Is such data a) available for the public b) in Indonesian? • Answer: it wasn't readily available, with a little bit of work, we can get some decent data :)
  12. What datasets are we using to build CLIP-Indonesian? Name Count

    (Train)* Count (Validation)* Original Dataset Link Translated Annotations Link CC12M 9,480,140 650,000 Link Link CC3M 2,520,816 300,000 Link Link COCO 2017 108,285 10,000 Link Link Flickr8k 5,670 800 Link Link WiT 89,610 9,000 Link Dataset is already in Indonesian ( fi lter by lang = id) Total 12,204,521 969,800 *) excludes broken images, SVGs, and images that cannot be downloaded. For WiT, also excludes image-text pairs with captions that have 80% of proper nouns.
  13. Building the Indonesian dataset What are the readily available image-text

    pairs datasets? Conceptual 12M (CC12M) ~12 million image-text pairs; covers a much more diverse set of visual concepts than CC3M. English Conceptual Captions (CC3M) 3 million images, paired with natural-language captions, collected from the web. The raw descriptions were extracted from the alt-text HTML attribute of each image. English COCO (Microsoft Common Objects in Context) 328K images. A large-scale object detection, segmentation, key-point detection, and captioning dataset. English
  14. Building the Indonesian dataset What are the readily available image-text

    pairs datasets? Wikipedia-based Image Text (WIT) A large multimodal multilingual dataset (including Indonesian!) Flickr 8k 8,000 images that are each paired with fi ve di ff erent captions which provide clear descriptions of the salient entities and events. English
  15. Building the Indonesian dataset General step-by-step 1. Translate the dataset

    to Indonesian (except for the WiT dataset) File name Original caption Indonesian caption 1000092795. jpg Two blonde haired youths looked at their hands while hanging out in the courtyard. Dua pemuda berambut lusuh melihat tangan mereka sambil nongkrong di halaman. 1000092795. jpg Two young, white boys were outside near a bunch of bushes. Dua anak muda, laki-laki kulit putih berada di luar dekat banyak semak. 1000092795. jpg Two men in green shirts are standing in the courtyard. Dua pria berkemeja hijau berdiri di halaman. 1000092795. jpg A man in a blue shirt stands in the park. Seorang pria dengan kemeja biru berdiri di taman. 1000092795. jpg Two friends enjoying time spent together. Dua teman menikmati waktu yang dihabiskan bersama.
  16. Building the Indonesian dataset General step-by-step 1. Translate the dataset

    to Indonesian (except for the WiT dataset) Repository: https://github.com/acul3/translated- dataset Available datasets in Indonesian: • Flickr30 • Coco (2017 train) • Sub Caption • VizWiz train • CC3M • CC12M
  17. Building the Indonesian dataset General step-by-step 1. Translate the dataset

    to Indonesian (except for the WiT dataset) $ pip install mariantranslate $ from mariantranslate import Translator lang_from = "en" # source language lang_to = "id" # target language en_id_translator = Translator(lang_from, lang_to) en_id_translator.translate("Due to the limited vegetation cover of the Faroe Islands, it is relatively easy to follow the history of geology.") >>> Karena tumbuhan terbatas menutupi Kepulauan Faroe, relatif mudah untuk mengikuti sejarah geologi.
  18. Building the Indonesian dataset General step-by-step 2. Download the images

    (link to complete code) # Load data with contexttimer.Timer(prefix="Loading from tsv"): df = pd.read_csv(sys.argv[1], delimiter='\t', header=None) url_to_idx_map = {url: index for index, caption, url in df.itertuples()} base_dir = os.path.join(os.getcwd(), sys.argv[2]) def process(item): url, image_id = item base_url = os.path.basename(url) # extract base url stem, ext = os.path.splitext(base_url) # split into stem and extension filename = f'{image_id:08d}---{stem}.jpg' # create filename filepath = os.path.join(base_dir, filename) # concat to get filepath if not os.path.isfile(filepath): req = requests.get(url, stream=True, timeout=1, verify=False).raw image = Image.open(req).convert('RGB') image.save(filepath) # save PIL image Downloads images Collects URLs that need to be downloaded
  19. Building the Indonesian dataset General step-by-step • Tip: since downloading

    all the images might take a while, it can be bene fi cial to implement multiprocessing • Multiprocessing enables a faster downloading process, however approximately ~20% # of images will be lost. • This might not be a problem for CC3M and CC12M that have a large # large datasets, but it's a problem for WiT data that only have ~100k of images-caption # pairs for Indonesian data. • Thus for WiT, we download the images without multiprocessing in order to preserve all images as many as possible. 2. Download the images (link to complete code)
  20. Building the Indonesian dataset General step-by-step python downloaders/cc12m.py <tsv file>

    <output folder> python downloaders/cc12m.py datasets/cc12m/cc12m.tsv datasets/cc12m/images • Tip: build a command-line interface for your script to make it easier for the programs to be run; you just need to de fi ne the input and output • This way you can make a shell script to replicate the procedures automatically 2. Download the images (link to complete code)
  21. Building the Indonesian dataset General step-by-step 3. Preprocess the dataset

    (link to complete code) • We need to process the datasets so that all datasets will have the same format • The code that we will be using accepts JSON lines (jsonl) fi les as input • The scripts in the `/preprocessors` folder convert JSON or .tsv fi les (depending on the dataset) into JSON lines fi les. • Each dataset will have a separate `train` and `val` dataset. • So in summary what the preprocessing script does is: • Separate training and validation dataset • Convert the original dataset (still in .csv, or .tsv) into a common jsonl fi le • At the end we will have fi les like cc12m_train.json, cc12m_val.json, cc3m_train.json, cc3m_val.json, etc. all following the same format. {"image_path": "29374927984.jpeg", "captions": "Buah di atas meja"} {"image_path": "34875339282.jpeg", "captions": "Orang sedang berlari di pantai"}
  22. Building the Indonesian dataset General step-by-step with open(annotation_file, "r") as

    f: annotations = json.load(f)["annotations"] image_path_to_caption = collections.defaultdict(list) for element in annotations: caption = f"{element['caption'].lower().rstrip('.')}" image_path = images_dir + "/%012d.jpg" % (element["image_id"]) image_path_to_caption[image_path].append(caption) lines = [] for image_path, captions in image_path_to_caption.items(): lines.append(json.dumps({"image_path": image_path, "captions": captions})) # Train and validation split train_lines = lines[:-10_001] valid_lines = lines[-10_001:] with open(output_file+"_train.json", "w") as f: f.write("\n".join(train_lines)) with open(output_file+"_val.json", "w") as f: f.write("\n".join(valid_lines)) 3. Preprocess the dataset (link to complete code). Sampel: COCO dataset Parse the caption and image path Convert to the JSON lines format Separate into training and validation Write into separate training and validation fi les
  23. Building the Indonesian dataset General step-by-step 3. Preprocessing (sample: COCO

    dataset) python preprocessors/coco.py <coco json file> <coco images folder> <output file name (without extension)> python preprocessors/coco.py datasets/coco/coco_captions_train2017.json datasets/coco/images datasets/coco/ coco_dataset • The script will output two fi les: coco_dataset_train.json and coco_dataset_val.json
  24. Building the Indonesian dataset Additional steps for the WiT dataset

    • Tip: There are many di ff erent kinds of preprocessing that you can do to get a high quality dataset
  25. Building the Indonesian dataset Additional steps for the WiT dataset

    (source code) 3. Remove image-text pairs that contain mostly of proper nouns # Setup CRFTagger ct = CRFTagger() ct.set_model_file('all_indo_man_tag_corpus_model.crf.tagger') # Load data df = pd.read_csv(sys.argv[1], delimiter='\t') df = df[["caption_reference_description", "image_url"]] def drop_propn(text): try: if len(text)==0: return True text = text.split() result = ct.tag_sents([text]) nnp_cnt = 0 total = len(result[0]) for x in result[0]: if x[1] == "NNP": nnp_cnt += 1 if (nnp_cnt/total) >= sys.argv[3]: return True return False except Exception as e: print(e) return True df["to_drop"] = df["caption_reference_description"].apply(drop_propn) df = df[df["to_drop"]==False] df = df.drop("to_drop",axis=1) df.to_csv(sys.argv[2], sep='\t') Load part-of-speech (POS) tagger Calculate percentage of proper noun (NNP) Only keep image-caption pairs where to_drop=False
  26. Building the Indonesian dataset Merging them all together awk 1

    cc12m_dataset_disk1_train.json cc12m_dataset_disk2_train.json cc3m_dataset_train.json coco_dataset_train.json flickr8k_dataset_train.json wit_dataset_train.json > train_dataset_v6.json awk 1 cc12m_dataset_disk1_val.json cc12m_dataset_disk2_val.json cc3m_dataset_val.json coco_dataset_val.json flickr8k_dataset_val.json wit_dataset_val.json > val_dataset_v6.json
  27. Code Overview • Uses the JAX/Flax backend • Is a

    vision-text dual encoder model using a pre-trained vision and text encoder • For the image encoder, we use Vision Transformer (ViT), more speci fi cally openai/clip-vit-base-patch32. • For the text encoder, we experimented with two models: IndoBERT Large (indobenchmark/indobert-base-p2) and Indonesian RoBERTa Base ( fl ax- community/indonesian-roberta-base). • The CLIP-Indonesian model uses a modi fi ed code from the HybridCLIP code
  28. Code Intro to JAX/Flax: JAX motivating example Classic Numpy •

    JAX: a framework that is speci fi cally suited for Machine Learning Research • What's missing? • Running on accelerated hardware (GPU/TPU) • Fast optimization via automatic di ff erentiation • Parallelization of data and computation https://www.youtube.com/watch?v=WdTeDXsOSj4
  29. Code Intro to JAX/Flax: JAX motivating example Classic Numpy Replace

    numpy with jax.numpy https://www.youtube.com/watch?v=WdTeDXsOSj4
  30. Code Intro to JAX/Flax: What is Flax? • Flax: a

    deep learning framework built on top of JAX • Contains all the usual elements you usually encounter in deep learning frameworks: • Neural network API ( fl ax.linen): Dense, Conv, {Batch|Layer|Group} Norm, Attention, Pooling, {LSTM|GRU} Cell, Dropout
 • Optimizers ( fl ax.optim): SGD, Momentum, Adam, LARS, Adagrad, LAMB, RMSprop
 • Utilities and patterns: replicated training, serialization and checkpointing, metrics, prefetching on device • And more! https://github.com/google/ fl ax
  31. Code FlaxHybridCLIP code from HuggingFace https://github.com/huggingface/transformers/tree/master/examples/research_projects/jax-projects/hybrid_clip python run_hybrid_clip.py \ --output_dir

    ${MODEL_DIR} \ --text_model_name_or_path="roberta-base" \ --vision_model_name_or_path="openai/clip-vit-base- patch32" \ --tokenizer_name="roberta-base" \ --train_file="coco_dataset/train_dataset.json" \ --validation_file="coco_dataset/ validation_dataset.json" \ --do_train --do_eval \ --num_train_epochs="40" --max_seq_length 96 \ --per_device_train_batch_size="64" \ --per_device_eval_batch_size="64" \ --learning_rate="5e-5" --warmup_steps="0" -- weight_decay 0.1 \ --overwrite_output_dir \ --preprocessing_num_workers 32 \ --push_to_hub
  32. Code Modi fi cations: Image augmentation 1. Code for image

    augmentation (source code; docs for torchvision transforms; based on clip-italian) self.transforms = torch.nn.Sequential( Resize([image_size], interpolation=InterpolationMode.BICUBIC), RandomCrop([image_size], pad_if_needed=True, padding_mode="edge"), ColorJitter(hue=0.1), RandomHorizontalFlip(), RandomAffine( degrees=15, translate=(0.1, 0.1), scale=(0.8, 1.2), shear=(-15, 15, -15, 15), interpolation=InterpolationMode.BILINEAR, fill=127, ), RandomPerspective( distortion_scale=0.3, p=0.3, interpolation=InterpolationMode.BILINEAR, fill=127, ), RandomAutocontrast(p=0.3), RandomEqualize(p=0.3), ConvertImageDtype(torch.float), Normalize( (0.48145466, 0.4578275, 0.40821073), (0.26862954, 0.26130258, 0.27577711), ), )
  33. Code Modi fi cations: Better optimizer 2. Optimizer (source code;

    based on clip-italian) optimizer = optax.chain( optax.adaptive_grad_clip(0.01, eps=0.001), optax.scale_by_belief(), optax.scale_by_schedule(decay_lr_schedule_fn), optax.scale(-1.0), )
  34. Code Modi fi cations: backbone freezing 3. Backbone Freezing (source

    code; based on clip-italian) image_embeds = vision_outputs[1] if self.freeze_backbones: image_embeds = jax.lax.stop_gradient(image_embeds) image_embeds = self.visual_projection(image_embeds) text_embeds = text_outputs[1] if self.freeze_backbones: text_embeds = jax.lax.stop_gradient(text_embeds) text_embeds = self.text_projection(text_embeds)
  35. Code Running the script #!/bin/bash SCRIPT_DIR=clip-indonesian MODEL_DIR=/mnt/disks/data-1/models/training_indobert IMAGE_ENCODER="openai/clip-vit-base-patch32" TEXT_ENCODER="indobenchmark/indobert-base-p2" python

    ${SCRIPT_DIR}/run_hybrid_clip.py \ --output_dir ${MODEL_DIR} \ --overwrite_output_dir \ --tokenizer_name=${TEXT_ENCODER} \ --train_file="../data/train_dataset_v6.json" \ --validation_file="../data/val_dataset_v6.json" \ --do_train --do_eval \ --num_train_epochs="10" --max_seq_length 96 \ --per_device_train_batch_size="64" \ --per_device_eval_batch_size="64" \ --learning_rate="0.00005" --warmup_ratio 0.1 --weight_decay 0.0 \ --preprocessing_num_workers 16 \ --exp_name training_v3 \ --text_model_name_or_path=${TEXT_ENCODER} \ --vision_model_name_or_path=${IMAGE_ENCODER} \ --eval_steps 500 \ --logging_steps 100 \ --save_steps 500 \ --save_total_limit 5 \ --adabelief \ --freeze_backbones
  36. Monitoring Setting up Weights & Biases # Enable wandb if

    jax.process_index() == 0 and args.log_wandb: try: wandb.init( name=args.exp_name, entity="galuh", project="clip-Indonesian", sync_tensorboard=True ) wandb.config.update(training_args) wandb.config.update(model_args) wandb.config.update(data_args) except ImportError as e: print(e) Enabling wandb (source code)
  37. Experiments • All image encoders are OpenAI ViT • Interestingly,

    in terms of validation loss, using IndoBERT large vs Roberta Base does not di ff er much
  38. Takeaway It's possible to do a project that requires a

    lot of computation and data on your own • Computing resources -> TPU Research Cloud • Dataset -> A large image-text pairs dataset in Indonesian • Code, compute-intensive NLP+CV -> Flax/Jax + HuggingFace • Monitoring system -> Weights & Biases
  39. References Bianchi, F., Attanasio, G., Pisoni, R., Terragni, S., Sarti,

    G., Lakshmi, S. (2021). Contrastive Language-Image Pre-training for the Italian Language arXiv preprint arXiv:2108.08688 . Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., & Sutskever, I. (2021). Learning Transferable Visual Models From Natural Language Supervision. ICML . Wilie, B., Vincentio, K., Winata, G. I., Cahyawijaya, S., Li, X., Lim, Z. Y., ... & Purwarianti, A. (2020). IndoNLU: Benchmark and resources for evaluating Indonesian natural language understanding. arXiv preprint arXiv:2009.05387 . Hybrid CLIP by the HuggingFace tea m Indonesian Roberta Base by Wilson Wongso, Steven Limcorn, Samsul Rahmadani, and Chew Kok Wa h Indonesian Translated Datasets by Samsul Rahmadan