초거대모델 학습을 위한 이미지-텍스트 데이터셋

COYO-700M: Large - scale Image - Text Pair Dataset Copyright
2022. Kakao Corp. All rights reserved. Redistribution or public display is not permitted without written permission from Kakao. 초거대모델 학습을 위한   이미지-텍스트 데이터셋 변민우 dylan.m 카카오브레인 if(kakao)2022

Large - scale Image - Text Models

CLIP:   Connecting Text and Images1) DALL·E:   Creating Images
from Text2) 1) https:/ /openai.com/blog/clip/   2) https:/ /openai.com/blog/dall - e/

CLIP:   Connecting Text and Images1) DALL·E:   Creating Images
from Text2) 1) https:/ /openai.com/blog/clip/   2) https:/ /openai.com/blog/dall - e/ 400M Image - Text Pairs 250M Image - Text Pairs

ALIGN (Google) 1) Florence (Microsoft) 2) 1) https:/ /ai.googleblog.com/2021/05/align -
scaling - up - visual - and - vision.html   2) https:/ /www.microsoft.com/en - us/research/publication/ fl orence - a - new - foundation - model - for - computer - vision/ 1.8B Image - Text Pairs 900M Image - Text Pairs

Image - Text Pair Dataset

Image - Text Pair   2016 Gyeongbokgung Palace Night View
Travel to beautiful history   https:/ /en.wikipedia.org/wiki/Gyeongbokgung   Picasso in front of his painting   https:/ /en.wikipedia.org/wiki/Pablo_Picasso   A panoramic view of Mauritius Island   https:/ /en.wikipedia.org/wiki/Mauritius   Van Gogh's Starry Night Over the Rhône, 1888, oil on canvas   https:/ /en.wikipedia.org/wiki/The_Starry_Night   Various fruits arranged at a stall in the Municipal Market of São Paulo   https:/ /en.wikipedia.org/wiki/Fruit

Image Tag & Alternative Text 이미지에 대한 설명으로 대체텍스트를 입력하면
  스크린리더를 통해 정보를 인식하도록 할 수 있습니다. 1) https:/ /en.m.wikipedia.org/wiki/Wikipedia:Manual_of_Style/Accessibility/Alternative_text_for_images <img src="//upload.wikimedia.org/wikipedia/commons/ thumb/0/0a/Jacques-Louis_David_017.jpg/170px-Jacques- Louis_David_017.jpg"   alt="Painting of Napoleon Bonaparte in His Study at the Tuileries" decoding="async" width="170" height="280" class="thumbimage" data-file-width="1576" data-file-height="2596">

Common Crawl “We build and maintain an open repository of
web crawl data that can be accessed and analyzed by anyone.” — Common Crawl1) 2,300억 웹 페이지 6.8 PiB 데이터 크기 3) 10년 총 수집기간 2) 1) Common Crawl, https:/ /commoncrawl.org/   2) 2013년 ~ 2022년 8월 기준   3) 페비바이트(Pebibyte, PiB) — 1 PiB = 250 bytes = 1,125,899,906,842,624 bytes = 1024 TiB(테비바이트)

Common Crawl 220억 웹 페이지 532TiB 데이터 크기 1년 '20년
10월 ~ ‘21년 8월

Data Filtering Image Filtering Text Filtering Image - Text Filtering
Deduplication

Image Filtering JPEG PNG BMP WEBP . . .

Image Filtering aspect ratio < 3.0 ≥ 200px > 5KB

Text Filtering 1) https:/ /github.com/google/cld3 5 ≤ Length ≤ 1000
English1) 3 ≤ # Words ≤ 256 # Noun ≥ 1

- Perceptual Image Hash1) 기반으로 중복된 이미지 제거 - 데이터셋
내에서 중복된 (pHash,Text) 샘플을 제거 - 해시값이 동일한 이미지에 대해 다른 텍스트 쌍이 존재할 수 있음. - 테스트 대상 외부 공개 데이터셋과 중복된 이미지 제거 - ImageNet-1K/21K - Flickr-30K - MS - COCO - CC-3M / CC-12M - 10번 이상 등장하는 텍스트가 포함된 샘플 모두 제거 - “Image of”, “photo of”, “jpeg”, … Deduplication 1) https:/ /www.hackerfactor.com/blog/index.php?/archives/432 - Looks - Like - It.html

- 안전한 데이터셋 만들기 - 포르노 이미지를 분류하는 모델을 사용하여
해당 이미지가 포함된 데이터 제거 - 욕설, 비속어 및 포르노 단어를 포함하는 데이터 제거 - 그럼에도 불구하고 완전한 제거를 할 수 없어 적합하지 않은 데이터가 포함되어 있을 가능성 있음 NSFW Filtering 🔞

Image - Text Metadata

- OpenAI CLIP Model (ViT - B/32, ViT - L/14)
Image - Text Similarity Image Encoder Text Encoder Van Gogh's Starry Night Over the Rhône,   1888, oil on canvas T1 I1 Cosine(I1, T1) = ?

Watermark Score 0.981 0.215

Watermark Score CNN RegNetY-16GF1) Watermark Score   [0.0, 1.0] Watermark
Images    Shutterstock   Getty Images   . . . Non-watermark Images    OpenImages   . . .   1) https:/ /github.com/facebookresearch/SWAG

- LAION Aesthetics Predictor V2 (https:/ /laion.ai/blog/laion - aesthetics/) Aesthetic
Score   7.0 https:/ /en.wikipedia.org/wiki/A_Little_Coaxing   4.5 https:/ /en.wikipedia.org/wiki/ American_Football_Conference   5.1 https:/ /en.wikipedia.org/wiki/Capital_One_Tower_(Virginia)   6.1 https:/ /en.wikipedia.org/wiki/Aoraki_/_Mount_Cook_National_Park

- SCRFD: Sample and Computation Redistribution for Ef fi cient
Face Detection - https:/ /insightface.ai/scrfd Faces   0 https:/ /en.wikipedia.org/wiki/Rookery_Building   1 https:/ /en.wikipedia.org/wiki/Charli_XCX   2 https:/ /en.wikipedia.org/wiki/The_Carpenters   11 https:/ /en.wikipedia.org/wiki/Sweden_national_football_team

- len(text) - len(text.split()) - transformers.BertTokenizer1) - transformers.GPT2TokenizerFast2) Word &
Tokens 1) https:/ /huggingface.co/docs/transformers/model_doc/bert#transformers.BertTokenizer   2) https:/ /huggingface.co/docs/transformers/model_doc/gpt2#transformers.GPT2TokenizerFast

COYO-700M https:/ /github.com/kakaobrain/coyo - dataset

COYO-700M 656M Unique Images 566M Unique Texts 747M Unique Image
- Text Pairs

속성 데이터 타입 설명 ID Long 64비트 정수형 아이디 URL
String <img> 태그의 src 속성에서 가져온 이미지 URL TEXT String <img> 태그의 alt 속성에서 가져온 이미지의 대체 텍스트 WIDTH Integer 이미지의 가로 길이 HEIGHT Integer 이미지의 세로 길이 IMAGE_PHASH String 이미지 해시값 WORD_COUNT Integer 공백으로 구분한 단어 개수 NUM_TOKENS_BERT Integer BertTokenizer 를 사용하여 분리된 토큰들의 개수 NUM_TOKENS_GPT Integer GPT2TokenizerFast를 사용하여 분리한 토큰들의 개수 NUM_FACES Integer 이미지에 포함된 얼굴 개수 CLIP_SIMILARITY_VITB32 Float CLIP ViT - B/32 모델 기반의 이미지-텍스트 코사인 유사도 CLIP_SIMILARITY_VITL14 Float CLIP ViT - L/14 모델 기반의 이미지-텍스트 코사인 유사도 WATERMARK_SCORE Float 이미지에 워터마크 포함 여부를 예측한 점수 AESTHETIC_SCORE_LAION_V2 Float 이미지의 미적 수준을 예측한 점수

COYO Examples Non - face High - resolution • width
> 256 • height > 256 • face_count == 0 • width > 1024 • height > 1024

COYO Examples Non - face High - resolution • width
> 256 • height > 256 • face_count == 0 • width > 1024 • height > 1024 High - quality Image Generation • width > 320 • height > 320 • word_count > 10 • clip_similarity_vitb32 > 0.3 • width > 512 • height > 512 • aesthetic_score_laion_v2 > 5.0 • watermark_score < 0.5

Experiments

ALIGN unCLIP + 1.8B Image - Text Pairs + 250M
Image - Text Pairs 1) C. Jia, Y. Yang, Y. Xia, Y. - T. Chen, Z. Parekh, H. Pham, Q. V. Le, Y. Sung, Z. Li, and T. Duerig. Scaling up visual and vision - language representation learning with noisy text supervision. arXiv:2102.05918, 2021.   2) Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text - conditional image generation with clip latents. arXiv:2204.06125, 2022.

ALIGN Image Encoder Text Encoder I1•T1 I1•T1 I1•T1 I1•T1 ...
I1•T1 I1•T1 I1•T1 I1•T1 I1•T1 ... I1•T1 I1•T1 I1•T1 I1•T1 I1•T1 ... I1•T1 I1•T1 I1•T1 I1•T1 I1•T1 ... I1•T1 ... ... ... ... ... ... IN•T1 IN•T2 IN•T3 IN•T4 ... IN•TN I1 I2 I3 I4 ... IN T1 T2 T3 T4 ... TN 1.8B Noisy Image - Text Pairs Contrastive Learning Text1 Text2 Text3 Text4 Text5 Text6 Text7 TextN 1) https:/ /ai.googleblog.com/2021/05/align - scaling - up - visual - and - vision.html   2) https:/ /openai.com/blog/clip/

ALIGN + COYO-700M Image Encoder Text Encoder I1•T1 I1•T1 I1•T1
I1•T1 ... I1•T1 I1•T1 I1•T1 I1•T1 I1•T1 ... I1•T1 I1•T1 I1•T1 I1•T1 I1•T1 ... I1•T1 I1•T1 I1•T1 I1•T1 I1•T1 ... I1•T1 ... ... ... ... ... ... IN•T1 IN•T2 IN•T3 IN•T4 ... IN•TN I1 I2 I3 I4 ... IN T1 T2 T3 T4 ... TN 747M Noisy Image - Text Pairs Tornado moving over a fi eld Stunning Locations Photog… Feel Calmly, Feel The Warmth Changing of the Tides Secluded Bardsey Island Mounted Print Photo of a Person Standing on Rice Terraces Great - horned owl on old tree Fruit Oil Painting 06 Contrastive Learning

ALIGN + COYO-700M Zero - shot Learning Model ImageNet Flickr30K
MSCOCO Image Encoder Text Encoder Top-1 I2T R@1 T2I R@1 I2T R@1 T2I R@1 CLIP - L/14   (OpenAI) 307M   (ViT - L/14@336) 117M   (GPT-2) 76.2 88.0 68.7 58.4 37.8 ALIGN - L2   (Google) 480M   (EffNet - L2) 335M   (BERT-large) 76.4 88.6 75.7 58.6 45.6 ALIGN - B7 (Google) 66M   (EffNet - B7) 110M   (BERT-base) 69.3 - - 55.4 41.7 ALIGN - B7   (Kakao Brain) 66M   (EffNet - B7) 110M   (BERT-base) 68.6   (-0.7) 88.1   (-) 73.2   (-) 61.2   (+5.8) 43.1   (+1.4)

unCLIP (DALL·E 2) 1) Aditya Ramesh, Prafulla Dhariwal, Alex Nichol,
Casey Chu, and Mark Chen. Hierarchical text - conditional image generation with clip latents. arXiv:2204.06125, 2022.

- COYO-700M에서 선별한 1억개 데이터를 활용하여 텍스트 기반 이미지 생성
모델을 학습 - 더 자세한 내용은 다음 “카카오브레인의 텍스트 기반 이미지 생성 기술” 세션을 확인해주세요. unCLIP (DALL·E 2) Goryeo celadon in the shape of darth vader A pencil drawing of an astronaut riding a horse A high quality picture of a medieval knight with golden armor

Future works

안전한 고품질 학습 데이터 학습 데이터

Q&A https:/ /github.com/kakaobrain/coyo - dataset

초거대모델 학습을 위한 이미지-텍스트 데이터셋

초거대모델 학습을 위한 이미지-텍스트 데이터셋

More Decks by kakao

Other Decks in Programming

Featured

Transcript