Upgrade to Pro — share decks privately, control downloads, hide ads and more …

AIアプリDojo #3: Hugging Face Transformers モデルのチュー...

AIアプリDojo #3: Hugging Face Transformers モデルのチューニング入門

生成系AIへの注目が集まる中、エンジニアの皆さんは、AIアプリを検討する機会が増えていくと思います。このセッションでは、自然言語理解や自然言語生成をコンピューターを使って処理していく上で、広く使われているHugging Face Transformersを用いて、事前学習済みモデルに追加の学習をさせるなど、チューニング、カスタム化のために必要なことを具体例を取り上げながら学びます。
AI アプリ Dojo #1, #2で学んだ環境を使っていきますので、Python, Pytorch, Hugging Face Transformersをご準備ください。もし可能であればPytorchがサポートするGPUを有するPC/Macをご準備ください。

事前学習済みモデルのファインチューニングの流れ
transformer.Trainerクラスに関連したクラスの理解
Hugging Face Datasets、Apache Arrowとの関係
Hugging Face Datasetsのパフォーマンス
BERTモデルのファインチューニング実行
出来上がったカスタムモデルを公開するまでの流れ

利用バージョン:
Windows 11 Pro 22H2 (22621.2215) Python 3.10.6, CUDA V11.7.64, Pytorch 2.0.1+cu117
macOS 13.5.1(22G90)Python 3.11.4 Pytorch 2.0.1 (mpsサポートなし)

Akira Onishi (IBM)

August 30, 2023
Tweet

More Decks by Akira Onishi (IBM)

Other Decks in Technology

Transcript

  1. ೔ຊΞΠɾϏʔɾΤϜגࣜձࣾ ςΫϊϩδʔࣄۀຊ෦ ΧελϚʔɾαΫηε ϓϦϯγύϧɾϚωʔδϟʔ ݉ 8JOEPXT/&5$POUBJOFS1PSUJOH1SPHSBNਪਐϦʔμʔ େ੢ জ "LJSB0OJTIJ!JCNDPN 5XJUUFS!POJBL

     IUUQTXXXGBDFCPPLDPNBLJSBPOJTIJ IUUQTXXXMJOLFEJODPNJOPOJBL "*ΞϓϦ%PKP )VHHJOH'BDF5SBOTGPSNFST Ϟσϧͷνϡʔχϯάೖ໳ 
  2. ࣗݾ঺հ 1SPQFSUZ 7BMVF ࢯ໊ େ੢ জ 5XJUUFS-JOLFE*O POJBL *5ۀքྺ ೥໨

    ϚΠϒʔϜ ΏΔ΍͔ͳμΠΤοτ )BTI5BH ͍͍Ͷ͐੩Ԭੜ׆ ࠲ӈͷ໏ ౿·Εͯ΋ͳ্ཱ͓͕ͪΔಓͷ૲ Α͘࢖͏ٕ ೴಺ม׵Ͱࣗ෼ΛϙδςΟϒʹ IUUQTXXXGBDFCPPLDPNBLJSBPOJTIJ 'BDFCPPLʮ͓ʹ͋͘ʯͰݕࡧ
  3. ࠓ೔ͷ࿩୊ ϩʔΧϧϋʔυ΢ΣΞ 8JOEPXT-JOVY.BD 1ZUIPO 1Z5PSDI  $6%" ϑΝΠϯνϡʔχϯά༻ ΞϓϦ Hugging

    Face Transformers BERTモデル ࡉ͔ͳཧ۶͸ൈ͖ʹͯ͠ɺ 1ZUIPOͱ )VHHJOH'BDF5SBOTGPSNFSTɺ%BUBTFUTͳͲΛ࢖ͬͯ #&35ϞσϧͷϑΝΠϯνϡʔχϯάମݧ ϩʔΧϧϋʔυ΢ΣΞ 8JOEPXT-JOVY.BD 1ZUIPO 1Z5PSDI  $6%" "*ਪ࿦ΞϓϦ ςετ࣮ߦ Hugging Face Transformers Hugging Face Hubに 公開したモデル カスタムモデル カスタムモデル
  4. શମ૾ ϑΝΠϯ νϡʔχϯάͷ ࣮ߦ ϩʔΧϧ؀ڥͰͷ ςετ )VHHJOH'BDF )VC΁ͷ 1SJWBUFެ։ 1SJWBUFϞσϧͷ

    ςετ )VHHJOH 'BDF)VC΁ͷ 1VCMJDެ։ ެ։Ϟσϧͷ ར༻ ਫ਼౓ͷධՁ ຊ೔ͷηογϣϯͰମݧ͢Δ಺༰ )VHHJOH'BDF%BUBTFUT͔ΒσʔλͷऔΓग़͠ 5SBJOFSͱ1ZUPSDIΛ࢖ͬͨϑΝΠϯνϡʔχϯά νϡʔχϯάͨ͠ϞσϧͷϩʔΧϧςετ )VHHJOH'BDF)VC΁ͷϞσϧͷެ։ ެ։ͨ͠Ϟσϧͷςετ )PTUFEJOGFSFODF"1* ֶशࡁΈϞσϧΛڞ༗͢Δ )VHHJOH'BDF/-1ίʔε https://huggingface.co/learn/nlp-course/ja/chapter4/3
  5. 🤗 5SBOTGPSNFST "*ϞσϧͷΧελϚΠζ Ϟσϧ΁ͷௐ੔ɺ௥Ճֶश "*Ϟσϧ΍ σʔληοτ "*ਪ࿦ػցֶशʹదͨ͠ ίϯϐϡʔλʔ "*ਪ࿦ػցֶशʹదͨ͠ 04ɺϥϯλΠϜ

    "*ϞσϧΛ࢖ͬͨܭࢉ IUUQTIVHHJOHGBDFDP "*ਪ࿦ ࣄલֶशࡁΈͷϞσϧΛ ར༻ͨ͠ܭࢉ "*ਪ࿦ʹదͨ͠ ίϯϐϡʔλʔ "*ਪ࿦ʹదͨ͠ 04ɺϥϯλΠϜ "*ϞσϧΛ࢖ͬͨܭࢉ ௥Ճͷσʔλ ֶश༻ͱݕূ༻ʣ from transformers import pipeline detector = pipeline(task="object-detection") preds = detector("画像のURL”) trainer = Trainer( model=model, args=training_args, train_dataset=small_train_dataset, eval_dataset=small_eval_dataset, compute_metrics=compute_metrics, ) trainer.Train()
  6. #&35 #JEJSFDUJPOBM&ODPEFS3FQSFTFOUBUJPOTGSPN5SBOTGPSNFST https://arxiv.org/abs/1810.04805 ΑΓ'JHVSFΛҾ༻ #&35͸ࣗવݴޠॲཧͷ൚༻Ϟσϧ ࣄલֶशࡁΈϞσϧΛ࢖͍ ผͷλεΫΛ࣮ߦ͢ΔͨΊͷϞσϧΛ ࡞੒Ͱ͖ΔʢϑΝΠϯνϡʔχϯάʣ ྫɿ ࣗવݴޠਪ࿦

    ./-*.VMUJ(FOSF /BUVSBM-BOHVBHF*OGFSFODF  ݻ༗දݱநग़ /&3/BNFE&OUJUZ 3FDPHOJUJPO ࣭໰Ԡ౴ 42V"%5IF4UBOGPSE 2VFTUJPO"OTXFSJOH%BUBTFU #&35ͷࡉ͔ͳ͜ͱ͸͓͖ͯ͞ɺ·ͣ͸ ϑΝΠϯνϡʔχϯάΛମݧͯ͠ΈΑ͏
  7. ࢀߟ$16ͷΈͰܭࢉͨ͠ྫ ܇࿅σʔλ਺   oniak3@AkiranoiMac py % python3 trainsample1.py load_dataset('yelp_review_full'):

    AutoTokenizer.from_pretrained('bert-base-cased'): dataset.map(tokenize_function, batched=True): Map: 100%|██████████████████████████| 650000/650000 [02:27<00:00, 4395.82 examples/s] Map: 100%|████████████████████████████| 50000/50000 [00:11<00:00, 4389.09 examples/s] tokenized_datasets['train'].select(range(1000)): tokenized_datasets['test'].select(range(1000)): AutoModelForSequenceClassification.from_pretrained('bert-base-cased', num_labels=5): Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight'] You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference. huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks... To disable this warning, you can either: - Avoid using `tokenizers` before the fork if possible - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false) TrainingArguments(output_dir='test_trainer', evaluation_strategy='epoch'): evaluate.load('accuracy'): Trainer(model=model, args=training_args, train_dataset=small_train_dataset, eval_dataset=small_eval_dataset, compute_metrics=compute_metrics): trainer.train(): {'eval_loss': 1.4513660669326782, 'eval_accuracy': 0.399, 'eval_runtime': 928.2999, 'eval_samples_per_second': 1.077, 'eval_steps_per_second': 0.135, 'epoch': 1.0} {'eval_loss': 1.0377055406570435, 'eval_accuracy': 0.55, 'eval_runtime': 925.9615, 'eval_samples_per_second': 1.08, 'eval_steps_per_second': 0.135, 'epoch': 2.0} 79%|██████████████████████████████████▉ | 298/375 [2:30:31<31:06, 24.24s/it] {'eval_loss': 1.0231441259384155, 'eval_accuracy': 0.592, 'eval_runtime': 922.4306, 'eval_samples_per_second': 1.084, 'eval_steps_per_second': 0.136, 'epoch': 3.0} {'train_runtime': 11808.8493, 'train_samples_per_second': 0.254, 'train_steps_per_second': 0.032, 'train_loss': 1.072725830078125, 'epoch': 3.0} 100%|████████████████████████████████████████████| 375/375 [3:16:48<00:00, 31.49s/it] oniak3@AkiranoiMac py %
  8. # Hugging Face Transformer Λ࢖ͬͨϑΝΠϯνϡʔχϯά $16ͷΈͰ࣮ߦ͢Δ৔߹ page 1: 計算に時間がかかるのでご注意 from

    datasets import load_dataset print ("load_dataset('yelp_review_full'):") dataset = load_dataset("yelp_review_full") from transformers import AutoTokenizer print ("AutoTokenizer.from_pretrained('bert-base-cased'):") tokenizer = AutoTokenizer.from_pretrained("bert-base-cased") def tokenize_function(examples): return tokenizer(examples["text"], padding="max_length", truncation=True) print ("dataset.map(tokenize_function, batched=True):") tokenized_datasets = dataset.map(tokenize_function, batched=True) print ("tokenized_datasets['train'].select(range(1000)):") small_train_dataset = tokenized_datasets["train"].shuffle(seed=42).select(range(1000)) print ("tokenized_datasets['test'].select(range(1000)):") small_eval_dataset = tokenized_datasets["test"].shuffle(seed=42).select(range(1000)) from transformers import AutoModelForSequenceClassification print ("AutoModelForSequenceClassification.from_pretrained('bert-base-cased', num_labels=5):") model = AutoModelForSequenceClassification.from_pretrained(“bert-base-cased”, num_labels=5) #星1から星5 model = model.to(“cpu”) # モデルの計算処理⽤にCPUデバイスを明⽰的に指定する、この指定がないと CUDA または MPSで計算 from transformers import TrainingArguments, Trainer print ("TrainingArguments(output_dir='test_trainer', evaluation_strategy='epoch'):") # TrainingArgumentにおいてもCPUデバイスの利⽤を明⽰的に指定する、この指定がないと CUDA または MPSで計算 training_args = TrainingArguments(output_dir="test_trainer", evaluation_strategy="epoch", use_cpu=True) #࣍ϖʔδ΁ଓ͘
  9. # Hugging Face Transformer Λ࢖ͬͨϑΝΠϯνϡʔχϯά $16ͷΈͰ࣮ߦ͢Δ৔߹ page 2 import numpy

    as np import evaluate print ("evaluate.load('accuracy'):") metric = evaluate.load("accuracy") def compute_metrics(eval_pred): logits, labels = eval_pred predictions = np.argmax(logits, axis=-1) return metric.compute(predictions=predictions, references=labels) print ("Trainer(model=model, args=training_args, train_dataset=small_train_dataset, eval_dataset=small_eval_dataset, compute_metrics=compute_metrics):") trainer = Trainer( model=model, args=training_args, train_dataset=small_train_dataset, eval_dataset=small_eval_dataset, compute_metrics=compute_metrics, ) print ("trainer.train():") trainer.train() print ("trainer.evaluate():") eval_result = trainer.evaluate() import pprint pprint.pprint(eval_results) tokenizer.save_pretrained("test_trainer") trainer.save_model()
  10. 5SBJOFSʹؔ࿈͢ΔओͳΫϥε DMBTTUSBOTGPSNFS5SBJOFS USBJOϝιου FWBMVBUFϝιου TBWF@NPEFMϝιου DMBTTEBUBTFUT%BUBTFU USBJO@EBUBTFU τϨʔχϯά༻ FWBM@EBUBTFU ධՁ༻

    DMBTTUSBOTGPSNFS1SF5SBJOFE.PEFM NPEFM DMBTT USBOTGPSNFS5SBJOJOH"SHVNFOU BSHT DMBTTUSBOTGPSNFS&WBM1SFEJDUJPO DPNQVUF@NFUSJDT MPHJUT MBCFMT def compute_metrics(eval_pred): logits, labels = eval_pred # labels は正解のラベル predictions = np.argmax(logits, axis=-1) # モデルによる予測 return metric.compute(predictions=predictions, references=labels) DMBTTFWBMVBUF&WBMVBUJPO.PEVMF DPNQVUFϝιου class evaluate Loadメソッド NFUSJD
  11. %BUBTFUTͷύϑΥʔϚϯε  import os; import psutil; import timeit from datasets

    import load_dataset mem_before = psutil.Process(os.getpid()).memory_info().rss / (1024 * 1024) wiki = load_dataset("wikipedia", "20220301.en", split="train") mem_after = psutil.Process(os.getpid()).memory_info().rss / (1024 * 1024) print(f"RAM memory used: {(mem_after - mem_before)} MB") 8JLJQFEJBFO20.3GBのデータ NBD047FOUVSBʢ'EʣJ.BD*OUFM$PSFJ()[ίΞ$16 44%5#ʹΑΔ࣮ߦ݁Ռ ॳճ Downloading: 100%|█████████████████████████| 15.3k/15.3k [00:00<00:00, 21.0MB/s] Downloading: 100%|█████████████████████████| 20.3G/20.3G [15:29<00:00, 21.8MB/s] RAM memory used: 52.98046875 MB #2回⽬以降 RAM memory used: 18.48046875 MB pip install apache_beam
  12. %BUBTFUTͷύϑΥʔϚϯε  import os; import psutil; import timeit from datasets

    import load_dataset mem_before = psutil.Process(os.getpid()).memory_info().rss / (1024 * 1024) wiki = load_dataset("wikipedia", "20220301.en", split="train") mem_after = psutil.Process(os.getpid()).memory_info().rss / (1024 * 1024) print(f"RAM memory used: {(mem_after - mem_before)} MB") s = """batch_size = 1000 for batch in wiki.iter(batch_size): ... """ time = timeit.timeit(stmt=s, number=1, globals=globals()) print(f"Time to iterate over the {wiki.dataset_size >> 30} GB dataset: {time:.1f} sec, " f"ie. {float(wiki.dataset_size >> 27)/time:.1f} Gb/s") NBD047FOUVSBʢ'EʣJ.BD*OUFM$PSFJ()[ίΞ$16 44%5# 1$*F઀ଓ ʹΑΔ࣮ߦ݁Ռ Time to iterate over the 18 GB dataset: 156.3 sec, ie. 1.0 Gb/s (初回) Time to iterate over the 18 GB dataset: 46.5 sec, ie. 3.2 Gb/s (2回⽬) 8JOEPXT1SP)".%3Z[FO9()[ίΞ $16 44%5# 1$*F઀ଓ RAM memory used: 20.3125 MB Time to iterate over the 18 GB dataset: 30.2 sec, ie. 5.0 Gb/s
  13. ϑΝΠϯνϡʔχϯάͷ࣮ߦ Windows 11 Pro, CPU: AMD Ryzen 9 5950X (16コア),

    Memory: 128GB GPU: NVIDIA GeForce RTX 4090 (Memory 24GB) ܇࿅σʔλ਺   ςετσʔλ਺   ӳจΛ౉͠ɺ੕͍ͭ͘ͷ ϨϏϡʔධՁʹͳΔ͔Λ ਪ࿦͢ΔϞσϧΛ࡞੒
  14. (16Λ࢖ͬͨϑΝΠϯνϡʔχϯάͷαϯϓϧ from datasets import load_dataset print ("load_dataset('yelp_review_full'):") dataset = load_dataset("yelp_review_full")

    from transformers import AutoTokenizer print ("AutoTokenizer.from_pretrained('bert-base-cased'):") tokenizer = AutoTokenizer.from_pretrained("bert-base-cased") def tokenize_function(examples): return tokenizer(examples["text"], padding="max_length", truncation=True) print ("dataset.map(tokenize_function, batched=True):") tokenized_datasets = dataset.map(tokenize_function, batched=True) print ("tokenized_datasets['train']") small_train_dataset = tokenized_datasets["train"].shuffle(seed=182).select(range(30000)) large_train_dataset = tokenized_datasets["train"] print ("tokenized_datasets['test']") small_eval_dataset = tokenized_datasets["test"].shuffle(seed=182).select(range(10000)) large_eval_dataset = tokenized_datasets["test"] from transformers import AutoModelForSequenceClassification print ("AutoModelForSequenceClassification.from_pretrained('bert-base-cased', num_labels=5):") model = AutoModelForSequenceClassification.from_pretrained("bert-base-cased", num_labels=5) #model = model.to("cuda") from transformers import TrainingArguments, Trainer print ("TrainingArguments(output_dir='test_finetune1', evaluation_strategy='epoch'):") output_foldername="test_finetune1" training_args = TrainingArguments(output_dir=output_foldername, evaluation_strategy=“epoch”, per_device_train_batch_size=32, per_device_eval_batch_size=24) ࣍ϖʔδ΁ଓ͘
  15. # GPUΛ࢖ͬͨϑΝΠϯνϡʔχϯάͷαϯϓϧଓ͖ import numpy as np import evaluate print ("evaluate.load('accuracy'):")

    metric = evaluate.load("accuracy") def compute_metrics(eval_pred): logits, labels = eval_pred print("logits: {} label: {}".format(logits, labels)) predictions = np.argmax(logits, axis=-1) return metric.compute(predictions=predictions, references=labels) print ("Trainer(…):") trainer = Trainer( model=model, args=training_args, train_dataset=small_train_dataset, eval_dataset=small_eval_dataset, compute_metrics=compute_metrics, ) print ("trainer.train():") trainer.train() #print ("trainer.evaluate():") eval_results = trainer.evaluate() print("eval results:") import pprint pprint.pprint(eval_results) tokenizer.save_pretrained(output_foldername) trainer.save_model()
  16. ΧελϜϞσϧͷϩʔΧϧςετྫ  from transformers import pipeline, AutoModelForSequenceClassification, AutoTokenizer # ϑΝΠϯνϡʔχϯάʹΑΓ࡞੒ͨ͠ϞσϧΛϩʔΧϧετϨʔδ͔ΒಡΈࠐΉ

    model = AutoModelForSequenceClassification.from_pretrained("test_finetune1") # BERTͷtokenizerΛ४උ͢Δ tokenizer = AutoTokenizer.from_pretrained("test_finetune1") # Hugging Face TransfomerͷpipelineʹΑΓAIਪ࿦Λ࣮ߦ͢Δ classifier = pipeline(task="text-classification", model=model, tokenizer=tokenizer) result = classifier("This place is really great ! I'd highly recommend.") print (result) ʔʔʔ [{'label': 'LABEL_4', 'score': 0.9601454138755798}] 1ZUIPOΛ࣮ߦ͍ͯ͠ΔϑΥϧμͷαϒϑΥϧμʹ UFTU@USBJOFS͕͋Γɺ ͦ͜ʹϞσϧͱ5PLFOJ[FS͕֨ೲ͞Ε͍ͯΔ৔߹
  17. ΧελϜϞσϧͷϩʔΧϧςετྫ from transformers import AutoTokenizer, AutoModelForSequenceClassification import torch tokenizer =

    AutoTokenizer.from_pretrained("test_finetune1") model = AutoModelForSequenceClassification.from_pretrained("test_finetune1") # transformersͷpipelineͰ͸ͳ͘ɺtokenizerͱmodelΛ໌ࣔతʹར༻ inputs = tokenizer("This place is really great ! I'd highly recommend.", return_tensors="pt") with torch.no_grad(): logits = model(**inputs).logits import pprint pprint.pprint(logits) # ਪ࿦݁Ռͷऔಘ predicted_class_id = logits.argmax().item() result = model.config.id2label[predicted_class_id] print(predicted_class_id) print(result) ʔʔʔ tensor([[-2.0764, -3.4568, -1.9678, 2.1805, 5.3953]]) 4 LABEL_4
  18. ࢀߟ(16Λ࢖ͬͯશσʔλͰܭࢉͨ͠ྫ PS D:¥Learn¥transformers¥finetune> python tune2.py 184789.6056312 load_dataset('yelp_review_full'): AutoTokenizer.from_pretrained('bert-base-cased'): dataset.map(tokenize_function, batched=True):

    Map: 100%|█████████████████████████████████████████████| 650000/650000 [02:31<00:00, 4295.43 examples/s] Map: 100%|███████████████████████████████████████████████| 50000/50000 [00:11<00:00, 4254.34 examples/s] tokenized_datasets['train'] tokenized_datasets['test'] AutoModelForSequenceClassification.from_pretrained('bert-base-cased', num_labels=5): Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight'] You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference. TrainingArguments(output_dir='test_trainer3', evaluation_strategy='epoch'): evaluate.load('accuracy'): Trainer(model=model, args=training_args, train_dataset=small_train_dataset, eval_dataset=small_eval_dataset, compute_metrics=compute_metrics): trainer.train(): {‘loss’: 1.0243, ‘learning_rate’: 4.958975368811435e-05, ‘epoch’: 0.02} # தུ… {'loss': 0.556, 'learning_rate': 3.601962618356061e-07, 'epoch': 2.98} {'eval_loss': 0.734175443649292, 'eval_accuracy': 0.69484, 'eval_runtime': 247.19, 'eval_samples_per_second': 202.274, 'eval_steps_per_second': 8.431, 'epoch': 3.0} {'train_runtime': 30651.3372, 'train_samples_per_second': 63.619, 'train_steps_per_second': 1.988, 'train_loss': 0.6722067566574003, 'epoch': 3.0} 100%|███████████████████████████████████████████████████████████| 60939/60939 [8:30:51<00:00, 1.99it/s] 100%|███████████████████████████████████████████████████████████████| 2084/2084 [04:06<00:00, 8.45it/s] eval results: {'epoch': 3.0, 'eval_accuracy': 0.69484, 'eval_loss': 0.734175443649292, 'eval_runtime': 246.7769, 'eval_samples_per_second': 202.612, 'eval_steps_per_second': 8.445} 215861.5289557 total time:31071.923324500007 8JOEPXT1SP $16".%3Z[FO9 ίΞ .FNPSZ(# (16/7*%*"(F'PSDF359 .FNPSZ(#  ܇࿅σʔλ ݅ ධՁσʔλ ݅ ܭࢉ࣌ؒ ࣌ؒ ໿ ඵ
  19. from datasets import load_dataset import time start = time.perf_counter() print(start)

    print ("load_dataset('yelp_review_full'):") dataset = load_dataset("yelp_review_full") from transformers import AutoTokenizer print ("AutoTokenizer.from_pretrained('bert-base-cased'):") tokenizer = AutoTokenizer.from_pretrained("bert-base-cased") def tokenize_function(examples): return tokenizer(examples["text"], padding="max_length", truncation=True) print ("dataset.map(tokenize_function, batched=True):") tokenized_datasets = dataset.map(tokenize_function, batched=True) print ("tokenized_datasets['train']") large_train_dataset = tokenized_datasets["train"] print ("tokenized_datasets['test']") large_eval_dataset = tokenized_datasets["test"] from transformers import AutoModelForSequenceClassification print ("AutoModelForSequenceClassification.from_pretrained('bert-base-cased', num_labels=5):") model = AutoModelForSequenceClassification.from_pretrained("bert-base-cased", num_labels=5) #model = model.to("cuda") from transformers import TrainingArguments, Trainer print ("TrainingArguments(output_dir='test_trainer4large', evaluation_strategy='epoch'):") training_args = TrainingArguments(output_dir=“test_trainer4large”, evaluation_strategy=“epoch”, per_device_train_batch_size=32, per_device_eval_batch_size=24) # 次ページへ続く
  20. #続き import numpy as np import evaluate print ("evaluate.load('accuracy'):") metric

    = evaluate.load("accuracy") def compute_metrics(eval_pred): logits, labels = eval_pred print("logits: {} label: {}".format(logits, labels)) predictions = np.argmax(logits, axis=-1) return metric.compute(predictions=predictions, references=labels) print ("Trainer(…):") trainer = Trainer( model=model, args=training_args, train_dataset=large_train_dataset, eval_dataset=large_eval_dataset, compute_metrics=compute_metrics, ) print ("trainer.train():") trainer.train() eval_results = trainer.evaluate() print("eval results:") import pprint pprint.pprint(eval_results) trainer.save_model() tokenizer.save_pretrained("test_trainer4large") end = time.perf_counter() print(end) print("total time:"+str(end-start))
  21. QVTI@UP@IVC  Λ࢖ͬͨૹ৴ from transformers import AutoTokenizer, AutoModelForSequenceClassification local_name="test_trainer4large" #

    datasets಺ͷσʔλΛશͯ࢖ͬͨϞσϧ hub_name="oniak3ai/bert_classification_yelp_review_full" tokenizer = AutoTokenizer.from_pretrained(local_name) tokenizer.push_to_hub(hub_name, private=True) model = AutoModelForSequenceClassification.from_pretrained(local_name) model.push_to_hub(hub_name, private=True) 5PLFOJ[FSΛΞοϓϩʔυ͍ͯ͠ͳ͍৔߹ɺ.PEFMDBSE಺Ͱͷςετ࣮ߦͰΤϥʔͱͳΔͷͰ UPLFOJ[FSͱNPEFMɺͦΕͧΕ QVTI@UP@IVCϝιουͰ )VHHJOH'BDF)VCʹૹ৴ QVTI@UP@IVCϝιουͷύϥϝʔλ lPOJBLBJCFSU@DMBTTJGJDBUJPO@ZFMQ@SFWJFX@GVMMz0SHBOJ[BUJPO໊Ϟσϧ໊ Ϟσϧ໊͸ϩʔΧϧϞσϧͱҧ͏໊લΛࢦఆՄೳ QSJWBUF5SVFϞσϧΛඇެ։ѻ͍ͰΞοϓϩʔυ ͜ͷύϥϝʔλΛࢦఆ͠ͳ͍৔߹͸ɺΞοϓϩʔυͨ͠Ϟσϧ͸ެ։ঢ়ଶʹͳΓ·͢
  22. ඇެ։ঢ়ଶʹ͋ΔϞσϧͷςετํ๏ λʔϛφϧɺ1PXFS4IFMMͳͲ͔ΒIVHHJOHGBDFDMJͰϩάΠϯ 1ZUIPO͔ΒςετίʔυΛ࣮ߦ from transformers import pipeline, AutoModelForSequenceClassification, AutoTokenizer #

    ϑΝΠϯνϡʔχϯάʹΑΓ࡞੒ͨ͠Ϟσϧ,TokenizerΛHugging Face Model Hub͔ΒಡΈࠐΉ hub_name="oniak3ai/bert_classification_yelp_review_full" model = AutoModelForSequenceClassification.from_pretrained(hub_name) tokenizer = AutoTokenizer.from_pretrained(hub_name) # Hugging Face TransfomersͷpipelineʹΑΓAIਪ࿦Λ࣮ߦ͢Δ classifier = pipeline(task="text-classification", model=model, tokenizer=tokenizer) result = classifier("This place is really great ! I'd highly recommend.") print (result) oniak3@AkiranoiMac py % python3 privatetest.py Downloading (…)lve/main/config.json: 100%|█████| 958/958 [00:00<00:00, 2.30MB/s] Downloading pytorch_model.bin: 100%|█████████| 433M/433M [00:19<00:00, 21.7MB/s] Downloading (…)okenizer_config.json: 100%|█████| 314/314 [00:00<00:00, 1.66MB/s] Downloading (…)solve/main/vocab.txt: 100%|███| 232k/232k [00:00<00:00, 1.43MB/s] Downloading (…)/main/tokenizer.json: 100%|███| 711k/711k [00:00<00:00, 1.11MB/s] Downloading (…)cial_tokens_map.json: 100%|██████| 125/125 [00:00<00:00, 763kB/s] [{'label': 'LABEL_4', 'score': 0.6090189218521118}]
  23. ৼΓฦΓ ϑΝΠϯ νϡʔχϯάͷ ࣮ߦ ϩʔΧϧ؀ڥͰͷ ςετ )VHHJOH'BDF )VC΁ͷ 1SJWBUFެ։ 1SJWBUFϞσϧͷ

    ςετ )VHHJOH 'BDF)VC΁ͷ 1VCMJDެ։ ެ։Ϟσϧͷ ར༻ ਫ਼౓ͷධՁ ຊ೔ͷηογϣϯͰମݧͨ͠಺༰ )VHHJOH'BDF%BUBTFUT͔ΒσʔλͷऔΓग़͠ 5SBJOFSͱ1ZUPSDIΛ࢖ͬͨϑΝΠϯνϡʔχϯά νϡʔχϯάͨ͠ϞσϧͷϩʔΧϧςετ )VHHJOH'BDF)VC΁ͷϞσϧͷެ։ ެ։ͨ͠Ϟσϧͷςετ )PTUFEJOGFSFODF"1* ֶशࡁΈϞσϧΛڞ༗͢Δ )VHHJOH'BDF/-1ίʔε IUUQTIVHHJOHGBDFDPMFBSOOMQDPVSTFKBDIBQUFS
  24. ϫʔΫγϣοϓɺηογϣϯɺ͓Αͼࢿྉ͸ɺ*#.·ͨ͸ηογϣϯൃදऀʹΑͬͯ४උ͞ΕɺͦΕͧΕಠࣗͷݟղΛ൓өͨ͠΋ͷͰ͢ɻͦΕΒ͸৘ใఏڙͷ໨తͷΈ Ͱఏڙ͞Ε͓ͯΓɺ͍͔ͳΔࢀՃऀʹରͯ͠΋๏཯త·ͨ͸ͦͷଞͷࢦಋ΍ॿݴΛҙਤͨ͠΋ͷͰ͸ͳ͘ɺ·ͨͦͷΑ͏ͳ݁ՌΛੜΉ΋ͷͰ΋͋Γ·ͤΜɻຊߨԋࢿྉ ʹؚ·Ε͍ͯΔ৘ใʹ͍ͭͯ͸ɺ׬શੑͱਖ਼֬ੑΛظ͢ΔΑ͏౒ྗ͠·͕ͨ͠ɺʮݱঢ়ͷ··ʯఏڙ͞Εɺ໌ࣔ·ͨ͸҉ࣔʹ͔͔ΘΒ͍͔ͣͳΔอূ΋൐Θͳ͍΋ͷͱ ͠·͢ɻຊߨԋࢿྉ·ͨ͸ͦͷଞͷࢿྉͷ࢖༻ʹΑͬͯɺ͋Δ͍͸ͦͷଞͷؔ࿈ʹΑͬͯɺ͍͔ͳΔଛ֐͕ੜͨ͡৔߹΋ɺ*#.͸੹೚ΛෛΘͳ͍΋ͷͱ͠·͢ɻຊߨԋ ࢿྉʹؚ·Ε͍ͯΔ಺༰͸ɺ*#.·ͨ͸ͦͷαϓϥΠϠʔ΍ϥΠηϯεަ෇ऀ͔Β͍͔ͳΔอূ·ͨ͸ද໌ΛҾ͖ͩ͢͜ͱΛҙਤͨ͠΋ͷͰ΋ɺ*#.ιϑτ΢ΣΞͷ࢖༻ Λنఆ͢Δద༻ϥΠηϯεܖ໿ͷ৚߲Λมߋ͢Δ͜ͱΛҙਤͨ͠΋ͷͰ΋ͳ͘ɺ·ͨͦͷΑ͏ͳ݁ՌΛੜΉ΋ͷͰ΋͋Γ·ͤΜɻ ຊߨԋࢿྉͰ*#.੡඼ɺϓϩάϥϜɺ·ͨ͸αʔϏεʹݴٴ͍ͯͯ͠΋ɺ*#.͕Ӧۀ׆ಈΛߦ͍ͬͯΔ͢΂ͯͷࠃͰͦΕΒ͕࢖༻ՄೳͰ͋Δ͜ͱΛ҉ࣔ͢Δ΋ͷͰ͸͋Γ ·ͤΜɻຊߨԋࢿྉͰݴٴ͍ͯ͠Δ੡඼ϦϦʔε೔෇΍੡඼ػೳ͸ɺࢢ৔ػձ·ͨ͸ͦͷଞͷཁҼʹج͍ͮͯ*#.ಠࣗͷܾఆݖΛ΋͍ͬͯͭͰ΋มߋͰ͖Δ΋ͷͱ͠ɺ ͍͔ͳΔํ๏ʹ͓͍ͯ΋কདྷͷ੡඼·ͨ͸ػೳ͕࢖༻ՄೳʹͳΔͱ֬໿͢Δ͜ͱΛҙਤͨ͠΋ͷͰ͸͋Γ·ͤΜɻຊߨԋࢿྉʹؚ·Ε͍ͯΔ಺༰͸ɺࢀՃऀ͕։࢝͢Δ ׆ಈʹΑͬͯಛఆͷൢചɺച্ߴͷ޲্ɺ·ͨ͸ͦͷଞͷ݁Ռ͕ੜ͡Δͱड़΂Δɺ·ͨ͸҉ࣔ͢Δ͜ͱΛҙਤͨ͠΋ͷͰ΋ɺ·ͨͦͷΑ͏ͳ݁ՌΛੜΉ΋ͷͰ΋͋Γ·

    ͤΜɻύϑΥʔϚϯε͸ɺ؅ཧ͞Εͨ؀ڥʹ͓͍ͯඪ४తͳ*#.ϕϯνϚʔΫΛ࢖༻ͨ͠ଌఆͱ༧ଌʹج͍͍ͮͯ·͢ɻϢʔβʔ͕ܦݧ͢Δ࣮ࡍͷεϧʔϓοτ΍ύ ϑΥʔϚϯε͸ɺϢʔβʔͷδϣϒɾετϦʔϜʹ͓͚ΔϚϧνϓϩάϥϛϯάͷྔɺೖग़ྗߏ੒ɺετϨʔδߏ੒ɺ͓Αͼॲཧ͞ΕΔϫʔΫϩʔυͳͲͷߟྀࣄ߲Λ ؚΉɺ਺ଟ͘ͷཁҼʹԠͯ͡มԽ͠·͢ɻ͕ͨͬͯ͠ɺݸʑͷϢʔβʔ͕͜͜Ͱड़΂ΒΕ͍ͯΔ΋ͷͱಉ༷ͷ݁ՌΛಘΒΕΔͱ֬໿͢Δ΋ͷͰ͸͋Γ·ͤΜɻ هड़͞Ε͍ͯΔ͢΂ͯͷ͓٬༷ࣄྫ͸ɺͦΕΒͷ͓٬༷͕ͲͷΑ͏ʹ*#.੡඼Λ࢖༻͔ͨ͠ɺ·ͨͦΕΒͷ͓٬༷͕ୡ੒ͨ݁͠Ռͷ࣮ྫͱͯࣔ͠͞Εͨ΋ͷͰ͢ɻ࣮ࡍ ͷ؀ڥίετ͓ΑͼύϑΥʔϚϯεಛੑ͸ɺ͓٬༷͝ͱʹҟͳΔ৔߹͕͋Γ·͢ɻ *#.ɺ*#.ϩΰɺJCNDPNɺ*#.$MPVEɺ*#.$MPVE1BLT͸ɺੈքͷଟ͘ͷࠃͰొ࿥͞Εͨ*OUFSOBUJPOBM#VTJOFTT.BDIJOFT$PSQPSBUJPOͷ঎ඪͰ͢ɻଞͷ੡඼໊͓ ΑͼαʔϏε໊౳͸ɺͦΕͧΕ*#.·ͨ͸֤ࣾͷ঎ඪͰ͋Δ৔߹͕͋Γ·͢ɻݱ࣌఺Ͱͷ*#.ͷ঎ඪϦετʹ͍ͭͯ͸ɺXXXJCNDPNMFHBMDPQZUSBEFTIUNMΛ͝ཡ ͍ͩ͘͞ɻ .JDSPTPGU 8JOEPXT 8JOEPXT4FSWFS /&5'SBNFXPSL /&5 /&5$PSF͸ɺ.JDSPTPGU$PSQPSBUJPOͷ঎ඪ·ͨ͸ొ࿥঎ඪͰ͢ɻ /7*%*" /7*%*"ϩΰ /7*%*"$6%"͸ /7*%*"$PSQPSBUJPOͷ঎ඪ·ͨ͸ొ࿥঎ඪͰ͢ɻ )VHHJOH'BDF͸ɺ )VHHJOH'BDF *OD ͷ঎ඪͰ͢ɻʢొ࿥঎ඪͱͯ͠ग़ئதʣ ࢿྉ಺Ͱར༻͍ͯ͠Δ)VHHJOH'BDFʹొ࿥͞Ε͍ͯΔϞσϧ͸ɺ֤Ϟσϧ͕ࢦఆͨ͠ϥΠηϯεͰӡ༻Ͱ͖·͢ɻ ࢿྉ಺ʹ͍ࣔͯ͠Δ"*ਪ࿦Λ࣮ߦ͢ΔͨΊͷίʔυ͸ɺαϯϓϧͰ͋Γ׬શͳίʔυͰ͸͋Γ·ͤΜɻ*5ΤϯδχΞͷମݧػձΛ૿΍ֶ͢श໨తͰ४උͨ͠΋ͷͰ͢ɻ "*ϞσϧΛ࣮ࡍͷγεςϜʹ૊ΈࠐΉ৔߹͸ɺϞσϧͷϥΠηϯεܖ໿Λ֬ೝ͠ɺγεςϜཁ݅ʹԠͨ͡"*ਪ࿦࣮ߦ؀ڥΛ४උ͠ɺඞཁͳྫ֎ॲཧΛ௥Ճ͢ΔͳͲ࣮ӡ ༻ʹ࢖͑ΔίʔυΛ࡞੒͠ɺे෼ͳσόοάɺςετΛߦ͍ͬͯͩ͘͞ɻϑΝΠϯνϡʔχϯάͨ͠ϞσϧΛ঎༻ར༻͢Δ৔߹͸ɺར༻͢ΔσʔληοτͷϥΠηϯε ܖ໿Λ֬ೝ͍ͯͩ͘͠͞ɻ )VHHJOH'BDF5SBOTGPSNFSͷٕज़తͳ໰୊ղܾɺϑΟʔυόοΫ͸ɺIUUQTHJUIVCDPNIVHHJOHGBDFUSBOTGPSNFST ΑΓɺ (JU)VC*TTVF 1VMM3FRVFTUΛ௨ͯ͡ɺΦʔϓϯιʔείϛϡχςΟͱڞʹਐΊ͍ͯͩ͘͞ɻ