Upgrade to Pro — share decks privately, control downloads, hide ads and more …

AIアプリ Dojo #2 Hugging Face Transformers入門

AIアプリ Dojo #2 Hugging Face Transformers入門

生成系AIへの注目が集まる中、エンジニアの皆さんは、AIアプリを検討する機会が増えていくと思います。このセッションでは、自然言語理解や自然言語生成をコンピューターを使って処理していく上で、広く使われているHugging Face Transformersについて、インストールから簡単な実行までを確認しながら学んでいきます。
必須ではありませんが、事前に gitコマンドやPythonのインストールを済ませておくと、参加される皆さん自身も学びながらHugging Face Transformersを試せると思います。

Akira Onishi (IBM)

June 28, 2023
Tweet

More Decks by Akira Onishi (IBM)

Other Decks in Technology

Transcript

  1. ೔ຊΞΠɾϏʔɾΤϜגࣜձࣾ
    ςΫϊϩδʔࣄۀຊ෦
    ΧελϚʔɾαΫηε ϓϦϯγύϧɾϚωʔδϟʔ
    ݉ 8JOEPXT/&5$POUBJOFS1PSUJOH1SPHSBNਪਐϦʔμʔ
    େ੢ জ "LJSB0OJTIJ!JCNDPN 5XJUUFS!POJBL

    IUUQTXXXGBDFCPPLDPNBLJSBPOJTIJ
    IUUQTXXXMJOLFEJODPNJOPOJBL
    "*ΞϓϦ%PKP
    )VHHJOH'BDF5SBOTGPSNFSTೖ໳

    View full-size slide

  2. ࣗݾ঺հ
    1SPQFSUZ 7BMVF
    ࢯ໊ େ੢ জ
    5XJUUFS-JOLFE*O POJBL
    *5ۀքྺ ೥໨
    ϚΠϒʔϜ μΠΤοτ
    )BTI5BH ͍͍Ͷ͐੩Ԭੜ׆
    ࠲ӈͷ໏ ౿·Εͯ΋ͳ্ཱ͓͕ͪΔಓͷ૲
    Α͘࢖͏ٕ ೴಺ม׵Ͱࣗ෼ΛϙδςΟϒʹ
    IUUQTXXXGBDFCPPLDPNBLJSBPOJTIJ
    'BDFCPPLʮ͓ʹ͋͘ʯͰݕࡧ

    View full-size slide

  3. The AI community building the future.
    https://huggingface.co/

    View full-size slide

  4. "*ΞϓϦ%PKPୈճͷൣғ
    اۀ಺ͷ"*׆༻ʹܨ͍͛ͯͨ͘Ίɺ
    طଘͷϞσϧΛͦͷ··ར༻͢Δཱ৔Ͱग़ൃ͠ɺ
    5SBOTGPSNFSTΛ࢖ͬͯ
    "*ϞσϧʹΑΔਪ࿦Λ࣮ߦ͢Δ
    ˞ମݧ͕த৺ͷͨΊɺͦΕͧΕͷٕज़Λ
    ໢ཏతʹղઆ͢ΔηογϣϯͰ͸͋Γ·ͤΜɻ

    View full-size slide

  5. ࢀߟॻ੶ػցֶशΤϯδχΞͷͨΊͷ5SBOTGPSNFST
    IUUQTXXXPSFJMMZDPKQCPPLT
    🤗Transformers を詳しく学べる参考書

    View full-size slide

  6. ࠓ೔ͷ࿩୊
    ϩʔΧϧϋʔυ΢ΣΞ
    8JOEPXT-JOVY.BD
    1ZUIPO 1Z5PSDI $6%"

    "*ਪ࿦ΞϓϦ
    Hugging Face Transformers
    Hugging Faceに公開されているモデル
    ࡉ͔ͳཧ۶͸ൈ͖ʹͯ͠ɺ
    1ZUIPOͱ )VHHJOH'BDF5SBOTGPSNFSTΛ࢖ͬͯ "*ਪ࿦Λ࣮ߦͯ͠ΈΔ
    ຊ೔ͷηογϣϯͰମݧ͢Δ಺༰
    จষཁ໿
    ݻ༗දݱநग़ /&3/BNFE&OUJUZ3FDPHOJUJPO

    จষੜ੒
    ෼ྨɺจষ಺ͷײ৘෼ੳ
    Ի੠ϑΝΠϧ͔Βจࣈى͜͠
    ը૾಺ͷ෺ମݕग़
    ࣭ٙԠ౴
    ຋༁
    ϓϩάϥϜͷιʔείʔυੜ੒

    View full-size slide

  7. ෮श"*ਪ࿦Λ൐͏ΞϓϦͷߏ੒
    8FCϒϥ΢β
    ΞϓϦ
    8FCαΠτ
    8FC "1*
    αʔϏε

    AIモデル
    計算処理
    ࠓ೔͸͜͜ͷ࿩
    ͜ͷܭࢉΛ
    "*ਪ࿦ॲཧͱ΋ݴ͏

    View full-size slide

  8. ෮श"*ֶशͱ"*ਪ࿦Λ෼͚ͯߟ͑Α͏
    "*ֶश
    Ϟσϧͷ࡞੒ɺվྑ
    "*ਪ࿦
    ϞσϧΛར༻ͨ͠ܭࢉ
    ֶश
    σʔλ
    ਂ૚ֶश
    ʢܭࢉʣ
    ཧ࿦తͳԾઆɺݚڀɺ
    ࣮ূ͔Βͷཪ෇͚
    େن໛ͳܭࢉࢿݯ
    )1$)JHI1FSGPSNBODF
    $PNQVUJOH
    "*Ϟσϧ
    "*ਪ࿦ʹదͨ͠
    ίϯϐϡʔλʔ
    ඞཁͳ౤ࢿ

    "*ਪ࿦ʹదͨ͠
    04ɺϥϯλΠϜ
    "*ϞσϧΛ࢖ͬͨܭࢉ
    ϞσϧʹΑΔ࣮ݧɺݕূ
    "*ίϛϡχςΟͷϥΠϒϥϦΛར༻ͯ͠
    ୯ମͰ΋࢝ΊΒΕΔ
    සൟʹߋ৽͕ൃੜ͢ΔલఏͰͷ
    ։ൃɾӡ༻͕ཧ૝త
    ϑΟʔυόοΫ
    ڊେͳσʔλϨΠΫϋ΢εɺ
    σʔλαΠΤϯςΟετɺ
    )1$؀ڥ΁ͷ౤ࢿ͕ඞཁ
    IUUQTIVHHJOHGBDFDP

    View full-size slide

  9. ෮श"*ਪ࿦ͷ࣮ߦ؀ڥ
    IUUQTQZQJPSH
    IUUQTIVHHJOHGBDFDP
    ίϯϐϡʔλʔ
    8JOEPXT-JOVY04
    1ZUIPO 1Z5PSDI ͳͲ
    $16
    (16ϝϞϦ
    ϝϞϦ
    (16
    ϝϞϦ
    (16ϝϞϦ
    /7.F
    44%
    (16υϥΠό
    (1(16ԋࢉϥΠϒϥϦ
    "*ϞσϧΛར༻ͨ͠ΞϓϦ
    ϋʔυ΢ΣΞந৅ԽϨΠϠʔ )BSEXBSF"CTUSBDUJPO-BZFS

    $IJQ
    ηοτ
    /FUXPSL
    *OUFSGBDF
    طଘͷ"*Ϟσϧ
    ࢲ͕࣮ͨͪ૷͢Δਪ࿦༻ͷίʔυ
    "*ϞσϧΛ࢖ͬͨܭࢉ
    طଘͷϥΠϒϥϦ
    IUUQTXXXOWJEJBDPNKBKQ
    ిݯϢχοτ

    View full-size slide

  10. ෮श1ZUIPO
    ΠϯλϓϦλܕݴޠ
    ಈతͳܕ෇͚
    ։ൃޮ཰ΛߴΊΔϑϨʔϜϫʔΫɺϥΠϒϥϦ
    https://www.python.org/downloads/release/python-3106/
    今回は
    Python 3.10.6 を使います

    View full-size slide

  11. 1ZUPSDI ͷ؀ڥΛ੔͑Δ
    https://pytorch.org/get-started/locally/
    $6%"ར༻ͷࡍͷ͝஫ҙ
    ࠷৽ͷ$6%"5PPMLJU ͸ରԠͯ͠·ͤΜ
    $6%"·ͨ͸ΛΠϯετʔϧ
    ͠·͠ΐ͏

    View full-size slide

  12. )VHHJOH'BDF.PEFM)VC
    IUUQTIVHHJOHGBDFDPNPEFMT
    2023/6/28 21:54時点で
    239,989モデルを公開中
    名前による検索
    タスクによる絞り込み
    モデルの推論を試せる
    テスト機能を提供

    View full-size slide

  13. )VHHJOH'BDF5SBOTGPSNFST
    https://github.com/huggingface/transformers/blob/main/README_ja.md
    ೔ຊޠϞσϧΛ࣮ߦ͢ΔલఏͰͷ४උ
    QJQJOTUBMMUSBOTGPSNFST
    QJQJOTUBMMTFOUFODFQJFDF
    ࣗવݴޠͷཧղ
    ࣗવݴޠͷੜ੒
    Ϟσϧ͔Βਪ࿦
    Ϟσϧ΁ͷ௥Ճֶश

    View full-size slide

  14. 🤗 5SBOTGPSNFST
    "*ϞσϧͷΧελϚΠζ
    Ϟσϧ΁ͷௐ੔ɺ௥Ճֶश
    "*Ϟσϧ΍
    σʔληοτ
    "*ਪ࿦ػցֶशʹదͨ͠
    ίϯϐϡʔλʔ
    "*ਪ࿦ػցֶशʹదͨ͠
    04ɺϥϯλΠϜ
    "*ϞσϧΛ࢖ͬͨܭࢉ
    IUUQTIVHHJOHGBDFDP
    "*ਪ࿦
    ࣄલֶशࡁΈͷϞσϧΛ
    ར༻ͨ͠ܭࢉ
    "*ਪ࿦ʹదͨ͠
    ίϯϐϡʔλʔ
    "*ਪ࿦ʹదͨ͠
    04ɺϥϯλΠϜ
    "*ϞσϧΛ࢖ͬͨܭࢉ
    ௥Ճͷσʔλ
    ֶश༻ͱݕূ༻ʣ
    from transformers import pipeline
    detector = pipeline(task="object-detection")
    preds = detector("画像のURL”)
    trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=small_train_dataset,
    eval_dataset=small_eval_dataset,
    compute_metrics=compute_metrics,
    )
    trainer.Train()
    ˞ࠓޙͷԠ༻ฤͰ
    ঺հ͢Δ༧ఆ

    View full-size slide

  15. ࢀߟ5SBOTGPSNFSͷ࢓૊Έ
    ೥݄ "UUFOUJPOJT"MM:PV/FFE ͷ࿦จൃද
    ຋༁λεΫʹؔ͢Δ5SBOTGPSNFSͷϞσϧ͕ϕʔε
    https://huggingface.co/learn/nlp-course/ja/chapter1/4
    &ODPEFS
    ಛ௃ྔͷ
    ੜ੒
    Decoder
    ⽬的の
    系列を
    ⽣成
    ೖྗʢӳޠʣ
    ༧ଌ͞Εͨग़ྗʢ೔ຊޠʣ
    ͜Ε·Ͱʹ༧ଌ͞Εͨग़ྗʢ೔ຊޠʣ
    ͬ͘͟Γͨ͠ϙΠϯτɿ
    ͜Ε·Ͱར༻͞Ε͖ͯͨϑΟʔυόοΫϧʔϓΛ
    ༗͢Δʮ࠶ؼܕʯχϡʔϥϧωοτϫʔΫ 3//
    ΑΓ΋
    ฒྻԽ͠΍͍͢ ΑΓେن໛ͳֶशϞσϧͷ։ൃ΁
    ͜ͷ࿦จ͕ൃද͞Ε͔ͯΒଟ͘ͷݴޠϞσϧ͕ొ৔ɺ
    େ͖̏ͭ͘ʹ෼ྨʢ(15ܕɺ#&35ܕɺ#"355ܕʣ
    ฒྻԽ͠΍͍͢൓໘ɺϝϞϦফඅྔ͸େ͖͘ͳͬͨ
    ֓೦ਤ

    View full-size slide

  16. import torch
    import time
    from transformers import AutoTokenizer, AutoModelForCausalLM
    prompt_base = "ユーザー: {}システム: "
    start = time.perf_counter()
    tokenizer = AutoTokenizer.from_pretrained("rinna/japanese-gpt-neox-3.6b-instruction-sft", use_fast=False)
    end = time.perf_counter()
    print("Tokenizer loaded:"+str(end-start))
    start = time.perf_counter()
    model = AutoModelForCausalLM.from_pretrained("rinna/japanese-gpt-neox-3.6b-instruction-sft")
    #GPUメモリが12-16GBの場合、float16でなんとかメモリ内に収める
    #model = AutoModelForCausalLM.from_pretrained("rinna/japanese-gpt-neox-3.6b-instruction-sft", torch_dtype=torch.float16)
    end = time.perf_counter()
    print("CausalLM loaded:"+str(end-start))
    if torch.cuda.is_available():
    model = model.to("cuda")
    print ("cuda is available")
    def inferencing(prompt):
    start = time.perf_counter()
    token_ids = tokenizer.encode(prompt, add_special_tokens=False, return_tensors="pt")
    with torch.no_grad():
    output_ids = model.generate(
    token_ids.to(model.device),
    do_sample=True,
    max_new_tokens=256,
    temperature=0.9,
    top_k=50,
    repetition_penalty=1.0,
    pad_token_id=tokenizer.pad_token_id,
    bos_token_id=tokenizer.bos_token_id,
    eos_token_id=tokenizer.eos_token_id
    )
    output = tokenizer.decode(output_ids.tolist()[0][token_ids.size(1):])
    output = output.replace("", "¥n")
    end = time.perf_counter()
    print(”Inferencing completed:"+str(end-start))
    return output
    #続き
    def do_conversation():
    text = input("Neox-3.6b>")
    if text == "end":
    return False
    prompt = prompt_base.format(text)
    result = inferencing(prompt)
    print(result)
    return True
    while True:
    res = do_conversation()
    if res == False:
    break
    "*ΞϓϦ%PKPୈճͷৼΓฦΓ
    ར༻ϞσϧSJOOBKBQBOFTFHQUOFPYCJOTUSVDUJPOTGU
    IUUQTIVHHJOHGBDFDPSJOOBKBQBOFTFHQUOFPYCJOTUSVDUJPOTGU

    View full-size slide

  17. &ODPEF .PEFM %FDPEF
    "*ͱͷ΍ΓऔΓ͸ʮਓ͕ཧղͰ͖Δ৘ใʯ
    ػցֶशʹ͓͚Δܭࢉͷத਎͸ʮ਺஋ԋࢉʯ
    ਓ΍γεςϜ͕
    ೖྗͨ͠ςΩετ
    &ODPEFS %FDPEFS
    ਪ࿦Ͱར༻͢Δ
    .PEFM
    ೖྗ ग़ྗ
    𝑒!

    𝑒"
    𝑟!

    𝑟"
    ੜ੒͞Εͨ
    ςΩετ
    ਺஋ϕΫτϧ ਺஋ϕΫτϧ
    IUUQTIVHHJOHGBDFDPEPDTUSBOTGPSNFSTNBJO@DMBTTFTUPLFOJ[FS

    View full-size slide

  18. ࢀߟจࣈྻͷτʔΫϯԽ
    UQZ ͱͯ͠อଘʢܾͯ͠ɺUPLFOQZʹ͠ͳ͍Α͏ʹɺ࣍ͷϖʔδͰղઆ

    GSPNUSBOTGPSNFSTJNQPSU"VUP5PLFOJ[FS
    UPLFOJ[FS"VUP5PLFOJ[FSGSPN@QSFUSBJOFE CFSUCBTFVODBTFE

    FODPEJOHUPLFOJ[FS 8FBSFWFSZIBQQZUPTIPXZPVUIF🤗5SBOTGPSNFSTMJCSBSZ

    QSJOU FODPEJOH

    ࣮ߦ݁Ռ
    \JOQVU@JET<
    > UPLFO@UZQF@JET< > BUUFOUJPO@NBTL<
    >^

    View full-size slide

  19. ࣦഊͷڞ༗UPLFOQZ ͸࡞Βͳ͍͜ͱ
    'JMF6TFSTPOJBLQZUPLFOQZ MJOF JONPEVMF
    GSPNUSBOTGPSNFSTJNQPSU"VUP5PLFOJ[FS
    *NQPSU&SSPSDBOOPUJNQPSUOBNF"VUP5PLFOJ[FSGSPNQBSUJBMMZJOJUJBMJ[FENPEVMF
    USBOTGPSNFST NPTUMJLFMZEVFUPBDJSDVMBSJNQPSU

    -JCSBSZ'SBNFXPSLT1ZUIPOGSBNFXPSL7FSTJPOTMJCQZUIPOTJUF
    QBDLBHFTUSBOTGPSNFST@@JOJU@@QZ

    'JMF%=MFBSO=USBOTGPSNFST=TBNQMFT=UPLFOQZ MJOF JONPEVMF
    GSPNUSBOTGPSNFSTJNQPSU"VUP5PLFOJ[FS
    *NQPSU&SSPSDBOOPUJNQPSUOBNF"VUP5PLFOJ[FSGSPNQBSUJBMMZJOJUJBMJ[FENPEVMF
    USBOTGPSNFST NPTUMJLFMZEVFUPBDJSDVMBSJNQPSU

    $=6TFST=POJBL="QQ%BUB=-PDBM=1SPHSBNT=1ZUIPO=1ZUIPO=MJC=TJUF
    QBDLBHFT=USBOTGPSNFST=@@JOJU@@QZ

    ͝஫ҙɿࣗ෼Ͱ࡞੒ͨ͠1ZUIPOίʔυΛ UPLFOQZ ͱ࣮ͯ͠ߦ༻ͷϑΥϧμʹอଘͯ͋͠Δͱɺ
    ଞͷQZUIPOϓϩάϥϜͷ࣮ߦ࣌ʹඞͣऔΓࠐ·Εͯɺ༧ظͤ͵Τϥʔ͕ੜ͡·͢ɻ ̀ 1ZUIPOఆ਺ͷ্ॻ͖ͱΈͳ͞ΕΔ

    දࣔ͞Ε͍ͯΔΤϥʔίʔυͰݕࡧͯ͠΋ɺԿ͕ؒҧ͍ͬͯΔͷ͔ʹ͍ͭͯɺؾ෇͖ʹ͘͘ɺ
    UPLFOQZͱ͍͏ϑΝΠϧ͕ଘࡏ͢ΔͱΤϥʔ͕ଓͨ͘ΊɺUSBOTGPSNFST΍1ZUIPO 1ZUPSDIͷ࠶ΠϯετʔϧͰ΋म෮Ͱ͖·ͤΜ
    https://docs.python.org/ja/3/library/token.html
    1ZUIPO؀ڥ͕յΕΔʂʁ

    View full-size slide

  20. 5SBOTGPSNFSTɺయܕతͳϢʔεέʔε
    จষཁ໿
    ݻ༗දݱநग़ /&3/BNFE&OUJUZ3FDPHOJUJPO

    จষੜ੒
    ෼ྨɺจষ಺ͷײ৘෼ੳ
    Ի੠ϑΝΠϧ͔Βจࣈى͜͠
    ը૾಺ͷ෺ମݕग़
    ࣭ٙԠ౴
    ຋༁
    ϓϩάϥϜͷιʔείʔυੜ੒
    ͜ͷͭͷϢʔεέʔεʹ߹Θͤͯ
    ؆୯ͳίʔυΛूΊ·ͨ͠ɻ
    (16౥ࡌͷ8JOEPXT
    (16αϙʔτͳ͠NBD04
    ͦΕͧΕͰಈ࡞֬ೝ͍ͯ͠·͢

    View full-size slide

  21. จষཁ໿
    #textsum.py Ϟσϧͷϖʔδʹ͋ΔαϯϓϧίʔυΛҾ༻͍ͯ͠·͢
    from transformers import pipeline
    seq2seq = pipeline("summarization", model="tsmatz/mt5_summarize_japanese")
    sample_text = "αοΧʔͷϫʔϧυΧοϓΧλʔϧେձɺੈքϥϯΩϯά24ҐͰάϧʔϓEʹଐ͢Δ೔
    ຊ͸ɺ23೔ͷ1࣍Ϧʔάॳઓʹ͓͍ͯɺੈք11ҐͰաڈ4ճͷ༏উΛތΔυΠπͱରઓ͠·ͨ͠ɻࢼ߹͸લ
    ൒ɺυΠπͷҰํతͳϖʔεͰ͸͡·Γ·͕ͨ͠ɺޙ൒ɺ೔ຊͷ৿อ؂ಜ͸߈ܸతͳબखΛੵۃతʹಈһ͠
    ͯྲྀΕΛม͑·ͨ͠ɻ݁ہɺ೔ຊ͸લ൒ʹ1఺ΛୣΘΕ·͕ͨ͠ɺ్தग़৔ͷಊ҆཯બखͱઙ໺୓ຏબख͕
    ޙ൒ʹΰʔϧΛܾΊɺ2ର1Ͱٯసউͪ͠·ͨ͠ɻήʔϜͷྲྀΕΛ͔ͭΜͩ৿อࡃ഑͕ޭΛ૗͠·ͨ͠ɻ"
    result = seq2seq(sample_text)
    print(result)
    Downloading (…)lve/main/config.json: 100%|█████| 867/867 [00:00<00:00, 2.00MB/s]
    Downloading pytorch_model.bin: 100%|███████| 1.20G/1.20G [00:15<00:00, 79.1MB/s]
    Downloading (…)okenizer_config.json: 100%|█████| 399/399 [00:00<00:00, 3.74MB/s]
    Downloading spiece.model: 100%|████████████| 4.31M/4.31M [00:00<00:00, 26.9MB/s]
    Downloading tokenizer.json: 100%|██████████| 16.3M/16.3M [00:00<00:00, 17.0MB/s]
    Downloading (…)cial_tokens_map.json: 100%|████| 74.0/74.0 [00:00<00:00, 612kB/s]
    Your max_length is set to 128, but your input_length is only 126. Since this is a summarization task, where
    outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g.
    summarizer('...', max_length=63)
    [{'summary_text': 'αοΧʔͷϫʔϧυΧοϓΧλʔϧେձ͸23೔ɺ1࣍Ϧʔάॳઓ͕͋ΓɺੈքϥϯΩϯά24ҐͰάϧʔϓEʹଐ͢Δ೔ຊ͸ɺ
    υΠπͱରઓͨ͠ɻ'}]
    IUUQTIVHHJOHGBDFDPUTNBU[NU@TVNNBSJ[F@KBQBOFTF Λར༻ ϞσϧαΠζ (#

    View full-size slide

  22. ݻ༗දݱநग़
    #ner.py Ϟσϧͷϖʔδʹ͋ΔαϯϓϧίʔυΛҰ෦վม͍ͯ͠·͢
    from transformers import pipeline
    import pandas as pd
    model_name = "tsmatz/xlm-roberta-ner-japanese"
    classifier = pipeline("token-classification", model=model_name)
    result = classifier("ాத͸4݄ͷཅؾͷྑ͍೔ʹɺླΛ͚ͭͯ۽ຊݝͷѨોࢁʹొͬͨɻ۽ຊͷถম
    யʮനַ ͠ΖʯΛҿΜͩɻ")
    df = pd.DataFrame(result)
    print(df)
    entity score index word start end
    0 PER 0.999310 1 ▁ 0 1
    1 PER 0.999407 2 ా 0 1
    2 PER 0.999074 3 த 1 2
    3 LOC 0.998935 14 ۽ຊ 19 21
    4 LOC 0.997582 15 ݝ 21 22
    5 LOC 0.998968 17 Ѩ 23 24
    6 LOC 0.998960 18 ો 24 25
    7 LOC 0.998147 19 ࢁ 25 26
    8 LOC 0.990043 24 ۽ຊ 31 33
    9 PRD 0.997916 30 ന 38 39
    10 PRD 0.998629 31 ַ 39 40
    11 PRD 0.998314 32 ▁ 41 42
    12 PRD 0.997710 33 ͠ 41 42
    13 PRD 0.998055 34 Ζ 42 43
    IUUQTIVHHJOHGBDFDPUTNBU[YMNSPCFSUBOFSKBQBOFTF (#͋ΔͷͰॳճ͸͕͔͔࣌ؒΓ·͢ʣ
    pip install pandas

    View full-size slide

  23. จষੜ੒
    #tg.py
    from transformers import AutoTokenizer, AutoModelForCausalLM
    tokenizer = AutoTokenizer.from_pretrained("rinna/japanese-gpt2-medium",
    use_fast=False, padding_side='left')
    tokenizer.do_lower_case = True # due to some bug of tokenizer config loading
    model = AutoModelForCausalLM.from_pretrained("rinna/japanese-gpt2-medium")
    data ="͜Μʹͪ͸ɺ"
    input = tokenizer.encode(data, return_tensors="pt")
    output = model.generate(input, do_sample=True, max_length=300,
    num_return_sequences=1,pad_token_id=tokenizer.eos_token_id)
    print(tokenizer.batch_decode(output))
    pip install protobuf
    https://huggingface.co/rinna/japanese-gpt2-medium
    1JQFMJOFͰ͸ͳ͘ɺ"VUP$MBTTΛར༻

    View full-size slide

  24. ײ৘෼ੳ
    # sentiment.py
    from transformers import pipeline
    model_name = "Mizuiro-sakura/luke-japanese-large-sentiment-analysis-wrime"
    classifier = pipeline("sentiment-analysis", model=model_name, tokenizer=model_name)
    data =[
    "ࡢ೔஗͘·Ͱ࢓ࣄͯ͠େมͩͬͨɻ͔͠΋๙ΊΒΕͣ൵͔ͬͨ͠",
    "ࡢ೔஗͘·Ͱ࢓ࣄͯ͠େม͚ͩͬͨͲɺ๙ΊΒΕͨͷͰݩؾ͍ͬͺ͍",
    "ΧϥΦέͰେ੠Ͱ੠͕ރΕΔ·Ͱ೤এͨ͠",
    "ࠓே͔ΒେӍ͕߱ͬͯͲ͜ʹ΋֎ग़Ͱ͖ͳͯ͘࠷ѱͩͬͨ"]
    ret = classifier(data)
    emotions=['͏Ε͍͠', '൵͍͠', 'ظ଴', 'ౖΓ', 'ڪΕ', 'ݏѱ', '৴པ']
    for i, r in enumerate(ret):
    print(f"'{data[i]}' ͸ {float(r['score']) * 100:.2f}%ͷείΞͰ `{emotions[int(r['label'][-1])]}` ͱ
    ൑ఆ͞Ε·ͨ͠")
    'ࡢ೔஗͘·Ͱ࢓ࣄͯ͠େมͩͬͨɻ͔͠΋๙ΊΒΕͣ൵͔ͬͨ͠' ͸ 97.88%ͷείΞͰ `൵͍͠` ͱ൑ఆ͞Ε·ͨ͠
    'ࡢ೔஗͘·Ͱ࢓ࣄͯ͠େม͚ͩͬͨͲɺ๙ΊΒΕͨͷͰݩؾ͍ͬͺ͍' ͸ 98.29%ͷείΞͰ `͏Ε͍͠` ͱ൑ఆ͞Ε·ͨ͠
    'ΧϥΦέͰେ੠Ͱ੠͕ރΕΔ·Ͱ೤এͨ͠' ͸ 98.48%ͷείΞͰ `͏Ε͍͠` ͱ൑ఆ͞Ε·ͨ͠
    'ࠓே͔ΒେӍ͕߱ͬͯͲ͜ʹ΋֎ग़Ͱ͖ͳͯ͘࠷ѱͩͬͨ' ͸ 97.69%ͷείΞͰ `൵͍͠` ͱ൑ఆ͞Ε·ͨ͠
    IUUQTIVHHJOHGBDFDP.J[VJSPTBLVSBMVLFKBQBOFTFMBSHFTFOUJNFOUBOBMZTJTXSJNF Λར༻

    View full-size slide

  25. Ի੠͔Βจࣈى͜͠
    #trans.py : TransformersͷυΩϡϝϯτ಺ͷαϯϓϧίʔυΛҾ༻͍ͯ͠·͢
    from transformers import pipeline
    generator = pipeline(model="openai/whisper-large")
    text = generator(
    [
    "https://huggingface.co/datasets/Narsil/asr_dummy/resolve/main/mlk.flac",
    "https://huggingface.co/datasets/Narsil/asr_dummy/resolve/main/1.flac",
    ]
    )
    print(text)
    [{'text': ' I have a dream that one day this nation will rise up and live out the true meaning of its creed.'},
    {'text': ' He hoped there would be stew for dinner, turnips and carrots and bruised potatoes and fat mutton
    pieces to be ladled out in thick, peppered flour-fattened sauce.'}]
    IUUQTIVHHJOHGBDFDPPQFOBJXIJTQFSMBSHFW (#͋ΔͷͰɺॳճ͸μ΢ϯϩʔυʹ͕͔͔࣌ؒΓ·͢ʣ
    oniak3@AkiranoiMac py % python3 stt.py
    Downloading (…)lve/main/config.json: 100%|█| 1.96k/1.96k [00:00<00:00, 5.31MB/s]
    Downloading pytorch_model.bin: 100%|███████| 6.17G/6.17G [05:06<00:00, 20.2MB/s]
    Downloading (…)neration_config.json: 100%|█| 3.51k/3.51k [00:00<00:00, 25.6MB/s]
    Downloading (…)okenizer_config.json: 100%|█████| 842/842 [00:00<00:00, 8.18MB/s]
    Downloading (…)olve/main/vocab.json: 100%|█| 1.04M/1.04M [00:00<00:00, 1.64MB/s]
    Downloading (…)/main/tokenizer.json: 100%|█| 2.20M/2.20M [00:00<00:00, 11.7MB/s]
    Downloading (…)olve/main/merges.txt: 100%|███| 494k/494k [00:00<00:00, 34.7MB/s]
    Downloading (…)main/normalizer.json: 100%|██| 52.7k/52.7k [00:00<00:00, 353kB/s]
    Downloading (…)in/added_tokens.json: 100%|█| 2.08k/2.08k [00:00<00:00, 18.7MB/s]
    Downloading (…)cial_tokens_map.json: 100%|█| 2.08k/2.08k [00:00<00:00, 10.2MB/s]
    Downloading (…)rocessor_config.json: 100%|███| 185k/185k [00:00<00:00, 59.9MB/s]
    GGNQFH͕ඞཁ
    IUUQTGGNQFHPSHEPXOMPBEIUNM
    Ԡ༻໰୊
    ೔ຊޠͷձ࿩ؚ͕·Ε͍ͯΔԻ੠ϑΝΠϧΛ
    ࢦఆ͢Δͱɺ݁Ռ͸Ͳ͏ͳΔͰ͠ΐ͏͔ʁ
    ͥͻɺ͓ࢼ͍ͩ͘͠͞ɻ

    View full-size slide

  26. ը૾಺ͷ෺ମݕग़
    #obd.py : Transformersͷจॻ಺ͷίʔυΛҾ༻͍ͯ͠·͢
    import requests
    from PIL import Image
    from transformers import pipeline
    # Download an image with cute cats
    url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/coco_sample.png"
    image_data = requests.get(url, stream=True).raw
    image = Image.open(image_data)
    # Allocate a pipeline for object detection
    object_detector = pipeline('object-detection')
    result = object_detector(image)
    print (result)
    [{'score': 0.9982201457023621, 'label': 'remote', 'box': {'xmin': 40, 'ymin': 70, 'xmax': 175, 'ymax':
    117}}, {'score': 0.9960021376609802, 'label': 'remote', 'box': {'xmin': 333, 'ymin': 72, 'xmax': 368,
    'ymax': 187}}, {'score': 0.9954745173454285, 'label': 'couch', 'box': {'xmin': 0, 'ymin': 1, 'xmax': 639,
    'ymax': 473}}, {'score': 0.9988006353378296, 'label': 'cat', 'box': {'xmin': 13, 'ymin': 52, 'xmax': 314,
    'ymax': 470}}, {'score': 0.9986783862113953, 'label': 'cat', 'box': {'xmin': 345, 'ymin': 23, 'xmax':
    640, 'ymax': 368}}]
    https://huggingface.co/docs/transformers/tasks/object_detection
    pip install timm

    View full-size slide

  27. ࣭ٙԠ౴
    #qanda.py : Ϟσϧͷϖʔδ಺ͷαϯϓϧίʔυΛҾ༻͍ͯ͠·͢
    from transformers import pipeline
    model_name = "tsmatz/roberta_qa_japanese"
    qa_pipeline = pipeline("question-answering", model=model_name,tokenizer=model_name)
    result = qa_pipeline(
    question = "ܾউτʔφϝϯτͰ೔ຊʹউͬͨͷ͸Ͳ͜Ͱ͔ͨ͠ɻ",
    context = "೔ຊ͸༧બϦʔάͰڧ߽ͷυΠπͱεϖΠϯʹউܾͬͯউτʔφϝϯτʹਐΜ͕ͩɺΫϩΞνΞͱରઓͯ͠ഊΕͨɻ",
    align_to_words = False,
    )
    print(result)
    oniak3@AkiranoiMac py % code qanda.py
    oniak3@AkiranoiMac py % python3 qanda.py
    Downloading (…)lve/main/config.json: 100%|█████| 731/731 [00:00<00:00, 1.71MB/s]
    Downloading pytorch_model.bin: 100%|█████████| 440M/440M [00:21<00:00, 20.7MB/s]
    Downloading (…)okenizer_config.json: 100%|█████| 540/540 [00:00<00:00, 2.86MB/s]
    Downloading spiece.model: 100%|██████████████| 806k/806k [00:00<00:00, 1.14MB/s]
    Downloading (…)/main/tokenizer.json: 100%|█| 2.41M/2.41M [00:00<00:00, 2.67MB/s]
    Downloading (…)cial_tokens_map.json: 100%|█████| 170/170 [00:00<00:00, 1.03MB/s]
    {'score': 0.4740956723690033, 'start': 38, 'end': 43, 'answer': 'ΫϩΞνΞ'}
    IUUQTIVHHJOHGBDFDPUTNBU[SPCFSUB@RB@KBQBOFTF நग़ܕͷ࣭໰Ԡ౴Ϟσϧʣ
    ࢦఆͨ͠DPOUFYU಺ʹؚ·ΕΔճ౴Λநग़͢ΔϞσϧ

    View full-size slide

  28. ຋༁ ӳޠ͔Β೔ຊޠ

    #e2j.py
    #pip install sacremoses ͕ඞཁ
    from transformers import pipeline
    model_name = "staka/fugumt-en-ja"
    translator = pipeline("translation", model=model_name)
    text = ["I have a pen. I have an Apple. How's your translation in Japanese?",
    "Watsonx is our upcoming enterprise-ready AI and data platform designed to multiply the impact of AI
    across your business. The platform comprises three powerful components: the watsonx.ai studio for new
    foundation models, generative AI and machine learning; the watsonx.data fit-for-purpose store for the
    flexibility of a data lake and the performance of a data warehouse; plus the watsonx.governance toolkit,
    to enable AI workflows that are built with responsibility, transparency and explainability."]
    ret = translator(text)
    print(ret)
    oniak3@AkiranoiMac py % python3 trans.py
    [{'translation_text': 'ࢲ͸ϖϯΛ͍࣋ͬͯ·͢ɻࢲ͸ΞοϓϧΛ͍࣋ͬͯ·͢ɻ͋ͳͨͷ຋༁͸೔ຊޠͰͲ͏Ͱ͔͢ɻ'},
    {'translation_text': 'Watsonx͸ɺΤϯλʔϓϥΠζରԠͷAIͱσʔλϓϥοτϑΥʔϜͰɺϏδωεશମͷAIͷӨڹΛֻ͚߹Θͤ
    ΔΑ͏ʹઃܭ͞Ε͍ͯ·͢ɻϓϥοτϑΥʔϜ͸ɺ3ͭͷڧྗͳίϯϙʔωϯτͰߏ੒͞Ε͍ͯ·͢ɻ৽͍͠ج൫ϞσϧͷͨΊͷ
    watsonx.aiελδΦɺδΣωϨʔςΟϒAIͱػցֶशɺσʔλϨΠΫͷॊೈੑͱσʔλ΢ΣΞϋ΢εͷύϑΥʔϚϯεͷͨΊͷ
    watsonx.data fit-for-purposeετΞɺ੹೚ɺಁ໌ੑɺઆ໌ՄೳੑΛඋ͑ͨAIϫʔΫϑϩʔΛ࣮ݱ͢ΔͨΊͷwatsonx.governance
    πʔϧΩοτͰ͢ɻ'}]
    IUUQTIVHHJOHGBDFDPTUBLBGVHVNUFOKB IUUQTTUBLBKQXPSEQSFTT Q Λࢀরʣ
    QJQJOTUBMMTBDSFNPTFT

    View full-size slide

  29. ιʔείʔυੜ੒
    #codeg.py: https://blog.salesforceairesearch.com/codegen/ Λࢀߟʹ͍ͯ͠·͢
    from transformers import AutoTokenizer, AutoModelForCausalLM
    tokenizer = AutoTokenizer.from_pretrained("Salesforce/codegen2-1B")
    model = AutoModelForCausalLM.from_pretrained("Salesforce/codegen2-1B", trust_remote_code=True,
    revision="main")
    text = "Solve the two sum problem with python."
    input_ids = tokenizer(text, return_tensors="pt").input_ids
    generated_ids = model.generate(input_ids, max_length=192)
    print(tokenizer.decode(generated_ids[0], skip_special_tokens=False)[len(text):])
    import math
    def two_sum(nums, target):
    """
    :type nums: List[int]
    :type target: int
    :rtype: List[int]
    """
    nums.sort()
    for i in range(len(nums)):
    if nums[i] == target:
    return [i, len(nums)]
    return []
    if __name__ == '__main__':
    print(two_sum([2, 7, 11, 15], 9))
    <|endoftext|><|python|>#
    IUUQTIVHHJOHGBDFDP4BMFTGPSDFDPEFHFO# (#͋ΔͷͰؾ௕ʹμ΢ϯϩʔυ͠·͠ΐ͏࿦จ͸ͪ͜Βʣ
    ࢼ༻ͷͨΊͰ͋Ε͹$16Ͱ΋ਪ࿦Ͱ͖·͕͢ɺ࣮༻తͰ͸͋Γ·ͤΜ
    ਪ࿦࣌ؒ γεςϜߏ੒
    ඵ 8JOEPXT1SP)

    ".%3Z[FO9/7*%*"(F'PSDF359
    ඵ NBD047FOUVSBʢ'ʣ
    J.BD*OUFM$PSFJ()[ίΞ$16

    View full-size slide

  30. ·ͱΊ
    5SBOTGPSNFSTΛ࢖ͬͯ
    "*ϞσϧʹΑΔਪ࿦Λ࣮ߦ
    $BMMUP"DUJPO
    ຊ೔ͷମݧΛଞͷਓʹڞ༗͢Δ
    5SBOTGPSNFSTΛࢼͯ͠ΈΔ

    View full-size slide

  31. ϫʔΫγϣοϓɺηογϣϯɺ͓Αͼࢿྉ͸ɺ*#.·ͨ͸ηογϣϯൃදऀʹΑͬͯ४උ͞ΕɺͦΕͧΕಠࣗͷݟղΛ൓өͨ͠΋ͷͰ͢ɻͦΕΒ͸৘ใఏڙͷ໨తͷΈ
    Ͱఏڙ͞Ε͓ͯΓɺ͍͔ͳΔࢀՃऀʹରͯ͠΋๏཯త·ͨ͸ͦͷଞͷࢦಋ΍ॿݴΛҙਤͨ͠΋ͷͰ͸ͳ͘ɺ·ͨͦͷΑ͏ͳ݁ՌΛੜΉ΋ͷͰ΋͋Γ·ͤΜɻຊߨԋࢿྉ
    ʹؚ·Ε͍ͯΔ৘ใʹ͍ͭͯ͸ɺ׬શੑͱਖ਼֬ੑΛظ͢ΔΑ͏౒ྗ͠·͕ͨ͠ɺʮݱঢ়ͷ··ʯఏڙ͞Εɺ໌ࣔ·ͨ͸҉ࣔʹ͔͔ΘΒ͍͔ͣͳΔอূ΋൐Θͳ͍΋ͷͱ
    ͠·͢ɻຊߨԋࢿྉ·ͨ͸ͦͷଞͷࢿྉͷ࢖༻ʹΑͬͯɺ͋Δ͍͸ͦͷଞͷؔ࿈ʹΑͬͯɺ͍͔ͳΔଛ֐͕ੜͨ͡৔߹΋ɺ*#.͸੹೚ΛෛΘͳ͍΋ͷͱ͠·͢ɻຊߨԋ
    ࢿྉʹؚ·Ε͍ͯΔ಺༰͸ɺ*#.·ͨ͸ͦͷαϓϥΠϠʔ΍ϥΠηϯεަ෇ऀ͔Β͍͔ͳΔอূ·ͨ͸ද໌ΛҾ͖ͩ͢͜ͱΛҙਤͨ͠΋ͷͰ΋ɺ*#.ιϑτ΢ΣΞͷ࢖༻
    Λنఆ͢Δద༻ϥΠηϯεܖ໿ͷ৚߲Λมߋ͢Δ͜ͱΛҙਤͨ͠΋ͷͰ΋ͳ͘ɺ·ͨͦͷΑ͏ͳ݁ՌΛੜΉ΋ͷͰ΋͋Γ·ͤΜɻ
    ຊߨԋࢿྉͰ*#.੡඼ɺϓϩάϥϜɺ·ͨ͸αʔϏεʹݴٴ͍ͯͯ͠΋ɺ*#.͕Ӧۀ׆ಈΛߦ͍ͬͯΔ͢΂ͯͷࠃͰͦΕΒ͕࢖༻ՄೳͰ͋Δ͜ͱΛ҉ࣔ͢Δ΋ͷͰ͸͋Γ
    ·ͤΜɻຊߨԋࢿྉͰݴٴ͍ͯ͠Δ੡඼ϦϦʔε೔෇΍੡඼ػೳ͸ɺࢢ৔ػձ·ͨ͸ͦͷଞͷཁҼʹج͍ͮͯ*#.ಠࣗͷܾఆݖΛ΋͍ͬͯͭͰ΋มߋͰ͖Δ΋ͷͱ͠ɺ
    ͍͔ͳΔํ๏ʹ͓͍ͯ΋কདྷͷ੡඼·ͨ͸ػೳ͕࢖༻ՄೳʹͳΔͱ֬໿͢Δ͜ͱΛҙਤͨ͠΋ͷͰ͸͋Γ·ͤΜɻຊߨԋࢿྉʹؚ·Ε͍ͯΔ಺༰͸ɺࢀՃऀ͕։࢝͢Δ
    ׆ಈʹΑͬͯಛఆͷൢചɺച্ߴͷ޲্ɺ·ͨ͸ͦͷଞͷ݁Ռ͕ੜ͡Δͱड़΂Δɺ·ͨ͸҉ࣔ͢Δ͜ͱΛҙਤͨ͠΋ͷͰ΋ɺ·ͨͦͷΑ͏ͳ݁ՌΛੜΉ΋ͷͰ΋͋Γ·
    ͤΜɻύϑΥʔϚϯε͸ɺ؅ཧ͞Εͨ؀ڥʹ͓͍ͯඪ४తͳ*#.ϕϯνϚʔΫΛ࢖༻ͨ͠ଌఆͱ༧ଌʹج͍͍ͮͯ·͢ɻϢʔβʔ͕ܦݧ͢Δ࣮ࡍͷεϧʔϓοτ΍ύ
    ϑΥʔϚϯε͸ɺϢʔβʔͷδϣϒɾετϦʔϜʹ͓͚ΔϚϧνϓϩάϥϛϯάͷྔɺೖग़ྗߏ੒ɺετϨʔδߏ੒ɺ͓Αͼॲཧ͞ΕΔϫʔΫϩʔυͳͲͷߟྀࣄ߲Λ
    ؚΉɺ਺ଟ͘ͷཁҼʹԠͯ͡มԽ͠·͢ɻ͕ͨͬͯ͠ɺݸʑͷϢʔβʔ͕͜͜Ͱड़΂ΒΕ͍ͯΔ΋ͷͱಉ༷ͷ݁ՌΛಘΒΕΔͱ֬໿͢Δ΋ͷͰ͸͋Γ·ͤΜɻ
    هड़͞Ε͍ͯΔ͢΂ͯͷ͓٬༷ࣄྫ͸ɺͦΕΒͷ͓٬༷͕ͲͷΑ͏ʹ*#.੡඼Λ࢖༻͔ͨ͠ɺ·ͨͦΕΒͷ͓٬༷͕ୡ੒ͨ݁͠Ռͷ࣮ྫͱͯࣔ͠͞Εͨ΋ͷͰ͢ɻ࣮ࡍ
    ͷ؀ڥίετ͓ΑͼύϑΥʔϚϯεಛੑ͸ɺ͓٬༷͝ͱʹҟͳΔ৔߹͕͋Γ·͢ɻ
    *#.ɺ*#.ϩΰɺJCNDPNɺ*#.$MPVEɺ*#.$MPVE1BLT͸ɺੈքͷଟ͘ͷࠃͰొ࿥͞Εͨ*OUFSOBUJPOBM#VTJOFTT.BDIJOFT$PSQPSBUJPOͷ঎ඪͰ͢ɻଞͷ੡඼໊͓
    ΑͼαʔϏε໊౳͸ɺͦΕͧΕ*#.·ͨ͸֤ࣾͷ঎ඪͰ͋Δ৔߹͕͋Γ·͢ɻݱ࣌఺Ͱͷ*#.ͷ঎ඪϦετʹ͍ͭͯ͸ɺXXXJCNDPNMFHBMDPQZUSBEFTIUNMΛ͝ཡ
    ͍ͩ͘͞ɻ
    .JDSPTPGU 8JOEPXT 8JOEPXT4FSWFS /&5'SBNFXPSL /&5 /&5$PSF͸ɺ.JDSPTPGU$PSQPSBUJPOͷ঎ඪ·ͨ͸ొ࿥঎ඪͰ͢ɻ
    /7*%*" /7*%*"ϩΰ /7*%*"$6%"͸ /7*%*"$PSQPSBUJPOͷ঎ඪ·ͨ͸ొ࿥঎ඪͰ͢ɻ
    )VHHJOH'BDF͸ɺ )VHHJOH'BDF *OD ͷ঎ඪͰ͢ɻʢొ࿥঎ඪͱͯ͠ग़ئதʣ
    ࢿྉ಺Ͱར༻͍ͯ͠Δ)VHHJOH'BDFʹొ࿥͞Ε͍ͯΔϞσϧ͸ɺ֤Ϟσϧ͕ࢦఆͨ͠ϥΠηϯεͰӡ༻Ͱ͖·͢ɻ
    ࢿྉ಺ʹ͍ࣔͯ͠Δ"*ਪ࿦Λ࣮ߦ͢ΔͨΊͷίʔυ͸ɺαϯϓϧͰ͋Γ׬શͳίʔυͰ͸͋Γ·ͤΜɻ*5ΤϯδχΞͷମݧػձΛ૿΍ֶ͢श໨తͰ४උͨ͠΋ͷͰ͢ɻ
    "*ϞσϧΛ࣮ࡍͷγεςϜʹ૊ΈࠐΉ৔߹͸ɺϞσϧͷϥΠηϯεࣄ߲Λ֬ೝ͠ɺγεςϜཁ݅ʹԠͨ͡"*ਪ࿦࣮ߦ؀ڥΛ४උ͠ɺඞཁͳྫ֎ॲཧΛ௥Ճ͢ΔͳͲ࣮ӡ
    ༻ʹ࢖͑ΔίʔυΛ࡞੒͠ɺे෼ͳσόοάɺςετΛߦ͍ͬͯͩ͘͞ɻ
    )VHHJOH'BDF5SBOTGPSNFSͷٕज़తͳ໰୊ղܾɺϑΟʔυόοΫ͸ɺIUUQTHJUIVCDPNIVHHJOHGBDFUSBOTGPSNFST ΑΓɺ
    (JU)VC*TTVF 1VMM3FRVFTUΛ௨ͯ͡ɺΦʔϓϯιʔείϛϡχςΟͱڞʹਐΊ͍ͯͩ͘͞ɻ

    View full-size slide