All We Need Is Prompting on a Pre-trained Japanese Large Language Model

Toshinori Sato (@overlast) • Senior Software Engineer / Manager •
Natural Language Processing • Information Retrieval • LINE CLOVA • Japanese NLU system • HyperCLOVA • Japanese Corpus / Evaluation • OSS: Main Contributor of NEologd project • mecab-ipadic-NEologd

LINE NLP team and contributors Toshinori Sato Takashi Uemura Wataru
Sakata Akifumi Nakamachi Kenta Shinzato Takuto Asakura Tatsuya Uchiyama Masahiko Higashiyama Tung Nguyen Shengzhe Li Koga Kobayashi Takato Yamazaki Seiichi Inoue Yoshifumi Kondo Jumon Nozaki et al.

Attention, please ! The target audience is Mainly engineers who
are interested in natural language processing. And then there is the part for NLP professionals. NLP means Natural Language Processing. Omit from this session is Detailed information related to the following. - Building language models - Tuning methods for language models See below for more information. - https://arxiv.org/abs/2109.04650 * 40-minute session Please enjoy listening to it over a cup of coffee or something ! * What Changes Can Large-scale Language Models Bring? Intensive Study on HyperCLOVA: Billions-scale Korean Generative Pretrained Transformers, Boseop Kim et.al, EMNLP 2021

Application example of HyperCLOVA A dialogue system with role-playing functions

Large-scale general-purpose language models + α Automatic evaluation with 39B
JP Model for a QA task

Purely auto-generated text

E.g. A spoken dialogue system applying HyperCLOVA 㲔Ç㲋 Hello.

E.g. A spoken dialogue system applying HyperCLOVA 㲔Ç㲋 Client App
Voice Hello.

E.g. A spoken dialogue system applying HyperCLOVA 㲔Ç㲋 Speech To
Text Client App Dialog App Voice Hello.

Text Client App Dialog App HyperCLOVA Query text Voice Hello.

Text Client App Dialog App HyperCLOVA Result text Query text Large-scale Language Model ɾɾɾ Knowledge Base Search Voice Hello.

Text Response text Client App Dialog App HyperCLOVA Result text Query text ɾɾɾ Knowledge Base Search Voice Hello. Large-scale Language Model

㲔Ç㲋 E.g. A spoken dialogue system applying HyperCLOVA 㲔Ç㲋 Speech
To Text Text To Speech Response text Client App Dialog App HyperCLOVA Result text Query text ɾɾɾ Knowledge Base Search Voice Hello. Sounds Long time, no see. Large-scale Language Model

Text Text To Speech Response text Client App Dialog App HyperCLOVA 㲔Ç㲋 Result text Query text ɾɾɾ Knowledge Base Search Sounds Voice Long time, no see. Hello. Large-scale Language Model

Agenda - What’s HyperCLOVA - Inside of HyperCLOVA - Application
development by Prompting - Evaluation of HyperCLOVA’s JP LMs - Application to Dialogue Systems - The future of LINE and NLP

What is Language Model (LM) ? - HyperCLOVA includes an
unsupervised autoregressive language model - Autoregressive language model … - is capable of calculating probability distributions - provides maximum likelihood estimation of parameters for a sample - can generate future text based on a text up to the present - A model that gives the probability of a certain sequence of words - E.g. P(It’s sunny today) > P(Sunny is today)

Our policy for corpus data collection We do not use
any data of our conversation service. - All messages on LINE - All posts on OpenChat We maintain this corpus with the utmost consideration for the rights held by various customers. - Add versatility to this corpus - Make a subset of this corpus available for use outside of LINE

LINE LM Corpus(for HyperCLOVA’s LMs) NO DATA from LINE or
LINE Open-chat is used to build our LMs - Developed based on a corpus built for training the BERT models after 2019 - Used crawled data for LINE search - Eliminated data that can be easily extracted as "non-public personal information" - Covered important sites for learning Japanese expressions - Purchased and used of external content after resolving rights issues !

Current status of LINE LM Corpus For 82B JP Model
Samples 10B Tokens 500B Bytes 1.8T

Extensive use of pre-trained large-scale

Modeling status of HyperCLOVA 1.3B → 6.7B → 13B →
39B 13B → 39B 82B 204B ʙ (in 2022) Multi-lingual Model Large model JP / Multi-lingual Hyper scale JP Model JP Model Work in progress

Architecture of HyperCLOVA Eco System Infra Model Data

Methods for applying LMs to a target task - Method
for HyperCLOVA e.t.c. - Few-shot: Give “a description of a task” and “some demonstration cases” - One-shot: Give “a description of a task” and “a demonstration case” - Zero-shot: Give “a description of a task” only - Pros: Possibility to solve a task from brief instructions or short examples - Cons: Possibility of not reaching the performance of a SOTA model achieved by fine-tuning - Method for BERT e.t.c. - Fine-tuning: Supervised learning on a dataset of a target task based on a general- purpose pre-trained model - Pros: Excellent performance in benchmarks - Cons: Need to learn for each target task / Possible loss of generalization ability

Task Outlines and Few Shots for individual tasks Playground(HyperCLOVA Studio)
TextField Prompt Task outline Output Title(description of task) Another information Samples Query (In some cases, output is given as a suffix after inference) Shot Shot ….

TextField

TextField Prompt

TextField Prompt Task outline

TextField Prompt Task outline Title(description of task)

TextField Prompt Task outline Title(description of task) Another information

TextField Prompt Task outline Title(description of task) Another information Samples Shot Shot ….

TextField Prompt Task outline Title(description of task) Another information Samples

TextField Prompt Task outline Title(description of task) Another information Samples Shot

TextField Prompt Task outline Title(description of task) Another information Samples Shot Shot ….

TextField Prompt Task outline Title(description of task) Another information Samples Query (In some cases, output is given as a suffix after inference) Shot Shot ….

TextField Prompt Task outline Output Title(description of task) Another information Samples Query (In some cases, output is given as a suffix after inference) Shot Shot ….

TextField Prompt Task outline Title(description of task) Another information Samples Query (In some cases, a next output is given as a suffix after inference) Shot Previous output …. Output

• ղઆ͔Βആ۟Λੜ੒͠·͢ɻ • */͕֝ݹ͍஑ʹඈͼࠐΜͩ࣌ͷԻͷ༷ࢠΛӵΜͩ۟Ͱ͢ɻ͕֝஑ʹඈͼࠐΉԻΛදݱͨ͠୯७ͳ۟Ͱ͸͋Γ ·͕͢ɺपғͷ੩ऐ΍ऐΕͨݹ஑ͷ༷ࢠɺ͕֝஑ʹඈͼࠐΉੜͷ༂ಈͷΑ͏ͳ৘ܠ͕·͟·͟ͱ఻Θͬͯ͘ Δɺझͷ͋Δ۟ͱͳ͍ͬͯ·͢ɻ͜ͷ۟ͷقޠ͸ʰ֝ʱͰɺ͜Ε͸य़Λදݱ͍ͯ͠ΔقޠͰ͢ɻ • 065ݹ஑΍֝ඈ͜Ήਫͷ͓ͱ • */ࢁܗݝʹ͋ΔཱੴࣉʢΓͬ͠Ό͘͡ʣͰηϛ͕໐͍͍ͯΔ༷ࢠΛӵ·Εͨ۟Ͱ͢ɻ͜ͷ۟͸໌֬ͳ۟༁͸
Θ͔͍ͬͯͳ͍ͷͰɺগ͠ϛεςϦΞεͰӵΈղ͘͜ͱʹϩϚϯ͕͋Δആ۟Ͱ͢ɻʮؓ͞ʯͱʮઊͷ੠ʯͱ͍ ͏Ұݟໃ६͢Δදݱͷҙਤ͕ಛʹߟ͑ͤ͞ΒΕ·͢Ͷɻ • 065ؓ͞΍ؠʹ͠Έೖઊͷ੠ • */݄຤ͷॵ͍೔ͷޕޙͷձٞʹͱͯ΋૖େͳγεςϜ։ൃͷ࿩Λ͠Α͏ͱ͍ͯ͠Δ༷ࢠΛӵ·Εͨ۟Ͱ͢ɻ ୯७ͳ۟Ͱ͸͋Γ·͕͢ɺαϥϦʔϚϯͳΤϯδχΞ͕༷ʑͳݒ೦͕͋ΔதͰɺͦΕͰ΋͜ͷٕज़Λܗʹ͢Δ ͜ͱͰਓ޻஌ೳٕज़ͷະདྷΛ։͖͍ͨͱ͍͏ر๬Λ͜Ίͨؾ࣋ͪΛදݱ͍ͯ͠Δɻ • 065݄ͷ೤͍࠭ʹຒΊͨΔΘ͕ࠢ • ˡ8JUIBMJUUMFFEJUJOH JUCFDPNFTBIBJLV FH೤͖݄࠭ʹຒΊͨΔΘ͕ࠢ Response to Individual Tasks with Task Outline and Few-shot Example: Support for writing Haiku with 2-shots

Extensive use of pre-trained large-scale

- Product packages always contain a product summary, but in
many cases, a advertising description is transcribed by a salesperson - In this demo, the parameters of HyperCLOVA Studio(Play-ground) are adjusted to generate an attractive description - From a food product title and a summary, HyperCLOVA Studio generate an attractive description to advertise it Document generation using Play-ground E.g. Product summary to the description

Demo Movie 60sec

঎඼໊: ͬ͢ͺ ͔ʹͤΜ ֓ཁ: ◦εφοΫ՛ࢠ ◦ݪࡐྉ໊: খഴค(ࠃ಺੡଄)ɺ২෺༉ɺͰΜ คɺ͔ʹͷ਎ɺค຤ਣʢখഴΛؚ Ήʣɺ࠭౶ɺ৯Ԙɺ͔ʹύ΢μ ʔɺࠛ෍ύ΢μʔɺ࠭౶
/ ๲ு ࡎɺ؁ຯྉ(ఱ૲) ◦಺༰ྔ: 60g આ໌: ΄ΜͷΓࢎຯ ΄ΜͷΓࢎຯ͕ޮ͍ͨΧχ෩ຯͷ ͤΜ΂͍Ͱ͢ɻޱͷதʹೖΕΔͱ ϗϩϗϩͱ่Ε͍ͯ͘৯ײ͸ͨ· Γ·ͤΜɻ͓ञͷ͓ͭ·Έ΍͓΍ ͭʹͲ͏ͧɻ Base Prefix Output → ঎඼ͷ֓ཁ͔ΒࢥΘͣങ͍ͨ͘ͳΔઆ໌จΛੜ੒͢Δ ঎඼໊ۤʑ͍͠γϣίϥΧΧΦ ֓ཁ˓໊শνϣίϨʔτ˓ݪࡐྉ໊ΧΧΦϚε ࠃ಺੡଄ ɺίίΞύ΢μʔɺ࠭౶ɺίίΞόλʔೕԽࡎɺ߳ྉ˓಺༰ྔH˓อ ଘํ๏ˆҎԼͷྫྷ҉ॴͰอଘ͍ͯͩ͘͠͞ɻ આ໌݈߁ͷ͜ͱ͚ͩΛߟ͑ͨ௒ΧΧΦϙϦϑΣϊʔϧɻѹ౗తͳڧ͍ۤຯͱඓΛ؏͘઱྽ͳ߳ΓͷԞʹ΄ͷ͔ͳ؁ຯΛײ͡ΒΕ·͢ɻ͠ ͔΋Hͱେ༰ྔͰ๨೥ձͷേήʔϜʹ΋ϐολϦ One-shot => Increased the Temperature (randomness) and lowered the Repetition penalty (for control of repetition) to make the text contain the appeal

Application development using Inference API Eco System Infra Model Data

Subjective evaluation of a dialogue system using HyperCLOVA with 6.7B/13B/39B
JP Model for 4 tasks Annotations were made for all task/model combinations Subjective evaluation by the same 5 annotators Each session is N round-trip conversational pairs The user receives a list of N topics for evaluation Each session consumes one vocabulary item from the list Conducted in Play-ground 4. Free chatting 3. Reacting to user sentiment on a topic 2. Tracking different multiple topics 1. Understanding of basic vocabulary

Exception: Free chat tasks did not evaluate the achievement of
the goal Natural response Q: Was it a natural reaction? Are there any breakdowns or inconsistencies in the history of the conversation? Following a topic Q: Did it stay on topic? Did it lose track of the topic (in this case, did it lose track of what it was being asked about? Was it able to switch topics (in this case, was it able to pull back to the previous question? Providing a topic or asking a question Q: Did it provide a topic? Was it able to get the speaker to talk during the answer (most likely not)? Achievement of goals Q: Did it achieve your objective? Common evaluation criteria for all tasks !

1. Understanding of basic vocabulary elementary vocabulary secondary vocabulary খֶߍ
ۚͮͪ தֶߍ Ԗච େਓ νϡʔϦοϓ ઌੜ ώϚϫϦ ϥΠΦϯ ص ΩϦϯ Ҝࢠ ిं ۺ ं αϯμϧ ηʔλʔ ΓΜ͝ εΧʔτ Έ͔Μ Ωϟϕπ αϯϚ ͖Ύ͏Γ Ϛάϩ εζϝ ϋʔϞχΧ Πϯί ϐΞϊ τϯϘ ΞϦ IUUQTSFQPTJUPSZOJOKBMBDKQ BDUJPOSFQPTJUPSZ@BDUJPO@DPNNPO@EPXOMPBEJUFN@JEJUFN@OPBUUSJCVUF@JEpMF@OP ToDO: Ask the HyperCLOVA questions in the form of Level 1 and Level 2 for each vocabulary

1. Understanding of basic vocabulary

1. Understanding of basic vocabulary Does the system accurately answer
word meanings (Level 1) and emotions (Level 2)? 6.7B 13B 39B Natural response 0.55 0.66 0.98 Following a topic 0.63 0.84 0.98 Providing a topic or asking a question 0.00 0.01 0.00 Achievement of goals 0.55 0.68 0.84

Topic A Topic B Topic A Topic B ৽ܕίϩφ΢Πϧε Πϯό΢ϯυ
Πνϩʔ େ୩ᠳฏ ۓٸࣄଶએݴ ৽ܕίϩφϫΫνϯ AR(֦ுݱ࣮) ࣗಈӡసٕज़ YouTuber VTuber ϨΦφϧυɾμɾϰΟϯν ΫϩʔυɾϞω ฏ੒ ྩ࿨ Πϯλʔωοτ 5G σϑϨܦࡁ ௒ߴྸԽࣾձ ւ֎ཱྀߦ ࠃ಺ཱྀߦ ిؾࣗಈं ϦχΞதԝ৽װઢ 2. Tracking different multiple topics ToDO: Start a conversation about topic A and switch to topic B before 10 round trips ฏ੒

Evaluation: Can the system move from topic A to topic
B during a conversation? 6.7B 13B 39B Natural response 0.66 0.53 0.91 Following a topic 0.71 0.61 0.95 Providing a topic or asking a question 0.04 0.01 0.02 Achievement of goals 0.66 0.55 0.91 2. Tracking different multiple topics

3. Reacting to user sentiment on a topic Topic sentiment
A sentiment B ͕Μ͹ͬͯཉ͍͠ ࣙΊͯཉ͍͠ ৽ܕίϩφ΢Πϧε ؤுΖ͏ ෆ҆ͩ Πϯό΢ϯυ ໭ͬͯ͘Δ ໭Βͳ͍ ৽ܕίϩφϫΫνϯ ଴ͱ͏ ͍ͭʹͳΔ YouTuber ΍Γ͍ͨ ΍Γͨ͘ͳ͍ େ୩ᠳฏ ׆༂ͯ͠ཉ͍͠ ࡾৼͯ͠ཉ͍͠ AR(֦ுݱ࣮) ໘ന͍ ๞͖ͨ ௒ߴྸԽࣾձ େৎ෉ ৺഑ ւ֎ཱྀߦ ߦ͖͍ͨ ߦ͖ͨ͘ͳ͍ ిؾࣗಈं ৐Γ͍ͨ ৐Γͨ͘ͳ͍ ϦχΞதԝ৽װઢ ৐Γ͍ͨ ৐Γͨ͘ͳ͍ ToDO: Have a 15 back and forth conversation about the Topic. Speak with the feeling of sentiment A at first, then sentiment B.

Evaluation: Was the system able to agree with the user
when he or she was feeling sentiment A about the topic? 3. Reacting to user sentiment on a topic 6.7B 13B 39B Natural response 0.69 0.45 0.90 Following a topic 0.74 0.52 0.95 Providing a topic or asking a question 0.04 0.02 0.03 Achievement of goals 0.68 0.46 0.90

Evaluation When a user has sentiment B feelings about a
topic, could the system disagree? 3. Reacting to user sentiment on a topic 6.7B 13B 39B Natural response 0.61 0.40 0.87 Following a topic 0.67 0.45 0.93 Providing a topic or asking a question 0.09 0.02 0.03 Achievement of goals 0.46 0.36 0.50

4. Free chatting

Evaluation: Facilitate a free dialogue with the system 4. Free
chatting 6.7B 13B 39B Natural response 0.65 0.40 0.92 Following a topic 0.76 0.40 0.94 Providing a topic or asking a question 0.12 0.04 0.09 Achievement of goals - - -

Summary: Subjective Evaluation of 39B JP Model 1. Understanding of
basic vocabulary 2. Tracking different multiple topics 3. Reacting to positive sentiment on a topic 3. Reacting to negative sentiment on a topic 4. Free chatting Natural response 0.978 0.908 0.908 0.872 0.925 Following a topic 0.984 0.952 0.951 0.930 0.935 Providing a topic or asking a question 0.003 0.023 0.033 0.035 0.086 Achievement of goals 0.835 0.907 0.899 0.505 -

The goal of LINE NLP team is to achieve high
quality and safe output

Difficulties with the Japanese language Difficult to learn Japanese speakers
use - Hiragana - Katakana - Kanji - Romaji - e.t.c. to write a single document Large amount of essential vocabulary Required for daily conversation - over 8,000 words Need to know many of - Homonyms - Honorifics - Dialects Omission of words Japanese speakers may omit following words in a document - Subject - Object Omitted words may not be uniquely inferred

Conducting joint research using HyperCLOVA Hope to collaborate with more
research institutions and companies in the future Providing HyperCLOVA’s APIs to universities, research institutes, and companies Collaborating to dramatically improve system performance and detect and eliminate bias in language models with - Osaka University Graduate School - Tokyo Metropolitan University - Waseda University

Difficulties in text generation Potential risks of generated text The
following technologies need to be developed - Improving the content bias of a corpus and its notation - Ensuring the truthfulness and security of a output text Implementation of AI Ethics Various ethical considerations need to be taken into account for input and output texts - Toxicity - Sexual - Offensive - Profane - Intimidating - Attack on identity Automation of intrinsic evaluation Need metrics that can be applied to dynamic text generation results - Accuracy of topical content - Consistency of generated text - Determination of achievement of objectives

Eco System Infra Model Data Automatic evaluation for 39B JP
model

Automatic evaluation with 39B JP Model for a QA task
Few-shots were created by randomly extracting a context from the RCQA possible only dev-set for each inference Create Few-shot with a context, a question text and an answer - If correct answer is contained and easily extracted from inference result, we judged it is correct TASK: RCQA* possible only - Removed unanswerable questions from dataset of the normal RCQA task * ղ౴Մೳੑ෇͖ಡղσʔληοτ: http://www.cl.ecei.tohoku.ac.jp/rcqa/

Result of automatic evaluation with 39B JP Model for RCQA
possible only task model / few-shot shot temperature top_p answer match 6.7B / contextual 0 0.5 0.8 - 4 0.1 0.9 66.52 13B / contextual 0 0.5 0.8 - 4 0.4 0.1 70.28 39B / contextual 0 0.4 0.5 80.51 1 0.4 0.5 89.18 2 0.4 0.5 89.31 3 0.4 0.5 89.09 4 0.4 0.5 89.83 39B / non-contextual 0 0.4 0.5 69.50 1 0.4 0.5 76.97 2 0.4 0.5 79.08 3 0.4 0.5 79.38 4 0.4 0.5 80.51

HyperCLOVA’s LM vs BERT-large TASK: RCQA possible only (Removed unanswerable
questions from the normal RCQA task) - It is possible that BERT can achieve higher results with fine-tuning on specific tasks - HyperCLOVA can achieve the same level of performance with Prompting and rough parameter search test acc test F1 memo HyperCLOVA 85.03 89.95 JP 39B 2-shots, temperature=0.4, top_p= 0.5 BERT-jp-large 86.68 90.49 Using subset of LINE LM corpus

Application example: HyperCLOVA Friends Talk with any adjustable character using
HyperCLOVA

Demo Movie 60sec

Application example: HyperCLOVA Friends Talk with any adjustable character using
HyperCLOVA HyperCLOVA allows for some role-playing

HyperCLOVA allows for generic role-playing Challenge: Features other than smooth
conversation and topic tracking The truth of what it says should be verified before it responds Conversation is smooth and the meaning of what is said is understood Some ambiguous responses e.g. Temperature of hot water during washing Effect of data bias e.g. unsettled but became female The consistency of its persona is a bit suspect Start with NO character set

Is HyperCLOVA really necessary for NLP? - YES !! -
If you're on a budget … - The history of NLP is strongly linked to the development of AI-related technologies - LINE wants to move in the direction of building our own models and having customers use them Large-scale general-purpose LMs DNN Traditional only ML Rule only Small LM only

The future of LINE and NLP

LINE was released to the public 10 years FAQ from
NLPer: Isn’t there a challenge left to tackle?

FAQ from NLPer: Isn’t there a challenge left to tackle?
No, not yet !! It's not over yet !!

Various issues related to HyperCLOVA • Building models and using
those models to make inferences - the biggest challenge of all • Fine-tuning and other parameter-efficient transfer learning methods, as well as compact models • Responding to new topics/events that have arisen since a model was built • Implementing AI Ethics • Filtering according to the application and specifying the reason • Building a Web corpus • Removing duplicate data • Realization of accountability for each entry used • Responding to deletion requests on a URL/ID basis • Detection and anonymization of personal information

LINE has more than 50 service brands

LINE’s NLP journey is still in its early stages Let's
challenge together at LINE LINE's various services needs essential improvements using NLP technology ! • Large-scale general-purpose LMs • “High Road” NLP • Information Retrieval • String processing • Data creation • Evaluation tasks, e.t.c.

HyperCLOVA Hands-on

HyperCLOVA Hands-on to be held during 2021 Hands-on with HyperCLOVA
Studio and APIs for engineers Please wait for informations from LINE Developers ( @LINE_DEV) A Python SDK to using HyperCLOVA API will be provided

Open & Share

LINE’s LMs for OSS start in FY2021 Of course, for
models “other than” HyperCLOVA Performance target: LINE's LMs for OSS > other OSS LMs Would like to update a few times a year, if possible !! Train using a subset of the corpus for HyperCLOVA(LINE LM Corpus) !

Summary - Updated the current status of HyperCLOVA in LINE
- Reported on large-scale general-purpose LMs and Prompting, using several topics as examples - There are cases where surprisingly high quality can be achieved - There are issues that cannot be solved ad hoc - On LINE, we can work on all layers of NLP R&D for not only HyperCLOVA - Please stay tuned for a next NLP information by LINE

All We Need Is Prompting on a Pre-trained Japan...

All We Need Is Prompting on a Pre-trained Japanese Large Language Model

More Decks by LINE DEVDAY 2021

Other Decks in Technology

Featured

Transcript