自然言語処理による論文執筆支援

Slide 1

Slide 1 text

ࣗવݴޠॲཧʹΑΔ࿦จࣥචࢧԉ ౦๺େֶ ಛ೚ݚڀһɼLangsmithגࣜձࣾ ڞಉ૑ۀऀ ܀ྛथੜ

Slide 28

Slide 28 text

l πʔϧʹೖྗͨ͠σʔλ͕Ͳ͏ѻΘΕΔ͔֬ೝ͠Α͏ʢϓϥΠόγʔϙϦγʔΛಡ΋͏ʣ - ਪᏏதͷ࿦จ͸యܕతʹ͸ະൃද৘ใͰ͋ΓऔΓѻ͍஫ҙ - ೖྗσʔλ͕׆༻͞ΕΔ৔߹ɼγεςϜ։ൃऀ͕ͦͷೖྗΛݟΔͱ͍͏ةݥੑ΋͞ͳ͕Βɼ ೖྗͨ͠৘ใΛϞσϧ͕ۮવग़ྗͯ͠͠·͏Մೳੑ͕൱ఆͰ͖ͳ͍ʢޙड़ʣ l DeepL΍Langsmithͷ৔߹ - ແྉϓϥϯʢ౦๺େϓϥϯʣɿೖྗσʔλΛγεςϜͷվળͷͨΊʹར༻͢ΔՄೳੑ͕͋Δɽ - ༗ྉϓϥϯɿೖྗσʔλ͸ฐࣾαʔόʹ࢒Βͳ͍ɽͭ·ΓϞσϧ։ൃʹ΋࢖Θͳ͍ɽ l OpenAI (ChatGPT) ͷ৔߹ - Communication Information: If you communicate with us, we may collect your name, contact information, and the contents of any messages you send… ະग़൛࿦จ৘ใΛπʔϧʹೖྗ͢Δةݥੑ https://openai.com/privacy/ Other 3.5% Closed QA 2.6% Extract 1.9% {summary} """ This is the outline of the commercial for that play: """ 3 Methods and experimental details 3.1 High-level methodology Our methodology follows that of Ziegler et al. (2019) and Stiennon et al. (2020), who applied it in the stylistic continuation and summarization domains. We start with a pretrained language model (Radford et al., 2019; Brown et al., 2020; Fedus et al., 2021; Rae et al., 2021; Thoppilan et al., 2022), a distribution of prompts on which we want our model to produce aligned outputs, and a team of trained human labelers (see Sections 3.4 for details). We then apply the following three steps (Figure 2). Step 1: Collect demonstration data, and train a supervised policy. Our labelers provide demon- strations of the desired behavior on the input prompt distribution (see Section 3.2 for details on this distribution). We then fine-tune a pretrained GPT-3 model on this data using supervised learning. Step 2: Collect comparison data, and train a reward model. We collect a dataset of comparisons between model outputs, where labelers indicate which output they prefer for a given input. We then train a reward model to predict the human-preferred output. Step 3: Optimize a policy against the reward model using PPO. We use the output of the RM as a scalar reward. We fine-tune the supervised policy to optimize this reward using the PPO algorithm (Schulman et al., 2017). Steps 2 and 3 can be iterated continuously; more comparison data is collected on the current best policy, which is used to train a new RM and then a new policy. In practice, most of our comparison data comes from our supervised policies, with some coming from our PPO policies. 3.2 Dataset Our prompt dataset consists primarily of text prompts submitted to the OpenAI API, specifically those using an earlier version of the InstructGPT models (trained via supervised learning on a subset of our demonstration data) on the Playground interface.4 Customers using the Playground were informed that their data could be used to train further models via a recurring notification any time InstructGPT models were used. In this paper we do not use data from customers using the API in production. We heuristically deduplicate prompts by checking for prompts that share a long common prefix, and we limit the number of prompts to 200 per user ID. We also create our train, validation, and test splits based on user ID, so that the validation and test sets contain no data from users whose data is in the training set. To avoid the models learning potentially sensitive customer details, we GPT-3.5 (ChatGPTͷલ਎) ͷ৔߹ɼϢʔβೖྗྫ͔Βσʔληοτ͕࡞ΒΕɼ Ϟσϧͷֶशʹ࢖ΘΕͨΓɼ࿦จதͰఏࣔ͞ΕͨΓ͍ͯ͠Δ ҎԼͷ࿦จ֓ཁͷӳޠදݱΛ ચ࿅ͤͯ͞΄͍͠ [Ouyang+, 2022] 🔍

Slide 1

Slide 1 text

Slide 2

Slide 2 text

Slide 3

Slide 3 text

Slide 4

Slide 4 text

Slide 5

Slide 5 text

Slide 6

Slide 6 text

Slide 7

Slide 7 text

Slide 8

Slide 8 text

Slide 9

Slide 9 text

Slide 10

Slide 10 text

Slide 11

Slide 11 text

Slide 12

Slide 12 text

Slide 13

Slide 13 text

Slide 14

Slide 14 text

Slide 15

Slide 15 text

Slide 16

Slide 16 text

Slide 17

Slide 17 text

Slide 18

Slide 18 text

Slide 19

Slide 19 text

Slide 20

Slide 20 text

Slide 21

Slide 21 text

Slide 22

Slide 22 text

Slide 23

Slide 23 text

Slide 24

Slide 24 text

Slide 25

Slide 25 text

Slide 26

Slide 26 text

Slide 27

Slide 27 text

Slide 28

Slide 28 text

Slide 29

Slide 29 text

Slide 30

Slide 30 text

Slide 31

Slide 31 text

Slide 32

Slide 32 text

Slide 33

Slide 33 text