୭Ͱ΋Ͱ͖Δʂ0QFO"* Embedding API Λ׆༻ͯ͠ɺ ߴ౓ͳϨίϝϯυػೳΛ࣮ݱͯ͠ΈΑ͏ By Atsumine Kondo - @sugoikondo A story about implementing an advanced recommendation function using the OpenAI Embedding API

ۙ౻ ๛ๆ Atsumine Kondo ● Backend / Frontend / Infra ● Scala, Kotlin, Python, etc… ● Vue.js, Nuxt.js, Next,.js etc… @sugoikondo Group Management Solution dept Product development div

͜ͷൃදͰֶ΂Δ͜ͱ What you can learn from this presentation

● จষϕΫτϧͱ͸Կ͔͕Θ͔Δ You can learn what a text vector is ● ↑ ͕Θ͔Δ͜ͱͰɺҎԼιϦϡʔγϣϯ͕࣮ݱͰ͖Δ ● Ϩίϝϯυ, ҟৗ஋ݕग़, ෼ྨ໰୊, ݕࡧ etc… ● By learning the above, you can realize the following solution ● Recommendation, outlier detection, classification problems, search etc... ͜ͷൃදͰֶ΂Δ͜ͱɾͰ͖ΔΑ͏ʹͳΔ͜ͱ

͸͡Ίʹ Introduction

● 8 ݄ʹ AI ʹΑΔ࿈݁Պ໨Ϩίϝϯυ ػೳΛϦϦʔε We released an AI-based consolidated account recommendation in Aug. 2023 ● OpenAI ࣾͷ Embedding API Λ׆༻ Using OpenAI's Embedding API Ϋϥ΢υ࿈݁ձܭʹՊ໨ϨίϝϯυػೳΛ࣮૷ ref: 20230804-mf-press-1/ We’ve Implemented a subject recommendation function in our application ※ ಛڐग़ئࡁΈ Pattent applied

● ݸࣾͷצఆՊ໨ʹରͯ͠ɺҙຯతʹ͍ۙ ਌ձࣾͷ࿈݁Պ໨Λ্Ґ 3 ͭΛఏҊ͢Δ ● Suggest the top three parent company consolidated accounts that are semantically close to the individual company's accounts. ref: 20230804-mf-press-1/ Ϋϥ΢υ࿈݁ձܭʹՊ໨ϨίϝϯυػೳΛ࣮૷ We’ve Implemented a subject recommendation function in our application

● ࿩୊ੑͷߴ͔͞Βɺଟ͘ͷϝσΟΞ ͰऔΓ্͛ͯ΋Β͍·ͨ͠ɻ Due to the high profile of the topic, we have had a lot of media coverage. ● news/1522209.html ● ● ● In total, about 8 articles... ଟ͘ͷϝσΟΞͰऔΓ্͛ͯ௖͖·ͨ͠ We have had a lot of media coverage.

● צఆՊ໨ͱ͸ɺࢿ࢈ͳͲͷऔҾΛه࿥͢Δࡍʹ࢖͏໊শɾݟग़͠ ● Accounts are names or headings used to record transactions of assets, etc. आํ Debit ିํ Credit ஍୅Ո௞ Rent expenses 50,000 ී௨༬ۚ Ordinary deposit 50,000 ● ͜͜Ͱ͍͏ʮ஍୅Ո௞ʯͱʮී௨༬ۚʯ͕ͦΕͧΕצఆՊ໨ ● The “Rent expenses" and “Ordinary deposit" here are the accounts respectively. ྫ: Ո௞ 5 ສԁΛޱ࠲Ҿ͖མͱ͠Ͱࢧ෷ͬͨ৔߹ e.g. You paid 50,000 yen rent via direct debit. צఆՊ໨/࿈݁Պ໨ͱ͸ʁ What is an account/consolidated account?

● ࿈݁ձܭจ຺Ͱ͸ɺάϧʔϓ಺ͷձࣾͷ࿈݁Պ໨ͱֹۚΛٵ্͍͛ɺͦΕΒΛ਌ձࣾͷՊ໨ Ұͭʢ࿈݁Պ໨ʣʹू໿ͤ͞Δ࡞ۀ͕͋Δɻ ● In the consolidation accounting context, there is a process of taking the consolidated accounts and balance of the companies in the group and consolidating them into one account (consolidated account) of the parent company. צఆՊ໨/࿈݁Պ໨ͱ͸ʁ What is an account/consolidated account? ࢠձࣾA Company A ਌ձࣾ Parent Company ී௨༬ۚ Ordinary Deposit Aۜߦ Bank A ݱۚٴͼ༬ۚ Cash & Deposit ࢠձࣾB Company B

צఆՊ໨໊͸ձࣾʹΑ͔ͬͯͳΓදهΏΕ͕͋Δ Account names vary considerably from company to company.

ΑΓྑ͍צఆՊ໨໊ϨίϝϯυΛͲ͏࣮ݱ͢Δ͔ʁ How to achieve better account recommendations? ● ୯७ͳ͍͋·͍ݕࡧɾฤूڑ཭ͳͲͰ͸ٵऩ͖͠Εͳ͍ύλʔϯ͕ଟ͍ ● ྫ: ʮʓʓۜߦʯͱʮී௨༬ۚʯɺʮݱۚٴͼ༬ۚʯͳͲ ● ւ֎ࢠձ͕ࣾ͋Δ৔߹͸ʮʓʓ BankʯͳͲ೔ຊޠҎ֎ͷϞϊ͕དྷΔέʔε΋͋Δ ● Many patterns cannot be absorbed by simple fuzzy search, edit distance, etc. ● Ex: “XX bank” and “Ordinary deposit”, “Cash and deposits”, etc. ● If there is an overseas subsidiary, there are cases where things other than Japanese are sent. ● ҙຯͷۙ͞΋Ճຯͯ͠ɺݸࣾͷצఆՊ໨ʹҰ൪͍ۙ࿈݁Պ໨ΛϨίϝϯυ ͢Δඞཁ͕͋Δɻ ● It is necessary to recommend the consolidated accounts that are closest to the individual company's accounts, taking into account the proximity in meaning.

͔͠΋ɺ͜ΕΛղܾ͢Δʹ͸·ͩͨ͘͞Μͷน͕… And there are still many barriers to solving this...

צఆՊ໨໊Ϩίϝϯυ࣮ݱʹཱͪ͸͔ͩΔน Barriers to achieving account recommendation ● ϦϦʔε͔Β೔͕ઙ͘ɺֶशʹ࢖͑Δσʔλ͕े෼ʹू·͍ͬͯͳ͍ ● Պ໨ม׵ͷ࣮੷͸ɺ͍͍ͤͥ਺ඦ ~ ઍ݅͋Δ͔Ͳ͏͔ ● ֶशʹ࢖͏ͳΒ࠷௿Ͱ΋਺ສ ~ ਺ेສఔ౓͸΄͍͠ ● At the moment there was still little data available for training. ● At most, there are a few hundred ~ a thousand account conversion data ● Training the model may require tens or hundreds of thousands of data. ● ML Ϟσϧͷϝϯςίετ΍ਓࡐͷ֬อ͕ࠔ೉ ● ໰୊ൃੜ࣌ʹରԠͰ͖Δਓͷ༻ҙ͔ΒɺϞσϧ࠶ֶशͳͲͷίετ΋ແࢹͰ͖ͳ͍ ● Difficulty in securing maintenance costs and human resources for ML models ● Preparing people who can respond to problems when they occur is difficult, and the cost of re-training models cannot be ignored.

ͦ͜Ͱௐ΂͍ͯͯग़ձͬͨ΋ͷ͕… So I was researching and came across…

OpenAI Embedding API

͔͠͠ɺEmbedding API ʹ͍ͭͯ࿩͢લʹɺ ·ͣ͸จষϕΫτϧ / ෼ࢄදݱʹֶ͍ͭͯͼ·͠ΐ͏ɻ But before we talk about the Embedding API, Let's first learn about Text vector / Embedding representation.

About Embedding / Word2Vec

● จষΛ਺஋/ϕΫτϧʹม׵͢Δٕज़ɾख๏ͷ͜ͱ ● A technology or method of converting text into vectors. About Embedding / Word2Vec ”ݱ͓ۚΑͼ༬ۚ” [[-0.03455162],[-0.01306203], [ 0.01672893],…, [-0.00129271], [ 0.00694819],[-0.01055199]] ● ϕΫτϧɺ෼ࢄදݱ͋Δ͍͸ຒΊࠐΈදݱͱݺ͹ΕΔ͜ͱ΋͋Δɻ ● Sometimes called vector, distributed or embedded representation. ‘Cash and deposits’

About Embedding / Word2Vec ex: ʮݱۚʯͱʮෛ࠴ʯ͕ͦΕͧΕ [0.6, 0.8], [-0.3, 0.4]ͱ ͳΔ৔߹ When "Cash" and "Liabilities" become [0.4, 0.8] and [-0.3, 0.9], respectively ුಈখ਺఺ͷ഑ྻʹͳΔ͜ͱͰɺ࠲ඪ·ͨ͸ϕΫτϧΛද͢͜ͱ͕Ͱ͖Δɻ It can represent coordinates or vectors by being a floating-point array. -0.5 0.5 1 0.5 1 ݱۚ ෛ࠴ Liabilities Cash 0

਺஋Խ/ϕΫτϧԽ͢ΔͱԿ͕خ͍͠ͷ͔ʁ What can be done by quantifying/vectorizing ?

2. ϕΫτϧಉ࢜ͷՃࢉɾݮࢉͳͲ ɹ ਺஋ܭࢉ͕Ͱ͖ΔΑ͏ʹͳΔ ϕΫτϧԽ͢ΔͱͰ͖Δ͜ͱ What you can implement when vectorize texts 1. ϕΫτϧಉ࢜ͷྨࣅ౓ΛଌΔ͜ͱ ͕Ͱ͖Δ Can calculate similarity between vectors Can perform numerical operations such as addition and subtraction against vectors

2. ϕΫτϧಉ࢜ͷՃࢉɾݮࢉͳͲ ɹ ਺஋ܭࢉ͕Ͱ͖ΔΑ͏ʹͳΔ 1. ϕΫτϧಉ࢜ͷྨࣅ౓ΛଌΔ͜ͱ ͕Ͱ͖Δ Can calculate similarity between vectors Can perform numerical operations such as addition and subtraction between vectors ← ࠓճ͸ ͬͪ͜ This time we talk about this mainly. ϕΫτϧԽ͢ΔͱͰ͖Δ͜ͱ What you can implement when vectorize texts

1. ϕΫτϧಉ࢜ͷྨࣅ౓ΛଌΔ͜ͱ͕Ͱ͖Δ Can calculate similarity between vectors

● 2 ͭͷϕΫτϧͷؒʹͳ֯͢౓ΛٻΊΔ͜ͱ ͰɺϕΫτϧͷ޲͖ͷྨࣅ౓Λࢉग़Ͱ͖Δ ● By calculating the angle between two vectors, the similarity of vector orientation can be calculated ● ίαΠϯྨࣅ౓͕Ұൠత ● + Ͱਖ਼ͷ૬ؔɺ- Ͱෛͷ૬ؔ ● Cosine similarity is generally used. ● Plus means positive correction, negative means negative correction 1. ϕΫτϧಉ࢜ͷྨࣅ౓ΛଌΔ͜ͱ͕Ͱ͖Δ ݱۚ A ۜߦ ஍୅Ո௞ cos(‘ݱۚ’, ‘Aۜߦ’) = 0.85 cos(‘ݱۚ’, ‘஍୅Ո௞’) = 0.05 Rent expenses Rent expenses Cash Cash Cash Bank A Bank A Can calculate similarity between vectors

จষؒͷྨࣅ౓ΛٻΊΔ͜ͱ͕Ͱ͖ΔͷͰɺ͜ΕΒιϦϡʔγϣϯ͕࣮ݱͰ͖Δ Since similarity between sentences can be determined, you can apply it for the below solution 1. ϕΫτϧಉ࢜ͷྨࣅ౓ΛଌΔ͜ͱ͕Ͱ͖Δ • Ϩίϝϯυʢྨࣅ౓͕ߴ͍΋ͷʣ- Recommendations (highly similarity) • ҟৗ஋ݕग़ (ྨࣅ౓͕௿͍΋ͷ) - Outlier detection (low similarity) • ෼ྨ໰୊ʢྨࣅ౓͕͍ۙ΋ͷಉ࢜Ͱ෼ྨ͢Δʣ- Classification (Classify by its similarity) Can calculate similarity between vectors

จষؒͷྨࣅ౓ΛٻΊΔ͜ͱ͕Ͱ͖ΔͷͰɺ͜ΕΒιϦϡʔγϣϯ͕࣮ݱͰ͖Δ Since similarity between sentences can be determined, you can apply it for the below solution 1. ϕΫτϧಉ࢜ͷྨࣅ౓ΛଌΔ͜ͱ͕Ͱ͖Δ • Ϩίϝϯυʢྨࣅ౓͕ߴ͍΋ͷʣ- Recommendations (highly similarity) • ҟৗ஋ݕग़ (ྨࣅ౓͕௿͍΋ͷ) - Outlier detection (low similarity) • ෼ྨ໰୊ʢྨࣅ౓͕͍ۙ΋ͷಉ࢜Ͱ෼ྨ͢Δʣ- Classification (Classify by its similarity) Can calculate similarity between vectors

1. ϕΫτϧಉ࢜ͷྨࣅ౓ΛଌΔ͜ͱ͕Ͱ͖Δ ฤूڑ཭΍ LIKE ݕࡧͰ͸΄΅ແཧͳϨίϝϯυ΋… Recommendations that are almost impossible with edit distance or LIKE search… Can calculate similarity between vectors

1. ϕΫτϧಉ࢜ͷྨࣅ౓ΛଌΔ͜ͱ͕Ͱ͖Δ ϕΫτϧൺֱͳΒͰ͖Δɻ - But vector comparisons can. Can calculate similarity between vectors

1. ϕΫτϧಉ࢜ͷྨࣅ౓ΛଌΔ͜ͱ͕Ͱ͖Δ Google ݕࡧͷϝχϡʔͷྫ΋ྫߟ͑ͯΈΑ͏ Let's think about an example of a Google search menu one as well Can calculate similarity between vectors Ref:

1. ϕΫτϧಉ࢜ͷྨࣅ౓ΛଌΔ͜ͱ͕Ͱ͖Δ Can calculate similarity between vectors Ref: Google ݕࡧͷϝχϡʔͷྫ΋ྫߟ͑ͯΈΑ͏ Let's think about an example of a Google search menu one as well

1. ϕΫτϧಉ࢜ͷྨࣅ౓ΛଌΔ͜ͱ͕Ͱ͖Δ ࣮͸ݕࡧϫʔυ͝ͱʹɺϝχϡʔͷฒͼॱ͕มΘΔ The order of the menu changes for each search term. Can calculate similarity between vectors Ref:

1. ϕΫτϧಉ࢜ͷྨࣅ౓ΛଌΔ͜ͱ͕Ͱ͖Δ ͱ͍͏Θ͚ͰαΫοͱϨίϝϯυͤͯ͞ΈΔ So let's get a quick recommendation result Can calculate similarity between vectors

1. ϕΫτϧಉ࢜ͷྨࣅ౓ΛଌΔ͜ͱ͕Ͱ͖Δ ͍ͦͦۙ݁͜͜͠Ռ͕औΕΔ - We can retrieve reasonably close results Can calculate similarity between vectors

2. ϕΫτϧಉ࢜ͷՃࢉɾݮࢉͳͲ਺஋ܭࢉ͕Ͱ͖ΔΑ͏ʹͳΔ Can perform numerical operations such as addition and subtraction against vectors

● ϕΫτϧ͸୯ͳΔଟ࣍ݩ഑ྻͳͷͰɺ࣍ݩ਺ ͕߹͑͹Ճࢉɾݮࢉʢ߹੒ʣ͕Ͱ͖Δ ● Vectors are simply multidimensional arrays, so they can be added or subtracted (combined) if the number of dimensions matches. ● ϕΫτϧಉ࢜Λ߹੒͢Δ͜ͱͰɺෳ਺ͷϕΫ τϧͷҙຯΛ࣋ͬͨ··ɺҰͭͷϕΫτϧʹ ͢Δ͜ͱ͕Ͱ͖Δ ● Vectors can be combined into a single vector with the meaning of multiple vectors 2. ϕΫτϧಉ࢜ͷՃࢉɾݮࢉͳͲ਺஋ܭࢉ͕Ͱ͖ΔΑ͏ʹͳΔ IT ΦϨϯδ ۚ༥ܥ MoneyForward Can perform numerical operations such as addition and subtraction against vectors Orange Fintech

ίϨʹ͍ۙ΋ͷ͕࣮ݱͰ͖Δ Something close to this can be implemented. 2. ϕΫτϧಉ࢜ͷՃࢉɾݮࢉͳͲ਺஋ܭࢉ͕Ͱ͖ΔΑ͏ʹͳΔ Ref: ͭ·Γɺ Can perform numerical operations such as addition and subtraction against vectors So, IT Orange Fintech

ίϨʹ͍ۙ΋ͷ΋࣮ݱͰ͖Δ (લͷྫͳΒ IT ͕ώοτ) Something similar to this can also be implemented (IT will hit in the previous example). 2. ϕΫτϧಉ࢜ͷՃࢉɾݮࢉͳͲ਺஋ܭࢉ͕Ͱ͖ΔΑ͏ʹͳΔ Ref: ΋ͪΖΜݮࢉ΋Ͱ͖ΔͷͰɺ Of course, we can also subtract them, Can perform numerical operations such as addition and subtraction against vectors MoneyForward -Fintech -Orange

2. ϕΫτϧಉ࢜ͷՃࢉɾݮࢉͳͲ਺஋ܭࢉ͕Ͱ͖ΔΑ͏ʹͳΔ ϕΫτϧͷ߹੒Λ࢖͏͜ͱͰɺ͜Μͳ͜ͱ͕Ͱ͖Δ Using vector composition, we can do this Can perform numerical operations such as addition and subtraction against vectors 1. target_words ͷϕΫτϧΛܭࢉ͠ɺՃࢉ 2. ͦͯ͠ candidate_words ͷͦΕͧΕͱൺֱ 1. Compute and add vectors of target_words 2. Then compare the vector with each of the candidate_words ones

2. ϕΫτϧಉ࢜ͷՃࢉɾݮࢉͳͲ਺஋ܭࢉ͕Ͱ͖ΔΑ͏ʹͳΔ ݁Ռ͸͜͏ͳΔ - The result is this. Can perform numerical operations such as addition and subtraction against vectors

2. ϕΫτϧಉ࢜ͷՃࢉɾݮࢉͳͲ਺஋ܭࢉ͕Ͱ͖ΔΑ͏ʹͳΔ ͪͳΈʹ ‘Macbook’ Λ ‘Mac’ ʹ͢Δͱ͜͏ͳΔ If I change 'Macbook' to 'Mac', the result becomes like this Can perform numerical operations such as addition and subtraction against vectors

2. ϕΫτϧಉ࢜ͷՃࢉɾݮࢉͳͲ਺஋ܭࢉ͕Ͱ͖ΔΑ͏ʹͳΔ ಉ͡ख๏ͰɺΩϟϥΫλʔ࿈૝ήʔϜ΋࡞ΕΔ The same technique can be used to create character associative games. Can perform numerical operations such as addition and subtraction against vectors

2. ϕΫτϧಉ࢜ͷՃࢉɾݮࢉͳͲ਺஋ܭࢉ͕Ͱ͖ΔΑ͏ʹͳΔ ΋ͪΖΜ౰ͯΒΕΔ - Of course guessable. Can perform numerical operations such as addition and subtraction against vectors

OpenAI ͷEmbedding API ʹ͍ͭͯ About OpenAI's Embedding API

● (લड़ͷ)ςΩετͷϕΫτϧɾ෼ࢄදݱΛऔಘͰ͖Δ API ● ref: ● API to obtain vector/distributed representation of text ● 23೥6݄ʹՁ֨վఆ͞Εɺada ϞσϧͰͦΕ·Ͱͷ 75% Φϑͷ
 $ 0.0001/ 1K token ʹͳͬͨ ● Prices were revised in June 2023 to $ 0.0001/ 1K tokens, 75% off the previous price for the ada model. OpenAI ͷ Embedding API ʹ͍ͭͯ About OpenAI’s Embedding API

● 24೥1݄ʹߋʹ҆ՁͳϞσϧ text-embedding-3-small ͕ొ৔͠ɺߋʹ 80% Φϑʹ ● લͷϞσϧ (ada) ΑΓ΋ߴਫ਼౓ͳϞσϧ text-embedding-3-large ΋ొ৔ ● In January 2012, an even less expensive model, text-embedding-3-small, became available at an additional 80% off! ● A model text-embedding-3-large, which is more accurate than the previous model (ada), also became available ● ߋʹ࣍ݩ਺ͷ࡟ݮΛެࣜͰαϙʔτɻϕΫτϧܭࢉͷߴ଎Խ΍ετϨʔδ༰ྔͷ࡟ݮ͕ݟࠐΊΔ ● Furthermore, the reduction of the number of dimensions is supported by the formula, which is expected to speed up vector calculations and reduce storage capacity. OpenAI ͷ Embedding API ʹ͍ͭͯ About OpenAI’s Embedding API

● ͔͠΋ΑΓ҆Ձʹར༻Ͱ͖Δ Batch API ΋ར༻Մೳ ● Ϩεϙϯε͕஗͘ͳΔ(24 ࣌ؒҎ಺)୅ΘΓʹɺ൒ֹͰར༻Մೳ ● Batch API is also available for an even lower cost. ● Slow response (within 24 hours), but available at half price OpenAI ͷ Embedding API ʹ͍ͭͯ About OpenAI’s Embedding API ● ϦΞϧλΠϜੑͷཁٻ͕௿͍Ϟϊ΍ɺॳظσʔλߏஙͳͲʹద͍ͯ͠Δ ● Suitable for low real-time objects, initial data construction, etc.

● τʔΫϯͷ໨҆͸ҎԼ͔Βௐ΂Δ͜ͱ͕Ͱ ͖Δɻ ● If you are concerned about the tokens, you can find out more about them below. ● ● $1 ࢖͏ͷʹ 100 ສ ~ 5,000 ສจࣈ͘Β͍ ౤͛Δඞཁ͕͋ΔͷͰɺίετ΋ͦ͜·Ͱ ؾʹͳΒͳ͍ ● You need to send about 1 ~ 50 million letters to spend $1, so the cost is not much of a concern. OpenAI ͷ Embedding API ʹ͍ͭͯ About OpenAI’s Embedding API

1SFSFWJFXFENPEFM Ґਖ਼౴཰ ҐἬਖ਼౴཰ ̍Ґਖ਼౴཰XJUIPVU&OHMJTI ҐἬਖ਼౴཰XJUIPVU&OHMJTI Պ໨ϨίϝϯυͰ΋ࣄલݕ౼Ϟσϧͱൺ΂ͯɺਫ਼౓͕େ͖͘޲্ Significantly improved accuracy in account recommendations compared to pre-reviewed model 0QFO"*UFYUFNCFEEJOHMBSHF Ґਖ਼౴཰ ҐἬਖ਼౴཰ ̍Ґਖ਼౴཰XJUIPVU&OHMJTI ҐἬਖ਼౴཰XJUIPVU&OHMJTI OpenAI ͷ Embedding API ʹ͍ͭͯ About OpenAI’s Embedding API

● ಉ͡ςΩετΛૹͬͨࡍͷϨεϙϯε͸ৗʹҰఆͳͷͰɺ ϕΫτϧΩϟογϡͷػߏΛ࡞ΔͱϦΫΤετΛݮΒͤΔɻ ● Since the response is always constant when the same text is sent, a vector cache mechanism can be created to reduce requests. ● OpenAI ͷϨεϙϯε͸ (ϦΫΤετʹΑΔ͕) ਺ඵ͔͔Δͱ ͖΋͋ΔͷͰɺϨεϙϯελΠϜվળͷͨΊʹ΋ϕΫτϧ Ωϟογϡ͸͋ͬͨํ͕Α͍ ● API’s response can take several seconds (depending on the request), so vector caching is recommended to improve response time. OpenAI ͷ Embedding API ʹ͍ͭͯ About OpenAI’s Embedding API

● API ʹ͸ҰൠతͳϨʔτϦϛοτ͕ઃఆ͞Ε͍ͯΔ ● ݱࡏ͸ Tier ʹΑͬͯมΘΔ ● API has a general rate limit ● Currently varies depending on Tier ● ϦΫΤετ਺ʹ΋ΑΔ͕ɺΩϟογϡػߏ΋ ͋Ε͹ֻ͔Δ͜ͱ͸͋·Γແ͍ ● Depends on the number of requests, but with a cache mechanism, there is little to worry about. Embedding API ࢖༻࣌ͷϝϞɾ஫ҙ఺ Notes on using Embedding API

● ͕ɺͦΕͱ͸ผͰγεςϜશମϨϕϧͰ͔ ͔ΔϨʔτϦϛοτͷΑ͏ͳ΋ͷ͕Ͳ͏΍ Βଘࡏ͢Δ ● But apart from that, there is apparently some kind of rate limit that is applied at the system-wide level. 1. ϨʔτϦϛοτ͕2छྨଘࡏ͢Δ ● ଟ͍࣌͸਺ճʹ̍ճ͘Β͍ͷස౓Ͱ͜ͷϦϛοτʹ఍৮͢Δ ● I encounter this rate limit error about once every few times at most.

ղܾ๏: ϦτϥΠػߏΛಋೖ͢Δ - Solution: Implement a retry mechanism ϨʔτϦϛοτͷରॲ๏ ● OpenAI ΋Exponential Backoff Λਪ঑ɺPython ͸ ͍͔ͭ͘ϥΠϒϥϦͷαϯϓϧ΋ࡌ͍ͤͯΔɻ ● Ref: rate-limits/retrying-with-exponential-backoff ● OpenAI also recommends Exponential Backoff and some Python sample code is also provided Rate Limit ΁ͷରॲ๏ͱͯ͠͸ඍົʹ΋ࢥ͑Δ͕ɺಋೖҎ߱Ͱൃੜ݅਺͸΄΅θϩʹɻ Although this may seem like a subtle way to deal with the Rate Limit, the number of occurrences has dropped to almost zero since its introduction.

จষϕΫτϧΛ׆༻ͯ͠Ͱ͖Δ͜ͱ What you can achieve with text vector

ࣄྫ1: Պ໨ಉ࢜ͰͷϨίϝϯυ Example 1:Recommendation between accounts • Redis ΛϕΫτϧΩϟογϡอଘ༻ʹར༻ • ͜ͷ࢓૊ΈͰඅ༻Λ཈͑ΒΕ͓ͯΓɺྦྷܭͰ ਺ ઍສՊ໨ఔ౓ΛϕΫτϧԽ͕ͨ͠ɺඅ༻͸΄ͱ ΜͲֻ͔͍ͬͯͳ͍ • Use Redis as vector cache storage • Thanks to this mechanism, a total of about tens of millions of accounts have been vectorized so far, but at little or almost no cost. • Embedding API ͕࢖͓͔͑ͨ͛ͰɺGPU ΍େྔͷ CPU/ϝϞϦΛ٧ΜͩߴՁͳϚγϯ͕ෆཁʹ Embedding API eliminates the need for expensive machines packed with GPUs and lots of CPU/memory

ࣄྫ2: ྖऩॻͷϑϦʔϫʔυݕࡧ Example 2: Free word search for receipts Vector DB Receipt.pdf, jpg, etc… [[-0.03455162],[-0.01306203],…, [ 0.00694819],[-0.01055199]] User • Vectorize the text of the contents of the receipt • OCR, use ChatGPT, etc… • Then store it in Vector DB, etc • ྖऩॻͷத਎ͷςΩετΛ༧ΊϕΫτϧԽ • OCR, ChatGPT ʹ౤͛Δ etc… • ͦΕΛ Vector DB ͳͲʹอଘ͓ͯ͘͠ Vectorization Upload

ࣄྫ2: ྖऩॻͷϑϦʔϫʔυݕࡧ Example 2: Free word search for receipts Vector DB • Ϣʔβ͕ೖྗͨ͠ݕࡧϫʔυΛϕΫτϧԽɺ DB ্ͷ஋͔Β͍ۙ͠΋ͷΛϐοΫ User • Vectorize search words entered by the user, and pick the closest ones from the values on the DB. 12/1ͷ1ສԁͷྖऩॻ A receipt of10,000 yen on December 1. [[-0.03455162],[-0.01306203],…, [ 0.00694819],[-0.01055199]] Search Receipt_Dec_1.pdf Vectorization

ࣄྫ3: ͱ͋ΔྖऩॻͱྨࣅͷྖऩॻΛ୳͢ Example 3: Find receipts that are similar to a certain receipt. Vector DB • Text to File ͕Ͱ͖Ε͹ɺ΋ͪΖΜ File to File ࣮ͩͬͯ૷Ͱ͖ͪΌ͏ • If Text to File can be implemented, of course File to File can also be implemented. ͜Εͱྨࣅͷྖऩॻ͕΄͍͠ I need a receipt similar to this one. [[-0.03455162],[-0.01306203],…, [ 0.00694819],[-0.01055199]] Text extraction Vectorization Receipt, Dec. 1, … Search

ཁ͸… After all…

จষϕΫτϧΛੜ੒Ͱ͖Δ͜ͱͰͰ͖Δ͜ͱ What can be done by being able to generate sentence vectors ςΩετʹม׵Ͱ͖Δ΋ͷͳΒɺͳΜͰ΋Ϩίϝϯυ etc Λ࣮૷Ͱ͖Δɻ Anything that can be converted to text can be used to implement recommendations, etc. ͔͠΋ ChatGPT ͷ͓ӄͰɺը૾ etc ΛςΩετʹม׵͢Δෑډ΋௿͘ͳ͍ͬͯΔ Also, thanks to ChatGPT, the difficulty of converting images and other data to text has been reduced.

·ͱΊ summary

● OpenAI ࣾͷ Embedding API Λ׆༻͢Δ͜ͱͰɺML ΤϯδχΞ͕ ډͳ͍νʔϜͰ΋ AI ιϦϡʔγϣϯΛ؆୯͔ͭ҆Ձʹ࣮ݱͰ͖ͨ ● OpenAI's Embedding API made implementing an AI solution for a team without an ML engineer easy and inexpensive. ● Embedding API Λ׆༻͢Δ͜ͱͰɺϨίϝϯυ΍ҟৗ஋ݕग़ɺςΩ ετ෼ྨͳͲଟ༷ͳιϦϡʔγϣϯΛ࣮ݱͰ͖Δ ● Embedding APIs can be used to implement various solutions such as recommendation, outlier detection, text classification, etc. ·ͱΊ - Summary

