The Future of AI: Beyond Completion Models to Systematic Innovation

2024. 01. 01 WRTN. All Rights Reserved. 01 Intro 02
Previous works 03 Systematic solutions 04 Outro The Future of AI: Beyond Completion Models to Systematic Innovation

Self-introduction - Sungho Park - ML Engineer @ wrtn technologies
- Main works - LLM-based Autonomous Agent Design, Evaluation - Python Backend Service Engineering - Interest things - Open-source Contribution - Model serving - Embodied Agent

- Productizable AI의 system architecture 관점에서 구성하는 고민들을 공유하려고 합니다.
한국 커뮤니티에서 관련된 많은 논의가 발생하길 바랍니다. - 청중에게 발표하기 부끄럽고 부족한 실력이지만 모아둔 짧은 인사이트들을 공유하려고 합니다. - 주제에 대한 뚜렷한 대안은 저조차도 없지만, 그 동안 학계, 커뮤니티의 연구에서 짚고 넘어갈 사실들을 정리했습니다. - 정신 없이 흘러가는 최근 동향들에서 잠시 멀어지고, 처음부터 짧게 되짚으면서 시작하려고 합니다. About Presentation

Intro About the presentation - Background history - Perspective of
LLM - AI system architecture

Background history Evolution of Language Model - NLU(2011 ~): Word2Vec
- NLG(2015 ~): Seq2Seq, Transformers - PLM(2018 ~): BERT, LLM(2020 ~): GPT-3 - Mega Scale(~2021): GPT, PaLM, - Token Eﬃciency(2022 ~): Chinchilla - RL and Instruction Tuning(2022 ~): InstructGPT, Alpaca - Better Reasoning(2022 ~): CoT, ReAct, Tree of Thoughts - Lots of Training Methods(2023 ~ )... 2023 LangCon 김기현님 발표 참고

Background History - Before Higher Performance LLM( ~ 2022) -
Using AI as RecSys(Shopping), Classiﬁcation(Finance), OCR … - AI: Small component of system, secondary role of system - After Higher Performance LLM(2023 ~) - AI: Service domain, rich platform feature, enhanced RecSys, Classiﬁcation - AI: Core component of system.

- Before Higher Performance LLM( ~ 2022) - Using AI
as RecSys(Shopping), Classification(Finance), OCR … - AI: Small component of system, secondary role of system - After Higher Performance LLM(2023 ~) - AI: Service domain, rich platform feature, enhanced RecSys, Classification - AI: Core component of system. - Three perspective of AI for business for success - Performance, Efficiency of model: - High barrier: Resources, Researchers, Engineers, Cost … - Serve the model - High barrier: Resources, Researchers, Engineers, Cost … - Utilizing the model: Presentation Target - Low barrier: Just junior engineers and API cost LLM guarantee the lower-boundary performance of service. Background History

Perspective of LLM in the service - Completion tools, black
box system: - Core component of Chat interface, Query Classification … - Focus on broad capabilities like fluence, domain knowledge understanding - Service quality depends on Performance of LLM - Cooperative system component - Core component of Chat interface, Query Classification … - Focus on specific capabilities like decision, code generation - Service quality depends on System Architecture & Performance of LLM Current practice of LLM usage: Mixture of both.

Discussion of AI Community Current practice of LLM usage: Mixture
of both - LLM as Completion method - AI Community: Try to make better LLM. - 👍 Decreasing cost of LLM API: OpenAI, Claude, Together.ai, … - 👍 Better LLM API: OpenAI, Claude, Together.ai, … - 👍 Better Public LLM: Korean Continual Learning Llama2, Yi, Qwen, … - LLM as Cooperative system component - AI Community: Try to make better System Architecture of LLM usage. - 👍 Modularized AI constructs for easy creation: Langchain, LlamaIndex, .. - 👍 Well-crafted research, project: Taskweaver, AutoGPT, … - ❓ Build Productizable AI architecture for business ﬁeld: Enterprise role - Easy to serve and maintain, less cost/latency, Multi-modal I/O - Langchain and LlamaIndex are library, doesn’t provide architecture - Maybe Agent Frameworks: Taskweaver, AutoGPT?.. ➡ Hard to use in various scenario. - Hidden practice

Let’s Focus on building Productizable AI Project

- 업계에 많은 노하우가 축적되어서 당연하다고 생각할 수 있는 이야기들도
다수 포함. - 기업이 추구하는 목표와 연구(특히 Open-source contributors들의 노고)의 성격이 다름을 인지하고 여러 인사이트들을 공유하려고 합니다. - 그럼에도 한국 필드에서 AI system architecture와 관련된 다양한 논의들의 시작점이 되었으면 하는 마음에 여러가지 생각들을 정리해봤습니다. About Presentation

Breaking the Completion Model Barrier Fix the model, and increase
the performance ⏱ Maybe Before AGI...

Terminology - Planning: Generate sequence of action - e.g., 1.
Eat, 2. Run, 3. Sleep - Reasoning: To infer about a problem - e.g., A is B, B is C. So, A is C. - LLM Performance: Fluency, Domain-knowledge understanding, mathematical, logical thinking …, every ability of LLM.

CoT, ReAct - ReAct(https://arxiv.org/abs/2210.03629 ) - CoT(https://arxiv.org/abs/2201.11903 ) - Mixture
of ReAct and CoT is the best reasoning method. (from ReAct paper)

ADaPT: As-Needed Decomposition and Planning with Language Models ADaPT(https://arxiv.org/abs/2311.05772 )
- Plan: Sequence of action - Decomposition: Action candidates of each sequence - ADaPT: Plan and Decomposition with Iterative error handling - e.g., In left ﬁgure - Step 1b ➡ Step 2 ➡ Step 3 - Maybe the most natural way to integrate Planner Architecture - Try Iterative Fail/Success scenario

RestGPT: Connecting Large Language Models with Real-World RESTful APIs RestGPT(http://arxiv.org/abs/2306.06624
) API-augmented Planning: Integrate APIs response to prompt

Better than ReAct, Reﬂection and without Planning RestGPT: Connecting Large
Language Models with Real-World RESTful APIs RestGPT Evaluation(http://arxiv.org/abs/2306.06624 )

VOYAGER: An Open-Ended Embodied Agent with Large Language Models Voyager(http://arxiv.org/abs/2305.16291
) LLM-powered embodied lifelong learning agent in Minecraft - continuously explores the world - acquires diverse skills - makes novel discoveries without human intervention - ever-growing skill library of executable code for storing and retrieving complex behaviors - a new iterative prompting mechanism that incorporates environment feedback, execution errors, and self-veriﬁcation for program improvement. - Environment(mineﬂayer-collectblock) is key!

Reasoning with Language Model is Planning with World Model LLM
Reasoner(http://arxiv.org/abs/2305.14992 ) Lack of LLM - hard to predict the world state - simulate long-term outcomes of actions - iteratively reﬁning existing reasoning step Use Monte Carlo Tree Search! - Open-Loop Planning in RL(Discrete case) - LLM as action, state generator and reward function

- Inject information via prompts. - CoT, ReAct - ADaPT
- RestGPT - Handling response as system component - Expected response type like type checking in Programming Language - e.g., The system expects valid json format response to LLMs like {“thought”: “something..”, “action”: “something…”} - Believe LLM capability and performance - e.g., LLM can make valid JSON format using regex constraints. - Utilize the APIs or python function as tool - e.g., LLM can call the search function. - Paper and Open-Source Projects provide lots of idea! - Thanks for the researchers and contributors. - (Optional) Code generation task is core component Common Points of Previous Works

Limitation of Previous Works Let’s think about the developer who
is sitting in front of the notebook. His goal is making something impressive chat system. But he found the limitations of previous works. - Hard to generalize - e.g., In the common conversation service, MCTS isn’t easy to adapt. - e.g., Many research targets benchmark to prove the ability of proposed methods. - Evaluation metrics(Acc) ≠ Service metrics(MOU, Retention) - It’s necessary to evaluate using benchmark. - But system metrics and effectiveness is different. - Too complicated to maintain - Hard coupling with the old modules.. - Hard to find the best practice solution for utilizing the LLM

Systematic Solutions First, Public projects and papers These can’t be
solutions, but good reference for solution.

Diﬀerence between two section - Breaking the Completion Model Barrier:
Focus on the performance - Systematic Solution: Focus on the system to overcome the limitations

- Focus on the system to overcome the limitations -
Objective deﬁnition: Set the target use case and overcome or solve the use case - Use LLM as components - Systematic features should be concerned… e.g., JSON mode, Latency, Cost … Systematic Solution

AutoGPT - Concern about 5 Topics - Goals - Constraints
- Commands(APIs, Python Function) - Resources - Performance Evaluation - Generate - text, reasoning, plan, criticism, speak, command - JSON format - Execution and generations are saved to Memory - Iterative running AutoGPT(https://github.com/Signiﬁcant-Gravitas/AutoGPT ), Decoding AutoGPT Understanding

AutoGPT - ✅ Focus on the system to overcome the
limitations - Good. AutoGPT can run few weeks to solve the problem. This was an amazing experience at the time(2023.03) - ✅ Objective deﬁnition: Set the target use case and overcome or solve the use case - Good. - ✅ Use LLM as components - Yes. Typical practice using LLM as components in the system.

AutoGPT - ✅ Good to make own agent - Yes.
Forge of AutoGPT supports this philosophy. - ❌ Handle the performance of AutoGPT - Hard to handle the quality of answer - Iterative reasoning can be lead to wrong answer. - ❌ Good to use in general purpose like ChatGPT - Can’t handle the error of commands. - Too expensive. - Too slow. - Hard to change the behavior of AutoGPT.

BabyAGI 1. Deﬁne the task, make solution a. solution: can
be just text, execution of tools 2. Storing the solution 3. Create new task and reprioritize task list 4. Go to 1. BabyAGI(https://github.com/yoheinakajima/babyagi?tab=readme-ov-ﬁle ), BabyAGI review

BabyAGI - More improved planning sequence - Same as AutoGPT
about the summary(Focus on the system, etc…)

TaskWeaver 1. Generate thought and reply target 2. If reply
target is “User”, just reply to user. 3. If reply target is “CodeInterpreter”, CodeInterpreter will be try to solve the problem with tools. 4. Repeat TaskWeaver(https://github.com/microsoft/TaskWeaver ), TaskWeaver Paper(http://arxiv.org/abs/2311.17541 ) - “Role” is the problem solver. - User - CodeInterpreter - Two role objects interact each other to solve the problem. - Two pattern of tool usage - CodeInterpreter calls the tools - Function calling calls the tools - Simple ReAct

TaskWeaver - ✅ Focus on the system to overcome the
limitations - Good. TaskWeaver can a lot of use case. - ✅ Objective deﬁnition: Set the target use case and overcome or solve the use case - Good. - ✅ Use LLM as components - Yes. Typical practice using LLM as components in the system. - Planner prompt - json response schema: init_plan, plan, current_plan_step, send_to, message - Experience module - Forced not to repeat mistakes in Code Interpreter - e.g., tool argument generation, function usage scenario - LLM in the Plugin(e.g., summary, speech-to-text)

TaskWeaver - ✅ Good to use in general purpose like
ChatGPT - Yes. It provide very similar experience when using CodeInterpreter. - It provide SerpAPI and web_search_plugin like Web Browsing in ChatGPT - ✅ Good to survey the code-based studying - Well-crafted code architecture compared to other python agent frameworks - ❌ Good to adapt in business service - Hard to customize the framework - Hard to adapt the framework into the online server. This is only the framework to research and ﬁnd the possibilities/limitations. - ❌ Good to make own agent - No. It’s very hard to implement the new planner. - Very hard to modify the architecture. - ❌ Handle the performance of TaskWeaver. - Same as AutoGPT. Iterative reasoning can be lead to wrong answer.

Summary of the systematic solutions We want these criteria, but...
- ✅ Good to make own agent - ✅ Good to use in general purpose like ChatGPT - ✅ Good to adapt in business service - ✅ Well-crafted code architecture - ✅ Faster, Cheaper, Simplify, Easy-prompt-debugging - ✅ Handle the performance There is no public solution that meets all of these.

More criteria for the system - ❓ Easy-debugging for service
and prompt results - ❓ Good to evaluate the system performance on oﬄine - Simple evaluation: Just use public benchmark to approximate the service performance - ❓ Good to unit test the LLM component - Simple unit test for very small amount of dataset: Examples of Liner - ❓ Easy to utilize larger context of tool response - How to manage the large crawling response? - ❓ Good to log the service performance on online - If response of component generates the probability, record and log. - ❓ More domain-speciﬁc requirements … - Understanding domain knowledge Unit test reference: https://speakerdeck.com/huffon/autonomous-agent-in-production?slide=81

Can AGI make these concerns garbage? There are two perspectives.
- Yes: AGI will better fulﬁll all criteria. - No: AGI will be a component of insane performance. What’s the answer: Maybe both?..

About wrtn Korean, Japan Service Store, Studio Plugin Chat Moving
Towards a Common Goal(AGI brings to People), Approaching It from Diﬀerent Angles

wrtn AI Team’s Works (findings from “More criteria for the
system” slides) - ✅ Easy-debugging for service and prompt results - Promptflow: doesn't provide debugging features, but visualization and modularization of promptflow leads to easy-debugging. ➡ Developoers should divide the features into graph flow. - 🏋 Good to evaluate the system performance on offline - ✅ Simple evaluation: Just use public benchmark to approximate the service performance Promptflow: provides simple evaluation feature. - 🏋 Better Inner-Benchmark systems - ✅ Good to unit test the LLM component - ✅ Simple unit test for very small amount of dataset Promptflow: provides flow-driven development and becomes easy to unit test. - ✅ Easy to utilize larger context of tool response - ✅ How to manage the large crawling response? ➡ “wrtn search” was trying to solve this problem and was accomplished. Promptflow isn’t ultimate solution, but very useful.

wrtn’s Works (ﬁndings from “More criteria for the system” slides)

wrtn’s Works (findings from “More criteria for the system” slides)
- ❓ Good to log the service performance on online (perspective of AI service) - Hard to define the score of AI service. - How to define the score of answer on the online service? - Maybe other engineering, service management is the solution. - e.g., RAG service can log the similarity score, but this isn’t the user satisfaction. - ❓ More domain-specific requirements … - 🏋 How do you define meaningful things to user and how will the system take it? - Areas the company will always need to address and prove

Outro - Summary - Conclusion

Summary: Systematic solutions for LLM Service 👍 Pros - Common
and natural methods of building AI service - Easy to compose complex functional requirements as service - Easy to overcome LLM performance 👎 Cons - Hard to accomplish good architecture design - Hard to debug, cooperative engineering - Complicated metrics and evaluation for research and service - All the quality of system are depends on response of tool - Depends on too many outer service. It makes expensive service.

Conclusion - Make better tool - Design clear architecture for
engineering and LLM components - Always follow up the new methods, and decide to adapt - LLM library isn’t always the solution. It can be burden. - Easy Evaluation, Easy Scoring for fast-prototyping. - But still needs Korean Agent Benchmark.

- AI system architecture는 기업들이 시도하는 문제해결인만큼 그 동안 공유가
어려웠던 부분들이 많았습니다. - 그럼에도 Stable Diﬀusion이 열풍에서 많은 분들이 Open-source model들의 성장, use case 도출, 생성 툴 architecture 개선들을 관찰했듯이 TaskWeaver, AutoGen과 같은 LLM project architecture 관점의 다양한 시도들이 Open-Source 진영에서 일어나기 바랍니다. - 개인적으로 국내외 많은 contributor들의 수혜를 받고 있는만큼 더 많은 기여를 하고자 노력하겠습니다. - 부족한 발표였던만큼 관련 커뮤니티에서 AI system 구성에 관한 추가적인 논의들로 좋은 관점들을 보충해주시면 좋겠습니다. Another Conclusion

The Future of AI: Beyond Completion Models to ...

The Future of AI: Beyond Completion Models to Systematic Innovation

Other Decks in Research

Featured

Transcript