LLMOps를 위한 데이터 연못 만들기

LLMOps를 위한 거대한 데이터 연못 만들기 한성민 | Riiid MLOps
Lead

Speaker 한성민 (Sungmin Han) MLOps Lead at Riiid Google Developer
Experts (GDE) for ML Google Developer Groups (GDG) for Go F-Lab Python Mentor Former Research Engineer at Naver Clova Former Software Engineer at IGAWorks Former Software Engineer at 심심이

세션에서 다룰 주제 1. 레이크하우스 (LakeHouse) 2. Large Language Model
(LLM) 3. ETL vs ELT 4. 메달리언(Medallion) 아키텍처 5. 실전 LLM 학습 6. 그 밖에…

레이크하우스 (LakeHouse)

Data Warehouse의 한계 Data Warehouse Data Tableized 필연적으로 손실 발생

LLM Data 특징 • 비 구조적 텍스트 데이터 • 텍스트의
임베딩 벡터 데이터 (Array) • 텍스트와 연결된 이미지 / 오디오 (멀티모달) • 텍스트 레이블 데이터 (Complex)

LakeHouse https://www.cidrdb.org/cidr2021/papers/cidr2021_paper17.pdf a new architecture choice has emerged: the data
lakehouse, which combines key benefits of data lakes and data warehouses. This architecture offers low-cost storage in an open format accessible by a variety of processing engines like Spark while also providing powerful management and optimization features.

Large Language Model (LLM) 데이터 처리

Large Language Model (LLM) Large Language Model (LLM) Large Unlabeled
Corpus

Large Language Model (LLM) Large Language Model (LLM) Large Unlabeled
Corpus Large Language Model (LLM) Small Labeled Corpus Pre-Training Fine-Tuning

LLM 예시 (감정분류 / Few shot) 너는 지금부터 리뷰에 대해서
감정 분류를 해야해, 설명은 제외할게. 감정 분류에 대한 예시는 다음과 같아. INPUT: 이 행사 정말 시간 엄청 낭비하고 무엇보다 재미가 없었어요. OUTPUT: 매우 부정 INPUT: 와 행사 시청하는데 시간 가는줄 몰랐어요, 추천!! OUTPUT: 매우 긍정 INPUT: 행사에 가서 기술 세션 들었어요. OUTPUT: 중립 INPUT: 행사 연사진 이력이 화려하고, 발표자료 너무 좋았어요. OUTPUT:

LLM 예시 (감정분류 / Few shot) 결과 긍정 INPUT: 행사
연사진 이력이 화려하고, 발표 자료 너무 좋았어요. OUTPUT:

LLM 예시 (감정분류 / Few shot) => 여러 입력 그럼에도
불구하고 몇몇 세션들의 내용이 아쉬웠어요. 데이터야놀자 2023은 정말 유익한 세션들로 가득했어요. 주최측의 세심한 준비와 관리에 감탄했어요. 네트워킹 타임이 부족한 것 같아요. 최신 데이터 트렌드와 기술을 한눈에 볼 수 있어서 좋았어요. 행사장의 위치와 교통이 불편했어요.

LLM 예시 (감정분류 / Few shot) => 여러 입력 결과
그럼에도 불구하고 몇몇 세션들의 내용이 아쉬웠어요. 데이터야놀자 2023은 정말 유익한 세션들로 가득했어요. 주최측의 세심한 준비와 관리에 감탄했어요. 네트워킹 타임이 부족한 것 같아요. 최신 데이터 트렌드와 기술을 한눈에 볼 수 있어서 좋았어요. 행사장의 위치와 교통이 불편했어요. 부정 매우 긍정 긍정 부정 긍정 부정

LLM 예시 실제 동작

Large Language Model (LLM) LLM Prompt (Text) Output (Text) Llama
GPT Falcon PaLM2 Vicuna

ETL vs ELT

ETL vs ELT ETL ELT 가공 데이터 (정형화) Structured, Semi-Structured
즉시 분석 용이 텍스트 분석 목적 원천 데이터 Structured, Semi-Structured, Unstructured 모든 데이터 포함 텍스트, 이미지, 소리 분석 및 연구 목적

“ETL은 필연적으로 손실 발생" Source Destination 기존 ETL의 문제 Transformation

“ELT” (Extract - Load - Transform)

“쓰건 안쓰건 모두 적재, 이후 필요에 따라 가공" Source Destination
새로운 ELT 컨셉의 목표 Load Transformation

메달리언(Medallion) 아 키텍처

Medallion Architecture Silver Gold Bronze 원본 데이터 정형화 처리 필터링
데이터 전처리 사용 데이터만 선별 비즈니스 목적 가공 최종 테이블

Pipeline on Medallion Quizium (Our Service) Fivetran Original Bronze Silver
Gold

실전 LLM 학습

[Demonstration]

Demonstration Scenario Bronze Silver Gold Llama2 Llama2 Fine-tuning

Bronze df = dataset["train"].to_pandas() df = pd.concat([df.drop(['preference-suggestion'], axis=1), df['preference- suggestion'].apply(pd.Series)],
axis=1) for col in ['preference-suggestion-metadata', 'correct-response-suggestion-metadata']: expanded = df[col].apply(pd.Series) expanded = expanded.rename(lambda x: col + "_" + x, axis=1) df = pd.concat([df.drop([col], axis=1), expanded], axis=1) df['rank'] = df['rank'].apply(lambda x: [int(i) for i in x]) df['value'] = df['value'].apply(list) schema = StructType([ StructField("request", StringType()), StructField("response-1", StringType()), ... ]) spark_df = spark.createDataFrame(df, schema=schema) bronze_path = "/dataya_nolja/bronze" spark_df.write.format("delta").mode("overwrite").save(bronze_path)

Silver silver_df = spark_df.select("request", "response-1") silver_path = "/dataya_nolja/silver" silver_df.write.format("delta").mode("overwrite").save(silver_path)

Gold df = silver_df gold_df = df.select(concat( lit("[REQ]"), df['request'], lit("[/REQ]"),
lit("[RES]"), df['response-1'], lit("[/RES]") ).alias("text")) gold_path = "/dataya_nolja/gold" gold_df.write.format("delta").mode("overwrite").save(gold_path)

Gold Dataset

Upload to HuggingFace import pandas as pd from datasets import
Dataset pdf = gold_df.toPandas() dataset = Dataset.from_pandas(pdf) dataset.push_to_hub("ken-sungmin/dataya-nolja-llama2-finetuning", token=huggingface_token)

HuggingFace Datasets https://huggingface.co/datasets/ken-sungmin/dataya-nolja-llama2-finetuning

Fine-Tuning (w/ LoRA) # Dataset data_name = "ken-sungmin/dataya-nolja-llama2-finetuning" training_data =
load_dataset(data_name, split="train") # Model and tokenizer names base_model_name = "NousResearch/Llama-2-7b-chat-hf" refined_model = "llama2-finetuned-dataya-nolja" # Tokenizer llama_tokenizer = AutoTokenizer.from_pretrained(base_model_name, trust_remote_code=True) llama_tokenizer.pad_token = llama_tokenizer.eos_token llama_tokenizer.padding_side = "right" # Fix for fp16 # Quantization Config quant_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=torch.float16, bnb_4bit_use_double_quant=False ) # Model base_model = AutoModelForCausalLM.from_pretrained( base_model_name, quantization_config=quant_config, device_map={"": 0} ) base_model.config.use_cache = False base_model.config.pretraining_tp = 1

Fine-Tuning (w/ LoRA) #2 # LoRA Config peft_parameters = LoraConfig(
lora_alpha=16, lora_dropout=0.1, ... ) # Training Params train_params = TrainingArguments( output_dir="./results_modified", num_train_epochs=1, per_device_train_batch_size=4, save_steps=25, ... ) # Trainer fine_tuning = SFTTrainer( model=base_model, train_dataset=training_data, ... ) # Training fine_tuning.train() fine_tuning.model.save_pretrained(refined_model, safe_serialization=True) llama_tokenizer.save_pretrained(refined_model)

Inferencing # Generate Text query = "I want to send
money, how long will it take?" text_gen = pipeline(task="text-generation", model=fine_tuning.model, tokenizer=llama_tokenizer, max_length=200) output = text_gen(f"<s>[REQ]{query}[/REQ]")

Output [{'generated_text': "<s>[REQ]I want to send money, how long will
it take?[/REQ] Hello there! Thank you for trusting the Bank of America with your financial needs. everybody here is happy to help you with your inquiry. To answer your question, the time it takes to send money through the Bank of America depends on the transfer method you choose.\nFor domestic transfers, typically the funds are available in the recipient's account within 1-2 business days. However, if you're sending money internationally, the time frame may vary depending on the country and the transfer method you choose.\nFor example, if you're using our Global Account Transfer service, the funds may take 2-3 business days to reach the recipient's account. Or, if you're using our Wire Transfer service, the funds can be available in the recipient's account within 24 hours.\nPlease note that the"}]

그 밖에…

RAG (Retrieval Argumented Generation) Prompt Vector Search Context LLM Prompt
+ Context

Vector Store (BigQuery) WITH query_embedding AS ( SELECT * FROM
ML.GENERATE_TEXT_EMBEDDING(MODEL text.embedding_model, ( SELECT "TREES DOWN NEAR THE INTERSECTION OF HIGHWAY" AS content)) ) SELECT q.content AS query_text, c.content AS candidate_text, ML.DISTANCE(q.text_embedding, c.text_embedding, 'COSINE') AS distance FROM query_embedding AS q, semantic_search_tutorial.candidate_embedding_cluster9 AS c ORDER BY distance ASC LIMIT 20;

Vector Store (Databricks) import mlflow from mlflow import gateway gateway.set_gateway_uri(gateway_uri="databricks")
mosaic_embeddings_route_name = "mosaicml-instructor-xl-embeddings" try: route = gateway.get_route(mosaic_embeddings_route_name) except: print(f"Creating the route {mosaic_embeddings_route_name}") print(gateway.create_route( name=mosaic_embeddings_route_name, route_type="llm/v1/embeddings", model={ "name": "instructor-xl", "provider": "mosaicml", "mosaicml_config": { "mosaicml_api_key": dbutils.secrets.get(scope="dbdemos", key="mosaic_ml_api_key") } } ))

Vector Store (Databricks) / #2 source_table_fullname = f"{catalog}.{db}.databricks_documentation" vs_index_fullname =
f"vs_catalog.{db}.databricks_documentation_index" if not index_exists(vs_index_fullname): print(f'Creating a vector store index `{vs_index_fullname}` against the table `{source_table_fullname}`, using AI Gateway {mosaic_embeddings_route_name}') i = vsc.create_delta_sync_index( source_table_name=source_table_fullname, dest_index_name=vs_index_fullname, primary_key="id", column_to_embed="content", ai_gateway_route_name=mosaic_embeddings_route_name ) sleep(3) spark.sql(f'ALTER SCHEMA vs_catalog.{db} OWNER TO `account users`') set_index_permission(f"vs_catalog.{db}.databricks_documentation_index", "ALL_PRIVILEGES", "account users") print(i)

Document AI Scraping

Spark NLP #1 document_assembler = DocumentAssembler().setInputCol("text").setOutputCol("document") tokenizer = Tokenizer().setInputCols(["document"]).setOutputCol("tokens") normalizer
= Normalizer().setInputCols(["tokens"]).setOutputCol("normalized") lemmatizer = LemmatizerModel.pretrained().setInputCols(["normalized"]).setOutputCol("lemmatized") finisher = Finisher().setInputCols(["lemmatized"]).setOutputCols(["finished_tokens"]) nlp_pipeline = Pipeline(stages=[ document_assembler, tokenizer, normalizer, lemmatizer, finisher ]) nlp_model = nlp_pipeline.fit(df) processed_df = nlp_model.transform(df)

Spark NLP #2 count_vectorizer = CountVectorizer(inputCol="finished_tokens", outputCol="features") label_indexer = StringIndexer(inputCol="label",
outputCol="label_index") lr = LogisticRegression(featuresCol="features", labelCol="label_index") classification_pipeline = Pipeline(stages=[ count_vectorizer, label_indexer, lr ]) train_df, test_df = processed_df.randomSplit([0.7, 0.3]) classification_model = classification_pipeline.fit(train_df) predictions = classification_model.transform(test_df) predictions.select("text", "label", "prediction").show()

Question?

Thank you! [email protected]

End of Doc.

LLMOps를 위한 데이터 연못 만들기

LLMOps를 위한 데이터 연못 만들기

More Decks by Sungmin Han

Other Decks in Technology

Featured

Transcript