Upgrade to Pro — share decks privately, control downloads, hide ads and more …

LLMOps를 위한 데이터 연못 만들기

Sungmin Han
October 14, 2023

LLMOps를 위한 데이터 연못 만들기

Llama2 + LoRA Fine-tuning을 데이터브릭스 데이터 전처리 부터 하나하나 경험하는 발표 자료입니다.

10월 14일, 데이터야놀자에서,
"LLMOps를 위한 거대한 데이터 연못 만들기" 주제로 발표된 주제로.

Databricks Medallion 아키텍처로 부터 Bronze -> Silver -> Gold 데이터 스파크 전처리부터, 허깅페이스 데이터셋 등록, Llama2 + LoRA로 직접 Fine-tuning 및 LLM 인퍼런스를 Transformers Pipeline으로 돌리는 End-to-end 내용을 다룹니다.

Sungmin Han

October 14, 2023
Tweet

More Decks by Sungmin Han

Other Decks in Technology

Transcript

  1. LLMOps를 위한 거대한
    데이터 연못 만들기
    한성민 | Riiid MLOps Lead

    View full-size slide

  2. Speaker
    한성민 (Sungmin Han)
    MLOps Lead at Riiid
    Google Developer Experts (GDE) for ML
    Google Developer Groups (GDG) for Go
    F-Lab Python Mentor
    Former Research Engineer at Naver Clova
    Former Software Engineer at IGAWorks
    Former Software Engineer at 심심이

    View full-size slide

  3. 세션에서 다룰 주제
    1. 레이크하우스 (LakeHouse)
    2. Large Language Model (LLM)
    3. ETL vs ELT
    4. 메달리언(Medallion) 아키텍처
    5. 실전 LLM 학습
    6. 그 밖에…

    View full-size slide

  4. 레이크하우스
    (LakeHouse)

    View full-size slide

  5. Data Warehouse의 한계
    Data Warehouse
    Data Tableized
    필연적으로 손실 발생

    View full-size slide

  6. LLM Data 특징
    ● 비 구조적 텍스트 데이터
    ● 텍스트의 임베딩 벡터 데이터 (Array)
    ● 텍스트와 연결된 이미지 / 오디오 (멀티모달)
    ● 텍스트 레이블 데이터 (Complex)

    View full-size slide

  7. LakeHouse
    https://www.cidrdb.org/cidr2021/papers/cidr2021_paper17.pdf
    a new architecture choice has emerged: the data
    lakehouse, which combines key benefits of data
    lakes and data warehouses. This architecture offers
    low-cost storage in an open format accessible by a
    variety of processing engines like Spark while also
    providing powerful management and optimization
    features.

    View full-size slide

  8. Large Language Model (LLM)
    데이터 처리

    View full-size slide

  9. Large Language Model (LLM)
    Large
    Language
    Model
    (LLM)
    Large Unlabeled Corpus

    View full-size slide

  10. Large Language Model (LLM)
    Large
    Language
    Model
    (LLM)
    Large Unlabeled
    Corpus
    Large
    Language
    Model
    (LLM)
    Small Labeled Corpus
    Pre-Training Fine-Tuning

    View full-size slide

  11. LLM 예시 (감정분류 / Few shot)
    너는 지금부터 리뷰에 대해서 감정 분류를 해야해, 설명은 제외할게.
    감정 분류에 대한 예시는 다음과 같아.
    INPUT:
    이 행사 정말 시간 엄청 낭비하고 무엇보다 재미가 없었어요.
    OUTPUT:
    매우 부정
    INPUT:
    와 행사 시청하는데 시간 가는줄 몰랐어요, 추천!!
    OUTPUT:
    매우 긍정
    INPUT:
    행사에 가서 기술 세션 들었어요.
    OUTPUT:
    중립
    INPUT:
    행사 연사진 이력이 화려하고, 발표자료 너무 좋았어요.
    OUTPUT:

    View full-size slide

  12. LLM 예시 (감정분류 / Few shot) 결과
    긍정
    INPUT:
    행사 연사진 이력이 화려하고, 발표
    자료 너무 좋았어요.
    OUTPUT:

    View full-size slide

  13. LLM 예시 (감정분류 / Few shot) => 여러 입력
    그럼에도 불구하고 몇몇 세션들의 내용이 아쉬웠어요.
    데이터야놀자 2023은 정말 유익한 세션들로 가득했어요.
    주최측의 세심한 준비와 관리에 감탄했어요.
    네트워킹 타임이 부족한 것 같아요.
    최신 데이터 트렌드와 기술을 한눈에 볼 수 있어서 좋았어요.
    행사장의 위치와 교통이 불편했어요.

    View full-size slide

  14. LLM 예시 (감정분류 / Few shot) => 여러 입력 결과
    그럼에도 불구하고 몇몇 세션들의 내용이 아쉬웠어요.
    데이터야놀자 2023은 정말 유익한 세션들로 가득했어요.
    주최측의 세심한 준비와 관리에 감탄했어요.
    네트워킹 타임이 부족한 것 같아요.
    최신 데이터 트렌드와 기술을 한눈에 볼 수 있어서 좋았어요.
    행사장의 위치와 교통이 불편했어요.
    부정
    매우 긍정
    긍정
    부정
    긍정
    부정

    View full-size slide

  15. LLM 예시 실제 동작

    View full-size slide

  16. Large Language Model (LLM)
    LLM
    Prompt (Text)
    Output (Text)
    Llama
    GPT
    Falcon PaLM2 Vicuna

    View full-size slide

  17. ETL vs ELT
    ETL ELT
    가공 데이터 (정형화)
    Structured, Semi-Structured
    즉시 분석 용이
    텍스트
    분석 목적
    원천 데이터
    Structured, Semi-Structured, Unstructured
    모든 데이터 포함
    텍스트, 이미지, 소리
    분석 및 연구 목적

    View full-size slide

  18. “ETL은 필연적으로 손실 발생"
    Source Destination
    기존 ETL의 문제
    Transformation

    View full-size slide

  19. “ELT”
    (Extract - Load - Transform)

    View full-size slide

  20. “쓰건 안쓰건 모두 적재, 이후 필요에 따라 가공"
    Source Destination
    새로운 ELT 컨셉의 목표
    Load
    Transformation

    View full-size slide

  21. 메달리언(Medallion) 아
    키텍처

    View full-size slide

  22. Medallion Architecture
    Silver Gold
    Bronze
    원본 데이터
    정형화 처리
    필터링
    데이터 전처리
    사용 데이터만 선별
    비즈니스 목적 가공
    최종 테이블

    View full-size slide

  23. Pipeline on Medallion
    Quizium
    (Our Service)
    Fivetran Original Bronze Silver Gold

    View full-size slide

  24. 실전 LLM 학습

    View full-size slide

  25. [Demonstration]

    View full-size slide

  26. Demonstration Scenario
    Bronze Silver Gold
    Llama2 Llama2
    Fine-tuning

    View full-size slide

  27. Bronze
    df = dataset["train"].to_pandas()
    df = pd.concat([df.drop(['preference-suggestion'], axis=1), df['preference-
    suggestion'].apply(pd.Series)], axis=1)
    for col in ['preference-suggestion-metadata', 'correct-response-suggestion-metadata']:
    expanded = df[col].apply(pd.Series)
    expanded = expanded.rename(lambda x: col + "_" + x, axis=1)
    df = pd.concat([df.drop([col], axis=1), expanded], axis=1)
    df['rank'] = df['rank'].apply(lambda x: [int(i) for i in x])
    df['value'] = df['value'].apply(list)
    schema = StructType([
    StructField("request", StringType()),
    StructField("response-1", StringType()),
    ...
    ])
    spark_df = spark.createDataFrame(df, schema=schema)
    bronze_path = "/dataya_nolja/bronze"
    spark_df.write.format("delta").mode("overwrite").save(bronze_path)

    View full-size slide

  28. Silver
    silver_df = spark_df.select("request", "response-1")
    silver_path = "/dataya_nolja/silver"
    silver_df.write.format("delta").mode("overwrite").save(silver_path)

    View full-size slide

  29. Gold
    df = silver_df
    gold_df = df.select(concat(
    lit("[REQ]"),
    df['request'],
    lit("[/REQ]"),
    lit("[RES]"),
    df['response-1'],
    lit("[/RES]")
    ).alias("text"))
    gold_path = "/dataya_nolja/gold"
    gold_df.write.format("delta").mode("overwrite").save(gold_path)

    View full-size slide

  30. Gold Dataset

    View full-size slide

  31. Upload to HuggingFace
    import pandas as pd
    from datasets import Dataset
    pdf = gold_df.toPandas()
    dataset = Dataset.from_pandas(pdf)
    dataset.push_to_hub("ken-sungmin/dataya-nolja-llama2-finetuning", token=huggingface_token)

    View full-size slide

  32. HuggingFace Datasets
    https://huggingface.co/datasets/ken-sungmin/dataya-nolja-llama2-finetuning

    View full-size slide

  33. Fine-Tuning (w/ LoRA)
    # Dataset
    data_name = "ken-sungmin/dataya-nolja-llama2-finetuning"
    training_data = load_dataset(data_name, split="train")
    # Model and tokenizer names
    base_model_name = "NousResearch/Llama-2-7b-chat-hf"
    refined_model = "llama2-finetuned-dataya-nolja"
    # Tokenizer
    llama_tokenizer = AutoTokenizer.from_pretrained(base_model_name, trust_remote_code=True)
    llama_tokenizer.pad_token = llama_tokenizer.eos_token
    llama_tokenizer.padding_side = "right" # Fix for fp16
    # Quantization Config
    quant_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_use_double_quant=False
    )
    # Model
    base_model = AutoModelForCausalLM.from_pretrained(
    base_model_name,
    quantization_config=quant_config,
    device_map={"": 0}
    )
    base_model.config.use_cache = False
    base_model.config.pretraining_tp = 1

    View full-size slide

  34. Fine-Tuning (w/ LoRA) #2
    # LoRA Config
    peft_parameters = LoraConfig(
    lora_alpha=16,
    lora_dropout=0.1,
    ...
    )
    # Training Params
    train_params = TrainingArguments(
    output_dir="./results_modified",
    num_train_epochs=1,
    per_device_train_batch_size=4,
    save_steps=25,
    ...
    )
    # Trainer
    fine_tuning = SFTTrainer(
    model=base_model,
    train_dataset=training_data,
    ...
    )
    # Training
    fine_tuning.train()
    fine_tuning.model.save_pretrained(refined_model, safe_serialization=True)
    llama_tokenizer.save_pretrained(refined_model)

    View full-size slide

  35. Inferencing
    # Generate Text
    query = "I want to send money, how long will it take?"
    text_gen = pipeline(task="text-generation", model=fine_tuning.model,
    tokenizer=llama_tokenizer, max_length=200)
    output = text_gen(f"[REQ]{query}[/REQ]")

    View full-size slide

  36. Output
    [{'generated_text': "[REQ]I want to send money, how long will it
    take?[/REQ] Hello there! Thank you for trusting the Bank of America with
    your financial needs. everybody here is happy to help you with your
    inquiry. To answer your question, the time it takes to send money through
    the Bank of America depends on the transfer method you choose.\nFor
    domestic transfers, typically the funds are available in the recipient's
    account within 1-2 business days. However, if you're sending money
    internationally, the time frame may vary depending on the country and the
    transfer method you choose.\nFor example, if you're using our Global
    Account Transfer service, the funds may take 2-3 business days to reach
    the recipient's account. Or, if you're using our Wire Transfer service,
    the funds can be available in the recipient's account within 24
    hours.\nPlease note that the"}]

    View full-size slide

  37. 그 밖에…

    View full-size slide

  38. RAG (Retrieval Argumented Generation)
    Prompt
    Vector
    Search
    Context
    LLM
    Prompt + Context

    View full-size slide

  39. Vector Store (BigQuery)
    WITH query_embedding AS (
    SELECT *
    FROM
    ML.GENERATE_TEXT_EMBEDDING(MODEL text.embedding_model,
    (
    SELECT "TREES DOWN NEAR THE INTERSECTION OF HIGHWAY" AS content))
    )
    SELECT
    q.content AS query_text,
    c.content AS candidate_text,
    ML.DISTANCE(q.text_embedding, c.text_embedding, 'COSINE') AS distance
    FROM
    query_embedding AS q,
    semantic_search_tutorial.candidate_embedding_cluster9 AS c
    ORDER BY distance ASC
    LIMIT 20;

    View full-size slide

  40. Vector Store (Databricks)
    import mlflow
    from mlflow import gateway
    gateway.set_gateway_uri(gateway_uri="databricks")
    mosaic_embeddings_route_name = "mosaicml-instructor-xl-embeddings"
    try:
    route = gateway.get_route(mosaic_embeddings_route_name)
    except:
    print(f"Creating the route {mosaic_embeddings_route_name}")
    print(gateway.create_route(
    name=mosaic_embeddings_route_name,
    route_type="llm/v1/embeddings",
    model={
    "name": "instructor-xl",
    "provider": "mosaicml",
    "mosaicml_config": {
    "mosaicml_api_key": dbutils.secrets.get(scope="dbdemos", key="mosaic_ml_api_key")
    }
    }
    ))

    View full-size slide

  41. Vector Store (Databricks) / #2
    source_table_fullname = f"{catalog}.{db}.databricks_documentation"
    vs_index_fullname = f"vs_catalog.{db}.databricks_documentation_index"
    if not index_exists(vs_index_fullname):
    print(f'Creating a vector store index `{vs_index_fullname}` against the table `{source_table_fullname}`,
    using AI Gateway {mosaic_embeddings_route_name}')
    i = vsc.create_delta_sync_index(
    source_table_name=source_table_fullname,
    dest_index_name=vs_index_fullname,
    primary_key="id",
    column_to_embed="content",
    ai_gateway_route_name=mosaic_embeddings_route_name
    )
    sleep(3)
    spark.sql(f'ALTER SCHEMA vs_catalog.{db} OWNER TO `account users`')
    set_index_permission(f"vs_catalog.{db}.databricks_documentation_index", "ALL_PRIVILEGES", "account users")
    print(i)

    View full-size slide

  42. Document AI Scraping

    View full-size slide

  43. Spark NLP #1
    document_assembler = DocumentAssembler().setInputCol("text").setOutputCol("document")
    tokenizer = Tokenizer().setInputCols(["document"]).setOutputCol("tokens")
    normalizer = Normalizer().setInputCols(["tokens"]).setOutputCol("normalized")
    lemmatizer = LemmatizerModel.pretrained().setInputCols(["normalized"]).setOutputCol("lemmatized")
    finisher = Finisher().setInputCols(["lemmatized"]).setOutputCols(["finished_tokens"])
    nlp_pipeline = Pipeline(stages=[
    document_assembler,
    tokenizer,
    normalizer,
    lemmatizer,
    finisher
    ])
    nlp_model = nlp_pipeline.fit(df)
    processed_df = nlp_model.transform(df)

    View full-size slide

  44. Spark NLP #2
    count_vectorizer = CountVectorizer(inputCol="finished_tokens", outputCol="features")
    label_indexer = StringIndexer(inputCol="label", outputCol="label_index")
    lr = LogisticRegression(featuresCol="features", labelCol="label_index")
    classification_pipeline = Pipeline(stages=[
    count_vectorizer,
    label_indexer,
    lr
    ])
    train_df, test_df = processed_df.randomSplit([0.7, 0.3])
    classification_model = classification_pipeline.fit(train_df)
    predictions = classification_model.transform(test_df)
    predictions.select("text", "label", "prediction").show()

    View full-size slide