Applied NLP with Keras

Machine Learning Yangon

Applied Natural Language Processing with Keras Presented By: Aye Hninn
Khine Ph.D. Candidate (Computer Science) Associate Research Manager (Computer Vision Lab) [email protected]

• Ph.D. Candidate in Department of Computational Science, PSU Over
8 years of working experience as a research engineer in machine learning startup companies based in Myanmar and Thailand Conducting research in in Natural Language Processing, Sentiment Analysis, Structure from Motion, and Visual Localization Published 4 international conference papers and one manuscript is accepted for publication in Scopus Q2 journal • Thailand Education Hub for ASEAN Countries Scholarship, 2018 • Google Women Techmakers Scholarship APAC, 2018 • Kaggle BIPOC Grantee, 2021 A little bit about me

Keras NLP How to build NLP applications using LLMs with
KerasNLP

How do humans learn the language? • Children acquire language
through interaction - not only with their parents and other adults, but also with other children. • All normal children who grow up in normal households, surrounded by conversation, will acquire the language that is being used around them. (Linguistic Society of America) Knowledge Transfer

Transfer Learning in NLP 6 Many NLP tasks share common
knowledge about language (e.g. linguistic representations, structural similarities) Tasks can inform each other—e.g. syntax and semantics Annotated data is rare, make use of as much supervision as available. Empirically, transfer learning has resulted in SOTA for many supervised NLP tasks (e.g. classification, information extraction, Q&A, etc.).

LLMs are artificial intelligence models that can generate human-like text,
based on patterns found in massive amounts of training data. They are used in applications such as language translation, chatbots, and content creation. ❏ a LM learns P ϴ (text) or P ϴ (text | some other text) Mathematically, a language model is a very simple and beautiful object. Ability to assign (meaningful) probabilities to all sequences requires extraordinary (but implicit) linguistic abilities and world knowledge. What is LLM? Machine Learning The probability intuitively tells us how “good” a sequence of tokens is. For example, if the vocabulary is ={𝖺𝗍𝖾,𝖻𝖺𝗅𝗅,𝖼𝗁𝖾𝖾𝗌𝖾, 𝗆𝗈𝗎𝗌𝖾,𝗍𝗁𝖾} • p(the,mouse,ate,the,cheese)=0.02, • p(𝗍𝗁𝖾,𝖼𝗁𝖾𝖾𝗌𝖾,𝖺𝗍𝖾,𝗍𝗁𝖾,𝗆𝗈𝗎𝗌𝖾)=0.01, • p(𝗆𝗈𝗎𝗌𝖾,𝗍𝗁𝖾,𝗍𝗁𝖾,𝖼𝗁𝖾𝖾𝗌𝖾,𝖺𝗍𝖾)=0.0001 https://stanford-cs324.github.io/winter2022

LLMs are trained using a process called unsupervised learning. This
involves feeding the model massive amounts of text data, such as books, articles, and websites, and having the model learn the patterns and relationships between words and phrases in the text. The model is then fine-tuned on a specific task, such as language translation or text summarization. How are LLMs trained? ❏ ❏ ❏ Doesn’t require human annotation ❏ Many languages have enough text to learn high capacity model ❏ Versatile—can learn both sentence and word representations with a variety of objective functions Machine Learning

The Pile (https://arxiv.org/pdf/2101.00027.pdf) 825 GB English text 22 high-quality datasets
. Data Behind LLMs Section #

Bengio et al 2003: A Neural Probabilistic Language Model Devlin
et al 2019: BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding 1 layer 24 layers Major themes: From shallow to deep 10

LLMs typically consist of an encoder, a decoder, and attention
mechanisms. The encoder takes in input text and converts it into a set of hidden representations, while the decoder generates the output text. The attention mechanisms help the model focus on the most relevant parts of the input text LLM Components Machine Learning

12 Is tokenization easy? Preprocessing in NLP

Each token is replaced by a vector that represents its
meaning in a continuous vector space. Common methods for word embeddings include Word2Vec, GloVe, and fastText. Each token is broken down into smaller subword units (e.g., characters or character n-grams), and each subword is replaced by a vector that represents its meaning. This approach can handle out-of-vocabulary (OOV) words and can improve the model's ability to capture morphological and semantic similarities. Common methods for subword embeddings include Byte Pair Encoding (BPE), Unigram Language Model (ULM), and SentencePiece. Word Embedding Subword Embedding Since LLMs operate on sequences of tokens, they need a way to encode the position of each token in the sequence. Positional encodings are vectors that are added to the word or subword embeddings to provide information about the position of each token. Positional Encoding Machine Learning

When comparing different models, it's important to consider their architecture,
the size of the model, the amount of training data used, and their performance on specific NLP tasks. Choose Between LLMs Machine Learning

•LLMs are used in a wide range of applications, including
language translation, chatbots, content creation, and text summarization. •They can also be used to improve search engines, voice assistants, and virtual assistants. Applications of LLMs Machine Learning

• Multilinguality: PaLM 2 is more heavily trained on multilingual
text, spanning more than 100 languages. This has significantly improved its ability to understand, generate and translate nuanced text — including idioms, poems and riddles — across a wide variety of languages, a hard problem to solve. PaLM 2 also passes advanced language proficiency exams at the “mastery” level. • Reasoning: PaLM 2’s wide-ranging dataset includes scientific papers and web pages that contain mathematical expressions. As a result, it demonstrates improved capabilities in logic, common sense reasoning, and mathematics. • Coding: PaLM 2 was pre-trained on a large quantity of publicly available source code datasets. This means that it excels at popular programming languages like Python and JavaScript, but can also generate specialized code in languages like Prolog, Fortran and Verilog. PaLM 2 (Pathway Language Model) - Google Machine Learning

Installation !pip install -q --upgrade keras-nlp tensorflow import keras_nlp import
tensorflow as tf from tensorflow import keras # Use mixed precision for optimal performance keras.mixed_precision.set_global_policy("mixed_float16")

• What it does: Converts strings to tf.RaggedTensors of token
ids. • What it does: Converts strings to a dictionary of preprocessed tensors consumed by the backbone, starting with tokenization. Tokenizer Preprocessor • What it does: Converts preprocessed tensors to dense features. Does not handle strings; call the preprocessor first. Backbone KerasNLP API Task • What it does: Converts strings to task-specific output (e.g., classification probabilities).

IMDB Data !curl -O https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz !tar -xf aclImdb_v1.tar.gz !# Remove
unsupervised examples !rm -r aclImdb/train/unsup BATCH_SIZE = 16 imdb_train = tf.keras.utils.text_dataset_from_directory( "aclImdb/train", batch_size=BATCH_SIZE,) imdb_test = tf.keras.utils.text_dataset_from_directory( "aclImdb/test", batch_size=BATCH_SIZE, )

Inference with a pre-trained classifier classifier = keras_nlp.models.BertClassifier.from_preset("bert_tiny_en_uncased_sst2") # Note:
batched inputs expected so must wrap string in iterable classifier.predict(["I love modular workflows in keras-nlp!"])

Fine-tuning a pre-trained BERT backbone classifier = keras_nlp.models.BertClassifier.from_preset( "bert_tiny_en_uncased", num_classes=2,
) classifier.fit( imdb_train, validation_data=imdb_test, epochs=1, )

Build and train your own transformer from scratch KerasNLP

Train Custom Vocabulary from IMDB Data vocab = keras_nlp.tokenizers.compute_word_piece_vocabulary( imdb_train.map(lambda
x, y: x), vocabulary_size=20000, lowercase=True, strip_accents=True, reserved_tokens=["[PAD]", "[START]", "[END]", "[MASK]", "[UNK]"], ) tokenizer = keras_nlp.tokenizers.WordPieceTokenizer( vocabulary=vocab, lowercase=True, strip_accents=True, oov_token="[UNK]", )

Proprocess data with a custom tokenizer packer = keras_nlp.layers.StartEndPacker( start_value=tokenizer.token_to_id("[START]"),
end_value=tokenizer.token_to_id("[END]"), pad_value=tokenizer.token_to_id("[PAD]"), sequence_length=512, ) def preprocess(x, y): token_ids = packer(tokenizer(x)) return token_ids, y imdb_preproc_train_ds = imdb_train.map( preprocess, num_parallel_calls=tf.data.AUTOTUNE ).prefetch(tf.data.AUTOTUNE) imdb_preproc_val_ds = imdb_test.map( preprocess, num_parallel_calls=tf.data.AUTOTUNE ).prefetch(tf.data.AUTOTUNE)

Design a tiny transformer token_id_input = keras.Input( shape=(None,), dtype="int32", name="token_ids",
) outputs = keras_nlp.layers.TokenAndPositionEmbedding( vocabulary_size=len(vocab), sequence_length=packer.sequence_length, embedding_dim=64, )(token_id_input) outputs = keras_nlp.layers.TransformerEncoder( num_heads=2, intermediate_dim=128, dropout=0.1, )(outputs) # Use "[START]" token to classify outputs = keras.layers.Dense(2)(outputs[:, 0, :]) model = keras.Model( inputs=token_id_input, outputs=outputs, )

Train the transformer directly to the classification objective model.compile( loss=keras.losses.SparseCategoricalCrossentropy(from_logits=True),
optimizer=keras.optimizers.experimental.AdamW(5e-5), metrics=keras.metrics.SparseCategoricalAccuracy(), jit_compile=True, ) model.fit( imdb_preproc_train_ds, validation_data=imdb_preproc_val_ds, epochs=3, )

Pretraining a Transformer from scratch with KerasNLP KerasNLP

Loading Wikipedia Data # Load wikitext-103 and filter out short
lines. wiki_train_ds = ( tf.data.TextLineDataset(wiki_dir + "wiki.train.raw") .filter(lambda x: tf.strings.length(x) > 100) .batch(PRETRAINING_BATCH_SIZE) ) wiki_val_ds = ( tf.data.TextLineDataset(wiki_dir + "wiki.valid.raw") .filter(lambda x: tf.strings.length(x) > 100) .batch(PRETRAINING_BATCH_SIZE) )

# Setting sequence_length will trim or pad the token outputs
to shape # (batch_size, SEQ_LENGTH). tokenizer = keras_nlp.tokenizers.WordPieceTokenizer( vocabulary=vocab_file, sequence_length=SEQ_LENGTH, lowercase=True, strip_accents=True, ) # Setting mask_selection_length will trim or pad the mask outputs to shape # (batch_size, PREDICTIONS_PER_SEQ). masker = keras_nlp.layers.MaskedLMMaskGenerator( vocabulary_size=tokenizer.vocabulary_size(), mask_selection_rate=MASK_RATE, mask_selection_length=PREDICTIONS_PER_SEQ, mask_token_id=tokenizer.token_to_id("[MASK]"), ) def preprocess(inputs): inputs = tokenizer(inputs) outputs = masker(inputs) # Split the masking layer outputs into a (features, labels, and weights) # tuple that we can use with keras.Model.fit(). features = { "token_ids": outputs["token_ids"], "mask_positions": outputs["mask_positions"], } labels = outputs["mask_ids"] weights = outputs["mask_weights"] return features, labels, weights # We use prefetch() to pre-compute preprocessed batches on the fly on the CPU. pretrain_ds = wiki_train_ds.map( preprocess, num_parallel_calls=tf.data.AUTOTUNE ).prefetch(tf.data.AUTOTUNE) pretrain_val_ds = wiki_val_ds.map( preprocess, num_parallel_calls=tf.data.AUTOTUNE ).prefetch(tf.data.AUTOTUNE)

Text Generation with GPT

Safety Social Biases and Stereotypes Toxicity Disinformation Data Protection Laws
Copyright Law Fair Use Harms Legality CO2 Mission Environmental Impact CS 324 LLM Course Ethics

Learning Resources https://stanford-cs324.github.io/winter2022/ https://web.stanford.edu/class/cs224n/ https://keras.io/guides/keras_nlp/ https://www.cs.princeton.edu/courses/archive/fall22/cos597G/ https://huggingface.co/learn/nlp-course/chapter1/1 https://blog.floydhub.com/tokenization-nlp/#:~:text=They%20do%20this%20usin g%20a,DL%20NLP%20model%20like%20BERT.

References Sebastian Ruder, "Transfer Learning - Machine Learning's Next Frontier".
http://ruder.io/transfer-learning/, 2017. Abonia S., Ashish P., "Large Languge model(LLM) CheatSheet", Medium-Github, 2023. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics. Yoshua Bengio, Réjean Ducharme, Pascal Vincent, and Christian Janvin. 2003. A neural probabilistic language model. J. Mach. Learn. Res. 3, (3/1/2003), 1137–1155. Stanford Large Language Model Course (https://stanford-cs324.github.io/winter2022/lectures/introduction/) Watson, , Qian, Jonathan, Bischof, François, Chollet, and others. "KerasNLP." . (2022). (https://keras.io/keras_nlp/)

Thank You Aye Hninn Khine (She/Her) Machine Learning Researcher

Applied NLP with Keras

Applied NLP with Keras

Aye Hninn Khine

More Decks by Aye Hninn Khine

Other Decks in Technology

Featured

Transcript

Machine Learning Yangon

Applied Natural Language Processing with Keras Presented By: Aye Hninn

• Ph.D. Candidate in Department of Computational Science, PSU Over

Keras NLP How to build NLP applications using LLMs with

How do humans learn the language? • Children acquire language

Transfer Learning in NLP 6 Many NLP tasks share common

LLMs are artificial intelligence models that can generate human-like text,

LLMs are trained using a process called unsupervised learning. This

The Pile (https://arxiv.org/pdf/2101.00027.pdf) 825 GB English text 22 high-quality datasets

Bengio et al 2003: A Neural Probabilistic Language Model Devlin

LLMs typically consist of an encoder, a decoder, and attention

12 Is tokenization easy? Preprocessing in NLP

Each token is replaced by a vector that represents its

When comparing different models, it's important to consider their architecture,

•LLMs are used in a wide range of applications, including

• Multilinguality: PaLM 2 is more heavily trained on multilingual

Installation !pip install -q --upgrade keras-nlp tensorflow import keras_nlp import

• What it does: Converts strings to tf.RaggedTensors of token

IMDB Data !curl -O https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz !tar -xf aclImdb_v1.tar.gz !# Remove

Inference with a pre-trained classifier classifier = keras_nlp.models.BertClassifier.from_preset("bert_tiny_en_uncased_sst2") # Note:

Fine-tuning a pre-trained BERT backbone classifier = keras_nlp.models.BertClassifier.from_preset( "bert_tiny_en_uncased", num_classes=2,

Build and train your own transformer from scratch KerasNLP

Train Custom Vocabulary from IMDB Data vocab = keras_nlp.tokenizers.compute_word_piece_vocabulary( imdb_train.map(lambda

Proprocess data with a custom tokenizer packer = keras_nlp.layers.StartEndPacker( start_value=tokenizer.token_to_id("[START]"),

Design a tiny transformer token_id_input = keras.Input( shape=(None,), dtype="int32", name="token_ids",

Train the transformer directly to the classification objective model.compile( loss=keras.losses.SparseCategoricalCrossentropy(from_logits=True),

Pretraining a Transformer from scratch with KerasNLP KerasNLP

Loading Wikipedia Data # Load wikitext-103 and filter out short

# Setting sequence_length will trim or pad the token outputs

Text Generation with GPT

Safety Social Biases and Stereotypes Toxicity Disinformation Data Protection Laws

References Sebastian Ruder, "Transfer Learning - Machine Learning's Next Frontier".

Thank You Aye Hninn Khine (She/Her) Machine Learning Researcher