Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Applied NLP with Keras

Applied NLP with Keras

This talk covers building NLP applications using KerasNLP library.

Aye Hninn Khine

July 30, 2023
Tweet

More Decks by Aye Hninn Khine

Other Decks in Technology

Transcript

  1. Applied Natural Language Processing with Keras Presented By: Aye Hninn

    Khine Ph.D. Candidate (Computer Science) Associate Research Manager (Computer Vision Lab) [email protected]
  2. • Ph.D. Candidate in Department of Computational Science, PSU Over

    8 years of working experience as a research engineer in machine learning startup companies based in Myanmar and Thailand Conducting research in in Natural Language Processing, Sentiment Analysis, Structure from Motion, and Visual Localization Published 4 international conference papers and one manuscript is accepted for publication in Scopus Q2 journal • Thailand Education Hub for ASEAN Countries Scholarship, 2018 • Google Women Techmakers Scholarship APAC, 2018 • Kaggle BIPOC Grantee, 2021 A little bit about me
  3. How do humans learn the language? • Children acquire language

    through interaction - not only with their parents and other adults, but also with other children. • All normal children who grow up in normal households, surrounded by conversation, will acquire the language that is being used around them. (Linguistic Society of America) Knowledge Transfer
  4. Transfer Learning in NLP 6 Many NLP tasks share common

    knowledge about language (e.g. linguistic representations, structural similarities) Tasks can inform each other—e.g. syntax and semantics Annotated data is rare, make use of as much supervision as available. Empirically, transfer learning has resulted in SOTA for many supervised NLP tasks (e.g. classification, information extraction, Q&A, etc.).
  5. LLMs are artificial intelligence models that can generate human-like text,

    based on patterns found in massive amounts of training data. They are used in applications such as language translation, chatbots, and content creation. ❏ a LM learns P ϴ (text) or P ϴ (text | some other text) Mathematically, a language model is a very simple and beautiful object. Ability to assign (meaningful) probabilities to all sequences requires extraordinary (but implicit) linguistic abilities and world knowledge. What is LLM? Machine Learning The probability intuitively tells us how “good” a sequence of tokens is. For example, if the vocabulary is ={𝖺𝗍𝖾,𝖻𝖺𝗅𝗅,𝖼𝗁𝖾𝖾𝗌𝖾, 𝗆𝗈𝗎𝗌𝖾,𝗍𝗁𝖾} • p(the,mouse,ate,the,cheese)=0.02, • p(𝗍𝗁𝖾,𝖼𝗁𝖾𝖾𝗌𝖾,𝖺𝗍𝖾,𝗍𝗁𝖾,𝗆𝗈𝗎𝗌𝖾)=0.01, • p(𝗆𝗈𝗎𝗌𝖾,𝗍𝗁𝖾,𝗍𝗁𝖾,𝖼𝗁𝖾𝖾𝗌𝖾,𝖺𝗍𝖾)=0.0001 https://stanford-cs324.github.io/winter2022
  6. LLMs are trained using a process called unsupervised learning. This

    involves feeding the model massive amounts of text data, such as books, articles, and websites, and having the model learn the patterns and relationships between words and phrases in the text. The model is then fine-tuned on a specific task, such as language translation or text summarization. How are LLMs trained? ❏ ❏ ❏ Doesn’t require human annotation ❏ Many languages have enough text to learn high capacity model ❏ Versatile—can learn both sentence and word representations with a variety of objective functions Machine Learning
  7. Bengio et al 2003: A Neural Probabilistic Language Model Devlin

    et al 2019: BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding 1 layer 24 layers Major themes: From shallow to deep 10
  8. LLMs typically consist of an encoder, a decoder, and attention

    mechanisms. The encoder takes in input text and converts it into a set of hidden representations, while the decoder generates the output text. The attention mechanisms help the model focus on the most relevant parts of the input text LLM Components Machine Learning
  9. Each token is replaced by a vector that represents its

    meaning in a continuous vector space. Common methods for word embeddings include Word2Vec, GloVe, and fastText. Each token is broken down into smaller subword units (e.g., characters or character n-grams), and each subword is replaced by a vector that represents its meaning. This approach can handle out-of-vocabulary (OOV) words and can improve the model's ability to capture morphological and semantic similarities. Common methods for subword embeddings include Byte Pair Encoding (BPE), Unigram Language Model (ULM), and SentencePiece. Word Embedding Subword Embedding Since LLMs operate on sequences of tokens, they need a way to encode the position of each token in the sequence. Positional encodings are vectors that are added to the word or subword embeddings to provide information about the position of each token. Positional Encoding Machine Learning
  10. When comparing different models, it's important to consider their architecture,

    the size of the model, the amount of training data used, and their performance on specific NLP tasks. Choose Between LLMs Machine Learning
  11. •LLMs are used in a wide range of applications, including

    language translation, chatbots, content creation, and text summarization. •They can also be used to improve search engines, voice assistants, and virtual assistants. Applications of LLMs Machine Learning
  12. • Multilinguality: PaLM 2 is more heavily trained on multilingual

    text, spanning more than 100 languages. This has significantly improved its ability to understand, generate and translate nuanced text — including idioms, poems and riddles — across a wide variety of languages, a hard problem to solve. PaLM 2 also passes advanced language proficiency exams at the “mastery” level. • Reasoning: PaLM 2’s wide-ranging dataset includes scientific papers and web pages that contain mathematical expressions. As a result, it demonstrates improved capabilities in logic, common sense reasoning, and mathematics. • Coding: PaLM 2 was pre-trained on a large quantity of publicly available source code datasets. This means that it excels at popular programming languages like Python and JavaScript, but can also generate specialized code in languages like Prolog, Fortran and Verilog. PaLM 2 (Pathway Language Model) - Google Machine Learning
  13. Installation !pip install -q --upgrade keras-nlp tensorflow import keras_nlp import

    tensorflow as tf from tensorflow import keras # Use mixed precision for optimal performance keras.mixed_precision.set_global_policy("mixed_float16")
  14. • What it does: Converts strings to tf.RaggedTensors of token

    ids. • What it does: Converts strings to a dictionary of preprocessed tensors consumed by the backbone, starting with tokenization. Tokenizer Preprocessor • What it does: Converts preprocessed tensors to dense features. Does not handle strings; call the preprocessor first. Backbone KerasNLP API Task • What it does: Converts strings to task-specific output (e.g., classification probabilities).
  15. IMDB Data !curl -O https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz !tar -xf aclImdb_v1.tar.gz !# Remove

    unsupervised examples !rm -r aclImdb/train/unsup BATCH_SIZE = 16 imdb_train = tf.keras.utils.text_dataset_from_directory( "aclImdb/train", batch_size=BATCH_SIZE,) imdb_test = tf.keras.utils.text_dataset_from_directory( "aclImdb/test", batch_size=BATCH_SIZE, )
  16. Inference with a pre-trained classifier classifier = keras_nlp.models.BertClassifier.from_preset("bert_tiny_en_uncased_sst2") # Note:

    batched inputs expected so must wrap string in iterable classifier.predict(["I love modular workflows in keras-nlp!"])
  17. Train Custom Vocabulary from IMDB Data vocab = keras_nlp.tokenizers.compute_word_piece_vocabulary( imdb_train.map(lambda

    x, y: x), vocabulary_size=20000, lowercase=True, strip_accents=True, reserved_tokens=["[PAD]", "[START]", "[END]", "[MASK]", "[UNK]"], ) tokenizer = keras_nlp.tokenizers.WordPieceTokenizer( vocabulary=vocab, lowercase=True, strip_accents=True, oov_token="[UNK]", )
  18. Proprocess data with a custom tokenizer packer = keras_nlp.layers.StartEndPacker( start_value=tokenizer.token_to_id("[START]"),

    end_value=tokenizer.token_to_id("[END]"), pad_value=tokenizer.token_to_id("[PAD]"), sequence_length=512, ) def preprocess(x, y): token_ids = packer(tokenizer(x)) return token_ids, y imdb_preproc_train_ds = imdb_train.map( preprocess, num_parallel_calls=tf.data.AUTOTUNE ).prefetch(tf.data.AUTOTUNE) imdb_preproc_val_ds = imdb_test.map( preprocess, num_parallel_calls=tf.data.AUTOTUNE ).prefetch(tf.data.AUTOTUNE)
  19. Design a tiny transformer token_id_input = keras.Input( shape=(None,), dtype="int32", name="token_ids",

    ) outputs = keras_nlp.layers.TokenAndPositionEmbedding( vocabulary_size=len(vocab), sequence_length=packer.sequence_length, embedding_dim=64, )(token_id_input) outputs = keras_nlp.layers.TransformerEncoder( num_heads=2, intermediate_dim=128, dropout=0.1, )(outputs) # Use "[START]" token to classify outputs = keras.layers.Dense(2)(outputs[:, 0, :]) model = keras.Model( inputs=token_id_input, outputs=outputs, )
  20. Train the transformer directly to the classification objective model.compile( loss=keras.losses.SparseCategoricalCrossentropy(from_logits=True),

    optimizer=keras.optimizers.experimental.AdamW(5e-5), metrics=keras.metrics.SparseCategoricalAccuracy(), jit_compile=True, ) model.fit( imdb_preproc_train_ds, validation_data=imdb_preproc_val_ds, epochs=3, )
  21. Loading Wikipedia Data # Load wikitext-103 and filter out short

    lines. wiki_train_ds = ( tf.data.TextLineDataset(wiki_dir + "wiki.train.raw") .filter(lambda x: tf.strings.length(x) > 100) .batch(PRETRAINING_BATCH_SIZE) ) wiki_val_ds = ( tf.data.TextLineDataset(wiki_dir + "wiki.valid.raw") .filter(lambda x: tf.strings.length(x) > 100) .batch(PRETRAINING_BATCH_SIZE) )
  22. # Setting sequence_length will trim or pad the token outputs

    to shape # (batch_size, SEQ_LENGTH). tokenizer = keras_nlp.tokenizers.WordPieceTokenizer( vocabulary=vocab_file, sequence_length=SEQ_LENGTH, lowercase=True, strip_accents=True, ) # Setting mask_selection_length will trim or pad the mask outputs to shape # (batch_size, PREDICTIONS_PER_SEQ). masker = keras_nlp.layers.MaskedLMMaskGenerator( vocabulary_size=tokenizer.vocabulary_size(), mask_selection_rate=MASK_RATE, mask_selection_length=PREDICTIONS_PER_SEQ, mask_token_id=tokenizer.token_to_id("[MASK]"), ) def preprocess(inputs): inputs = tokenizer(inputs) outputs = masker(inputs) # Split the masking layer outputs into a (features, labels, and weights) # tuple that we can use with keras.Model.fit(). features = { "token_ids": outputs["token_ids"], "mask_positions": outputs["mask_positions"], } labels = outputs["mask_ids"] weights = outputs["mask_weights"] return features, labels, weights # We use prefetch() to pre-compute preprocessed batches on the fly on the CPU. pretrain_ds = wiki_train_ds.map( preprocess, num_parallel_calls=tf.data.AUTOTUNE ).prefetch(tf.data.AUTOTUNE) pretrain_val_ds = wiki_val_ds.map( preprocess, num_parallel_calls=tf.data.AUTOTUNE ).prefetch(tf.data.AUTOTUNE)
  23. Safety Social Biases and Stereotypes Toxicity Disinformation Data Protection Laws

    Copyright Law Fair Use Harms Legality CO2 Mission Environmental Impact CS 324 LLM Course Ethics
  24. References Sebastian Ruder, "Transfer Learning - Machine Learning's Next Frontier".

    http://ruder.io/transfer-learning/, 2017. Abonia S., Ashish P., "Large Languge model(LLM) CheatSheet", Medium-Github, 2023. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics. Yoshua Bengio, Réjean Ducharme, Pascal Vincent, and Christian Janvin. 2003. A neural probabilistic language model. J. Mach. Learn. Res. 3, (3/1/2003), 1137–1155. Stanford Large Language Model Course (https://stanford-cs324.github.io/winter2022/lectures/introduction/) Watson, , Qian, Jonathan, Bischof, François, Chollet, and others. "KerasNLP." . (2022). (https://keras.io/keras_nlp/)