Document Chunking - Splitting documents to accommodate LLM context window limit. - Helps ranking sections of documents. - Each vector can embed a limited amount of data per model. - A long passage with multiple topics into a single vector can cause important nuance can get lost. - Overlapping text might be helpful.
Tokens and Tokenization ~50K vocab size [464, 5044, 1422, 470, 3272, 262, 4675, 780, 340, 373, 1165, 10032, 13] 60 chars (76 chars, 17 tokens) (55 chars, 24 tokens) [0.653249, -0.211342, 0.000436 … -0.532995, 0.900358, 0.345422] 13 tokens N-dimensional embedding vector per token …a continuous space representation we can use as model input Embeddings for similar concepts will be close to each other in N-dimensional space (e.g., vectors for “dog” and “hound” will have a cosine similarity closer to 1 than “dog” and “chair”) Less common words will tend to split into multiple tokens: There’s a bias towards English in the BPE corpus: dog chair hound
Self-Attention (Transformer Model) Intuition: • Each self-attention “head” learns relationships between a token and all other tokens in the context • Multiple heads in a layer focus on learning different relationships, including grammar and semantic
Fine Tuning vs Embedding GPT can learn knowledge in two ways: • Via model weights (i.e., fine-tune the model on a training set) teaching specialized tasks, less reliable for factual recall. not a base training, salt is in the water • Via model inputs (i.e., insert the knowledge into an input message) short-term memory, bound by token limits.
Fine Tuning • Type of “transfer learning” • It’s about teaching a new task, not new information or knowledge. • It is not a reliable way to store knowledge as part of the model. • Fine-tuning does not overrule hallucination (confabulation). • Slow, difficult and expensive. • Fine tuning is 1000x more difficult compared to prompt engineering.