2.1: Key Concepts in GenAI

2.1: Key Concepts in GenAI | GenAI Learning https://genai.gitpull.in/2-intro-genai/2.1-key-concept/index.html Key Concepts in Generative AI Concept Definition Large Language Models (LLMs) LLMs are AI models trained on vast amounts of text data. They use the Transformer architecture, which relies on attention mechanisms to process input data. Examples: GPT (Generative Pre-trained Transformer), BERT, T5. Tokenization Breaking down text into smaller units (tokens) for processing. Example: The sentence “Hello, world!” might be tokenized into ["Hello", ",", "world", "!"]. Embeddings Representing tokens as numerical (vectors) in a high-dimensional space. Embeddings capture semantic meaning (e.g., “king” - “man” + “woman” ≈ “queen”). Self-Attention/Attention Mechanism Mechanism that helps models focus on relevant words. Transformers The deep learning architecture used in LLMs. Transformers are the backbone of most modern generative models. Key components: Encoder, Decoder, and Attention Mechanism. Pre-training Training a model on a large dataset (e.g., all of Wikipedia) to learn general language patterns. Fine-tuning Adapting the pre-trained model to a specific task (e.g., sentiment analysis, chatbot). Prompt Engineering Designing effective inputs to guide model responses. Tokenization Tokenization is the process of converting text into smaller units, typically words or subwords, that can be processed by machine learning models. In natural language processing (NLP), tokens are the basic building blocks for understanding and generating language. Tokenization helps the model “understand” the text by converting it into a format that can be fed into the neural network. Hugo en-us 2.1.1: Tokenization https://genai.gitpull.in/2-intro-genai/2.1-key-concept/1-tokenization/index.html Mon, 01 Jan 0001 00:00:00 +0000 https://genai.gitpull.in/2-intro-genai/2.1-key-concept/1-tokenization/index.html Tokenization Tokenization is the process of converting text into smaller units, typically words or subwords, that can be processed by machine learning models. In natural language processing (NLP), tokens are the basic building blocks for understanding and generating language. Tokenization helps the model “understand” the text by converting it into a format that can be fed into the neural network. Word-level tokenization splits text into words. Example: "I love AI" → ["I", "love", "AI"]. Subword-level tokenization (used in models like GPT) splits words into smaller parts or subwords. This is more efficient and handles unknown words better, as the model can understand smaller pieces of a word. Example, “delightful”→[“delight”, “ful”]`. Character-Level Tokenization Splits text into individual characters. Example: "AI" → ["A", "I"]. Hands-On: Tokenization Tokenization Example 2.1.2: Embeddings https://genai.gitpull.in/2-intro-genai/2.1-key-concept/2-embeddings/index.html Mon, 01 Jan 0001 00:00:00 +0000 https://genai.gitpull.in/2-intro-genai/2.1-key-concept/2-embeddings/index.html Embeddings Embeddings are numerical representations of tokens in a high-dimensional space. They capture the semantic meaning of tokens, allowing models to understand relationships between words. Why are Embeddings Important? Words with similar meanings have similar embeddings. Embeddings enable models to generalize and understand context. Example: The embeddings for "king", "queen", "man", and "woman" might satisfy the relationship: king - man + woman ≈ queen Hands-On: Embeddings from transformers import AutoTokenizer, AutoModel import torch # Load the tokenizer and model tokenizer = AutoTokenizer.from_pretrained("gpt2") model = AutoModel.from_pretrained("gpt2") # Tokenize input text text = "Hello, how are you?" inputs = tokenizer(text, return_tensors="pt") # Generate embeddings with torch.no_grad(): outputs = model(**inputs) # Extract embeddings for the first token embeddings = outputs.last_hidden_state print("Embeddings shape:", embeddings.shape) print("Embeddings for 'Hello':", embeddings[0, 0, :5]) # First 5 dimensions Output: 2.1.3: Attention Mechanism https://genai.gitpull.in/2-intro-genai/2.1-key-concept/3-attention-mechanism/index.html Mon, 01 Jan 0001 00:00:00 +0000 https://genai.gitpull.in/2-intro-genai/2.1-key-concept/3-attention-mechanism/index.html Self-Attention: The Core of Transformers Imagine you’re reading a sentence: "The cat sat on the mat. Each word is important, but some are more related to others. “cat” is related to “sat.” “mat” is related to “sat.” “on” is less important. Self-Attention helps the model decide which words to focus on! How Self-Attention Works (Step-by-Step) Self-attention is done in 4 steps: 1. Convert Words into Vectors (Embeddings) Computers don’t understand words, so we convert them into numbers (word embeddings). 2.1.4: Transformers https://genai.gitpull.in/2-intro-genai/2.1-key-concept/4-transformers/index.html Mon, 01 Jan 0001 00:00:00 +0000 https://genai.gitpull.in/2-intro-genai/2.1-key-concept/4-transformers/index.html Transformer Architecture The Transformer Model is the core architecture behind most modern NLP and GenAI models like GPT, BERT, and LLaMA. Here’s how it works: Key Components of the Transformer Self-Attention Mechanism: This allows the model to focus on different words in a sentence when processing each word. For example, when processing the word “bank” in the sentence “I went to the bank to withdraw money,” the model can focus on the context to determine if “bank” refers to a financial institution or the side of a river. Multi-Head Attention: This technique allows the model to focus on different aspects of the sentence simultaneously, using multiple attention heads to capture different relationships between words. Positional Encoding: Since transformers don’t inherently understand the order of words (like sequential models), positional encoding is added to provide information about the position of words in a sentence. Encoder-Decoder Architecture: Encoder: Processes the input data (e.g., a sentence). Decoder: Generates the output data (e.g., a translation of the sentence). Both the Transformer Encoder and Decoder consist of: 2.1.5: Recap https://genai.gitpull.in/2-intro-genai/2.1-key-concept/5-recap/index.html Mon, 01 Jan 0001 00:00:00 +0000 https://genai.gitpull.in/2-intro-genai/2.1-key-concept/5-recap/index.html Summary: How Generative AI Processes Text When you enter a prompt like: ➡️ "Explain black holes" A Generative AI model follows these steps: Step 1: Tokenization Breaks text into smaller parts (tokens). Example: "Explain black holes" → ["explain", "black", "holes"] Step 2: Embeddings Each token is converted into a numerical vector for processing. Step 3: Transformer Model Processing Self-attention determines which words matter the most. Multiple layers refine understanding. Step 4: Text Generation