2.1.1: Tokenization | GenAI Learning

2.1.1: Tokenization | GenAI Learning https://genai.gitpull.in/2-intro-genai/2.1-key-concept/1-tokenization/index.html Tokenization Tokenization is the process of converting text into smaller units, typically words or subwords, that can be processed by machine learning models. In natural language processing (NLP), tokens are the basic building blocks for understanding and generating language. Tokenization helps the model “understand” the text by converting it into a format that can be fed into the neural network. Word-level tokenization splits text into words. Example: "I love AI" → ["I", "love", "AI"]. Subword-level tokenization (used in models like GPT) splits words into smaller parts or subwords. This is more efficient and handles unknown words better, as the model can understand smaller pieces of a word. Example, “delightful”→[“delight”, “ful”]`. Character-Level Tokenization Splits text into individual characters. Example: "AI" → ["A", "I"]. Hands-On: Tokenization Tokenization Example Hugo en-us