π Complete all tutorials to earn your Free LLMs & Transformers Certificate
Shareable on LinkedIn β’ Verified by AITutorials.site β’ No signup fee
π€ From Text to Numbers: The Foundation of LLMs
Neural networks are mathematical functions that operate on numbers. But human language is symbolicβletters, words, and sentences. How do we bridge this gap?
The answer lies in two fundamental concepts: tokenization (converting text to discrete units) and embeddings (converting those units to meaningful numerical vectors). Together, they form the input pipeline that makes LLMs possible.
The Complete Pipeline
Text ("Hello, world!") β Tokenization (["Hello", ",", "world", "!"]) β Token IDs ([15496, 11, 995, 0]) β Embeddings (768-dim vectors) β Positional Encoding β Transformer Model
Why Not Just Use ASCII Codes?
You might wonder: why not just use ASCII codes (A=65, B=66, etc.)? Several reasons:
- No semantic meaning: ASCII is arbitrary. 'A' and 'B' are adjacent numbers, but have no semantic relationship.
- Inefficient: Word-level or subword-level tokenization reduces sequence length dramatically.
- No learned representations: LLMs need to learn that "king" and "queen" are relatedβembeddings enable this.
- Language-agnostic: Modern tokenization works across all languages, including emojis and special characters.
Historical Context
Early NLP systems used one-hot encoding (a vector with all zeros except one position). For a 50,000-word vocabulary, each word was a 50,000-dimensional sparse vector. This was:
- Memory intensive: Huge sparse vectors
- No semantic relationships: Every word was equally distant from every other
- Not scalable: Can't handle new words
Modern embeddings solve all these problems by learning dense, low-dimensional representations where semantic similarity = vector similarity.
π Tokenization: Breaking Text into Pieces
Tokenization is the process of splitting text into discrete units called tokens. A token can be a character, word, subword, or even byte. The choice dramatically affects model performance and efficiency.
1. Character-Level Tokenization
Split text into individual characters. Simple and universal.
Text: "Hello"
Tokens: ['H', 'e', 'l', 'l', 'o']
Vocabulary size: ~256 (ASCII) or ~1,000+ (Unicode)
Sequence length: Very long (one char per token)
Pros:
- Tiny vocabulary (can fit all possible characters)
- No out-of-vocabulary (OOV) issues
- Works for any language
Cons:
- Very long sequences (computationally expensive)
- Model must learn to compose characters into words (harder)
- Loses word-level semantic information
Used by: Some early models, byte-level models
2. Word-Level Tokenization
Split text by whitespace and punctuation. Intuitive and natural.
Text: "Hello world! This is NLP."
Tokens: ['Hello', 'world', '!', 'This', 'is', 'NLP', '.']
Vocabulary size: 50,000-500,000+ words
Sequence length: Moderate
Pros:
- Shorter sequences than character-level
- Semantically meaningful units
- Intuitive for humans
Cons:
- Huge vocabulary (every word needs an ID)
- Poor handling of rare/misspelled words (becomes UNK token)
- Can't handle morphology ("run", "running", "runs" are separate)
- Different vocab for each language
Used by: Early Word2Vec, GloVe embeddings
3. Subword Tokenization (Modern Standard) β
Split text into subword unitsβthe best of both worlds. This is what GPT, BERT, and modern LLMs use.
Text: "unbelievable ChatGPT"
Tokens: ['un', 'believ', 'able', 'Chat', 'G', 'PT']
Vocabulary size: 30,000-100,000 subwords
Sequence length: Balanced
Pros:
- Moderate vocabulary size (32k-50k typical)
- Handles rare words by breaking into subwords
- Captures morphology (prefix/suffix patterns)
- No UNK tokens (can represent any word)
- Works across languages
Cons:
- Less intuitive for humans
- Requires training a tokenizer on corpus
Used by: GPT (BPE), BERT (WordPiece), T5 (SentencePiece)
Subword Algorithms Deep Dive
Byte Pair Encoding (BPE) - Used by GPT
BPE starts with characters and iteratively merges the most frequent pairs. Here's how it works step-by-step:
BPE Training Example
Corpus: "low low low lower lowest"
Step 1: Initialize with characters
Vocabulary: ['l', 'o', 'w', 'e', 'r', 's', 't']
Words: ['l o w', 'l o w', 'l o w', 'l o w e r', 'l o w e s t']
Step 2: Count all adjacent pairs
('l', 'o'): 4 times
('o', 'w'): 4 times
('w', 'e'): 2 times
('e', 'r'): 1 time
('e', 's'): 1 time
('s', 't'): 1 time
Step 3: Merge most frequent pair: ('l', 'o') β 'lo'
Vocabulary: ['l', 'o', 'w', 'e', 'r', 's', 't', 'lo']
Words: ['lo w', 'lo w', 'lo w', 'lo w e r', 'lo w e s t']
Step 4: Repeat: merge ('lo', 'w') β 'low'
Vocabulary: ['l', 'o', 'w', 'e', 'r', 's', 't', 'lo', 'low']
Words: ['low', 'low', 'low', 'low e r', 'low e s t']
Step 5: Continue until vocabulary reaches target size (e.g., 50,000)
During inference, BPE applies these merge rules in the same order to segment new text:
Input: "lower"
Step 1: ['l', 'o', 'w', 'e', 'r']
Apply merge 'lo': ['lo', 'w', 'e', 'r']
Apply merge 'low': ['low', 'e', 'r']
Final: ['low', 'e', 'r']
WordPiece - Used by BERT
Similar to BPE, but merges based on likelihood rather than frequency. Adds "##" prefix to indicate continuation tokens:
"playing" β ['play', '##ing']
"unhappiness" β ['un', '##hap', '##pi', '##ness']
SentencePiece - Used by T5, Llama
Works directly on raw text (bytes) without pre-tokenization. Language-agnostic and handles whitespace as tokens:
"Hello world" β ['βHello', 'βworld'] # β represents space
β Modern LLM Tokenizers:
- GPT-2/GPT-3: BPE, 50,257 tokens
- GPT-4: BPE (estimated 100,000 tokens)
- BERT: WordPiece, 30,522 tokens
- Llama 2: SentencePiece, 32,000 tokens
- Mistral: SentencePiece, 32,000 tokens
Special Tokens
LLMs use special tokens to mark boundaries and roles:
- [PAD]: Padding token (for batching sequences of different lengths)
- [UNK]: Unknown token (rarely used with subword tokenization)
- [CLS]: Classification token (BERT uses for sentence-level tasks)
- [SEP]: Separator token (marks boundaries between segments)
- <|endoftext|>: End of text (GPT-2)
- <s>, </s>: Start/end of sequence (many models)
Practical Tokenization with HuggingFace
from transformers import AutoTokenizer
# Load different tokenizers
gpt2_tokenizer = AutoTokenizer.from_pretrained("gpt2")
bert_tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
llama_tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf")
# Tokenize text
text = "Hello, world! How are you?"
print("=== GPT-2 (BPE) ===")
tokens_gpt = gpt2_tokenizer.tokenize(text)
print(f"Tokens: {tokens_gpt}")
print(f"Token IDs: {gpt2_tokenizer.encode(text)}")
print(f"Count: {len(tokens_gpt)}\n")
print("=== BERT (WordPiece) ===")
tokens_bert = bert_tokenizer.tokenize(text)
print(f"Tokens: {tokens_bert}")
print(f"Token IDs: {bert_tokenizer.encode(text)}")
print(f"Count: {len(tokens_bert)}\n")
# Decode back
decoded = gpt2_tokenizer.decode([15496, 11, 995, 0])
print(f"Decoded: {decoded}")
Output
=== GPT-2 (BPE) ===
Tokens: ['Hello', ',', 'Δ world', '!', 'Δ How', 'Δ are', 'Δ you', '?']
Token IDs: [15496, 11, 995, 0, 1374, 389, 345, 30]
Count: 8
=== BERT (WordPiece) ===
Tokens: ['hello', ',', 'world', '!', 'how', 'are', 'you', '?']
Token IDs: [101, 7592, 1010, 2088, 999, 2129, 2024, 2017, 1029, 102]
Count: 10 (includes [CLS] and [SEP])
Decoded: Hello, world!
Advanced Tokenization Features
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("gpt2")
# Batch tokenization
texts = ["Hello world", "Tokenization is important"]
encoded = tokenizer(texts, padding=True, truncation=True, return_tensors="pt")
print(encoded)
# Returns: {'input_ids': tensor([[...], [...]]), 'attention_mask': tensor([[...], [...]])}
# Truncation and padding
long_text = "This is a very long text..." * 100
encoded = tokenizer(
long_text,
max_length=512, # Limit to 512 tokens
truncation=True, # Cut off excess
padding='max_length', # Pad to 512
return_tensors="pt"
)
# Get token count (useful for API limits)
token_count = len(tokenizer.encode(text))
print(f"Token count: {token_count}")
# Encode/decode special tokens
text_with_special = tokenizer.decode([50256]) # GPT-2 end-of-text token
print(f"Special token: {text_with_special}")
Token Count and Cost Estimation
Understanding token counts is crucial for API usage and costs:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("gpt2")
# Rule of thumb: 1 token β 0.75 words (English)
text = "The quick brown fox jumps over the lazy dog"
tokens = tokenizer.encode(text)
print(f"Words: {len(text.split())}")
print(f"Tokens: {len(tokens)}")
print(f"Ratio: {len(text.split()) / len(tokens):.2f} words per token")
# OpenAI pricing example (as of 2023)
# GPT-3.5-turbo: $0.002 per 1K tokens
total_tokens = len(tokens)
cost = (total_tokens / 1000) * 0.002
print(f"Estimated cost: ${cost:.6f}")
Token Limit Constraints:
- GPT-3.5: 4,096 tokens (input + output)
- GPT-4: 8,192 or 32,768 tokens
- Claude 2: 100,000 tokens
- Llama 2: 4,096 tokens
Exceeding limits = truncation or API errors!
π Embeddings: From IDs to Meaning
After tokenization, we have token IDs (integers). But neural networks need to understand semantic relationships. A token ID is just an arbitrary numberβit doesn't capture that "king" and "queen" are related, or that "cat" and "dog" are both animals.
Embeddings solve this by converting token IDs into dense vectors (lists of numbers) where semantically similar tokens have similar vectors.
Token ID
A unique integer for each token
Example: 2891 for "king"
Embedding Vector
Dense vector capturing meaning
Example: [0.2, -0.5, 0.1, ..., -0.1] (768 dims)
Embedding Table
Lookup matrix: vocab_size Γ embedding_dim
Example: 50,000 Γ 768 = 38.4M parameters
Learning Process
Embeddings are learned during pre-training via backpropagation, not predefined
How Embeddings Work
An embedding layer is essentially a lookup table (matrix) with shape: [vocabulary_size, embedding_dimension]
Embedding Table Structure
Vocabulary size: 50,000 tokens
Embedding dimension: 768 (GPT-2), 1536 (GPT-3), 4096 (Llama 2 70B)
Embedding Matrix Shape: [50,000 Γ 768]
Total parameters: 38,400,000
Token ID 2891 ("king") β row 2891 β [0.21, -0.53, 0.12, 0.31, ..., -0.08]
Token ID 2737 ("queen") β row 2737 β [0.19, -0.51, 0.15, 0.29, ..., -0.06]
Token ID 1234 ("apple") β row 1234 β [0.05, 0.23, -0.34, 0.01, ..., 0.42]
Notice "king" and "queen" have very similar vectors (small differences). "apple" is quite different. This similarity structure is learned automatically during training!
Why Embeddings Capture Meaning
During pre-training, the model learns to predict the next token. To do this well, it must learn that:
- "The king sat on the throne" and "The queen sat on the throne" are both valid
- So "king" and "queen" should have similar embeddings
- But "The apple sat on the throne" is unusual, so "apple" embeddings should differ
Through billions of training examples, embeddings naturally organize themselves to reflect semantic similarity. This is called distributional semantics: "words that appear in similar contexts have similar meanings."
Famous Word Analogy Property
One of the most fascinating properties of embeddings is vector arithmetic that captures relationships:
embedding("king") - embedding("man") + embedding("woman") β embedding("queen")
embedding("Paris") - embedding("France") + embedding("Italy") β embedding("Rome")
embedding("walking") - embedding("walk") + embedding("run") β embedding("running")
embedding("big") - embedding("bigger") + embedding("small") β embedding("smaller")
This works because embeddings capture relationships like:
- Gender: male β female
- Geography: capital β country
- Grammar: base form β inflection
- Size/degree: comparative relationships
Key Insight: These relationships emerge naturally from training. They're not programmedβthey're learned by observing how words co-occur in billions of sentences!
Visualizing Embeddings
Embeddings are high-dimensional (768D for GPT-2). We can use dimensionality reduction (t-SNE, PCA) to visualize them in 2D:
Conceptual Visualization (2D projection)
* queen
* king
* woman * prince
* girl * princess
* cat
* dog * Python
* lion * JavaScript
* tiger * Java
* apple
* banana
* orange
Similar words cluster together. Related concepts (royalty, animals, programming, fruits) form distinct regions.
Evolution of Word Embeddings
Static Embeddings (2013-2018)
Early approaches learned fixed embeddings for each word:
- Word2Vec (2013): CBOW and Skip-gram models. Each word = one vector.
- GloVe (2014): Global vectors based on co-occurrence statistics.
- FastText (2016): Uses character n-grams, handles rare words better.
Limitation: One embedding per word, regardless of context. "bank" (river) vs "bank" (money) have the same embedding.
Contextualized Embeddings (2018-Present)
Modern LLMs generate different embeddings based on context:
- ELMo (2018): Bidirectional LSTM, context-aware embeddings.
- BERT (2018): Transformer-based, deeply contextualized.
- GPT-3 (2020): Contextual embeddings in massive decoder-only model.
Example of Contextualization:
Sentence 1: "I went to the river bank."
"bank" embedding: [0.2, 0.5, -0.1, ...] (closer to "river", "water")
Sentence 2: "I deposited money at the bank."
"bank" embedding: [0.8, -0.2, 0.3, ...] (closer to "money", "financial")
Same word, different contexts β different embeddings!
Embedding Dimensions
Different models use different embedding dimensions:
Higher dimensions = more capacity to represent nuanced meanings, but also more parameters to train.
Code: Extracting and Using Embeddings
import torch\nfrom transformers import AutoModel, AutoTokenizer\n\n# Load model and tokenizer\nmodel_name = \"bert-base-uncased\"\nmodel = AutoModel.from_pretrained(model_name)\ntokenizer = AutoTokenizer.from_pretrained(model_name)\n\ntext = \"I love machine learning\"\n\n# Tokenize\ninputs = tokenizer(text, return_tensors=\"pt\")\nprint(f\"Token IDs: {inputs['input_ids']}\")\nprint(f\"Tokens: {tokenizer.convert_ids_to_tokens(inputs['input_ids'][0])}\")\n\n# Get embeddings\nwith torch.no_grad():\n outputs = model(**inputs)\n # Last hidden state = final embeddings\n embeddings = outputs.last_hidden_state \n\nprint(f\"\\nEmbedding shape: {embeddings.shape}\") \n# Output: torch.Size([1, 7, 768])\n# [batch_size, sequence_length, embedding_dim]\n\n# Access individual token embeddings\nfor i, token in enumerate(tokenizer.convert_ids_to_tokens(inputs['input_ids'][0])):\n token_embedding = embeddings[0, i, :]\n print(f\"Token '{token}': {token_embedding[:5]}...\") # First 5 dims\n\n Output
\nToken IDs: tensor([[ 101, 1045, 2293, 3698, 4083, 102]])\nTokens: ['[CLS]', 'i', 'love', 'machine', 'learning', '[SEP]']\n\nEmbedding shape: torch.Size([1, 7, 768])\n\nToken '[CLS]': tensor([-0.1234, 0.5678, -0.9012, 0.3456, -0.7890])...\nToken 'i': tensor([0.2345, -0.6789, 0.0123, -0.4567, 0.8901])...\nToken 'love': tensor([0.3456, 0.7890, -0.1234, 0.5678, -0.9012])...\n...\n Sentence Embeddings
\n\n For tasks like semantic search or clustering, we need a single vector for an entire sentence. Common approaches:
1. Mean Pooling
\nimport torch\n\n# Average all token embeddings\nsentence_embedding = embeddings.mean(dim=1) # Shape: [1, 768]\nprint(f\"Sentence embedding shape: {sentence_embedding.shape}\")\n\n 2. CLS Token (BERT-style)
\n# Use the [CLS] token embedding (first token)\nsentence_embedding = embeddings[:, 0, :] # Shape: [1, 768]\n\n 3. Max Pooling
\n# Take maximum value across sequence for each dimension\nsentence_embedding = embeddings.max(dim=1)[0] # Shape: [1, 768]\n\n 4. Specialized Sentence Embedders
\nfrom sentence_transformers import SentenceTransformer\n\n# Models specifically trained for sentence similarity\nmodel = SentenceTransformer('all-MiniLM-L6-v2')\n\nsentences = [\n \"I love machine learning\",\n \"Deep learning is fascinating\",\n \"I enjoy cooking pasta\"\n]\n\nembeddings = model.encode(sentences)\nprint(f\"Embeddings shape: {embeddings.shape}\") # (3, 384)\n\n# Compute similarity\nfrom sklearn.metrics.pairwise import cosine_similarity\n\nsimilarity = cosine_similarity(embeddings)\nprint(f\"\\nSimilarity matrix:\\n{similarity}\")\n# Sentences 1 and 2 will have high similarity (both about ML)\n# Sentence 3 will have low similarity to 1 and 2\n\n Output
\nEmbeddings shape: (3, 384)\n\nSimilarity matrix:\n[[1.0 0.72 0.15] β \"I love machine learning\"\n [0.72 1.0 0.18] β \"Deep learning is fascinating\"\n [0.15 0.18 1.0]] β \"I enjoy cooking pasta\"\n\nML sentences are 72% similar\nML vs cooking is only 15% similar\n Applications of Embeddings
\nSemantic Search
\nConvert queries and documents to embeddings, find nearest neighbors for relevant results
\nRecommendation
\nEmbed items and users, recommend items with similar embeddings to user preferences
\nClustering
\nGroup similar documents/texts by clustering their embeddings (k-means, DBSCAN)
\nClassification
\nUse embeddings as features for downstream classifiers (sentiment, topic, intent)
\nTranslation
\nMap embeddings across languages using aligned embedding spaces
\nRAG Systems
\nRetrieve relevant documents by embedding similarity before LLM generation
\nEmbedding Quality Matters: Different models produce embeddings of different quality. For semantic search, use models trained specifically for similarity (e.g., sentence-transformers), not general LLM embeddings.
\nπ Positional Embeddings: Adding Order Information
Token embeddings alone have a critical limitation: they don't capture position. The sentences "John loves Mary" and "Mary loves John" would have identical token embeddings (just in different order), but they mean completely different things!
Transformers process all tokens in parallel (unlike RNNs which process sequentially), so they need an explicit way to encode position information. This is where positional embeddings come in.
The Problem
Without positional information, these are indistinguishable to the model:
"The cat sat on the mat" = "sat on the The mat cat" (same tokens, different order)
How Positional Embeddings Work
The final input embedding is the sum of token embedding and positional embedding:
Input Embedding = Token Embedding + Positional Embedding
For "John loves Mary":
Token "John" at position 0:
Token embedding: [0.5, -0.2, 0.8, ...]
Positional embedding: [0.0, 0.1, 0.0, ...]
Final embedding: [0.5, -0.1, 0.8, ...]
Token "loves" at position 1:
Token embedding: [0.3, 0.7, -0.1, ...]
Positional embedding: [0.1, 0.05, 0.02, ...]
Final embedding: [0.4, 0.75, -0.08, ...]
Token "Mary" at position 2:
Token embedding: [-0.1, 0.4, 0.6, ...]
Positional embedding: [0.15, 0.03, 0.04, ...]
Final embedding: [0.05, 0.43, 0.64, ...]
Now the same token at different positions has different final embeddings, preserving word order!
Types of Positional Encodings
1. Sinusoidal Positional Encoding (Original Transformer)
The original "Attention Is All You Need" paper used fixed sinusoidal functions:
PE(pos, 2i) = sin(pos / 10000^(2i/d_model))
PE(pos, 2i+1) = cos(pos / 10000^(2i/d_model))
Where:
- pos = position in sequence (0, 1, 2, ...)
- i = dimension index (0 to d_model/2)
- d_model = embedding dimension (512, 768, etc.)
Advantages:
- Deterministic (no learning required)
- Works for any sequence length (can extrapolate)
- Smooth gradual change with position
- Model can learn to attend to relative positions
Used by: Original Transformer, many encoder-decoder models
2. Learned Positional Embeddings (BERT, GPT)
Modern LLMs learn positional embeddings during training, just like token embeddings:
Positional Embedding Table:
[
Position 0: [0.01, -0.05, 0.12, ...], # Learned vector for position 0
Position 1: [0.03, -0.02, 0.15, ...], # Learned vector for position 1
Position 2: [0.05, 0.01, 0.18, ...], # Learned vector for position 2
...
Position 511: [0.45, 0.32, -0.21, ...] # Max position for BERT
]
Advantages:
- Can be optimized for specific tasks during training
- Often performs slightly better than sinusoidal
- Simpler to implement
Limitations:
- Fixed maximum sequence length (can't extrapolate beyond training length)
- Requires learning additional parameters
Used by: BERT, GPT-2, GPT-3, RoBERTa
3. Relative Positional Encoding (T5, Transformer-XL)
Instead of absolute positions, encode relative distances between tokens:
Instead of: "token at position 5"
Encode: "token is 3 positions before the current token"
Benefits:
- Better generalization to longer sequences
- Captures relationships more naturally (\"two words apart\" matters more than \"position 5 vs 7\")
- Can handle unbounded sequence lengths
Used by: T5, Transformer-XL, DeBERTa
4. Rotary Position Embedding (RoPE) - Modern LLMs
Recent models use RoPE, which rotates embedding vectors based on position:
- Encodes absolute position information
- Also captures relative positions naturally
- Better extrapolation to longer sequences
- More efficient than learned embeddings
Used by: Llama, Llama 2, GPT-NeoX, PaLM
5. ALiBi (Attention with Linear Biases)
Adds position-dependent bias to attention scores (no position embeddings at all):
- Very simple: penalize attention based on distance
- Excellent extrapolation to longer sequences
- No additional parameters
Used by: BLOOM, MPT, some Llama variants
Visualizing Positional Encodings
Sinusoidal Positional Encoding Pattern
Position Encoding Heatmap (first 8 dimensions, positions 0-16):
Dim Pos: 0 1 2 3 4 5 6 7 8 ...
0: [ 0.0 0.8 1.0 0.8 0.0 -0.8 -1.0 -0.8 0.0 ...]
1: [ 1.0 0.5 0.0 -0.5 -1.0 -0.5 0.0 0.5 1.0 ...]
2: [ 0.0 0.3 0.5 0.7 0.9 1.0 0.9 0.7 0.5 ...]
3: [ 1.0 0.9 0.8 0.6 0.3 0.0 -0.3 -0.6 -0.8 ...]
...
Notice: Each dimension has a different frequency
β Model can learn to attend to positions using these patterns
Code: Adding Positional Embeddings
import torch
import math
def sinusoidal_positional_encoding(seq_len, d_model):
\"\"\"Generate sinusoidal positional encodings\"\"\"
position = torch.arange(0, seq_len).unsqueeze(1)
div_term = torch.exp(
torch.arange(0, d_model, 2) * -(math.log(10000.0) / d_model)
)
pos_encoding = torch.zeros(seq_len, d_model)
pos_encoding[:, 0::2] = torch.sin(position * div_term)
pos_encoding[:, 1::2] = torch.cos(position * div_term)
return pos_encoding
# Generate for sequence length 10, embedding dim 512
pos_enc = sinusoidal_positional_encoding(seq_len=10, d_model=512)
print(f"Positional encoding shape: {pos_enc.shape}") # [10, 512]
# Visualize first position
print(f"Position 0 encoding (first 10 dims): {pos_enc[0, :10]}\")
Practical Example with Transformers
from transformers import BertModel, BertTokenizer
import torch
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')
text = "John loves Mary"
inputs = tokenizer(text, return_tensors="pt")
# BERT automatically adds positional embeddings
with torch.no_grad():
outputs = model(**inputs, output_hidden_states=True)
# Token embeddings (before positional encoding is added)
token_embeddings = model.embeddings.word_embeddings(inputs['input_ids'])
print(f"Token embeddings shape: {token_embeddings.shape}") # [1, 5, 768]
# Positional embeddings
position_ids = torch.arange(inputs['input_ids'].size(1)).unsqueeze(0)
position_embeddings = model.embeddings.position_embeddings(position_ids)
print(f"Position embeddings shape: {position_embeddings.shape}\") # [1, 5, 768]
# Combined (with layer norm and dropout applied)
final_embeddings = outputs.last_hidden_state[0]
print(f"Final embeddings shape: {final_embeddings.shape}\") # [1, 5, 768]
Key Takeaway: Positional embeddings are crucial for Transformers. Without them, the model has no sense of word order, making it impossible to understand language properly!
Context Window Limits
Positional embeddings impose sequence length limits:
- BERT: 512 tokens (learned positional embeddings)
- GPT-2: 1,024 tokens (learned)
- GPT-3: 2,048 tokens (learned)
- GPT-3.5/4: 4k-32k tokens (advanced techniques)
- Claude 2: 100k tokens (extended context methods)
- Llama 2: 4,096 tokens (RoPE, can be extended)
Going Beyond Training Length: Models struggle with sequences longer than training length when using learned positional embeddings. Modern techniques like RoPE and ALiBi enable better extrapolation to longer contexts.
π― Practical Tips & Best Practices
Tokenization Best Practices
- Use the right tokenizer: Always use the tokenizer that matches your model (GPT-2 tokenizer for GPT-2, BERT tokenizer for BERT)
- Watch token counts: APIs charge per token. Use
len(tokenizer.encode(text))to estimate costs - Handle truncation gracefully: Long documents should be chunked intelligently, not just cut off mid-sentence
- Test on your data: Check how your specific text tokenizes (especially domain-specific terms, code, non-English text)
- Be aware of special tokens: Models add [CLS], [SEP], etc. Factor these into token counts
Embedding Best Practices
- Use pre-trained embeddings: Don't train from scratch unless you have massive datasets
- Fine-tune when needed: For domain-specific tasks, fine-tune embeddings on your data
- Match embedding model to task: Use sentence-transformers for semantic search, not general LLM embeddings
- Normalize embeddings: For cosine similarity, normalize vectors to unit length
- Cache embeddings: Computing embeddings is expensive. Cache and reuse when possible
Common Pitfalls to Avoid
- Tokenizer mismatch: Using wrong tokenizer for a model causes gibberish output
- Ignoring special tokens: Forgetting to account for [CLS], [SEP] in token counts
- Exceeding context length: Going beyond model's max length causes errors or truncation
- Not handling edge cases: Emojis, code, URLs can tokenize unexpectedly
- Using wrong pooling: CLS token pooling doesn't work well for non-BERT models
Real-World Example: Building a Semantic Search System
from sentence_transformers import SentenceTransformer
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
# 1. Load specialized embedding model
model = SentenceTransformer('all-MiniLM-L6-v2')
# 2. Your document corpus
documents = [
\"Machine learning is a subset of AI\",
\"Deep learning uses neural networks\",
\"Python is a programming language\",
\"Natural language processing handles text\",
\"Computer vision processes images\"
]
# 3. Embed all documents (do this once, cache results)
doc_embeddings = model.encode(documents)
print(f\"Document embeddings shape: {doc_embeddings.shape}\") # (5, 384)
# 4. User query
query = \"Tell me about artificial intelligence\"
query_embedding = model.encode([query])
# 5. Find most similar documents
similarities = cosine_similarity(query_embedding, doc_embeddings)[0]
# 6. Rank and retrieve
ranked_indices = np.argsort(similarities)[::-1] # Descending order
print(\"\\nSearch Results:\")
for i, idx in enumerate(ranked_indices[:3], 1):
print(f\"{i}. {documents[idx]} (similarity: {similarities[idx]:.3f})\")
Output
Document embeddings shape: (5, 384)
Search Results:
1. Machine learning is a subset of AI (similarity: 0.652)
2. Deep learning uses neural networks (similarity: 0.531)
3. Natural language processing handles text (similarity: 0.398)
The system correctly retrieves AI-related documents first!
π Chapter Summary
Key Concepts Mastered:
- β Tokenization Fundamentals: Converting text into discrete units (tokens) that models can process
- β Tokenization Strategies: Character-level (small vocab, long sequences), Word-level (intuitive but large vocab), Subword (optimal balance)
- β BPE Algorithm: Iteratively merges most frequent character pairs to build subword vocabulary
- β Modern Tokenizers: GPT uses BPE, BERT uses WordPiece, Llama uses SentencePiece
- β Token IDs: Integer representation of tokens from vocabulary (e.g., "king" β 2891)
- β Embeddings: Dense vectors that capture semantic meaning where similar words have similar vectors
- β Embedding Learning: Automatically learned during pre-training, not manually programmed
- β Vector Arithmetic: king - man + woman β queen (relationships emerge naturally)
- β Contextualized Embeddings: Modern LLMs generate different embeddings for same word in different contexts
- β Positional Embeddings: Add position information so models understand word order
- β Positional Strategies: Sinusoidal (fixed), Learned (BERT/GPT), Relative (T5), RoPE (Llama), ALiBi (BLOOM)
Technical Specifications Learned:
| Component | Purpose | Example Dimensions |
|---|---|---|
| Vocabulary Size | Number of unique tokens | 30k-100k tokens |
| Token ID | Integer index in vocabulary | 0 to vocab_size-1 |
| Embedding Dimension | Vector size for each token | 768 (BERT/GPT-2), 4096 (Llama) |
| Embedding Table | Lookup matrix for embeddings | [50k Γ 768] = 38M params |
| Position Embedding | Encode token position in sequence | Same dim as token embeddings |
| Context Length | Max sequence length | 512-100k tokens (model-dependent) |
Code Skills Acquired:
- β Load and use HuggingFace tokenizers for different models
- β Tokenize text and decode token IDs back to text
- β Count tokens for API cost estimation
- β Handle truncation and padding for batch processing
- β Extract token embeddings from pre-trained models
- β Generate sentence embeddings (mean pooling, CLS token)
- β Compute semantic similarity using cosine similarity
- β Build semantic search systems with embeddings
- β Implement sinusoidal positional encoding
Practical Applications:
Semantic Search
Embed documents, find similar ones via cosine similarity
Text Classification
Use embeddings as features for sentiment, topic, intent classification
Clustering
Group similar documents by clustering embeddings
RAG Systems
Retrieve relevant context using embedding similarity before generation
What's Next?
Now that you understand how text becomes numbers (tokenization) and how those numbers capture meaning (embeddings), you're ready to learn how to communicate effectively with LLMs.
In the next tutorial, Prompt Engineering, you'll master:
- Crafting effective prompts for different tasks
- Zero-shot, one-shot, and few-shot learning techniques
- Chain-of-thought prompting for complex reasoning
- Prompt templates and best practices
- Common pitfalls and how to avoid them
π Excellent Progress! You've completed Module 2 and now have deep understanding of the input pipeline for LLMs. Tokenization and embeddings are the foundationβevery LLM interaction starts here. This knowledge will be invaluable as we explore more advanced topics!
π‘ Quick Self-Check Questions
- Why do modern LLMs use subword tokenization instead of word-level?
- How does BPE decide which character pairs to merge?
- What does it mean that embeddings are "learned"?
- Why do we need positional embeddings in Transformers?
- What's the difference between static and contextualized embeddings?
Answers: (1) Balances vocab size and sequence length, handles rare words; (2) Merges most frequent pairs iteratively; (3) Updated during training via backpropagation, not predefined; (4) Transformers process in parallel, no inherent position info; (5) Static = same vector always, contextualized = different based on surrounding words
Test Your Knowledge
Q1: What is the main advantage of Byte-Pair Encoding (BPE) tokenization?
Q2: How does BPE build its vocabulary?
Q3: How are learned embeddings different from predefined embeddings?
Q4: Why do Transformers need positional embeddings?
Q5: What is the difference between static and contextualized embeddings?