Tokenization & Embeddings - LLMs & Transformers Tutorial

🎓 Complete all tutorials to earn your Free LLMs & Transformers Certificate
Shareable on LinkedIn • Verified by AITutorials.site • No signup fee

🔤 From Text to Numbers: The Foundation of LLMs

Neural networks are mathematical functions that operate on numbers. But human language is symbolic—letters, words, and sentences. How do we bridge this gap?

The answer lies in two fundamental concepts: tokenization (converting text to discrete units) and embeddings (converting those units to meaningful numerical vectors). Together, they form the input pipeline that makes LLMs possible.

The Complete Pipeline

Text ("Hello, world!") → Tokenization (["Hello", ",", "world", "!"]) → Token IDs ([15496, 11, 995, 0]) → Embeddings (768-dim vectors) → Positional Encoding → Transformer Model

Why Not Just Use ASCII Codes?

You might wonder: why not just use ASCII codes (A=65, B=66, etc.)? Several reasons:

No semantic meaning: ASCII is arbitrary. 'A' and 'B' are adjacent numbers, but have no semantic relationship.
Inefficient: Word-level or subword-level tokenization reduces sequence length dramatically.
No learned representations: LLMs need to learn that "king" and "queen" are related—embeddings enable this.
Language-agnostic: Modern tokenization works across all languages, including emojis and special characters.

Historical Context

Early NLP systems used one-hot encoding (a vector with all zeros except one position). For a 50,000-word vocabulary, each word was a 50,000-dimensional sparse vector. This was:

Memory intensive: Huge sparse vectors
No semantic relationships: Every word was equally distant from every other
Not scalable: Can't handle new words

Modern embeddings solve all these problems by learning dense, low-dimensional representations where semantic similarity = vector similarity.

🔓 Tokenization: Breaking Text into Pieces

Tokenization is the process of splitting text into discrete units called tokens. A token can be a character, word, subword, or even byte. The choice dramatically affects model performance and efficiency.

1. Character-Level Tokenization

Split text into individual characters. Simple and universal.

Text: "Hello"
Tokens: ['H', 'e', 'l', 'l', 'o']
Vocabulary size: ~256 (ASCII) or ~1,000+ (Unicode)
Sequence length: Very long (one char per token)

Pros:

Tiny vocabulary (can fit all possible characters)
No out-of-vocabulary (OOV) issues
Works for any language

Cons:

Very long sequences (computationally expensive)
Model must learn to compose characters into words (harder)
Loses word-level semantic information

Used by: Some early models, byte-level models

2. Word-Level Tokenization

Split text by whitespace and punctuation. Intuitive and natural.

Text: "Hello world! This is NLP."
Tokens: ['Hello', 'world', '!', 'This', 'is', 'NLP', '.']
Vocabulary size: 50,000-500,000+ words
Sequence length: Moderate

Pros:

Shorter sequences than character-level
Semantically meaningful units
Intuitive for humans

Cons:

Huge vocabulary (every word needs an ID)
Poor handling of rare/misspelled words (becomes UNK token)
Can't handle morphology ("run", "running", "runs" are separate)
Different vocab for each language

Used by: Early Word2Vec, GloVe embeddings

3. Subword Tokenization (Modern Standard) ⭐

Split text into subword units—the best of both worlds. This is what GPT, BERT, and modern LLMs use.

Text: "unbelievable ChatGPT"
Tokens: ['un', 'believ', 'able', 'Chat', 'G', 'PT']
Vocabulary size: 30,000-100,000 subwords
Sequence length: Balanced

Pros:

Moderate vocabulary size (32k-50k typical)
Handles rare words by breaking into subwords
Captures morphology (prefix/suffix patterns)
No UNK tokens (can represent any word)
Works across languages

Cons:

Less intuitive for humans
Requires training a tokenizer on corpus

Used by: GPT (BPE), BERT (WordPiece), T5 (SentencePiece)

Subword Algorithms Deep Dive

Byte Pair Encoding (BPE) - Used by GPT

BPE starts with characters and iteratively merges the most frequent pairs. Here's how it works step-by-step:

BPE Training Example

Corpus: "low low low lower lowest"

Step 1: Initialize with characters

Vocabulary: ['l', 'o', 'w', 'e', 'r', 's', 't']
Words: ['l o w', 'l o w', 'l o w', 'l o w e r', 'l o w e s t']

Step 2: Count all adjacent pairs

('l', 'o'): 4 times
('o', 'w'): 4 times
('w', 'e'): 2 times
('e', 'r'): 1 time
('e', 's'): 1 time
('s', 't'): 1 time

Step 3: Merge most frequent pair: ('l', 'o') → 'lo'

Vocabulary: ['l', 'o', 'w', 'e', 'r', 's', 't', 'lo']
Words: ['lo w', 'lo w', 'lo w', 'lo w e r', 'lo w e s t']

Step 4: Repeat: merge ('lo', 'w') → 'low'

Vocabulary: ['l', 'o', 'w', 'e', 'r', 's', 't', 'lo', 'low']
Words: ['low', 'low', 'low', 'low e r', 'low e s t']

Step 5: Continue until vocabulary reaches target size (e.g., 50,000)

During inference, BPE applies these merge rules in the same order to segment new text:

Input: "lower"
Step 1: ['l', 'o', 'w', 'e', 'r']
Apply merge 'lo': ['lo', 'w', 'e', 'r']
Apply merge 'low': ['low', 'e', 'r']
Final: ['low', 'e', 'r']

WordPiece - Used by BERT

Similar to BPE, but merges based on likelihood rather than frequency. Adds "##" prefix to indicate continuation tokens:

"playing" → ['play', '##ing']
"unhappiness" → ['un', '##hap', '##pi', '##ness']

SentencePiece - Used by T5, Llama

Works directly on raw text (bytes) without pre-tokenization. Language-agnostic and handles whitespace as tokens:

"Hello world" → ['▁Hello', '▁world']  # ▁ represents space

✅ Modern LLM Tokenizers:

GPT-2/GPT-3: BPE, 50,257 tokens
GPT-4: BPE (estimated 100,000 tokens)
BERT: WordPiece, 30,522 tokens
Llama 2: SentencePiece, 32,000 tokens
Mistral: SentencePiece, 32,000 tokens

Special Tokens

LLMs use special tokens to mark boundaries and roles:

[PAD]: Padding token (for batching sequences of different lengths)
[UNK]: Unknown token (rarely used with subword tokenization)
[CLS]: Classification token (BERT uses for sentence-level tasks)
[SEP]: Separator token (marks boundaries between segments)
<|endoftext|>: End of text (GPT-2)
<s>, </s>: Start/end of sequence (many models)

Practical Tokenization with HuggingFace

from transformers import AutoTokenizer

# Load different tokenizers
gpt2_tokenizer = AutoTokenizer.from_pretrained("gpt2")
bert_tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
llama_tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf")

# Tokenize text
text = "Hello, world! How are you?"

print("=== GPT-2 (BPE) ===")
tokens_gpt = gpt2_tokenizer.tokenize(text)
print(f"Tokens: {tokens_gpt}")
print(f"Token IDs: {gpt2_tokenizer.encode(text)}")
print(f"Count: {len(tokens_gpt)}\n")

print("=== BERT (WordPiece) ===")
tokens_bert = bert_tokenizer.tokenize(text)
print(f"Tokens: {tokens_bert}")
print(f"Token IDs: {bert_tokenizer.encode(text)}")
print(f"Count: {len(tokens_bert)}\n")

# Decode back
decoded = gpt2_tokenizer.decode([15496, 11, 995, 0])
print(f"Decoded: {decoded}")

Output

=== GPT-2 (BPE) ===
Tokens: ['Hello', ',', 'Ġworld', '!', 'ĠHow', 'Ġare', 'Ġyou', '?']
Token IDs: [15496, 11, 995, 0, 1374, 389, 345, 30]
Count: 8

=== BERT (WordPiece) ===
Tokens: ['hello', ',', 'world', '!', 'how', 'are', 'you', '?']
Token IDs: [101, 7592, 1010, 2088, 999, 2129, 2024, 2017, 1029, 102]
Count: 10 (includes [CLS] and [SEP])

Decoded: Hello, world!

Advanced Tokenization Features

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("gpt2")

# Batch tokenization
texts = ["Hello world", "Tokenization is important"]
encoded = tokenizer(texts, padding=True, truncation=True, return_tensors="pt")
print(encoded)
# Returns: {'input_ids': tensor([[...], [...]]), 'attention_mask': tensor([[...], [...]])}

# Truncation and padding
long_text = "This is a very long text..." * 100
encoded = tokenizer(
    long_text, 
    max_length=512,  # Limit to 512 tokens
    truncation=True,  # Cut off excess
    padding='max_length',  # Pad to 512
    return_tensors="pt"
)

# Get token count (useful for API limits)
token_count = len(tokenizer.encode(text))
print(f"Token count: {token_count}")

# Encode/decode special tokens
text_with_special = tokenizer.decode([50256])  # GPT-2 end-of-text token
print(f"Special token: {text_with_special}")

Token Count and Cost Estimation

Understanding token counts is crucial for API usage and costs:

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("gpt2")

# Rule of thumb: 1 token ≈ 0.75 words (English)
text = "The quick brown fox jumps over the lazy dog"
tokens = tokenizer.encode(text)
print(f"Words: {len(text.split())}")
print(f"Tokens: {len(tokens)}")
print(f"Ratio: {len(text.split()) / len(tokens):.2f} words per token")

# OpenAI pricing example (as of 2023)
# GPT-3.5-turbo: $0.002 per 1K tokens
total_tokens = len(tokens)
cost = (total_tokens / 1000) * 0.002
print(f"Estimated cost: ${cost:.6f}")

Token Limit Constraints:

GPT-3.5: 4,096 tokens (input + output)
GPT-4: 8,192 or 32,768 tokens
Claude 2: 100,000 tokens
Llama 2: 4,096 tokens

Exceeding limits = truncation or API errors!

📊 Embeddings: From IDs to Meaning

After tokenization, we have token IDs (integers). But neural networks need to understand semantic relationships. A token ID is just an arbitrary number—it doesn't capture that "king" and "queen" are related, or that "cat" and "dog" are both animals.

Embeddings solve this by converting token IDs into dense vectors (lists of numbers) where semantically similar tokens have similar vectors.

🔢

Token ID

A unique integer for each token
Example: 2891 for "king"

➡️

Embedding Vector

Dense vector capturing meaning
Example: [0.2, -0.5, 0.1, ..., -0.1] (768 dims)

🎯

Embedding Table

Lookup matrix: vocab_size × embedding_dim
Example: 50,000 × 768 = 38.4M parameters

🎓

Learning Process

Embeddings are learned during pre-training via backpropagation, not predefined

How Embeddings Work

An embedding layer is essentially a lookup table (matrix) with shape: [vocabulary_size, embedding_dimension]

Embedding Table Structure

Vocabulary size: 50,000 tokens
Embedding dimension: 768 (GPT-2), 1536 (GPT-3), 4096 (Llama 2 70B)

Embedding Matrix Shape: [50,000 × 768]
Total parameters: 38,400,000

Token ID 2891 ("king")  → row 2891 → [0.21, -0.53, 0.12, 0.31, ..., -0.08]
Token ID 2737 ("queen") → row 2737 → [0.19, -0.51, 0.15, 0.29, ..., -0.06]
Token ID 1234 ("apple") → row 1234 → [0.05, 0.23, -0.34, 0.01, ..., 0.42]

Notice "king" and "queen" have very similar vectors (small differences). "apple" is quite different. This similarity structure is learned automatically during training!

Why Embeddings Capture Meaning

During pre-training, the model learns to predict the next token. To do this well, it must learn that:

"The king sat on the throne" and "The queen sat on the throne" are both valid
So "king" and "queen" should have similar embeddings
But "The apple sat on the throne" is unusual, so "apple" embeddings should differ

Through billions of training examples, embeddings naturally organize themselves to reflect semantic similarity. This is called distributional semantics: "words that appear in similar contexts have similar meanings."

Famous Word Analogy Property

One of the most fascinating properties of embeddings is vector arithmetic that captures relationships:

embedding("king") - embedding("man") + embedding("woman") ≈ embedding("queen")

embedding("Paris") - embedding("France") + embedding("Italy") ≈ embedding("Rome")

embedding("walking") - embedding("walk") + embedding("run") ≈ embedding("running")

embedding("big") - embedding("bigger") + embedding("small") ≈ embedding("smaller")

This works because embeddings capture relationships like:

Gender: male ↔ female
Geography: capital ↔ country
Grammar: base form ↔ inflection
Size/degree: comparative relationships

Key Insight: These relationships emerge naturally from training. They're not programmed—they're learned by observing how words co-occur in billions of sentences!

Visualizing Embeddings

Embeddings are high-dimensional (768D for GPT-2). We can use dimensionality reduction (t-SNE, PCA) to visualize them in 2D:

Conceptual Visualization (2D projection)

                    * queen
                    * king
       * woman              * prince
       * girl               * princess

       
* cat                    
* dog                     * Python
* lion                    * JavaScript
* tiger                   * Java


           * apple
           * banana
           * orange

Similar words cluster together. Related concepts (royalty, animals, programming, fruits) form distinct regions.

Evolution of Word Embeddings

Static Embeddings (2013-2018)

Early approaches learned fixed embeddings for each word:

Word2Vec (2013): CBOW and Skip-gram models. Each word = one vector.
GloVe (2014): Global vectors based on co-occurrence statistics.
FastText (2016): Uses character n-grams, handles rare words better.

Limitation: One embedding per word, regardless of context. "bank" (river) vs "bank" (money) have the same embedding.

Contextualized Embeddings (2018-Present)

Modern LLMs generate different embeddings based on context:

ELMo (2018): Bidirectional LSTM, context-aware embeddings.
BERT (2018): Transformer-based, deeply contextualized.
GPT-3 (2020): Contextual embeddings in massive decoder-only model.

Example of Contextualization:

Sentence 1: "I went to the river bank."
"bank" embedding: [0.2, 0.5, -0.1, ...] (closer to "river", "water")

Sentence 2: "I deposited money at the bank."
"bank" embedding: [0.8, -0.2, 0.3, ...] (closer to "money", "financial")

Same word, different contexts → different embeddings!

Embedding Dimensions

Different models use different embedding dimensions:

Model	Embedding Dim	Vocab Size	Embedding Params
Word2Vec	300	3M	900M
BERT-Base	768	30k	23M
GPT-2	768	50k	38M
GPT-3	12,288	50k	614M
Llama 2 (7B)	4,096	32k	131M
Llama 2 (70B)	8,192	32k	262M

Higher dimensions = more capacity to represent nuanced meanings, but also more parameters to train.

Code: Extracting and Using Embeddings

import torch\nfrom transformers import AutoModel, AutoTokenizer\n\n# Load model and tokenizer\nmodel_name = \"bert-base-uncased\"\nmodel = AutoModel.from_pretrained(model_name)\ntokenizer = AutoTokenizer.from_pretrained(model_name)\n\ntext = \"I love machine learning\"\n\n# Tokenize\ninputs = tokenizer(text, return_tensors=\"pt\")\nprint(f\"Token IDs: {inputs['input_ids']}\")\nprint(f\"Tokens: {tokenizer.convert_ids_to_tokens(inputs['input_ids'][0])}\")\n\n# Get embeddings\nwith torch.no_grad():\n    outputs = model(**inputs)\n    # Last hidden state = final embeddings\n    embeddings = outputs.last_hidden_state  \n\nprint(f\"\\nEmbedding shape: {embeddings.shape}\")  \n# Output: torch.Size([1, 7, 768])\n# [batch_size, sequence_length, embedding_dim]\n\n# Access individual token embeddings\nfor i, token in enumerate(tokenizer.convert_ids_to_tokens(inputs['input_ids'][0])):\n    token_embedding = embeddings[0, i, :]\n    print(f\"Token '{token}': {token_embedding[:5]}...\")  # First 5 dims

\n\n

\n

Output

\n

Token IDs: tensor([[ 101, 1045, 2293, 3698, 4083,  102]])\nTokens: ['[CLS]', 'i', 'love', 'machine', 'learning', '[SEP]']\n\nEmbedding shape: torch.Size([1, 7, 768])\n\nToken '[CLS]': tensor([-0.1234, 0.5678, -0.9012, 0.3456, -0.7890])...\nToken 'i': tensor([0.2345, -0.6789, 0.0123, -0.4567, 0.8901])...\nToken 'love': tensor([0.3456, 0.7890, -0.1234, 0.5678, -0.9012])...\n...

\n

\n\n

Sentence Embeddings

\n

\n For tasks like semantic search or clustering, we need a single vector for an entire sentence. Common approaches:

1. Mean Pooling

\n

import torch\n\n# Average all token embeddings\nsentence_embedding = embeddings.mean(dim=1)  # Shape: [1, 768]\nprint(f\"Sentence embedding shape: {sentence_embedding.shape}\")

\n\n

2. CLS Token (BERT-style)

\n

# Use the [CLS] token embedding (first token)\nsentence_embedding = embeddings[:, 0, :]  # Shape: [1, 768]

\n\n

3. Max Pooling

\n

# Take maximum value across sequence for each dimension\nsentence_embedding = embeddings.max(dim=1)[0]  # Shape: [1, 768]

\n\n

4. Specialized Sentence Embedders

\n

from sentence_transformers import SentenceTransformer\n\n# Models specifically trained for sentence similarity\nmodel = SentenceTransformer('all-MiniLM-L6-v2')\n\nsentences = [\n    \"I love machine learning\",\n    \"Deep learning is fascinating\",\n    \"I enjoy cooking pasta\"\n]\n\nembeddings = model.encode(sentences)\nprint(f\"Embeddings shape: {embeddings.shape}\")  # (3, 384)\n\n# Compute similarity\nfrom sklearn.metrics.pairwise import cosine_similarity\n\nsimilarity = cosine_similarity(embeddings)\nprint(f\"\\nSimilarity matrix:\\n{similarity}\")\n# Sentences 1 and 2 will have high similarity (both about ML)\n# Sentence 3 will have low similarity to 1 and 2

\n\n

\n

Output

\n

Embeddings shape: (3, 384)\n\nSimilarity matrix:\n[[1.0   0.72  0.15]   ← \"I love machine learning\"\n [0.72  1.0   0.18]   ← \"Deep learning is fascinating\"\n [0.15  0.18  1.0]]   ← \"I enjoy cooking pasta\"\n\nML sentences are 72% similar\nML vs cooking is only 15% similar

Transformers process all tokens in parallel (unlike RNNs which process sequentially), so they need an explicit way to encode position information. This is where positional embeddings come in.

The Problem

Without positional information, these are indistinguishable to the model:
"The cat sat on the mat" = "sat on the The mat cat" (same tokens, different order)

How Positional Embeddings Work

The final input embedding is the sum of token embedding and positional embedding:

Input Embedding = Token Embedding + Positional Embedding

For "John loves Mary":

Token "John" at position 0:
  Token embedding:      [0.5, -0.2, 0.8, ...]
  Positional embedding: [0.0, 0.1, 0.0, ...]
  Final embedding:      [0.5, -0.1, 0.8, ...]

Token "loves" at position 1:
  Token embedding:      [0.3, 0.7, -0.1, ...]
  Positional embedding: [0.1, 0.05, 0.02, ...]
  Final embedding:      [0.4, 0.75, -0.08, ...]

Token "Mary" at position 2:
  Token embedding:      [-0.1, 0.4, 0.6, ...]
  Positional embedding: [0.15, 0.03, 0.04, ...]
  Final embedding:      [0.05, 0.43, 0.64, ...]

Now the same token at different positions has different final embeddings, preserving word order!

Types of Positional Encodings

1. Sinusoidal Positional Encoding (Original Transformer)

The original "Attention Is All You Need" paper used fixed sinusoidal functions:

PE(pos, 2i)   = sin(pos / 10000^(2i/d_model))
PE(pos, 2i+1) = cos(pos / 10000^(2i/d_model))

Where:
- pos = position in sequence (0, 1, 2, ...)
- i = dimension index (0 to d_model/2)
- d_model = embedding dimension (512, 768, etc.)

Advantages:

Deterministic (no learning required)
Works for any sequence length (can extrapolate)
Smooth gradual change with position
Model can learn to attend to relative positions

Used by: Original Transformer, many encoder-decoder models

2. Learned Positional Embeddings (BERT, GPT)

Modern LLMs learn positional embeddings during training, just like token embeddings:

Positional Embedding Table:
[
  Position 0: [0.01, -0.05, 0.12, ...],  # Learned vector for position 0
  Position 1: [0.03, -0.02, 0.15, ...],  # Learned vector for position 1
  Position 2: [0.05, 0.01, 0.18, ...],   # Learned vector for position 2
  ...
  Position 511: [0.45, 0.32, -0.21, ...]  # Max position for BERT
]

Advantages:

Can be optimized for specific tasks during training
Often performs slightly better than sinusoidal
Simpler to implement

Limitations:

Fixed maximum sequence length (can't extrapolate beyond training length)
Requires learning additional parameters

Used by: BERT, GPT-2, GPT-3, RoBERTa

3. Relative Positional Encoding (T5, Transformer-XL)

Instead of absolute positions, encode relative distances between tokens:

Instead of: "token at position 5"
Encode:     "token is 3 positions before the current token"

Benefits:
- Better generalization to longer sequences
- Captures relationships more naturally (\"two words apart\" matters more than \"position 5 vs 7\")
- Can handle unbounded sequence lengths

Used by: T5, Transformer-XL, DeBERTa

4. Rotary Position Embedding (RoPE) - Modern LLMs

Recent models use RoPE, which rotates embedding vectors based on position:

Encodes absolute position information
Also captures relative positions naturally
Better extrapolation to longer sequences
More efficient than learned embeddings

Used by: Llama, Llama 2, GPT-NeoX, PaLM

5. ALiBi (Attention with Linear Biases)

Adds position-dependent bias to attention scores (no position embeddings at all):

Very simple: penalize attention based on distance
Excellent extrapolation to longer sequences
No additional parameters

Used by: BLOOM, MPT, some Llama variants

Visualizing Positional Encodings

Sinusoidal Positional Encoding Pattern

Position Encoding Heatmap (first 8 dimensions, positions 0-16):

Dim    Pos: 0    1    2    3    4    5    6    7    8  ...
 0:   [ 0.0  0.8  1.0  0.8  0.0 -0.8 -1.0 -0.8  0.0 ...]
 1:   [ 1.0  0.5  0.0 -0.5 -1.0 -0.5  0.0  0.5  1.0 ...]
 2:   [ 0.0  0.3  0.5  0.7  0.9  1.0  0.9  0.7  0.5 ...]
 3:   [ 1.0  0.9  0.8  0.6  0.3  0.0 -0.3 -0.6 -0.8 ...]
 ...

Notice: Each dimension has a different frequency
→ Model can learn to attend to positions using these patterns

Code: Adding Positional Embeddings

import torch
import math

def sinusoidal_positional_encoding(seq_len, d_model):
    \"\"\"Generate sinusoidal positional encodings\"\"\"
    position = torch.arange(0, seq_len).unsqueeze(1)
    div_term = torch.exp(
        torch.arange(0, d_model, 2) * -(math.log(10000.0) / d_model)
    )
    
    pos_encoding = torch.zeros(seq_len, d_model)
    pos_encoding[:, 0::2] = torch.sin(position * div_term)
    pos_encoding[:, 1::2] = torch.cos(position * div_term)
    
    return pos_encoding

# Generate for sequence length 10, embedding dim 512
pos_enc = sinusoidal_positional_encoding(seq_len=10, d_model=512)
print(f"Positional encoding shape: {pos_enc.shape}")  # [10, 512]

# Visualize first position
print(f"Position 0 encoding (first 10 dims): {pos_enc[0, :10]}\")

Practical Example with Transformers

from transformers import BertModel, BertTokenizer
import torch

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')

text = "John loves Mary"
inputs = tokenizer(text, return_tensors="pt")

# BERT automatically adds positional embeddings
with torch.no_grad():
    outputs = model(**inputs, output_hidden_states=True)
    
    # Token embeddings (before positional encoding is added)
    token_embeddings = model.embeddings.word_embeddings(inputs['input_ids'])
    print(f"Token embeddings shape: {token_embeddings.shape}")  # [1, 5, 768]
    
    # Positional embeddings
    position_ids = torch.arange(inputs['input_ids'].size(1)).unsqueeze(0)
    position_embeddings = model.embeddings.position_embeddings(position_ids)
    print(f"Position embeddings shape: {position_embeddings.shape}\")  # [1, 5, 768]
    
    # Combined (with layer norm and dropout applied)
    final_embeddings = outputs.last_hidden_state[0]
    print(f"Final embeddings shape: {final_embeddings.shape}\")  # [1, 5, 768]

Key Takeaway: Positional embeddings are crucial for Transformers. Without them, the model has no sense of word order, making it impossible to understand language properly!

Context Window Limits

Positional embeddings impose sequence length limits:

BERT: 512 tokens (learned positional embeddings)
GPT-2: 1,024 tokens (learned)
GPT-3: 2,048 tokens (learned)
GPT-3.5/4: 4k-32k tokens (advanced techniques)
Claude 2: 100k tokens (extended context methods)
Llama 2: 4,096 tokens (RoPE, can be extended)

Going Beyond Training Length: Models struggle with sequences longer than training length when using learned positional embeddings. Modern techniques like RoPE and ALiBi enable better extrapolation to longer contexts.

🎯 Practical Tips & Best Practices

Tokenization Best Practices

Use the right tokenizer: Always use the tokenizer that matches your model (GPT-2 tokenizer for GPT-2, BERT tokenizer for BERT)
Watch token counts: APIs charge per token. Use len(tokenizer.encode(text)) to estimate costs
Handle truncation gracefully: Long documents should be chunked intelligently, not just cut off mid-sentence
Test on your data: Check how your specific text tokenizes (especially domain-specific terms, code, non-English text)
Be aware of special tokens: Models add [CLS], [SEP], etc. Factor these into token counts

Embedding Best Practices

Use pre-trained embeddings: Don't train from scratch unless you have massive datasets
Fine-tune when needed: For domain-specific tasks, fine-tune embeddings on your data
Match embedding model to task: Use sentence-transformers for semantic search, not general LLM embeddings
Normalize embeddings: For cosine similarity, normalize vectors to unit length
Cache embeddings: Computing embeddings is expensive. Cache and reuse when possible

Common Pitfalls to Avoid

Tokenizer mismatch: Using wrong tokenizer for a model causes gibberish output
Ignoring special tokens: Forgetting to account for [CLS], [SEP] in token counts
Exceeding context length: Going beyond model's max length causes errors or truncation
Not handling edge cases: Emojis, code, URLs can tokenize unexpectedly
Using wrong pooling: CLS token pooling doesn't work well for non-BERT models

Real-World Example: Building a Semantic Search System

from sentence_transformers import SentenceTransformer
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

# 1. Load specialized embedding model
model = SentenceTransformer('all-MiniLM-L6-v2')

# 2. Your document corpus
documents = [
    \"Machine learning is a subset of AI\",
    \"Deep learning uses neural networks\",
    \"Python is a programming language\",
    \"Natural language processing handles text\",
    \"Computer vision processes images\"
]

# 3. Embed all documents (do this once, cache results)
doc_embeddings = model.encode(documents)
print(f\"Document embeddings shape: {doc_embeddings.shape}\")  # (5, 384)

# 4. User query
query = \"Tell me about artificial intelligence\"
query_embedding = model.encode([query])

# 5. Find most similar documents
similarities = cosine_similarity(query_embedding, doc_embeddings)[0]

# 6. Rank and retrieve
ranked_indices = np.argsort(similarities)[::-1]  # Descending order
print(\"\\nSearch Results:\")
for i, idx in enumerate(ranked_indices[:3], 1):
    print(f\"{i}. {documents[idx]} (similarity: {similarities[idx]:.3f})\")

Output

Document embeddings shape: (5, 384)

Search Results:
1. Machine learning is a subset of AI (similarity: 0.652)
2. Deep learning uses neural networks (similarity: 0.531)
3. Natural language processing handles text (similarity: 0.398)

The system correctly retrieves AI-related documents first!

📋 Chapter Summary

Key Concepts Mastered:

✅ Tokenization Fundamentals: Converting text into discrete units (tokens) that models can process
✅ Tokenization Strategies: Character-level (small vocab, long sequences), Word-level (intuitive but large vocab), Subword (optimal balance)
✅ BPE Algorithm: Iteratively merges most frequent character pairs to build subword vocabulary
✅ Modern Tokenizers: GPT uses BPE, BERT uses WordPiece, Llama uses SentencePiece
✅ Token IDs: Integer representation of tokens from vocabulary (e.g., "king" → 2891)
✅ Embeddings: Dense vectors that capture semantic meaning where similar words have similar vectors
✅ Embedding Learning: Automatically learned during pre-training, not manually programmed
✅ Vector Arithmetic: king - man + woman ≈ queen (relationships emerge naturally)
✅ Contextualized Embeddings: Modern LLMs generate different embeddings for same word in different contexts
✅ Positional Embeddings: Add position information so models understand word order
✅ Positional Strategies: Sinusoidal (fixed), Learned (BERT/GPT), Relative (T5), RoPE (Llama), ALiBi (BLOOM)

Technical Specifications Learned:

Component	Purpose	Example Dimensions
Vocabulary Size	Number of unique tokens	30k-100k tokens
Token ID	Integer index in vocabulary	0 to vocab_size-1
Embedding Dimension	Vector size for each token	768 (BERT/GPT-2), 4096 (Llama)
Embedding Table	Lookup matrix for embeddings	[50k × 768] = 38M params
Position Embedding	Encode token position in sequence	Same dim as token embeddings
Context Length	Max sequence length	512-100k tokens (model-dependent)

Code Skills Acquired:

✅ Load and use HuggingFace tokenizers for different models
✅ Tokenize text and decode token IDs back to text
✅ Count tokens for API cost estimation
✅ Handle truncation and padding for batch processing
✅ Extract token embeddings from pre-trained models
✅ Generate sentence embeddings (mean pooling, CLS token)
✅ Compute semantic similarity using cosine similarity
✅ Build semantic search systems with embeddings
✅ Implement sinusoidal positional encoding

Practical Applications:

🔍

Semantic Search

Embed documents, find similar ones via cosine similarity

📊

Text Classification

Use embeddings as features for sentiment, topic, intent classification

🔗

Clustering

Group similar documents by clustering embeddings

🤖

RAG Systems

Retrieve relevant context using embedding similarity before generation

What's Next?

Now that you understand how text becomes numbers (tokenization) and how those numbers capture meaning (embeddings), you're ready to learn how to communicate effectively with LLMs.

In the next tutorial, Prompt Engineering, you'll master:

Crafting effective prompts for different tasks
Zero-shot, one-shot, and few-shot learning techniques
Chain-of-thought prompting for complex reasoning
Prompt templates and best practices
Common pitfalls and how to avoid them

🎉 Excellent Progress! You've completed Module 2 and now have deep understanding of the input pipeline for LLMs. Tokenization and embeddings are the foundation—every LLM interaction starts here. This knowledge will be invaluable as we explore more advanced topics!

💡 Quick Self-Check Questions

Why do modern LLMs use subword tokenization instead of word-level?
How does BPE decide which character pairs to merge?
What does it mean that embeddings are "learned"?
Why do we need positional embeddings in Transformers?
What's the difference between static and contextualized embeddings?

Answers: (1) Balances vocab size and sequence length, handles rare words; (2) Merges most frequent pairs iteratively; (3) Updated during training via backpropagation, not predefined; (4) Transformers process in parallel, no inherent position info; (5) Static = same vector always, contextualized = different based on surrounding words

Test Your Knowledge

Q1: What is the main advantage of Byte-Pair Encoding (BPE) tokenization?

It only works with English text

It creates fixed-length tokens

It balances vocabulary size and sequence length while handling rare words

It doesn't require training

Q2: How does BPE build its vocabulary?

By randomly selecting common words

By iteratively merging the most frequent character or subword pairs

By using a predefined dictionary

By analyzing sentence structure

Q3: How are learned embeddings different from predefined embeddings?

Learned embeddings are larger

Predefined embeddings work better

Learned embeddings are randomly initialized and never change

Learned embeddings are updated during training via backpropagation

Q4: Why do Transformers need positional embeddings?

Because Transformers process tokens in parallel and have no inherent position information

To increase model size

To reduce computational cost

They don't actually need them

Q5: What is the difference between static and contextualized embeddings?

Static embeddings are always better

There is no difference

Static embeddings give the same vector for a word regardless of context; contextualized vary based on surrounding words

Contextualized embeddings don't use neural networks