“thinking” in LLMs

a Python-centric walkthrough of the technical mechanisms

Self-Attention Mechanism

Self-attention is at the heart of transformers. Here’s how you might implement a single-head self-attention in Python using NumPy:

import numpy as np

def self_attention(X, W_q, W_k, W_v):
    # X: (seq_len, d_model)
    Q = X @ W_q  # Queries
    K = X @ W_k  # Keys
    V = X @ W_v  # Values

    dk = Q.shape[-1]
    scores = Q @ K.T / np.sqrt(dk)  # Scaled dot-product
    weights = np.exp(scores) / np.exp(scores).sum(axis=-1, keepdims=True)  # Softmax
    output = weights @ V
    return output

# Example usage:
seq_len, d_model, d_k = 5, 16, 8
X = np.random.randn(seq_len, d_model)
W_q = np.random.randn(d_model, d_k)
W_k = np.random.randn(d_model, d_k)
W_v = np.random.randn(d_model, d_k)
attn_output = self_attention(X, W_q, W_k, W_v)

This code computes the attention output for a sequence of embeddings.

Positional Encoding

Transformers add positional information to token embeddings. Here’s a function for sinusoidal positional encoding:

def positional_encoding(seq_len, d_model):
    PE = np.zeros((seq_len, d_model))
    for pos in range(seq_len):
        for i in range(0, d_model, 2):
            PE[pos, i] = np.sin(pos / (10000 ** ((2 * i)/d_model)))
            if i + 1 < d_model:
                PE[pos, i + 1] = np.cos(pos / (10000 ** ((2 * i)/d_model)))
    return PE

# Example usage:
PE = positional_encoding(5, 16)

This function creates a matrix you add to your embeddings before feeding them to the transformer.