a Python-centric walkthrough of the technical mechanisms
by dev
Self-attention is at the heart of transformers. Here’s how you might implement a single-head self-attention in Python using NumPy:
import numpy as np
def self_attention(X, W_q, W_k, W_v):
# X: (seq_len, d_model)
Q = X @ W_q # Queries
K = X @ W_k # Keys
V = X @ W_v # Values
dk = Q.shape[-1]
scores = Q @ K.T / np.sqrt(dk) # Scaled dot-product
weights = np.exp(scores) / np.exp(scores).sum(axis=-1, keepdims=True) # Softmax
output = weights @ V
return output
# Example usage:
seq_len, d_model, d_k = 5, 16, 8
X = np.random.randn(seq_len, d_model)
W_q = np.random.randn(d_model, d_k)
W_k = np.random.randn(d_model, d_k)
W_v = np.random.randn(d_model, d_k)
attn_output = self_attention(X, W_q, W_k, W_v)
This code computes the attention output for a sequence of embeddings.
Transformers add positional information to token embeddings. Here’s a function for sinusoidal positional encoding:
def positional_encoding(seq_len, d_model):
PE = np.zeros((seq_len, d_model))
for pos in range(seq_len):
for i in range(0, d_model, 2):
PE[pos, i] = np.sin(pos / (10000 ** ((2 * i)/d_model)))
if i + 1 < d_model:
PE[pos, i + 1] = np.cos(pos / (10000 ** ((2 * i)/d_model)))
return PE
# Example usage:
PE = positional_encoding(5, 16)
This function creates a matrix you add to your embeddings before feeding them to the transformer.