What if you could just… invent new handwriting?

Not copy it. Not trace it. Genuinely generate a brand new handwritten digit that has never existed before — but looks completely real.
That’s what a Variational Autoencoder does. And in this post, we’re going to build one from scratch in PyTorch, trained on MNIST, that can generate any digit on demand.
1. What Even Is an Autoencoder?

Before we get to the variational part, let’s understand the base concept.
An autoencoder is a neural network with two jobs:
Input image (784 pixels)
↓
[ Encoder ] → Latent vector z (small, compressed)
↓
[ Decoder ] → Reconstructed image (784 pixels)
You train it to reconstruct its own input. The encoder is forced to compress the image into a small bottleneck — and the decoder learns to rebuild it from that compressed form.
The problem? The latent space is patchy. Points between encoded examples decode into garbage. You can’t sample freely. You can’t generate.
2. The Variational Trick

Instead of encoding an image to a single point z, a VAE encodes it to a distribution: a mean μ and a variance σ².
Input image
↓
Encoder
↓
μ and logvar ← two separate outputs
↓
Sample z = μ + ε × σ where ε ~ N(0,1)
↓
Decoder → reconstructed image
This is the reparameterisation trick — it makes sampling differentiable so gradients can flow back through it.
Because training forces the distributions to overlap and stay close to a standard Gaussian, the latent space becomes smooth and continuous. Any random point you sample will decode into something meaningful.
3. Making It Conditional

A standard VAE generates random digits — you have no control over which one comes out.
A Conditional VAE (CVAE) fixes this. We append a one-hot class label to both the encoder input and the decoder input:
def one_hot(labels, num_classes=10):
return F.one_hot(labels, num_classes).float()
Now the model knows which digit it’s encoding and which one it should generate. Tell it digit=7, it generates a 7.
4. The Dataset — MNIST

60,000 grayscale images of handwritten digits, 28×28 pixels, 10 classes.
train_dataset = datasets.MNIST(root='./data', train=True, download=True, transform=transforms.ToTensor())
train_loader = DataLoader(train_dataset, batch_size=128, shuffle=True)
Each pixel value is between 0 and 1 (after ToTensor). This is important — the decoder uses Sigmoid as its final activation, which also outputs values in [0,1].
5. The CVAE Architecture

Three parts. Clean and simple.
Encoder:
Input: 784 (image) + 10 (label) = 794 dims
→ Linear(794, 400) → ReLU
→ fc_mu: Linear(400, 20) ← mean
→ fc_logvar: Linear(400, 20) ← log variance
Reparameterise:
std = torch.exp(0.5 * logvar)
eps = torch.randn_like(std)
z = mu + eps * std
Decoder:
Input: 20 (z) + 10 (label) = 30 dims
→ Linear(30, 400) → ReLU
→ Linear(400, 784) → Sigmoid
→ reshape to (1, 28, 28)
The label is concatenated at both ends — encoder and decoder — so the model conditions its entire process on which digit it’s working with.
6. The Loss Function — Two Terms

The VAE loss is the sum of two things:
BCE = F.binary_cross_entropy(recon_x, x, reduction='sum')
KLD = -0.5 * torch.sum(1 + logvar - mu.pow(2) - logvar.exp())
loss = BCE + KLD
BCE (Reconstruction Loss) — how well the decoder rebuilt the original image. Lower = better reconstruction.
KL Divergence — how close the learned distribution is to N(0,1). This is the regulariser. Without it, the encoder could just memorise exact points and the latent space would collapse back into a regular autoencoder.
The two terms are in tension — BCE wants to memorise, KLD wants to generalise. The balance between them is what creates a smooth, generative latent space.
7. Training

100 epochs. Adam. lr=1e-3. Batch size 128.
for epoch in range(epochs):
for imgs, labels in train_loader:
labels_onehot = one_hot(labels).to(device)
optimizer.zero_grad()
recon, mu, logvar = model(imgs, labels_onehot)
loss = loss_function(recon, imgs, mu, logvar)
loss.backward()
optimizer.step()
Watch the loss drop:
Epoch 1 → Loss: 162.79
Epoch 11 → Loss: 102.25
Epoch 21 → Loss: 99.71
Epoch 51 → Loss: 97.11
Epoch 91 → Loss: 95.72
8. Generate on Demand

Sample a random z from N(0,1), condition on a digit, decode:
def show_generated_digit(model, digit=4):
z = torch.randn(1, latent_dim).to(device)
labels = one_hot(torch.tensor([digit]), num_classes).to(device)
generated = model.decode(z, labels).cpu()
plt.imshow(generated[0].squeeze(), cmap='gray')
plt.show()
Pass digit=4 → get a 4. digit=7 → get a 7. Every call returns a different sample because z is random each time.
The model has learned the concept of each digit — not just the training images.
The Big Picture

| Step | What we did |
|---|---|
| Dataset | MNIST — 60k handwritten digit images |
| Model | Conditional VAE — encoder + reparameterise + decoder |
| Conditioning | One-hot labels appended at encoder and decoder inputs |
| Loss | BCE (reconstruction) + KL Divergence (regularisation) |
| Train | 100 epochs · Adam · loss: 162 → 95 |
| Generate | Sample z ~ N(0,1), condition on digit, decode |
VAEs sit at the intersection of deep learning and probabilistic modelling. The latent space isn’t just a compressed representation — it’s a learned probability distribution over the space of possible images.
Everything after this — diffusion models, GANs, multimodal generators — builds on this same foundational idea.
Built with PyTorch · MNIST dataset · Conditional VAE from scratch