Fine-Tuning GPT-2 with PEFT-LoRA¶

Fine-tune GPT-2 on the English Quotes dataset using Parameter-Efficient Fine-Tuning (PEFT) with LoRA adapters.

1. Install Dependencies

Install all required libraries: HuggingFace Datasets, Transformers, PEFT (for LoRA), Accelerate, and BitsAndBytes.

In [ ]:
!pip install -q datasets transformers peft accelerate bitsandbytes

2. Basic Imports

Import core libraries: HuggingFace Datasets and Transformers for model/data, PEFT for LoRA, and PyTorch.

In [ ]:
from datasets import load_dataset
from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments, Trainer
from peft import LoraConfig, get_peft_model
import torch

3. Load Dataset & Tokenizer

Load the English Quotes dataset and split it 90/10 into train and validation sets. Load the GPT-2 tokenizer and assign eos_token as pad_token so batching works (GPT-2 has no native pad token).

In [ ]:
dataset = load_dataset("Abirate/english_quotes")
dataset_split = dataset["train"].train_test_split(test_size=0.1, seed=42)

train_data = dataset_split["train"]
val_data = dataset_split["test"]

model_name = "gpt2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token

4. Tokenize

Tokenize each quote with padding and truncation to 64 tokens. Set labels equal to input_ids — for causal LM, the Trainer internally shifts labels by 1 position to compute next-token prediction loss.

In [ ]:
def tokenize(batch):
    tokenized = tokenizer(batch["quote"], padding="max_length", truncation=True, max_length=64)
    tokenized["labels"] = tokenized["input_ids"].copy()
    return tokenized

train_data = train_data.map(tokenize, batched=True)
val_data = val_data.map(tokenize, batched=True)

5. Load Model

Load GPT-2 in FP16 precision to reduce memory usage. device_map="auto" automatically distributes model layers across available GPUs.

In [ ]:
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.float16,
    device_map="auto"
)

6. LoRA Config + Training Arguments + Training

LoRA config: Inject low-rank adapter matrices into GPT-2's attention layers (c_attn). r=8 is the rank (size of the adapters), lora_alpha=16 is the scaling factor. get_peft_model freezes all original weights and only trains the LoRA layers.

Training arguments: 5 epochs, batch size 4 with gradient accumulation of 2 (effective batch = 8), FP16 training, learning rate 2e-4.

Trainer: Wires everything together and runs the full training loop.

In [ ]:
lora_config = LoraConfig(
    r=8,
    lora_alpha=16,
    target_modules=["c_attn"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

model = get_peft_model(model, lora_config)

training_args = TrainingArguments(
    output_dir="./lora-llm",
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    gradient_accumulation_steps=2,
    eval_strategy="steps",
    eval_steps=20,
    logging_steps=10,
    save_steps=50,
    learning_rate=2e-4,
    num_train_epochs=5,
    fp16=True,
    report_to="none"
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_data,
    eval_dataset=val_data,
    tokenizer=tokenizer
)

trainer.train()

7. Save the Model

Save only the LoRA adapter weights (not the full GPT-2 base model) — this produces a tiny checkpoint of just the trained delta weights.

In [ ]:
model.save_pretrained("lora-gpt2")
tokenizer.save_pretrained("lora-gpt2")

8. Inference From Saved Model

Reload the base GPT-2 model and attach the saved LoRA adapter using PeftModel.from_pretrained. Wrap in a text-generation pipeline for simple inference.

In [ ]:
from transformers import pipeline, AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
import torch

base_model_name = "gpt2"
tokenizer = AutoTokenizer.from_pretrained("lora-gpt2")
tokenizer.pad_token = tokenizer.eos_token

base_model = AutoModelForCausalLM.from_pretrained(
    base_model_name,
    torch_dtype=torch.float16,
    device_map="auto"
)

model = PeftModel.from_pretrained(base_model, "lora-gpt2")

text_gen = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    torch_dtype=torch.float16,
    device_map="auto"
)

Run inference on sample prompts using temperature sampling (do_sample=True, temperature=0.7) for creative, varied outputs.

In [ ]:
prompt = "The secret to happiness is"
outputs = text_gen(prompt, max_new_tokens=70, num_return_sequences=1, do_sample=True, temperature=0.7)
print(outputs[0]["generated_text"])
In [ ]:
prompt = "once upon a time"
outputs = text_gen(prompt, max_new_tokens=70, num_return_sequences=1, do_sample=True, temperature=0.7)
print(outputs[0]["generated_text"])