Category: Builder's Corner
Keyword: finetune llm lora
Fine-tuning large language models (LLMs) on your own data unlocks custom capabilities, domain-specific expertise, and improved performance for your applications. However, full fine-tuning is resource-intensive. Enter LoRA (Low-Rank Adaptation): a parameter-efficient method that makes LLM fine-tuning accessible even on consumer GPUs.
In this tutorial, you'll learn how to fine-tune an LLM using LoRA, leveraging the Hugging Face ecosystem. We'll cover setup, data preparation, configuration, training, and evaluation — all with reproducible code and commands.
Prerequisites
-
Hardware:
- Recommended: NVIDIA GPU with ≥8GB VRAM (e.g., RTX 3060 or higher)
- Minimum: Modern CPU (fine-tuning will be slow)
- Operating System: Linux (Ubuntu 20.04+), macOS, or Windows (WSL2 recommended)
- Python: 3.9 or 3.10
-
Knowledge:
- Basic Python scripting
- Familiarity with Hugging Face Transformers and Datasets
- Understanding of LLMs and fine-tuning concepts
-
Software Tools:
- PyTorch 2.0+
transformers(v4.30+)peft(v0.4+)datasetsaccelerate- CUDA toolkit (if using GPU)
-
Accounts:
- Hugging Face account (to access models and datasets)
1. Environment Setup
-
Create and activate a Python virtual environment:
python3 -m venv lora-finetune-env source lora-finetune-env/bin/activate
-
Upgrade pip and install required packages:
pip install --upgrade pip pip install torch transformers datasets peft accelerate
-
Verify CUDA (for GPU acceleration):
python -c "import torch; print(torch.cuda.is_available())"
Should print
Trueif CUDA is working.
2. Prepare Your Data
-
Format your data as JSONL or CSV.
For text generation tasks, each example should have an
input(prompt/context) andoutput(desired response).{"input": "What is LoRA?", "output": "LoRA stands for Low-Rank Adaptation, a parameter-efficient fine-tuning method for LLMs."} {"input": "Explain fine-tuning.", "output": "Fine-tuning adapts a pre-trained model to a specific task or dataset."} -
Place your data file in your project directory (e.g.,
my_data.jsonl). -
Load your dataset using Hugging Face
datasets:from datasets import load_dataset dataset = load_dataset("json", data_files="my_data.jsonl") print(dataset)
3. Choose a Base Model
-
Pick a model checkpoint from Hugging Face Hub.
- Popular options:
tiiuae/falcon-7b,meta-llama/Llama-2-7b-hf,mistralai/Mistral-7B-v0.1, etc. - Check Hugging Face Models for more.
- Popular options:
-
Download and load the model and tokenizer:
from transformers import AutoModelForCausalLM, AutoTokenizer model_name = "mistralai/Mistral-7B-v0.1" # or your preferred model tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto", torch_dtype="auto")Tip: Add
use_auth_token=Trueif the model is gated.
4. Apply LoRA With PEFT
-
Configure LoRA parameters:
from peft import LoraConfig, get_peft_model lora_config = LoraConfig( r=8, # Rank lora_alpha=32, target_modules=["q_proj", "v_proj"], # Layer names may vary by model lora_dropout=0.05, bias="none", task_type="CAUSAL_LM" )Note: For Llama/Mistral, target modules are usually
q_projandv_proj. For other models, check their architecture for correct target modules. -
Wrap the base model with LoRA:
model = get_peft_model(model, lora_config) model.print_trainable_parameters()This should show only a small number of trainable parameters (the LoRA adapters).
5. Preprocess and Tokenize Your Data
-
Define a prompt formatting function:
def format_prompt(example): return f"### Question:\n{example['input']}\n\n### Answer:\n{example['output']}" -
Apply the function and tokenize:
def tokenize_function(example): prompt = format_prompt(example) return tokenizer( prompt, truncation=True, max_length=512, padding="max_length" ) tokenized_dataset = dataset["train"].map(tokenize_function)
6. Configure Training Arguments
-
Set up training hyperparameters:
from transformers import TrainingArguments training_args = TrainingArguments( per_device_train_batch_size=4, gradient_accumulation_steps=4, warmup_steps=50, num_train_epochs=3, learning_rate=2e-4, fp16=True, logging_steps=10, output_dir="./lora-finetuned-llm", save_strategy="epoch", evaluation_strategy="no", report_to="none" )
7. Launch Fine-Tuning
-
Initialize the Trainer and start training:
from transformers import Trainer trainer = Trainer( model=model, args=training_args, train_dataset=tokenized_dataset, ) trainer.train()Training progress and loss will be printed to the terminal.
Screenshot description: The terminal displays training progress, including epoch number, step, and loss values.
-
After training, save the LoRA adapters:
model.save_pretrained("./lora-finetuned-llm") tokenizer.save_pretrained("./lora-finetuned-llm")
8. Run Inference With Your Fine-Tuned LLM
-
Load the LoRA-adapted model and generate text:
from peft import PeftModel base_model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto", torch_dtype="auto") lora_model = PeftModel.from_pretrained(base_model, "./lora-finetuned-llm") lora_model.eval() prompt = "What is LoRA?" inputs = tokenizer(prompt, return_tensors="pt").to(lora_model.device) outputs = lora_model.generate(**inputs, max_new_tokens=64) print(tokenizer.decode(outputs[0], skip_special_tokens=True))Screenshot description: Terminal output shows the fine-tuned model's response to the prompt.
Common Issues & Troubleshooting
-
CUDA out of memory:
- Reduce
per_device_train_batch_sizeormax_length. - Use
gradient_accumulation_stepsto maintain effective batch size. - Try
fp16=Truefor mixed precision.
- Reduce
-
Model or tokenizer mismatch:
- Ensure you use the same model and tokenizer checkpoints.
-
Incorrect target modules for LoRA:
- Check your model architecture and set
target_modulesaccordingly.
- Check your model architecture and set
-
Dataset mapping errors:
- Check your JSONL/CSV format and ensure fields match your code.
-
Slow training:
- Ensure GPU is being used (
torch.cuda.is_available()isTrue). - Close other GPU-intensive applications.
- Ensure GPU is being used (
Next Steps
-
Experiment with different LoRA hyperparameters (
r,alpha,dropout). - Try larger or more specialized base models for better results (if hardware allows).
- Evaluate your fine-tuned model on held-out data or real-world tasks.
- Convert your LoRA adapters to ONNX or other formats for production deployment.
- Explore advanced PEFT techniques (e.g., QLoRA, AdaLoRA) for further efficiency.
References: