Category: Builder's Corner
Keyword: ai model compression techniques
Deploying AI models on edge devices—such as smartphones, IoT sensors, or embedded systems—requires careful optimization. Large models can be computationally expensive, consume excessive memory, and drain battery life. Model compression techniques help you reduce model size and latency, making AI practical for edge deployment.
Prerequisites
- Python (version 3.8 or above)
- PyTorch (version 1.12+), or TensorFlow (version 2.8+)
- ONNX (for interoperability, version 1.12+)
- torchvision (if using PyTorch sample models)
- Basic understanding of neural networks and Python programming
- Familiarity with command-line tools
- Optional: Netron for model visualization
1. Prepare Your Model
-
Select a Pretrained Model
For this tutorial, we'll use a pretrained
ResNet18model from PyTorch as our base.pip install torch torchvision
import torch import torchvision.models as models model = models.resnet18(pretrained=True) model.eval() -
Export Model for Compression
Save the model for further processing:
torch.save(model.state_dict(), "resnet18.pth")
2. Quantization
Quantization reduces model size and speeds up inference by representing weights and activations with lower precision (e.g., 8-bit integers instead of 32-bit floats).
-
Post-Training Static Quantization (PyTorch)
Install required tools:
pip install torch torchvision
import torch import torchvision.models as models from torch.quantization import quantize_dynamic model = models.resnet18(pretrained=True) model.eval() quantized_model = quantize_dynamic( model, {torch.nn.Linear}, dtype=torch.qint8 ) torch.save(quantized_model.state_dict(), "resnet18_quantized.pth")Screenshot description: "A side-by-side file explorer showing
resnet18.pthandresnet18_quantized.pth, with the quantized file significantly smaller." -
Quantization-Aware Training (QAT)
For best accuracy, retrain the model with quantization simulation:
import torch.quantization model.qconfig = torch.quantization.get_default_qat_qconfig('fbgemm') torch.quantization.prepare_qat(model, inplace=True) torch.quantization.convert(model.eval(), inplace=True)Note: QAT requires retraining with your dataset.
-
TensorFlow Lite Quantization
If using TensorFlow:
pip install tensorflow
import tensorflow as tf model = tf.keras.applications.MobileNetV2(weights="imagenet") converter = tf.lite.TFLiteConverter.from_keras_model(model) converter.optimizations = [tf.lite.Optimize.DEFAULT] quantized_tflite_model = converter.convert() with open("mobilenetv2_quantized.tflite", "wb") as f: f.write(quantized_tflite_model)
3. Pruning
Pruning removes redundant or less significant weights, reducing model size and potentially speeding up inference.
-
Install Pruning Toolkit
pip install torch torchvision torch-pruning
-
Apply Structured Pruning (PyTorch Example)
import torch import torchvision.models as models import torch_pruning as tp model = models.resnet18(pretrained=True) example_inputs = torch.randn(1, 3, 224, 224) strategy = tp.strategy.L1Strategy() DG = tp.DependencyGraph().build_dependency(model, example_inputs=example_inputs) for m in model.modules(): if isinstance(m, torch.nn.Conv2d): pruning_idxs = strategy(m.weight, amount=0.3) plan = DG.get_pruning_plan(m, tp.prune_conv_out_channel, pruning_idxs) plan.exec() torch.save(model.state_dict(), "resnet18_pruned.pth")Screenshot description: "A diagram of a ResNet block, with some convolutional filters highlighted as removed."
-
TensorFlow Model Pruning
pip install tensorflow-model-optimization
import tensorflow as tf import tensorflow_model_optimization as tfmot model = tf.keras.applications.MobileNetV2(weights="imagenet") prune_low_magnitude = tfmot.sparsity.keras.prune_low_magnitude pruning_params = { "pruning_schedule": tfmot.sparsity.keras.PolynomialDecay( initial_sparsity=0.0, final_sparsity=0.5, begin_step=0, end_step=1000 ) } pruned_model = prune_low_magnitude(model, **pruning_params) pruned_model.compile(optimizer='adam', loss='categorical_crossentropy')
4. Knowledge Distillation
Knowledge distillation transfers knowledge from a large "teacher" model to a smaller "student" model, enabling high accuracy with fewer parameters.
-
Set Up Teacher and Student Models
import torch import torch.nn as nn import torchvision.models as models teacher = models.resnet18(pretrained=True) student = models.resnet18(num_classes=100) # Smaller or shallower student -
Define Distillation Loss
def distillation_loss(student_logits, teacher_logits, labels, temperature=2.0, alpha=0.5): kd_loss = nn.KLDivLoss(reduction='batchmean')( nn.functional.log_softmax(student_logits / temperature, dim=1), nn.functional.softmax(teacher_logits / temperature, dim=1) ) * (temperature ** 2) ce_loss = nn.functional.cross_entropy(student_logits, labels) return alpha * kd_loss + (1 - alpha) * ce_loss -
Train Student Model
optimizer = torch.optim.Adam(student.parameters()) for data, labels in dataloader: optimizer.zero_grad() student_logits = student(data) with torch.no_grad(): teacher_logits = teacher(data) loss = distillation_loss(student_logits, teacher_logits, labels) loss.backward() optimizer.step()
5. Model Export and Deployment
-
Export to ONNX
pip install onnx
dummy_input = torch.randn(1, 3, 224, 224) torch.onnx.export(model, dummy_input, "resnet18_compressed.onnx", opset_version=12) -
Test on Edge Device
Copy your compressed model to the device and run inference using ONNX Runtime or TensorFlow Lite Interpreter.
pip install onnxruntime python -c " import onnxruntime import numpy as np session = onnxruntime.InferenceSession('resnet18_compressed.onnx') input_name = session.get_inputs()[0].name output = session.run(None, {input_name: np.random.randn(1,3,224,224).astype(np.float32)}) print(output) "Screenshot description: "Terminal output showing inference results and timing on a Raspberry Pi."
Common Issues & Troubleshooting
- Accuracy Drop: Compression may reduce accuracy. Use quantization-aware training, fine-tuning, or distillation to recover performance.
- Unsupported Operations: Some layers may not be supported by quantization/pruning libraries or edge runtimes. Check your model architecture and use supported layers.
- Export Errors: ONNX export may fail for custom layers. Implement custom ONNX operators or adjust your model.
- Device Compatibility: Ensure the edge device supports the chosen runtime (e.g., TFLite, ONNX Runtime) and quantized models.
- Inference Speed: Not all hardware accelerates quantized or pruned models equally. Test performance on your target device.
Next Steps
- Experiment with more aggressive pruning/quantization settings and measure the trade-off between size, speed, and accuracy.
- Explore model architecture search for even smaller, edge-optimized models (e.g., MobileNet, EfficientNet-Lite).
- Automate your compression workflow with scripts or CI pipelines for reproducibility.
- Keep up-to-date with the latest research and hardware support for edge AI deployment.
By applying these AI model compression techniques, you can unlock efficient, real-world AI applications on resource-constrained edge devices, delivering smarter user experiences everywhere.
