CUDA-Accelerated Image Processing & Deep Learning Pipeline
Custom CUDA image processing kernels integrated with PyTorch C++ Extensions, powering an end-to-end CNN inference pipeline for image classification.
Project Overview
This project bridges my background in hardware design with modern deep learning by building a computer vision pipeline from the ground up — starting at the CUDA kernel level and working up to a trained CNN classifier.
The core idea is to write custom image processing kernels in CUDA (grayscale conversion, Gaussian blur, Sobel edge detection, and histogram equalization), bind them into Python using PyTorch’s C++ Extension API, and then use those kernels as the preprocessing stage for a trained image classification model.
What I’m Building
Custom CUDA Kernels
Each kernel is written in CUDA C++ and optimized for the GPU memory hierarchy. The Gaussian blur kernel uses shared memory tiling to reduce global memory accesses — loading a tile of the image into fast on-chip shared memory so threads can reuse data without going back to global memory on every read. The kernels are compiled and exposed to Python as native torch.Tensor operations via a setup.py build.
Deep Learning Pipeline
On top of the preprocessing kernels, I’m training a convolutional neural network for image classification. The preprocessing and inference run end-to-end: a raw image goes in, my CUDA kernels prepare it, and the model returns a prediction with confidence. All experiments are documented in a Jupyter notebook with training curves, sample predictions, and performance comparisons.
Performance Analysis
A key deliverable is a quantitative comparison between three configurations: CPU-only processing with OpenCV, standard PyTorch GPU ops, and my custom CUDA kernels. I’ll profile using PyTorch Profiler and NVIDIA Nsight Systems to identify bottlenecks and explain the performance differences.
Technical Stack
- CUDA C++ for custom kernel implementation
- PyTorch C++ Extension API (
torch.utils.cpp_extension) for Python integration - Python / Jupyter for training, experiments, and visualization
- NVIDIA Nsight Systems for profiling
Current Status
In progress. Kernel foundations are being implemented and tested.