Skill Details
Back to Skills

/senior-computer-vision

by alirezarezvani

Computer vision engineering skill for object detection, image segmentation, and visual AI systems. Covers CNN and Vision Transformer architectures, YOLO/Faster R-CNN/DETR detection, Mask R-CNN/SAM seg

View on GitHub

Senior Computer Vision Engineer

Production computer vision engineering skill for object detection, image segmentation, and visual AI system deployment.

Table of Contents

Quick Start

# Generate training configuration for YOLO or Faster R-CNN
python scripts/vision_model_trainer.py models/ --task detection --arch yolov8

# Analyze model for optimization opportunities (quantization, pruning)
python scripts/inference_optimizer.py model.pt --target onnx --benchmark

# Build dataset pipeline with augmentations
python scripts/dataset_pipeline_builder.py images/ --format coco --augment

Core Expertise

This skill provides guidance on:

  • Object Detection: YOLO family (v5-v11), Faster R-CNN, DETR, RT-DETR
  • Instance Segmentation: Mask R-CNN, YOLACT, SOLOv2
  • Semantic Segmentation: DeepLabV3+, SegFormer, SAM (Segment Anything)
  • Image Classification: ResNet, EfficientNet, Vision Transformers (ViT, DeiT)
  • Video Analysis: Object tracking (ByteTrack, SORT), action recognition
  • 3D Vision: Depth estimation, point cloud processing, NeRF
  • Production Deployment: ONNX, TensorRT, OpenVINO, CoreML

Tech Stack

Category Technologies
Frameworks PyTorch, torchvision, timm
Detection Ultralytics (YOLO), Detectron2, MMDetection
Segmentation segment-anything, mmsegmentation
Optimization ONNX, TensorRT, OpenVINO, torch.compile
Image Processing OpenCV, Pillow, albumentations
Annotation CVAT, Label Studio, Roboflow
Experiment Tracking MLflow, Weights & Biases
Serving Triton Inference Server, TorchServe

Workflow 1: Object Detection Pipeline

Use this workflow when building an object detection system from scratch.

Step 1: Define Detection Requirements

Analyze the detection task requirements:

Detection Requirements Analysis:
- Target objects: [list specific classes to detect]
- Real-time requirement: [yes/no, target FPS]
- Accuracy priority: [speed vs accuracy trade-off]
- Deployment target: [cloud GPU, edge device, mobile]
- Dataset size: [number of images, annotations per class]

Step 2: Select Detection Architecture

Choose architecture based on requirements:

Requirement Recommended Architecture Why
Real-time (>30 FPS) YOLOv8/v11, RT-DETR Single-stage, optimized for speed
High accuracy Faster R-CNN, DINO Two-stage, better localization
Small objects YOLO + SAHI, Faster R-CNN + FPN Multi-scale detection
Edge deployment YOLOv8n, MobileNetV3-SSD Lightweight architectures
Transformer-based DETR, DINO, RT-DETR End-to-end, no NMS required

Step 3: Prepare Dataset

Convert annotations to required format:

# COCO format (recommended)
python scripts/dataset_pipeline_builder.py data/images/ \
    --annotations data/labels/ \
    --format coco \
    --split 0.8 0.1 0.1 \
    --output data/coco/

# Verify dataset
python -c "from pycocotools.coco import COCO; coco = COCO('data/coco/train.json'); print(f'Images: {len(coco.imgs)}, Categories: {len(coco.cats)}')"

Step 4: Configure Training

Generate training configuration:

# For Ultralytics YOLO
python scripts/vision_model_trainer.py data/coco/ \
    --task detection \
    --arch yolov8m \
    --epochs 100 \
    --batch 16 \
    --imgsz 640 \
    --output configs/

# For Detectron2
python scripts/vision_model_trainer.py data/coco/ \
    --task detection \
    --arch faster_rcnn_R_50_FPN \
    --framework detectron2 \
    --output configs/

Step 5: Train and Validate

# Ultralytics training
yolo detect train data=data.yaml model=yolov8m.pt epochs=100 imgsz=640

# Detectron2 training
python train_net.py --config-file configs/faster_rcnn.yaml --num-gpus 1

# Validate on test set
yolo detect val model=runs/detect/train/weights/best.pt data=data.yaml

Step 6: Evaluate Results

Key metrics to analyze:

Metric Target Description
mAP@50 >0.7 Mean Average Precision at IoU 0.5
mAP@50:95 >0.5 COCO primary metric
Precision >0.8 Low false positives
Recall >0.8 Low missed detections
Inference time <33ms For 30 FPS real-time

Workflow 2: Model Optimization and Deployment

Use this workflow when preparing a trained model for production deployment.

Step 1: Benchmark Baseline Performance

# Measure current model performance
python scripts/inference_optimizer.py model.pt \
    --benchmark \
    --input-size 640 640 \
    --batch-sizes 1 4 8 16 \
    --warmup 10 \
    --iterations 100

Expected output:

Baseline Performance (PyTorch FP32):
- Batch 1: 45.2ms (22.1 FPS)
- Batch 4: 89.4ms (44.7 FPS)
- Batch 8: 165.3ms (48.4 FPS)
- Memory: 2.1 GB
- Parameters: 25.9M

Step 2: Select Optimization Strategy

Deployment Target Optimization Path
NVIDIA GPU (cloud) PyTorch → ONNX → TensorRT FP16
NVIDIA GPU (edge) PyTorch → TensorRT INT8
Intel CPU PyTorch → ONNX → OpenVINO
Apple Silicon PyTorch → CoreML
Generic CPU PyTorch → ONNX Runtime
Mobile PyTorch → TFLite or ONNX Mobile

Step 3: Export to ONNX

# Export with dynamic batch size
python scripts/inference_optimizer.py model.pt \
    --export onnx \
    --input-size 640 640 \
    --dynamic-batch \
    --simplify \
    --output model.onnx

# Verify ONNX model
python -c "import onnx; model = onnx.load('model.onnx'); onnx.checker.check_model(model); print('ONNX model valid')"

Step 4: Apply Quantization (Optional)

For INT8 quantization with calibration:

# Generate calibration dataset
python scripts/inference_optimizer.py model.onnx \
    --quantize int8 \
    --calibration-data data/calibration/ \
    --calibration-samples 500 \
    --output model_int8.onnx

Quantization impact analysis:

Precision Size Speed Accuracy Drop
FP32 100% 1x 0%
FP16 50% 1.5-2x <0.5%
INT8 25% 2-4x 1-3%

Step 5: Convert to Target Runtime

# TensorRT (NVIDIA GPU)
trtexec --onnx=model.onnx --saveEngine=model.engine --fp16

# OpenVINO (Intel)
mo --input_model model.onnx --output_dir openvino/

# CoreML (Apple)
python -c "import coremltools as ct; model = ct.convert('model.onnx'); model.save('model.mlpackage')"

Step 6: Benchmark Optimized Model

python scripts/inference_optimizer.py model.engine \
    --benchmark \
    --runtime tensorrt \
    --compare model.pt

Expected speedup:

Optimization Results:
- Original (PyTorch FP32): 45.2ms
- Optimized (TensorRT FP16): 12.8ms
- Speedup: 3.5x
- Accuracy change: -0.3% mAP

Workflow 3: Custom Dataset Preparation

Use this workflow when preparing a computer vision dataset for training.

Step 1: Audit Raw Data

# Analyze image dataset
python scripts/dataset_pipeline_builder.py data/raw/ \
    --analyze \
    --output analysis/

Analysis report includes:

Dataset Analysis:
- Total images: 5,234
- Image sizes: 640x480 to 4096x3072 (variable)
- Formats: JPEG (4,891), PNG (343)
- Corrupted: 12 files
- Duplicates: 45 pairs

Annotation Analysis:
- Format detected: Pascal VOC XML
- Total annotations: 28,456
- Classes: 5 (car, person, bicycle, dog, cat)
- Distribution: car (12,340), person (8,234), bicycle (3,456), dog (2,890), cat (1,536)
- Empty images: 234

Step 2: Clean and Validate

# Remove corrupted and duplicate images
python scripts/dataset_pipeline_builder.py data/raw/ \
    --clean \
    --remove-corrupted \
    --remove-duplicates \
    --output data/cleaned/

Step 3: Convert Annotation Format

# Convert VOC to COCO format
python scripts/dataset_pipeline_builder.py data/cleaned/ \
    --annotations data/annotations/ \
    --input-format voc \
    --output-format coco \
    --output data/coco/

Supported format conversions:

From To
Pascal VOC XML COCO JSON
YOLO TXT COCO JSON
COCO JSON YOLO TXT
LabelMe JSON COCO JSON
CVAT XML COCO JSON

Step 4: Apply Augmentations

# Generate augmentation config
python scripts/dataset_pipeline_builder.py data/coco/ \
    --augment \
    --aug-config configs/augmentation.yaml \
    --output data/augmented/

Recommended augmentations for detection:

# configs/augmentation.yaml
augmentations:
  geometric:
    - horizontal_flip: { p: 0.5 }
    - vertical_flip: { p: 0.1 }  # Only if orientation invariant
    - rotate: { limit: 15, p: 0.3 }
    - scale: { scale_limit: 0.2, p: 0.5 }

  color:
    - brightness_contrast: { brightness_limit: 0.2, contrast_limit: 0.2, p: 0.5 }
    - hue_saturation: { hue_shift_limit: 20, sat_shift_limit: 30, p: 0.3 }
    - blur: { blur_limit: 3, p: 0.1 }

  advanced:
    - mosaic: { p: 0.5 }  # YOLO-style mosaic
    - mixup: { p: 0.1 }   # Image mixing
    - cutout: { num_holes: 8, max_h_size: 32, max_w_size: 32, p: 0.3 }

Step 5: Create Train/Val/Test Splits

python scripts/dataset_pipeline_builder.py data/augmented/ \
    --split 0.8 0.1 0.1 \
    --stratify \
    --seed 42 \
    --output data/final/

Split strategy guidelines:

Dataset Size Train Val Test
<1,000 images 70% 15% 15%
1,000-10,000 80% 10% 10%
>10,000 90% 5% 5%

Step 6: Generate Dataset Configuration

# For Ultralytics YOLO
python scripts/dataset_pipeline_builder.py data/final/ \
    --generate-config yolo \
    --output data.yaml

# For Detectron2
python scripts/dataset_pipeline_builder.py data/final/ \
    --generate-config detectron2 \
    --output detectron2_config.py

Architecture Selection Guide

Object Detection Architectures

Architecture Speed Accuracy Best For
YOLOv8n 1.2ms 37.3 mAP Edge, mobile, real-time
YOLOv8s 2.1ms 44.9 mAP Balanced speed/accuracy
YOLOv8m 4.2ms 50.2 mAP General purpose
YOLOv8l 6.8ms 52.9 mAP High accuracy
YOLOv8x 10.1ms 53.9 mAP Maximum accuracy
RT-DETR-L 5.3ms 53.0 mAP Transformer, no NMS
Faster R-CNN R50 46ms 40.2 mAP Two-stage, high quality
DINO-4scale 85ms 49.0 mAP SOTA transformer

Segmentation Architectures

Architecture Type Speed Best For
YOLOv8-seg Instance 4.5ms Real-time instance seg
Mask R-CNN Instance 67ms High-quality masks
SAM Promptable 50ms Zero-shot segmentation
DeepLabV3+ Semantic 25ms Scene parsing
SegFormer Semantic 15ms Efficient semantic seg

CNN vs Vision Transformer Trade-offs

Aspect CNN (YOLO, R-CNN) ViT (DETR, DINO)
Training data needed 1K-10K images 10K-100K+ images
Training time Fast Slow (needs more epochs)
Inference speed Faster Slower
Small objects Good with FPN Needs multi-scale
Global context Limited Excellent
Positional encoding Implicit Explicit

Reference Documentation

1. Computer Vision Architectures

See references/computer_vision_architectures.md for:

  • CNN backbone architectures (ResNet, EfficientNet, ConvNeXt)
  • Vision Transformer variants (ViT, DeiT, Swin)
  • Detection heads (anchor-based vs anchor-free)
  • Feature Pyramid Networks (FPN, BiFPN, PANet)
  • Neck architectures for multi-scale detection

2. Object Detection Optimization

See references/object_detection_optimization.md for:

  • Non-Maximum Suppression variants (NMS, Soft-NMS, DIoU-NMS)
  • Anchor optimization and anchor-free alternatives
  • Loss function design (focal loss, GIoU, CIoU, DIoU)
  • Training strategies (warmup, cosine annealing, EMA)
  • Data augmentation for detection (mosaic, mixup, copy-paste)

3. Production Vision Systems

See references/production_vision_systems.md for:

  • ONNX export and optimization
  • TensorRT deployment pipeline
  • Batch inference optimization
  • Edge device deployment (Jetson, Intel NCS)
  • Model serving with Triton
  • Video processing pipelines

Common Commands

Ultralytics YOLO

# Training
yolo detect train data=coco.yaml model=yolov8m.pt epochs=100 imgsz=640

# Validation
yolo detect val model=best.pt data=coco.yaml

# Inference
yolo detect predict model=best.pt source=images/ save=True

# Export
yolo export model=best.pt format=onnx simplify=True dynamic=True

Detectron2

# Training
python train_net.py --config-file configs/COCO-Detection/faster_rcnn_R_50_FPN_3x.yaml \
    --num-gpus 1 OUTPUT_DIR ./output

# Evaluation
python train_net.py --config-file configs/faster_rcnn.yaml --eval-only \
    MODEL.WEIGHTS output/model_final.pth

# Inference
python demo.py --config-file configs/faster_rcnn.yaml \
    --input images/*.jpg --output results/ \
    --opts MODEL.WEIGHTS output/model_final.pth

MMDetection

# Training
python tools/train.py configs/faster_rcnn/faster-rcnn_r50_fpn_1x_coco.py

# Testing
python tools/test.py configs/faster_rcnn.py checkpoints/latest.pth --eval bbox

# Inference
python demo/image_demo.py demo.jpg configs/faster_rcnn.py checkpoints/latest.pth

Model Optimization

# ONNX export and simplify
python -c "import torch; model = torch.load('model.pt'); torch.onnx.export(model, torch.randn(1,3,640,640), 'model.onnx', opset_version=17)"
python -m onnxsim model.onnx model_sim.onnx

# TensorRT conversion
trtexec --onnx=model.onnx --saveEngine=model.engine --fp16 --workspace=4096

# Benchmark
trtexec --loadEngine=model.engine --batch=1 --iterations=1000 --avgRuns=100

Performance Targets

Metric Real-time High Accuracy Edge
FPS >30 >10 >15
mAP@50 >0.6 >0.8 >0.5
Latency P99 <50ms <150ms <100ms
GPU Memory <4GB <8GB <2GB
Model Size <50MB <200MB <20MB

Resources

  • Architecture Guide: references/computer_vision_architectures.md
  • Optimization Guide: references/object_detection_optimization.md
  • Deployment Guide: references/production_vision_systems.md
  • Scripts: scripts/ directory for automation tools