The client had 3 photos. A computer vision model needs hundreds. This is how we closed that gap with generative AI, and why the resulting POC was enough to move the deal forward.

The Problem

Before a company signs off on a custom AI solution, they need to see it work. Not theoretically, not with placeholder data. They want to see their specific asset detected, in realistic conditions, on a live demo.

The challenge: the company had not committed to the project yet. They were cautious about sharing proprietary assets in volume before a deal was finalized. What they provided was three photos.

Three photos are not enough to train an object detection model. A minimum viable training set for this class of task typically requires 50 to 100 labeled examples across varied conditions: different lighting, angles, distances, and backgrounds. Three raw photos cover almost none of that variation.

This is a common deadlock in early-stage AI projects: the client wants a demo before committing resources, but building a real demo requires resources they haven't committed yet. The way out is generative AI.

What Is Computer Vision?

Computer vision is a field of artificial intelligence that enables machines to interpret and understand visual information from images or video. Object detection is a specific task within computer vision: given an image, identify where specific objects appear and draw a bounding box around each one.

A trained detection model processes an image and outputs: "There is an asset of class X at coordinates (x1, y1, x2, y2), with 91% confidence."

This kind of model can run on a live camera feed, a video file, or static images, making it useful for inventory tracking, quality control, security monitoring, and asset management.

Reference: What is Computer Vision? (IBM)

Generating a Dataset with Gemini AI

Instead of requesting more photos from the client, I used Gemini AI to generate synthetic training images from the original three.

The approach:

Submit one original photo as a visual reference to Gemini
Prompt the model to generate realistic variations under different conditions: bright lighting, low lighting, different angles, partial obstruction, motion blur, varied backgrounds
Use collage-style prompting to generate 4 condition variations per generation batch
Repeat across the daily free generation limit over multiple sessions

From 3 source photos, this produced approximately 100 synthetic images. Each generated image was realistic enough to serve as training data, covering lighting variation, angle variation, distance variation, and degraded conditions that the original 3 photos could never represent.

This approach is specific to POC phase. The goal is not production-grade precision. The goal is enough generalization to demonstrate that the detection concept works.

Labeling with Roboflow

After generating the dataset, every image needed to be labeled: bounding boxes drawn around each target asset and assigned a class name.

Roboflow is a platform built for the computer vision labeling and dataset management workflow. It provides:

A web-based annotation interface for drawing bounding boxes
Dataset versioning and management
Export to standard model formats (YOLO, COCO, Pascal VOC, and others)
Built-in augmentations (flip, rotate, brightness shift, crop) to expand dataset size without manual effort
Team collaboration and dataset sharing

I labeled all ~100 images in Roboflow and exported the dataset in YOLOv8 OBB (Oriented Bounding Boxes) format. Standard bounding boxes are always axis-aligned rectangles. OBB boxes can rotate to fit the actual object orientation, which matters when detecting assets that may appear at arbitrary angles in real-world conditions.

Why YOLOv8-n

YOLO (You Only Look Once) is a family of real-time object detection models. YOLOv8, developed by Ultralytics, offers several size variants with different accuracy and speed trade-offs:

Variant	Parameters	mAP50-95	Inference Speed	Best For
YOLOv8n (nano)	3.2M	37.3	Fastest	Edge devices, POC, Termux
YOLOv8s (small)	11.2M	44.9	Fast	Balanced accuracy and speed
YOLOv8m (medium)	25.9M	50.2	Moderate	General production use
YOLOv8l (large)	43.7M	52.9	Slow	Higher accuracy requirements
YOLOv8x (extra)	68.2M	53.9	Slowest	Maximum accuracy

I chose YOLOv8n for this POC for two specific reasons:

Training speed within free compute constraints. Google Colab's free GPU tier has session time limits and compute caps. A nano model trains faster, letting me complete full training runs and iterate on hyperparameters within free tier constraints.

POC purpose. The goal was not production deployment. It was demonstrating that detection of this specific asset class was feasible. YOLOv8n is accurate enough to produce a convincing demo with a small, clean dataset.

Reference: Ultralytics YOLOv8 Documentation

Training on Google Colab

Google Colab provides free GPU-backed Jupyter notebook sessions. Limited in session length and compute allocation, but sufficient for a POC training run on a ~100 image dataset.

from ultralytics import YOLO

# Load YOLOv8n with OBB (Oriented Bounding Box) support
model = YOLO("yolov8n-obb.pt")

# Train on labeled Roboflow dataset
results = model.train(
    data="dataset/data.yaml",
    epochs=100,
    imgsz=640,
    device="cuda"  # Colab GPU
)

The output of a successful training run is best.pt: the model checkpoint with the best validation performance across all epochs. This file is the deliverable for the POC demo.

Did It Work?

Yes. The best.pt model successfully detected the target asset in validation images that were not in the training set. Detections varied in bounding box precision depending on angle and lighting, but the concept was clearly demonstrated: the model could locate and identify the asset across conditions it had never seen.

Is this production-ready? No. The dataset is too small, the synthetic training data too limited in diversity, and the model variant too lightweight for reliable deployment in a real environment with uncontrolled conditions.

But that was never the objective. The POC goal was met: a working demo that could be placed in front of the client to advance the business conversation. A running model detecting their actual asset is a more persuasive argument than any slide deck.

Key Takeaways

Generative AI can substitute for scarce training data in POC contexts. When real data is unavailable or the client is unwilling to provide it before a deal is signed, synthetic data from models like Gemini can bridge the gap enough to validate the detection concept.

Roboflow significantly reduces labeling friction. The annotation interface, format export options, and built-in augmentations compressed what could have been days of manual work into a few hours.

Free compute is viable for POC. Google Colab's free GPU tier handles small datasets on lightweight models without issue. It is not suitable for production training, but it is entirely sufficient for demonstrating that something works.

The POC is the argument. A running model that detects the client's actual asset, even imperfectly, carries more persuasive force than a theoretical proposal. Getting to that point fast is the real engineering challenge.

YOLOv8 Asset Detection: Training a Computer Vision Model on AI-Generated Data