Computer VisionJan 2024 - Mar 2024

Harmful Content Classifier — Image & Video

Local vision LLM-powered harmful content detection system that classifies images and video frames into threat categories like weapons and nudity, with aggregated reasoning and detailed response output.

Architecture Flow

data flow · live

Image / Video

Frame Sampler

OpenCV

Vision LLM

Local · Ollama

Aggregate

Resolve conflicts

Verdict

Label + reason

Image / Video

Frame Sampler

OpenCV

Vision LLM

Local · Ollama

Aggregate

Resolve conflicts

Verdict

Label + reason

Key Achievements

Built fully local vision LLM inference pipeline — no external API calls, ensuring complete privacy and data sovereignty
Classified harmful content across multiple threat categories: knives, rifles, nudity, and extensible custom types
Returned structured classification output with category label and human-readable reasoning for each detection
Engineered video frame extraction and per-frame inference pipeline with aggregated result consolidation
Handled temporal inconsistencies across video frames through result aggregation and conflict resolution logic
Designed modular and extensible architecture for adding new harm categories without structural changes

Core Challenge

Detecting harmful content across diverse media types — images and videos — while keeping inference entirely local for privacy compliance, and producing not just binary flags but meaningful, structured categorization with reasoning that supports human review.

Solution

Deployed a locally running vision LLM via Ollama to perform zero-dependency harmful content inference. For videos, built a frame sampling pipeline with OpenCV that extracts frames at regular intervals, runs per-frame classification, and aggregates results into a single consolidated verdict with conflict resolution across temporally inconsistent frames.

Timeline

Jan 2024 - Mar 2024

Team

Lead Engineer

Status

Production Ready

Deep Dive

Developed a privacy-first harmful content classification system that runs entirely on local infrastructure using a vision-capable LLM — eliminating the need to send sensitive media to external APIs. Users submit an image or video and the system autonomously determines whether harmful content is present, categorizing detections into clearly defined threat classes.

The classification engine identifies a range of harmful categories including bladed weapons (knives), firearms (rifles), nudity, and other configurable threat types — returning not just a classification label but a structured reason explaining the detected content, enabling downstream audit trails and human review workflows.

For video inputs, the system implements an intelligent frame extraction pipeline that samples frames at defined intervals, runs the vision model inference on each frame independently, and then aggregates the per-frame results into a unified video-level verdict — accounting for temporal consistency and resolving conflicting frame-level signals into a coherent final classification.

Tangible Impact

Delivered a fully offline-capable harmful content moderation system supporting both image and video inputs, with structured multi-category output and reasoning — suitable for integration into content pipelines requiring privacy-preserving, auditable moderation.

Tech Stack

PythonVision LLMOllamaOpenCVFastAPIFrame ExtractionLocal Inference