Skip to content

UniPat-AI/SWE-Vision

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SWE-Vision

GITHUB Blog

An agentic VLM (Vision Language Model) framework that gives a language model access to a stateful Jupyter notebook running inside a Docker container. The agent can iteratively write and execute Python code to process images, run computations, and produce visualizations — all within a sandboxed environment.

Project Structure

SWE-Vision/ ├── swe_vision/ # Core library │ ├── __init__.py # Package exports │ ├── config.py # Constants, logging, tool definitions, system prompt │ ├── kernel.py # JupyterNotebookKernel — Docker-based Jupyter runtime │ ├── image_utils.py # Image encoding, MIME detection, OpenAI content parts │ ├── file_manager.py # NotebookFileManager — host ↔ container file sharing │ ├── trajectory.py # TrajectoryRecorder — saves full agent traces to disk │ ├── agent.py # VLMToolCallAgent — agentic loop with tool calling │ ├── cli.py # CLI entry point │ └── eval_utils.py # LLM judge prompt, answer extraction utilities │ ├── apps/ # Standalone applications │ ├── web_app.py # ChatGPT-style web UI (Flask + SSE streaming) │ └── trajectory_viewer.py # Trajectory visualization dashboard (Flask) │ ├── env/ # Docker environment (Dockerfile for the kernel) ├── requirements.txt └── README.md 

Quick Start

1. Install dependencies

pip install -r requirements.txt

2. Set environment variables

export OPENAI_API_KEY="sk-..." export OPENAI_BASE_URL="https://openrouter.ai/api/v1" # custom API endpoint export OPENAI_MODEL="openai/gpt-5.2" # default model

3. Prepare the Docker environment

The agent runs code inside a Docker container. Make sure Docker is installed and running, then place a Dockerfile in the env/ directory. A minimal example:

docker build -t swe-vision -f ./env/Dockerfile ./env

4. Run the agent (CLI)

We provide a script to run the agent with a single command.

bash run.sh

You can also run the agent manually.

# Single query with an image python -m swe_vision.cli --image photo.png "What objects are in this image?" # Multiple images python -m swe_vision.cli -i img1.png -i img2.png "What is the difference between these two images?"

5. Run the Web UI

A ChatGPT-style interface with real-time streaming of the agent's reasoning, code execution, and results:

python apps/web_app.py --port 8080 # Open http://localhost:8080

Web App Screenshot

6. View trajectories

Every agent run saves a trajectory (JSON + images) to ./trajectories/. Browse them with the viewer:

python apps/trajectory_viewer.py --port 5050 # Open http://localhost:5050

Trajectory Viewer Screenshot

Architecture

 User Query (+ images) │ ▼ ┌──────────────────────┐ │ LLM (e.g. GPT-5.2) │◄───────────────────────┐ │ │ │ │ Tool Calls: │ │ │ ┌────────────────┐ │ ┌──────────────┐ │ │ │ execute_code │─┼────►│Jupyter Kernel│ │ │ └────────────────┘ │ │ (Docker) │ │ │ ┌────────────────┐ │ └──────┬───────┘ │ │ │ finish │─┼──► Answer │ (Output) │ │ └────────────────┘ │ │ │ └──────────────────────┘ text + images ──────┘ 

Key components:

Module Responsibility
config.py All constants, tool schemas, system prompt
kernel.py Builds Docker image, starts container, manages Jupyter kernel via ZMQ
agent.py Orchestrates the agentic loop: LLM calls → tool dispatch → result collection
trajectory.py Records every step with timestamps, code, images; saves to JSON
image_utils.py Base64 encoding, compression, OpenAI content part builders
file_manager.py Copies files into the Docker mount so the kernel can access them

CLI Options

usage: python -m swe_vision.cli [-h] [--image IMAGE] [--interactive] [--model MODEL] [--api-key API_KEY] [--base-url BASE_URL] [--max-iterations MAX_ITERATIONS] [--save-trajectory SAVE_TRAJECTORY] [--verbose] [--quiet] [--reasoning | --no-reasoning] [query] 
Flag Description
--image, -i Image file path (repeatable)
--interactive Multi-turn interactive mode
--model, -m Model name (default: gpt-4o or $OPENAI_MODEL)
--reasoning / --no-reasoning Enable/disable extended reasoning
--save-trajectory Custom trajectory output directory
--quiet, -q Minimal console output

Environment Variables

Variable Description Default
OPENAI_API_KEY API key for the LLM provider (required)
OPENAI_BASE_URL Custom API base URL OpenAI default
OPENAI_MODEL Default model name gpt-4o
VLM_DOCKER_IMAGE Docker image name for the kernel swe-vision:latest
VLM_DOCKERFILE_DIR Path to the Dockerfile directory ./env/
VLM_HOST_WORK_DIR Host-side working directory for file sharing ~/tmp/vlm_docker_workdir
VLM_WEB_SESSION_DIR Session storage for the web app /tmp

Programmatic Usage

import asyncio from swe_vision import VLMToolCallAgent async def main(): agent = VLMToolCallAgent( model="openai/gpt-5.2", api_key="sk-...", reasoning=True, ) try: answer = await agent.run( "Analyze this chart and summarize the trends", image_paths=["chart.png"], ) print(answer) finally: await agent.cleanup() asyncio.run(main())

License

MIT

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors