This repository constitutes an AI Blueprint and Instruction Manual. It does not contain C++ Chromium source code. Instead, it is a structured prompt task queue intended to orchestrate an AI code-generation agent (such as the Gemini CLI) to build a custom browser application by modifying a local Chromium source tree.
The ultimate goal of this project is to modify the open-source Chromium browser to build SmartChrome—an autonomous browser where:
- Vision-Language Model (VLM) acts as the "Brain". It receives the browser's Accessibility Tree and viewport screenshots to execute actions (clicking, scrolling, typing) mimicking a real human.
- Large Language Model (LLM) acts as the "Mentor/Supervisor". It observes the VLM's actions and the resulting browser state to ensure the original user intent (e.g., "Find insights and generate an investment report") is achieved.
- Continuous Learning (Human-in-the-Loop): The architecture supports human-in-the-loop behavior. When a human intercepts and corrects a bad action taken by the VLM while using SmartChrome, the browsing activities and corrections are recorded. This telemetry acts as fine-tuning training data to continually improve the VLM's behavior when the browser is idle.
tasks/: The core of the blueprint. This contains the active queue of XML prompt tasks detailing exact architectural changes, file paths, and build commands.tasks/init_tasks/: Bootstrap tasks to initially scaffold the repository layout and architecture definitions.
backend/: A python-based Mock VLM Server used to test the telemetry pipeline and Mojo IPC interface before the real, heavy local VLM is fully integrated.scripts/: Automation scripts and utilities for analyzing the AI Agent's progress.docs/: Architecture diagrams, state documentation, and general project notes.
The task queue in tasks/ currently commands the AI Agent to build the following components inside the Chromium Source Tree:
- Task
001-003: Creating the primary Mojo IPC interface (vlm_agent.mojom) between the main Browser Process and the isolated VLM utility process. - Task
004: Implementing the core Accessibility (A11y) tree extraction from the Blink Renderer. - Task
005-006: Dispatching state from the Renderer to the Browser Process and capturing full-page RGBA viewport screenshots. - Task
007-008: Wiring the internal mechanisms and preparing the network dispatching logic to bounce payloads to the VLM Server. - Task
009: Setting up the Mock VLM Server and providing a hotfix for dirty Blink accessibility cache crashes. - Task
010+: Implementing Native Axtree Pivot extraction, adding VLM trace logging, and fixing Mojo pipe lifecycles (stale pipes and rebinds). - Task
022: Implemented the SmartChrome Commander—a native Chromium Side Panel (WebUI) for setting mission objectives, monitoring the agent's Chain of Thought, and toggling between Shadow and Autonomous modes. - Task
023: Implemented Native Resource Integration (GRIT) to bundle UI assets directly into the Chromium binary. - Task
024: Implemented Agent Autonomy & Navigation. Added support for direct URL navigation and "Bootstrap" search engine queries for empty states.
To utilize this blueprint, your local environment requires:
- WSL2 Ubuntu 22.04 (or a native Linux environment).
- The Google Gemini CLI (or another capable coding agent) installed and authenticated.
- A fully cloned and set up Chromium Source Tree (e.g., located at
~/chromium/src). - Python 3 installed for the mock backend telemetry server.
You do not run this code directly. You feed this repository to your AI CLI:
- Validate your Chromium build environment is working:
cd ~/chromium/src autoninja -C out/Default chrome
- Navigate to the
SmartChrome/tasks/directory. - Feed the XML tasks strictly in sequential order (e.g.,
task_001_...thentask_002_...) to your AI Agent. Example utilizing an AI prompt alias:cat task_001_frontend_mojo_ipc.xml | gemini-cli "Execute this task against the ~/chromium/src directory."
- Allow the AI to modify the Chromium C++ source files, add Mojo interfaces, and compile. Watch the agent's stdout to verify the build completes successfully before feeding it the next task.
While the AI agent is building the C++ components in the Chromium tree, you can launch the Mock Server in this directory to verify the telemetry outputs:
-
Start the Mock VLM Server:
cd backend python3 mock_server.py -
Launch the custom-built SmartChrome: Launch the AI-modified Chrome binary from your Chromium build folder. Enable the accessibility flag to ensure the C++ telemetry pipeline captures and sends the UI state to the mock server without crashing:
~/chromium/src/out/Default/chrome --force-renderer-accessibility -
Watch the
mock_server.pyterminal output. It will savedebug_latest_screenshot.jpganddebug_latest_a11y.jsoninside thebackend/folder whenever the browser telemetry pipeline fires.
Once Task 022 is implemented, you can test the user interface:
- Start the Mock VLM Server:
cd backend python3 mock_server.py - Launch SmartChrome with the Side Panel enabled:
~/chromium/src/out/Default/chrome --force-renderer-accessibility - Open the Commander:
- Click the SmartChrome Icon in the browser toolbar OR
- Open the Side Panel dropdown and select SmartChrome.
- Set a Mission:
- Type an objective (e.g., "Find the price of Bitcoin") into the Commander text area and press Set Mission.
- Verify the
mock_server.pylog shows the objective being received via the/vlm/objectiveendpoint.
- Monitor Reasoning:
- Observe the "Chain of Thought" feed in the Commander as the VLM responds to layout changes.
- Test Intervention:
- Click the Intervention button to toggle between Autonomous and Shadow modes. Verify the backend receives the updated state.