Skip to content

Attachments

adham90 edited this page Feb 16, 2026 · 5 revisions

Attachments

Send images, PDFs, and other files to vision-capable models using the with: option.

Basic Usage

Single File

class VisionAgent < ApplicationAgent model "gpt-4o" # Vision-capable model param :question, required: true user "{question}" end # Local file VisionAgent.call(question: "Describe this image", with: "photo.jpg") # URL VisionAgent.call( question: "What architecture is shown?", with: "https://example.com/building.jpg" )

Multiple Files

VisionAgent.call( question: "Compare these screenshots", with: ["screenshot_v1.png", "screenshot_v2.png"] )

Supported File Types

RubyLLM automatically detects file types:

Category Extensions
Images .jpg, .jpeg, .png, .gif, .webp, .bmp
Videos .mp4, .mov, .avi, .webm
Audio .mp3, .wav, .m4a, .ogg, .flac
Documents .pdf, .txt, .md, .csv, .json, .xml
Code .rb, .py, .js, .ts, .html, .css, and more

Vision-Capable Models

Not all models support vision. Use these:

Provider Models
OpenAI gpt-4o, gpt-4o-mini, gpt-4-turbo
Anthropic claude-3-5-sonnet, claude-3-opus, claude-3-haiku
Google gemini-2.0-flash, gemini-1.5-pro

Image Analysis Examples

Describe an Image

class ImageDescriber < ApplicationAgent model "gpt-4o" param :detail_level, default: "medium" user "Describe this image in {detail_level} detail." end result = ImageDescriber.call( detail_level: "high", with: "product_photo.jpg" )

Extract Text (OCR)

class OCRAgent < ApplicationAgent model "gpt-4o" user do <<~S  Extract all text from this image.  Preserve the original formatting and structure.  Return the text exactly as it appears.  S end def schema @schema ||= RubyLLM::Schema.create do string :extracted_text, description: "All text found in image" array :text_blocks, of: :object do string :content string :location, description: "top/middle/bottom" end end end end result = OCRAgent.call(with: "document_scan.png") puts result[:extracted_text]

Compare Images

class ImageComparator < ApplicationAgent model "claude-3-5-sonnet" user do <<~S  Compare these two images and identify:  1. Similarities  2. Differences  3. Which appears higher quality  S end def schema @schema ||= RubyLLM::Schema.create do array :similarities, of: :string array :differences, of: :string string :quality_winner, enum: ["first", "second", "equal"] string :explanation end end end result = ImageComparator.call(with: ["design_v1.png", "design_v2.png"])

Document Analysis

PDF Analysis

class PDFAnalyzer < ApplicationAgent model "gpt-4o" param :focus_area, default: "summary" user do <<~S  Analyze this PDF document. Focus on: {focus_area}   Provide:  - Main topics covered  - Key points  - Any important figures or data  S end end result = PDFAnalyzer.call( focus_area: "financial data", with: "annual_report.pdf" )

Invoice Processing

class InvoiceExtractor < ApplicationAgent model "gpt-4o" user "Extract invoice details from this document." def schema @schema ||= RubyLLM::Schema.create do string :invoice_number string :date string :vendor_name number :total_amount string :currency, default: "USD" array :line_items, of: :object do string :description integer :quantity number :unit_price number :total end end end end result = InvoiceExtractor.call(with: "invoice.pdf") # => { invoice_number: "INV-2024-001", total_amount: 1250.00, ... }

URLs vs Local Files

Local Files

# Relative path (from Rails root) result = VisionAgent.call(with: "storage/images/photo.jpg") # Absolute path result = VisionAgent.call(with: "/path/to/photo.jpg") # Active Storage result = VisionAgent.call(with: user.avatar.blob.path)

URLs

# Direct image URL result = VisionAgent.call(with: "https://example.com/image.jpg") # S3 signed URL url = document.file.url(expires_in: 1.hour) result = VisionAgent.call(with: url)

Debug Mode

result = VisionAgent.call( question: "test", with: ["image1.png", "image2.png"], dry_run: true ) # => { # dry_run: true, # agent: "VisionAgent", # attachments: ["image1.png", "image2.png"], # ... # }

Error Handling

begin result = VisionAgent.call( question: "Describe this", with: "missing_file.jpg" ) rescue Errno::ENOENT # File not found Rails.logger.error("Attachment file not found") rescue => e # Other errors (network, invalid format, etc.) Rails.logger.error("Attachment error: #{e.message}") end

Best Practices

Optimize Image Size

Large images increase cost and latency:

# Resize before sending image = MiniMagick::Image.open("large_photo.jpg") image.resize "1024x1024>" image.write "optimized_photo.jpg" result = VisionAgent.call(with: "optimized_photo.jpg")

Use Appropriate Detail Level

Some providers support detail levels:

# OpenAI specific - in your prompt user "Using high detail analysis, describe every element in this image."

Batch Related Images

Group related images in a single call:

# One call with multiple images (cheaper than multiple calls) result = CompareAgent.call( with: ["before.jpg", "after.jpg"] )

Handle Large Documents

For large PDFs, consider chunking:

class LargeDocumentAgent < ApplicationAgent model "gpt-4o" timeout 180 # Longer timeout for large docs user "Analyze this document page by page. Focus on key information." end

Related Pages

Clone this wiki locally