- Notifications
You must be signed in to change notification settings - Fork 4
Attachments
adham90 edited this page Feb 16, 2026 · 5 revisions
Send images, PDFs, and other files to vision-capable models using the with: option.
class VisionAgent < ApplicationAgent model "gpt-4o" # Vision-capable model param :question, required: true user "{question}" end # Local file VisionAgent.call(question: "Describe this image", with: "photo.jpg") # URL VisionAgent.call( question: "What architecture is shown?", with: "https://example.com/building.jpg" )VisionAgent.call( question: "Compare these screenshots", with: ["screenshot_v1.png", "screenshot_v2.png"] )RubyLLM automatically detects file types:
| Category | Extensions |
|---|---|
| Images | .jpg, .jpeg, .png, .gif, .webp, .bmp |
| Videos | .mp4, .mov, .avi, .webm |
| Audio | .mp3, .wav, .m4a, .ogg, .flac |
| Documents | .pdf, .txt, .md, .csv, .json, .xml |
| Code | .rb, .py, .js, .ts, .html, .css, and more |
Not all models support vision. Use these:
| Provider | Models |
|---|---|
| OpenAI | gpt-4o, gpt-4o-mini, gpt-4-turbo |
| Anthropic | claude-3-5-sonnet, claude-3-opus, claude-3-haiku |
gemini-2.0-flash, gemini-1.5-pro |
class ImageDescriber < ApplicationAgent model "gpt-4o" param :detail_level, default: "medium" user "Describe this image in {detail_level} detail." end result = ImageDescriber.call( detail_level: "high", with: "product_photo.jpg" )class OCRAgent < ApplicationAgent model "gpt-4o" user do <<~S Extract all text from this image. Preserve the original formatting and structure. Return the text exactly as it appears. S end def schema @schema ||= RubyLLM::Schema.create do string :extracted_text, description: "All text found in image" array :text_blocks, of: :object do string :content string :location, description: "top/middle/bottom" end end end end result = OCRAgent.call(with: "document_scan.png") puts result[:extracted_text]class ImageComparator < ApplicationAgent model "claude-3-5-sonnet" user do <<~S Compare these two images and identify: 1. Similarities 2. Differences 3. Which appears higher quality S end def schema @schema ||= RubyLLM::Schema.create do array :similarities, of: :string array :differences, of: :string string :quality_winner, enum: ["first", "second", "equal"] string :explanation end end end result = ImageComparator.call(with: ["design_v1.png", "design_v2.png"])class PDFAnalyzer < ApplicationAgent model "gpt-4o" param :focus_area, default: "summary" user do <<~S Analyze this PDF document. Focus on: {focus_area} Provide: - Main topics covered - Key points - Any important figures or data S end end result = PDFAnalyzer.call( focus_area: "financial data", with: "annual_report.pdf" )class InvoiceExtractor < ApplicationAgent model "gpt-4o" user "Extract invoice details from this document." def schema @schema ||= RubyLLM::Schema.create do string :invoice_number string :date string :vendor_name number :total_amount string :currency, default: "USD" array :line_items, of: :object do string :description integer :quantity number :unit_price number :total end end end end result = InvoiceExtractor.call(with: "invoice.pdf") # => { invoice_number: "INV-2024-001", total_amount: 1250.00, ... }# Relative path (from Rails root) result = VisionAgent.call(with: "storage/images/photo.jpg") # Absolute path result = VisionAgent.call(with: "/path/to/photo.jpg") # Active Storage result = VisionAgent.call(with: user.avatar.blob.path)# Direct image URL result = VisionAgent.call(with: "https://example.com/image.jpg") # S3 signed URL url = document.file.url(expires_in: 1.hour) result = VisionAgent.call(with: url)result = VisionAgent.call( question: "test", with: ["image1.png", "image2.png"], dry_run: true ) # => { # dry_run: true, # agent: "VisionAgent", # attachments: ["image1.png", "image2.png"], # ... # }begin result = VisionAgent.call( question: "Describe this", with: "missing_file.jpg" ) rescue Errno::ENOENT # File not found Rails.logger.error("Attachment file not found") rescue => e # Other errors (network, invalid format, etc.) Rails.logger.error("Attachment error: #{e.message}") endLarge images increase cost and latency:
# Resize before sending image = MiniMagick::Image.open("large_photo.jpg") image.resize "1024x1024>" image.write "optimized_photo.jpg" result = VisionAgent.call(with: "optimized_photo.jpg")Some providers support detail levels:
# OpenAI specific - in your prompt user "Using high detail analysis, describe every element in this image."Group related images in a single call:
# One call with multiple images (cheaper than multiple calls) result = CompareAgent.call( with: ["before.jpg", "after.jpg"] )For large PDFs, consider chunking:
class LargeDocumentAgent < ApplicationAgent model "gpt-4o" timeout 180 # Longer timeout for large docs user "Analyze this document page by page. Focus on key information." end