Streaming

Stream LLM responses in real-time as they're generated, reducing perceived latency for users.

Enabling Streaming

Per-Agent

class StreamingAgent < ApplicationAgent model "gpt-4o" streaming true # Enable streaming for this agent user "{prompt}" end

Global Default

# config/initializers/ruby_llm_agents.rb RubyLLM::Agents.configure do |config| config.default_streaming = true end

Using Streaming with a Block

Process chunks as they arrive:

StreamingAgent.call(user: "Write a story") do |chunk| print chunk.content # chunk is a RubyLLM::Chunk object end

Output appears progressively:

Once... upon... a... time...

Explicit Stream Method

For more explicit streaming, use the .stream() class method which forces streaming regardless of class settings:

result = MyAgent.stream(user: "Write a story") do |chunk| print chunk.content end # Access result metadata after streaming puts "Tokens: #{result.total_tokens}" puts "TTFT: #{result.time_to_first_token_ms}ms"

This method:

Forces streaming even if streaming false is set at class level
Requires a block (raises ArgumentError if none provided)
Returns a Result object with full metadata

Streaming Result Metadata

When streaming completes, the returned Result contains streaming-specific metadata:

result = StreamingAgent.call(user: "test") do |chunk| print chunk.content end result.streaming? # => true result.time_to_first_token_ms # => 245 (ms until first chunk arrived) result.duration_ms # => 2500 (total execution time)

HTTP Streaming

Server-Sent Events (SSE)

class StreamingController < ApplicationController include ActionController::Live def stream_response response.headers['Content-Type'] = 'text/event-stream' response.headers['Cache-Control'] = 'no-cache' response.headers['X-Accel-Buffering'] = 'no' # Disable nginx buffering StreamingAgent.call(user: params[:prompt]) do |chunk| response.stream.write "data: #{chunk.to_json}\n\n" end response.stream.write "data: [DONE]\n\n" rescue ActionController::Live::ClientDisconnected # Client disconnected, clean up ensure response.stream.close end end

Client-Side JavaScript

const eventSource = new EventSource('/stream?prompt=' + encodeURIComponent(prompt)); eventSource.onmessage = (event) => { if (event.data === '[DONE]') { eventSource.close(); return; } const chunk = JSON.parse(event.data); document.getElementById('output').textContent += chunk; }; eventSource.onerror = () => { eventSource.close(); };

Turbo Streams Integration

Controller

class ChatController < ApplicationController def create respond_to do |format| format.turbo_stream do StreamingAgent.call(user: params[:message]) do |chunk| Turbo::StreamsChannel.broadcast_append_to( "chat_#{params[:chat_id]}", target: "messages", partial: "messages/chunk", locals: { content: chunk } ) end end end end end

View

<%= turbo_stream_from "chat_#{@chat.id}" %> <div id="messages"></div>

Time-to-First-Token (TTFT) Tracking

Streaming executions track latency metrics:

# After streaming completes execution = RubyLLM::Agents::Execution.last execution.streaming? # => true execution.time_to_first_token_ms # => 245 (ms until first chunk) execution.duration_ms # => 2500 (total time)

Analytics

# Average TTFT for streaming agents RubyLLM::Agents::Execution.today.avg_time_to_first_token # => 312

Note: time_to_first_token_ms is stored in the metadata JSON column, not as a direct SQL column. Use the avg_time_to_first_token analytics method for aggregation, or access it on individual instances via execution.time_to_first_token_ms.

Streaming with Structured Output

When using schemas, the full response is still validated:

class StructuredStreamingAgent < ApplicationAgent model "gpt-4o" streaming true user "Write about {topic}" def schema @schema ||= RubyLLM::Schema.create do string :title string :content end end end # Stream the raw text StructuredStreamingAgent.call(topic: "AI") do |chunk| print chunk # Raw JSON chunks end # Result is parsed and validated at the end

Caching and Streaming

Important: Streaming responses are not cached by design, as caching would defeat the purpose of real-time streaming.

class MyAgent < ApplicationAgent streaming true cache_for 1.hour # Cache is ignored when streaming end

If you need caching with streaming-like UX, consider:

Cache the full response
Simulate streaming on the client side

Error Handling

begin StreamingAgent.call(user: "test") do |chunk| print chunk end rescue Timeout::Error puts "\n[Stream timed out]" rescue => e puts "\n[Stream error: #{e.message}]" end

Streaming in Background Jobs

For long-running streams, use ActionCable:

class StreamingJob < ApplicationJob def perform(prompt, channel_id) StreamingAgent.call(user: prompt) do |chunk| ActionCable.server.broadcast( channel_id, { type: 'chunk', content: chunk } ) end ActionCable.server.broadcast( channel_id, { type: 'complete' } ) end end

Best Practices

Use for Long Responses

Streaming is most beneficial for:

Long-form content generation
Conversational interfaces
Real-time transcription/translation

Handle Disconnections

def stream_response StreamingAgent.call(user: params[:prompt]) do |chunk| break if response.stream.closed? response.stream.write "data: #{chunk.to_json}\n\n" end ensure response.stream.close end

Set Appropriate Timeouts

class LongFormAgent < ApplicationAgent streaming true timeout 180 # 3 minutes for long content end

Monitor TTFT

Track time-to-first-token to ensure good UX:

# Alert if TTFT is too high if execution.time_to_first_token_ms > 1000 Rails.logger.warn("High TTFT: #{execution.time_to_first_token_ms}ms") end

Related Pages

Agent DSL - Configuration options
Execution Tracking - TTFT analytics
Dashboard - Monitoring streaming metrics

Streaming

Streaming

Enabling Streaming

Per-Agent

Global Default

Using Streaming with a Block

Explicit Stream Method

Streaming Result Metadata

HTTP Streaming

Server-Sent Events (SSE)

Client-Side JavaScript

Turbo Streams Integration

Controller

View

Time-to-First-Token (TTFT) Tracking

Analytics

Streaming with Structured Output

Caching and Streaming

Error Handling

Streaming in Background Jobs

Best Practices

Use for Long Responses

Handle Disconnections

Set Appropriate Timeouts

Monitor TTFT

Related Pages

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally