Flex Processing is a service tier optimized for high-throughput workloads that prioritizes fast inference and can handle occasional request failures. This tier offers significantly higher rate limits while maintaining the same pricing as on-demand processing.
Flex processing is available for all models to paid customers only with 10x higher rate limits compared to on-demand processing. Pricing matches the on-demand tier.
498 and error capacity_exceeded. Add jittered backoff and retries to smooth spikes.import os import requests GROQ_API_KEY = os.environ.get("GROQ_API_KEY") def main(): try: response = requests.post( "https://api.groq.com/openai/v1/chat/completions", headers={ "Content-Type": "application/json", "Authorization": f"Bearer {GROQ_API_KEY}" }, json={ "service_tier": "flex", "model": "llama-3.3-70b-versatile", "messages": [{ "role": "user", "content": "whats 2 + 2" }] } ) print(response.json()) except Exception as e: print(f"Error: {str(e)}") if __name__ == "__main__": main()