LLM Test Bench
Local Mode
Local Mode
Output will appear here...
💡 Side-by-side view is best on desktop. Stacked on mobile.
⚔ Model A
Output will appear here...
In: 0 Out: 0 Total: 0
⚔ Model B
Output will appear here...
In: 0 Out: 0 Total: 0
📖 LLM Test Bench — Help & Guide

Everything you need to understand prompt testing, configure providers, interpret results, and get the most from your LLM experiments.

🔬 What is Prompt Testing?

Prompt testing is the practice of systematically evaluating how different large language models (LLMs) respond to the same or similar inputs. Just as software engineers write unit tests for code, AI engineers write and refine prompts to ensure models behave reliably across a range of inputs.

LLM Test Bench lets you configure multiple AI providers, run prompts against them, compare outputs side-by-side (Duel mode), and build a searchable history of every experiment — so your learning compounds over time.

🧪
Test Tab
Run a single prompt against one model. See output, token usage, and save to history.
⚔️
Duel Tab
Send the same prompt to two models simultaneously. Compare outputs side-by-side.
⚙️
Configurations
Store API keys, URLs, and model lists for each provider. Export encrypted backups.
📊
History
Every test run is saved. Add comments, reload into Test, or export as JSON.
💡 Why Prompt Testing Matters

The same question asked to different models — or even to the same model with a slightly different system prompt — can produce dramatically different results. Prompt testing helps you:

  • Find the best model for your task — A coding assistant, a creative writer, and a factual Q&A bot all have different ideal models.
  • Reduce costs — Smaller, cheaper models often perform just as well for simple tasks. Testing reveals where you can save.
  • Improve reliability — A well-crafted system prompt that's been tested reduces hallucinations and keeps the model on-task.
  • Document your findings — The History tab turns ephemeral experiments into a searchable knowledge base.
  • Avoid vendor lock-in — By testing multiple providers, you understand trade-offs and can switch if pricing or availability changes.
💬
Rule of thumb: Start with a cheaper, faster model (e.g. Claude Haiku, GPT-4o Mini) to prototype your prompt. Only move to a larger model once the task is well-defined.
🚀 Getting Started

Step 1 — Add a Provider

Go to Configurations and click + Add Provider. Type a provider name — suggestions from a built-in catalog will appear automatically. Selecting a known name (e.g. "Anthropic") auto-fills the API URL.

Step 2 — Add Models

In the Models field, type or pick from the suggestions and click Add. Star (☆) a model to mark it as the default — it will be pre-selected whenever you open the Test tab.

Step 3 — Run a Test

Go to Test, choose your provider and model, enter a System Prompt and User Message, then click ▶ Run Test. The output and token counts appear below and are saved automatically to History.

⚠️
API Keys are required. Your key is sent directly to the chosen provider for each request. It is never logged or stored beyond the request. In Local Mode (no sign-in) your keys live only in your browser's localStorage.

Popular Provider URLs

ProviderAPI Base URLKey Prefix
Anthropichttps://api.anthropic.com/v1sk-ant-…
OpenAIhttps://api.openai.com/v1sk-proj-…
Googlehttps://generativelanguage.googleapis.com/v1betaAIza…
Mistralhttps://api.mistral.ai/v1
Groqhttps://api.groq.com/openai/v1gsk_…
DeepSeekhttps://api.deepseek.com/v1
Together AIhttps://api.together.xyz/v1
Perplexityhttps://api.perplexity.aipplx-…
Ollama (local)http://localhost:11434/v1n/a
⚙️ Understanding Parameters

The three parameters in the Test tab control how the model generates text. They are collapsed by default — click ⚙ Parameters to expand them.

Temperature — Controls randomness

ValueBehaviourGood For
0.0Fully deterministicMath, factual Q&A, structured output
0.3Mostly consistentCode generation, summaries
0.7Balanced (default)General purpose, conversation
1.0Creative & variedCreative writing, brainstorming
1.5 – 2.0Highly unpredictableExperimental / artistic only

Max Tokens — Maximum output length

ValueTypical Output
100A sentence or two
500A short paragraph
1 000Detailed explanation (default)
2 000Article or documentation
4 000+Long-form content

Top P (Nucleus Sampling) — Vocabulary diversity

📌
Keep Top P at 1.0 (default) unless you are specifically experimenting. Lower values (e.g. 0.1) make the model very conservative; it only picks from the most probable next words. Temperature and Top P interact — avoid setting both to extreme values at the same time.
🎯 Use Cases & Example Prompts

Below are ready-to-use system prompts for common tasks. Copy them into the System Prompt field to get started quickly.

General Assistant

You are a helpful, harmless, and honest AI assistant. You provide accurate information and admit when you're uncertain.

Code Review

You are an expert code reviewer. Analyze code for: - Bugs and potential issues - Performance optimizations - Best practices and patterns - Security vulnerabilities Provide specific, actionable feedback with code examples.

Creative Writing Coach

You are a creative writing coach. Help develop stories with: - Compelling characters and motivations - Engaging plot structure - Vivid, sensory descriptions - Natural, distinctive dialogue Maintain consistent tone and style throughout.

Data Analyst

You are a data analyst. When given data or questions about data: - Identify trends and patterns - Suggest appropriate visualizations - Explain statistical significance plainly - Provide actionable insights Use clear, non-technical language unless asked otherwise.

Sample Duel prompts — try these in the ⚔️ Duel tab to compare models:

  • Consistency test: "Explain quantum computing in simple terms." — reveals explanation style differences.
  • Creativity test: "Write a haiku about the feeling of debugging code at 2am."
  • Reasoning test: "A bat and ball cost $1.10. The bat costs $1 more than the ball. How much does the ball cost?"
  • Code test: "Write a Python function to find all prime numbers up to n using the Sieve of Eratosthenes."

Recommended models by task

TaskRecommended Models
Creative WritingGPT-4o, Claude Sonnet, Command R+
Code GenerationGPT-4o, Claude Sonnet, Codestral
Fast ResponsesGPT-4o Mini, Claude Haiku, Llama 3.1 8B (Groq)
Long ContextClaude 3 (200k), GPT-4 Turbo (128k)
Reasoningo1, o3-mini, DeepSeek Reasoner
Cost-EffectiveClaude Haiku, GPT-4o Mini, Mixtral (Groq free tier)
Local / PrivateOllama (llama3.2, mistral, phi3)
Tips & Best Practices
💰 Cost
Start with small, cheap models (Haiku, GPT-4o Mini, Groq free tier) to prototype. Move to larger models only when needed. Lower Max Tokens to avoid paying for unused output.
🔑 API Keys
Never share your keys. Use the export feature with a strong password to back up your configuration. Rotate keys periodically and monitor usage dashboards for anomalies.
✍️ Prompt Writing
Be specific. Shorter prompts cost less but may need more iterations. Add constraints ("in under 100 words", "JSON only") to get consistent output. Use the Optimize button to refine system prompts.
⚔️ Dueling
Use Duel to compare a premium vs. budget model for your exact task. You may find the cheaper model is "good enough" — saving cost without sacrificing quality.
📊 History
Add comments to history entries immediately while context is fresh. Use "Load in Test" to continue refining a prompt from any past run.
💾 Backups
Export your config regularly. Use a strong, unique password and store it in a password manager — not next to the export file. Sign in with Google to enable automatic cloud sync.
🔧 Troubleshooting
ErrorLikely CauseFix
API Error / 401 Invalid or expired API key Check key in Configurations. Regenerate on provider dashboard.
Rate Limit Exceeded Too many requests in a short window Wait 60 seconds, or switch to a different provider temporarily.
Invalid Model Model name misspelled or deprecated Check exact model name on provider docs. Use autocomplete suggestions when adding models.
No output / empty response Max Tokens too low, or model returned nothing Increase Max Tokens. Check if the system prompt contradicts the user message.
Decryption Failed Wrong password or corrupted file Re-enter password carefully. Use copy/paste from password manager.
Provider not in Test dropdown Provider added but page not refreshed Switch away from Configurations and back — dropdowns update automatically. Ensure the provider has at least one model added.
💬
Ollama (local models): Ensure Ollama is running locally (ollama serve) and CORS is enabled for browser requests. Use URL http://localhost:11434/v1. No API key required.
⚠️ Disclaimer

The information in this Help section — including provider URLs, model names, pricing estimates, and capability descriptions — is provided to the best of our knowledge as of the published date below. AI providers frequently update their APIs, deprecate models, change pricing, and revise terms of service without notice.

Model pricing and availability may change at any time. Always verify costs directly with your provider's dashboard before running large-scale tests. Token counts shown in this app are informational; billing is determined solely by the provider.

LLM Test Bench is an independent tool and is not affiliated with, endorsed by, or officially supported by Anthropic, OpenAI, Google, Mistral AI, Groq, or any other model provider mentioned herein. All product names and trademarks are the property of their respective owners.

Help content published: April 2025  ·  Privacy Policy  ·  Terms of Service