Everything you need to understand prompt testing, configure providers, interpret results, and get the most from your LLM experiments.
Prompt testing is the practice of systematically evaluating how different large language models (LLMs) respond to the same or similar inputs. Just as software engineers write unit tests for code, AI engineers write and refine prompts to ensure models behave reliably across a range of inputs.
LLM Test Bench lets you configure multiple AI providers, run prompts against them, compare outputs side-by-side (Duel mode), and build a searchable history of every experiment — so your learning compounds over time.
The same question asked to different models — or even to the same model with a slightly different system prompt — can produce dramatically different results. Prompt testing helps you:
- Find the best model for your task — A coding assistant, a creative writer, and a factual Q&A bot all have different ideal models.
- Reduce costs — Smaller, cheaper models often perform just as well for simple tasks. Testing reveals where you can save.
- Improve reliability — A well-crafted system prompt that's been tested reduces hallucinations and keeps the model on-task.
- Document your findings — The History tab turns ephemeral experiments into a searchable knowledge base.
- Avoid vendor lock-in — By testing multiple providers, you understand trade-offs and can switch if pricing or availability changes.
Step 1 — Add a Provider
Go to Configurations and click + Add Provider. Type a provider name — suggestions from a built-in catalog will appear automatically. Selecting a known name (e.g. "Anthropic") auto-fills the API URL.
Step 2 — Add Models
In the Models field, type or pick from the suggestions and click Add. Star (☆) a model to mark it as the default — it will be pre-selected whenever you open the Test tab.
Step 3 — Run a Test
Go to Test, choose your provider and model, enter a System Prompt and User Message, then click ▶ Run Test. The output and token counts appear below and are saved automatically to History.
Popular Provider URLs
| Provider | API Base URL | Key Prefix |
|---|---|---|
| Anthropic | https://api.anthropic.com/v1 | sk-ant-… |
| OpenAI | https://api.openai.com/v1 | sk-proj-… |
| https://generativelanguage.googleapis.com/v1beta | AIza… | |
| Mistral | https://api.mistral.ai/v1 | — |
| Groq | https://api.groq.com/openai/v1 | gsk_… |
| DeepSeek | https://api.deepseek.com/v1 | — |
| Together AI | https://api.together.xyz/v1 | — |
| Perplexity | https://api.perplexity.ai | pplx-… |
| Ollama (local) | http://localhost:11434/v1 | n/a |
The three parameters in the Test tab control how the model generates text. They are collapsed by default — click ⚙ Parameters to expand them.
Temperature — Controls randomness
| Value | Behaviour | Good For |
|---|---|---|
| 0.0 | Fully deterministic | Math, factual Q&A, structured output |
| 0.3 | Mostly consistent | Code generation, summaries |
| 0.7 | Balanced (default) | General purpose, conversation |
| 1.0 | Creative & varied | Creative writing, brainstorming |
| 1.5 – 2.0 | Highly unpredictable | Experimental / artistic only |
Max Tokens — Maximum output length
| Value | Typical Output |
|---|---|
| 100 | A sentence or two |
| 500 | A short paragraph |
| 1 000 | Detailed explanation (default) |
| 2 000 | Article or documentation |
| 4 000+ | Long-form content |
Top P (Nucleus Sampling) — Vocabulary diversity
Below are ready-to-use system prompts for common tasks. Copy them into the System Prompt field to get started quickly.
General Assistant
Code Review
Creative Writing Coach
Data Analyst
Sample Duel prompts — try these in the ⚔️ Duel tab to compare models:
- Consistency test: "Explain quantum computing in simple terms." — reveals explanation style differences.
- Creativity test: "Write a haiku about the feeling of debugging code at 2am."
- Reasoning test: "A bat and ball cost $1.10. The bat costs $1 more than the ball. How much does the ball cost?"
- Code test: "Write a Python function to find all prime numbers up to n using the Sieve of Eratosthenes."
Recommended models by task
| Task | Recommended Models |
|---|---|
| Creative Writing | GPT-4o, Claude Sonnet, Command R+ |
| Code Generation | GPT-4o, Claude Sonnet, Codestral |
| Fast Responses | GPT-4o Mini, Claude Haiku, Llama 3.1 8B (Groq) |
| Long Context | Claude 3 (200k), GPT-4 Turbo (128k) |
| Reasoning | o1, o3-mini, DeepSeek Reasoner |
| Cost-Effective | Claude Haiku, GPT-4o Mini, Mixtral (Groq free tier) |
| Local / Private | Ollama (llama3.2, mistral, phi3) |
| Error | Likely Cause | Fix |
|---|---|---|
| API Error / 401 | Invalid or expired API key | Check key in Configurations. Regenerate on provider dashboard. |
| Rate Limit Exceeded | Too many requests in a short window | Wait 60 seconds, or switch to a different provider temporarily. |
| Invalid Model | Model name misspelled or deprecated | Check exact model name on provider docs. Use autocomplete suggestions when adding models. |
| No output / empty response | Max Tokens too low, or model returned nothing | Increase Max Tokens. Check if the system prompt contradicts the user message. |
| Decryption Failed | Wrong password or corrupted file | Re-enter password carefully. Use copy/paste from password manager. |
| Provider not in Test dropdown | Provider added but page not refreshed | Switch away from Configurations and back — dropdowns update automatically. Ensure the provider has at least one model added. |
ollama serve) and CORS is enabled for browser requests.
Use URL http://localhost:11434/v1. No API key required.
The information in this Help section — including provider URLs, model names, pricing estimates, and capability descriptions — is provided to the best of our knowledge as of the published date below. AI providers frequently update their APIs, deprecate models, change pricing, and revise terms of service without notice.
Model pricing and availability may change at any time. Always verify costs directly with your provider's dashboard before running large-scale tests. Token counts shown in this app are informational; billing is determined solely by the provider.
LLM Test Bench is an independent tool and is not affiliated with, endorsed by, or officially supported by Anthropic, OpenAI, Google, Mistral AI, Groq, or any other model provider mentioned herein. All product names and trademarks are the property of their respective owners.
Help content published: April 2025 · Privacy Policy · Terms of Service