LLM Test Bench

Provider

Model

System Prompt

User Message

Max Tokens

Temperature

Top P

Output will appear here...

💡 Side-by-side view is best on desktop. Stacked on mobile.

System Prompt (shared)

User Message (shared)

⚔ Model A

Provider

Model

Output will appear here...

In: 0 Out: 0 Total: 0

⚔ Model B

Provider

Model

Output will appear here...

In: 0 Out: 0 Total: 0

📖 LLM Test Bench — Help & Guide

Everything you need to understand prompt testing, configure providers, interpret results, and get the most from your LLM experiments.

What is Prompt Testing? Why it Matters Getting Started Parameters Use Cases Tips & Best Practices Troubleshooting Disclaimer

🔬 What is Prompt Testing?

Prompt testing is the practice of systematically evaluating how different large language models (LLMs) respond to the same or similar inputs. Just as software engineers write unit tests for code, AI engineers write and refine prompts to ensure models behave reliably across a range of inputs.

LLM Test Bench lets you configure multiple AI providers, run prompts against them, compare outputs side-by-side (Duel mode), and build a searchable history of every experiment — so your learning compounds over time.

🧪

Test Tab

Run a single prompt against one model. See output, token usage, and save to history.

⚔️

Duel Tab

Send the same prompt to two models simultaneously. Compare outputs side-by-side.

⚙️

Configurations

Store API keys, URLs, and model lists for each provider. Export encrypted backups.

📊

History

Every test run is saved. Add comments, reload into Test, or export as JSON.

💡 Why Prompt Testing Matters

The same question asked to different models — or even to the same model with a slightly different system prompt — can produce dramatically different results. Prompt testing helps you:

Find the best model for your task — A coding assistant, a creative writer, and a factual Q&A bot all have different ideal models.
Reduce costs — Smaller, cheaper models often perform just as well for simple tasks. Testing reveals where you can save.
Improve reliability — A well-crafted system prompt that's been tested reduces hallucinations and keeps the model on-task.
Document your findings — The History tab turns ephemeral experiments into a searchable knowledge base.
Avoid vendor lock-in — By testing multiple providers, you understand trade-offs and can switch if pricing or availability changes.

💬

Rule of thumb: Start with a cheaper, faster model (e.g. Claude Haiku, GPT-4o Mini) to prototype your prompt. Only move to a larger model once the task is well-defined.

🚀 Getting Started

Step 1 — Add a Provider

Go to Configurations and click + Add Provider. Type a provider name — suggestions from a built-in catalog will appear automatically. Selecting a known name (e.g. "Anthropic") auto-fills the API URL.

Step 2 — Add Models

In the Models field, type or pick from the suggestions and click Add. Star (☆) a model to mark it as the default — it will be pre-selected whenever you open the Test tab.

Step 3 — Run a Test

Go to Test, choose your provider and model, enter a System Prompt and User Message, then click ▶ Run Test. The output and token counts appear below and are saved automatically to History.

⚠️

API Keys are required. Your key is sent directly to the chosen provider for each request. It is never logged or stored beyond the request. In Local Mode (no sign-in) your keys live only in your browser's localStorage.

Popular Provider URLs

Provider	API Base URL	Key Prefix
Anthropic	https://api.anthropic.com/v1	sk-ant-…
OpenAI	https://api.openai.com/v1	sk-proj-…
Google	https://generativelanguage.googleapis.com/v1beta	AIza…
Mistral	https://api.mistral.ai/v1	—
Groq	https://api.groq.com/openai/v1	gsk_…
DeepSeek	https://api.deepseek.com/v1	—
Together AI	https://api.together.xyz/v1	—
Perplexity	https://api.perplexity.ai	pplx-…
Ollama (local)	http://localhost:11434/v1	n/a

⚙️ Understanding Parameters

The three parameters in the Test tab control how the model generates text. They are collapsed by default — click ⚙ Parameters to expand them.

Temperature — Controls randomness

Value	Behaviour	Good For
0.0	Fully deterministic	Math, factual Q&A, structured output
0.3	Mostly consistent	Code generation, summaries
0.7	Balanced (default)	General purpose, conversation
1.0	Creative & varied	Creative writing, brainstorming
1.5 – 2.0	Highly unpredictable	Experimental / artistic only

Max Tokens — Maximum output length

Value	Typical Output
100	A sentence or two
500	A short paragraph
1 000	Detailed explanation (default)
2 000	Article or documentation
4 000+	Long-form content

Top P (Nucleus Sampling) — Vocabulary diversity

📌

Keep Top P at 1.0 (default) unless you are specifically experimenting. Lower values (e.g. 0.1) make the model very conservative; it only picks from the most probable next words. Temperature and Top P interact — avoid setting both to extreme values at the same time.

🎯 Use Cases & Example Prompts

Below are ready-to-use system prompts for common tasks. Copy them into the System Prompt field to get started quickly.

General Assistant

You are a helpful, harmless, and honest AI assistant. You provide accurate information and admit when you're uncertain.

Code Review

You are an expert code reviewer. Analyze code for: - Bugs and potential issues - Performance optimizations - Best practices and patterns - Security vulnerabilities Provide specific, actionable feedback with code examples.

Creative Writing Coach

You are a creative writing coach. Help develop stories with: - Compelling characters and motivations - Engaging plot structure - Vivid, sensory descriptions - Natural, distinctive dialogue Maintain consistent tone and style throughout.

Data Analyst

You are a data analyst. When given data or questions about data: - Identify trends and patterns - Suggest appropriate visualizations - Explain statistical significance plainly - Provide actionable insights Use clear, non-technical language unless asked otherwise.

Sample Duel prompts — try these in the ⚔️ Duel tab to compare models:

Consistency test: "Explain quantum computing in simple terms." — reveals explanation style differences.
Creativity test: "Write a haiku about the feeling of debugging code at 2am."
Reasoning test: "A bat and ball cost $1.10. The bat costs $1 more than the ball. How much does the ball cost?"
Code test: "Write a Python function to find all prime numbers up to n using the Sieve of Eratosthenes."

Recommended models by task

Task	Recommended Models
Creative Writing	GPT-4o, Claude Sonnet, Command R+
Code Generation	GPT-4o, Claude Sonnet, Codestral
Fast Responses	GPT-4o Mini, Claude Haiku, Llama 3.1 8B (Groq)
Long Context	Claude 3 (200k), GPT-4 Turbo (128k)
Reasoning	o1, o3-mini, DeepSeek Reasoner
Cost-Effective	Claude Haiku, GPT-4o Mini, Mixtral (Groq free tier)
Local / Private	Ollama (llama3.2, mistral, phi3)

✅ Tips & Best Practices

💰 Cost

Start with small, cheap models (Haiku, GPT-4o Mini, Groq free tier) to prototype. Move to larger models only when needed. Lower Max Tokens to avoid paying for unused output.

🔑 API Keys

Never share your keys. Use the export feature with a strong password to back up your configuration. Rotate keys periodically and monitor usage dashboards for anomalies.

✍️ Prompt Writing

Be specific. Shorter prompts cost less but may need more iterations. Add constraints ("in under 100 words", "JSON only") to get consistent output. Use the Optimize button to refine system prompts.

⚔️ Dueling

Use Duel to compare a premium vs. budget model for your exact task. You may find the cheaper model is "good enough" — saving cost without sacrificing quality.

📊 History

Add comments to history entries immediately while context is fresh. Use "Load in Test" to continue refining a prompt from any past run.

💾 Backups

Export your config regularly. Use a strong, unique password and store it in a password manager — not next to the export file. Sign in with Google to enable automatic cloud sync.

🔧 Troubleshooting

Error	Likely Cause	Fix
API Error / 401	Invalid or expired API key	Check key in Configurations. Regenerate on provider dashboard.
Rate Limit Exceeded	Too many requests in a short window	Wait 60 seconds, or switch to a different provider temporarily.
Invalid Model	Model name misspelled or deprecated	Check exact model name on provider docs. Use autocomplete suggestions when adding models.
No output / empty response	Max Tokens too low, or model returned nothing	Increase Max Tokens. Check if the system prompt contradicts the user message.
Decryption Failed	Wrong password or corrupted file	Re-enter password carefully. Use copy/paste from password manager.
Provider not in Test dropdown	Provider added but page not refreshed	Switch away from Configurations and back — dropdowns update automatically. Ensure the provider has at least one model added.

💬

Ollama (local models): Ensure Ollama is running locally (ollama serve) and CORS is enabled for browser requests. Use URL http://localhost:11434/v1. No API key required.

⚠️ Disclaimer

The information in this Help section — including provider URLs, model names, pricing estimates, and capability descriptions — is provided to the best of our knowledge as of the published date below. AI providers frequently update their APIs, deprecate models, change pricing, and revise terms of service without notice.

Model pricing and availability may change at any time. Always verify costs directly with your provider's dashboard before running large-scale tests. Token counts shown in this app are informational; billing is determined solely by the provider.

LLM Test Bench is an independent tool and is not affiliated with, endorsed by, or officially supported by Anthropic, OpenAI, Google, Mistral AI, Groq, or any other model provider mentioned herein. All product names and trademarks are the property of their respective owners.

Help content published: April 2025 · Privacy Policy · Terms of Service