Production Systems
Evals for Tool-Using Agents: Regression Tests Beyond Prompts
If your eval can't fail when a tool breaks, it's not an eval. Here's how to test tool selection, arguments, side effects, and safety - without flaky 'vibe checks.'
1 post tagged “testing”
If your eval can't fail when a tool breaks, it's not an eval. Here's how to test tool selection, arguments, side effects, and safety - without flaky 'vibe checks.'
No posts match your search.