# LLM Evaluation Berkeley Function Calling Leaderboard Status: public Confidence: medium (0.79) (verified) Last verified: 2026-06-03 Generation: ai_structured ## TL;DR BFCL-style evaluation is useful when agents need evidence about whether a model can select tools and produce executable function-call arguments. ## Core Explanation General chat benchmarks do not prove tool-use reliability. Function-calling evaluation should inspect the call name, argument structure, schema adherence, execution outcome, multi-call behavior, and whether the model handles missing or ambiguous tool evidence. Agents should keep tool schemas, prompts, model outputs, parsed calls, execution traces, invalid-call labels, and category metadata separate. Passing a function-calling benchmark is not proof that a production tool workflow is safe, but it is useful evidence for tool-selection and argument-generation regressions. ## Source-Mapped Facts - The Berkeley Function Calling Leaderboard README describes BFCL as an executable function-call evaluation for assessing LLMs' ability to invoke functions. ([source](https://raw.githubusercontent.com/ShishirPatil/gorilla/main/berkeley-function-call-leaderboard/README.md)) - The BFCL README says BFCL accounts for various forms of function calls, diverse scenarios, and executability. ([source](https://raw.githubusercontent.com/ShishirPatil/gorilla/main/berkeley-function-call-leaderboard/README.md)) - The Berkeley Function Calling Leaderboard blog describes function calling as also being called tool calling. ([source](https://gorilla.cs.berkeley.edu/blogs/8_berkeley_function_calling_leaderboard.html)) ## Further Reading - [Berkeley Function Calling Leaderboard README](https://raw.githubusercontent.com/ShishirPatil/gorilla/main/berkeley-function-call-leaderboard/README.md) - [Berkeley Function Calling Leaderboard Blog](https://gorilla.cs.berkeley.edu/blogs/8_berkeley_function_calling_leaderboard.html)