Introducing OfficeQA: A Benchmark for Finish-to-Finish Grounded Reasoning

December 9, 2025

58

There are a number of benchmarks that probe the frontier of agent capabilities (GDPval, Humanity’s Final Examination (HLE), ARC-AGI-2), however we don’t discover them consultant of the sorts of duties which might be vital to our prospects. To fill this hole, we have created and are open-sourcing OfficeQA—a benchmark that proxies for economically priceless duties carried out by Databricks’ enterprise prospects. We concentrate on a quite common but difficult enterprise process: Grounded Reasoning, which entails answering questions based mostly on advanced proprietary datasets that embrace unstructured paperwork and tabular knowledge.

Regardless of frontier fashions performing effectively on Olympiad-style questions, we discover they nonetheless wrestle on these economically vital duties. With out entry to the corpus, they reply ~2% of questions accurately. When supplied with a corpus of PDF paperwork, brokers carry out at <45% accuracy throughout all questions and <25% on a subset of the toughest questions.

Preview of efficiency of AI brokers on OfficeQA-All (246 examples) and OfficeQA-Laborious (a subset of 113 examples), together with a Claude Opus 4.5 Agent with default pondering (excessive) constructed with Claude’s Agent SDK and the OpenAI File Search & Retrieval API utilizing GPT-5.1 with reasoning_effort = excessive.

On this put up, we first describe OfficeQA and our design rules. We then consider present AI agent options — together with a GPT-5.1 Agent utilizing OpenAI’s File Search & Retrieval API and a Claude Opus 4.5 Agent utilizing Claude’s Agent SDK — on the benchmark. We experiment with utilizing Databricks’ ai_parse_document to parse OfficeQA’s corpus of PDFs, and discover that this delivers vital beneficial properties. Even with these enhancements, we discover that every one methods nonetheless fall in need of 70% accuracy on the complete benchmark and solely attain round 40% accuracy on the toughest break up, indicating substantial room for enchancment on this process. Lastly, we announce the Databricks Grounded Reasoning Cup, a contest in Spring 2026 the place AI brokers will compete towards human groups to drive innovation on this house.

Dataset Desiderata

We had a number of key objectives in constructing OfficeQA. First, questions needs to be difficult as a result of they require cautious work—precision, diligence, and time—not as a result of they demand PhD-level experience. Second, every query will need to have a single, clearly right reply that may be checked robotically towards floor reality, so methods may be skilled and evaluated with none human or LLM judging. Lastly and most significantly, the benchmark ought to precisely replicate frequent issues that enterprise prospects face.

We distilled frequent enterprise issues into three important elements:

Doc complexity: Enterprises have massive collections of supply supplies—similar to scans, PDFs, or images—that always include substantial numerical or tabular knowledge.
Info retrieval and aggregation: They should effectively search, extract, and mix data throughout many such paperwork.
Analytical reasoning and query answering: They require methods able to answering questions and performing analyses grounded in these paperwork, typically involving calculations or exterior information.

We additionally be aware that many enterprises demand extraordinarily excessive precision when performing these duties. Shut just isn’t adequate. Being off by one on a product or bill quantity can have catastrophic downstream outcomes. Forecasting income and being off by 5% can result in dramatically incorrect enterprise selections.

Present benchmarks don’t meet our wants:
		Instance
GDPVal	Duties are clear examples of economically priceless duties, however most don’t particularly take a look at for issues our prospects care about. Knowledgeable human judging is really helpful. This benchmark additionally supplies solely the set of paperwork wanted to reply every query immediately, which doesn’t permit for analysis of agent retrieval capabilities over a big corpus.	“You’re a Music Producer in Los Angeles in 2024. You’re employed by a consumer to create an instrumental observe for a music video for a track referred to as ‘Deja Vu’”
ARC-AGI-2	Duties are so summary as to be divorced from the connection to actual world economically priceless duties – they contain summary visible manipulation of coloured grids. Very small, specialised fashions are able to matching the efficiency of far bigger (1000x) basic objective LLMs.
Humanity’s Final Examination (HLE)	Not clearly consultant of most economically priceless work, and definitely not consultant of the workloads of Databricks’ prospects. Questions require PhD-level experience and no single human is probably going in a position to reply all of the questions.	“Compute the diminished twelfth dimensional Spin bordism of the classifying house of the Lie group G2. “Lowered” means that you could ignore any bordism courses that may be represented by manifolds with trivial principal G2 bundle.”

Introducing the OfficeQA Benchmark

We introduce OfficeQA, a dataset approximating proprietary enterprise corpora, however freely accessible and supporting a wide range of numerous and fascinating questions. We leverage the U.S. Treasury Bulletins to create this benchmark, traditionally printed month-to-month for 5 a long time starting in 1939 and quarterly thereafter. Every bulletin is 100-200 pages lengthy and consists of prose, many advanced tables, charts and figures describing the operations of the U.S. Treasury – the place cash got here from, the place it’s, the place it went and the way it financed operations. The overall dataset includes ~89,000 pages. Till 1996, the bulletins have been scans of bodily paperwork and afterwards, digitally produced PDFs.

We additionally see worth in making this historic Treasury knowledge extra accessible to the general public, researchers, and lecturers. USAFacts is a company that naturally shares this imaginative and prescient, on condition that its core mission is “to make authorities knowledge simpler to entry and perceive.” They partnered with us to develop this benchmark, figuring out the Treasury Bulletins as a perfect dataset and guaranteeing our questions mirrored reasonable use instances for these paperwork.

In step with our purpose that the questions needs to be answerable by non-expert people, not one of the questions require greater than highschool math operations. We do count on most people would wish to lookup among the monetary or statistical phrases by way of the online.

Dataset Overview

OfficeQA consists of 246 questions organized into two problem ranges – straightforward and exhausting – based mostly on the efficiency of present AI methods on the questions. “Simple” questions are outlined as questions that each of the frontier agent methods (detailed beneath) acquired right, and “Laborious” questions are questions that a minimum of one of many brokers answered incorrectly.

The questions on common require data from ~2 completely different Treasury Bulletin paperwork. Throughout a consultant pattern of the benchmark, human solvers averaged a completion time of fifty minutes per query. Nearly all of this time was spent finding the data required to reply the query throughout quite a few tables and figures inside the corpus.

Dataset overview

To make sure the questions in OfficeQA required document-grounded retrieval, we made greatest effort to filter out any questions that LLMs might reply accurately with out entry to the supply paperwork (i.e., may very well be answered by way of a mannequin’s parametric information or net search). Most of those filtered questions tended to be less complicated, or ask about extra basic details, like “Within the fiscal yr that George H.W. Bush first grew to become president, which U.S federal belief fund had the most important enhance in funding?”

Apparently, there have been a couple of seemingly extra advanced questions that fashions have been in a position to reply with parametric information alone like “Conduct a two-sample t-test to find out whether or not the imply U.S Treasury bond rate of interest modified between 1942–1945 (earlier than the top of World Warfare II) and 1946–1949 (after the top of World Warfare II) on the 5% significance degree. What’s the calculated t-statistic, rounded to the closest hundredth?” On this case, the mannequin leverages historic monetary data that have been memorized throughout pre-training after which computes the ultimate worth accurately. Examples like these have been filtered from the ultimate benchmark.

Instance OfficeQA Questions

Simple: “What have been the whole expenditures (in hundreds of thousands of nominal {dollars}) for U.S nationwide protection within the calendar yr of 1940?”

This requires a primary worth look-up, and summing of the values for the months within the specified calendar yr in a single desk (highlighted in crimson). Be aware that the totals for prior years are for fiscal and never calendar years.

Budget Expenditures 1940

Laborious: “Predict the whole outlays of the US Division of Agriculture in 1999 utilizing annual knowledge from the years 1990-1998 (inclusive). Use a primary linear regression match to provide the slope and y-intercept. Deal with 1990 as yr “0” for the time variable. Carry out all calculations in nominal {dollars}. You do not want to take into consideration postyear changes. Report all values inside sq. brackets, separated by commas, with the primary worth because the slope rounded to the closest hundredth, the second worth because the y-intercept rounded to the closest entire quantity and the third worth as the expected worth rounded to the closest entire quantity.”

Table FFO-3

This requires discovering data whereas navigating throughout a number of paperwork (pictured above), and entails extra superior reasoning and statistical calculation with detailed answering pointers.

Baseline Brokers: Implementation and Efficiency

We consider the next baselines¹:

GPT-5.1 Agent with File Search: We use GPT-5.1, configured with reasoning_effort=excessive, by way of the OpenAI Responses API and provides it entry to instruments like file search and net search. The PDFs are uploaded to the OpenAI Vector Retailer, the place they’re robotically parsed and listed. We additionally experiment with offering the Vector Retailer with pre-parsed paperwork utilizing ai_parse_document.
Claude Opus 4.5 Agent: We use Claude’s Agent Python SDK with Claude Opus 4.5 as a backend (default pondering=excessive) and configure this agent with the SDK-offered autonomous capabilities like context administration and a built-in software ecosystem containing instruments like file search (learn, grep, glob, and so on.), net search, programming execution and different software functionalities. Because the Claude Agent SDK didn’t present its personal built-in parsing answer, we experimented with (1) offering the agent with the PDFs saved in an area folder sandbox and talent to put in PDF reader packages like pdftotext and pdfplumber, and (2) offering the agent with pre-parsed paperwork utilizing ai_parse_document.
LLM with Oracle PDF Web page(s): We consider Claude Opus 4.5 and GPT 5.1 by immediately offering the mannequin with the precise oracle PDF(s) web page(s) required for answering the query. This can be a non-agentic baseline that measures how effectively LLMs can carry out with the supply materials needed for reasoning and deriving the right response, representing an higher certain of efficiency assuming an oracle retrieval system.
LLM with Oracle Parsed PDF Web page(s): We additionally take a look at offering Claude Opus 4.5 and GPT-5.1 immediately with the pre-parsed Oracle PDF web page(s) required to reply the query, which have been parsed utilizing ai_parse_document.

For all experiments, we take away any present OCR layer from the U.S. Treasury Bulletin PDFs as a result of their low accuracy. This ensures truthful analysis of every agent’s means to extract and interpret data immediately from the scanned paperwork.

We plot the correctness of all of the brokers beneath on the y-axis whereas the x-axis is the allowable absolute relative error to be thought of right. For instance, if the reply to a query is ‘5.2 million’ and the agent solutions ‘5.1 million’ (1.9% off from the unique reply), the agent could be scored as right at something above a 1.9% allowable absolute relative error, and incorrect at something <1.9%.

Performance on OfficeQA — Common correctness of baselines on the complete OfficeQA benchmark throughout completely different ranges of allowable absolute relative error. Fashions / Brokers labeled “PDF” use the corpus of PDFs, whereas these labeled “Databricks Parse” use paperwork parsed by way of Databricks’ ai_parse_document as enter.

LLM with Oracle Web page(s)

Apparently, each Claude Opus 4.5 and GPT 5.1 carry out poorly even when supplied immediately with the oracle PDF web page(s) wanted for every query. Nonetheless, when these similar pages are preprocessed utilizing Databricks ai_parse_document, efficiency jumps considerably—by +4.0 and +32.4 share factors for Claude Opus 4.5 and GPT 5.1 respectively (representing +7.5% and +85.0% relative will increase).

With parsing, the best-performing mannequin (GPT-5.1) reaches roughly 70% accuracy. The remaining ~30% hole stems from a number of components: (1) these non-agent baselines lack entry to instruments like net search, which ~13% of questions require; (2) parsing and extraction errors from tables and charts happen; and (3) computational reasoning errors stay.

Agent Methods with Full Corpus

When supplied with the OfficeQA corpus immediately, each brokers reply over half of OfficeQA questions incorrectly – reaching a most efficiency of 43.3% at 0% allowable error. Offering brokers with paperwork parsed with Databricks ai_parse_document improves efficiency as soon as once more: the Claude 4.5 Opus Agent improves by +30.4 share factors and the GPT 5.1 Agent by +9.3 share factors (81.7% and 21.5% relative will increase, respectively).

Nonetheless, even the very best agent – Claude Agent with Claude Opus 4.5 – nonetheless achieves lower than 70% % correctness at 0% allowable error with parsed paperwork, underscoring the issue of those duties for frontier AI methods. Attaining this increased efficiency additionally requires increased latency and related value. On common, the Claude Agent takes ~5 minutes to reply every query, whereas the lower-scoring OpenAI agent takes ~3 minutes.

As anticipated, correctness scores progressively enhance when increased absolute relative errors are allowed. Such discrepancies come up from precision divergence, the place the brokers might use supply values which have slight variations that drift throughout cascading operations and produce small closing deviations within the closing reply. Errors embrace incorrect parsing (studying ‘508’ as ‘608’, for instance), misinterpretation of statistical values, or an agent’s incapacity to retrieve related and correct data from the corpus. As an illustration, an agent produces an incorrect but shut reply to the bottom reality for this query: “What’s the sum of every yr’s whole Public debt securities excellent held by US Authorities accounts, in nominal hundreds of thousands of {dollars} recorded on the finish of the fiscal years 2005 to 2009 inclusive, returned as a single worth?” The agent finally ends up retrieving data from the June 2010 bulletin, however the related and proper values are discovered within the September 2010 publication (upon reported revisions), leading to a distinction of 21 million {dollars} (0.01% off from the bottom reality).

One other instance that ends in a bigger distinction is inside this query, “Carry out a time sequence evaluation on the reported whole surplus/deficit values from calendar years 1989-2013, treating all values as nominal values in hundreds of thousands of US {dollars} after which match a cubic polynomial regression mannequin to estimate the anticipated surplus or deficit for calendar yr 2025 and report absolutely the distinction with the U.S. Treasury’s reported estimate rounded to the closest entire quantity in hundreds of thousands of {dollars}.”, an agent incorrectly retrieves the fiscal yr values as a substitute of the calendar yr values for 8 years, which modifications the enter sequence used for the cubic regression and results in a special 2025 prediction and absolute-difference outcome that’s off by $286,831 million (31.6% off from the bottom reality).

Failure Modes

Whereas creating OfficeQA, we noticed a number of frequent failure modes of present AI methods:

Parsing errors stay a elementary problem—advanced tables with nested column hierarchies, merged cells, and weird formatting usually lead to misaligned or incorrectly extracted values. For instance, we noticed instances the place column shifts throughout automated extraction prompted numerical values to be attributed to the unsuitable headers fully.
Reply ambiguity additionally poses difficulties: monetary paperwork just like the U.S. Treasury Bulletin are regularly revised and reissued, which means a number of reliable values might exist for a similar knowledge level relying on which publication date the agent references. Brokers usually cease looking as soon as they discover a believable reply, lacking essentially the most authoritative or up-to-date supply, regardless of being prompted to search out the newest values.
Visible understanding represents one other vital hole. Roughly 3% of OfficeQA questions reference charts, graphs, or figures that require visible reasoning. Present brokers regularly fail on these duties, as proven within the instance beneath.

U.S. Gross Saving Ratio, 1898-1990 — Instance of a visible understanding query that every one brokers at the moment fail: “On report web page 5 of the September 1990 US Treasury Month-to-month Bulletin, what number of native maxima are there on the road plots on that web page?” AI methods both fail to search out the web page with the graphic or fail to correctly rely the variety of native maxima.

These remaining failure modes showcase that analysis progress remains to be wanted earlier than AI brokers can deal with the complete spectrum of enterprise in-domain reasoning duties.

Databricks Grounded Reasoning Cup

We are going to pit AI Brokers towards groups of people in Spring 2026 to see who can obtain the very best outcomes on the OfficeQA benchmark.

Timing: We’re focusing on San Francisco for the principle occasion, seemingly between late March and late April. Actual dates might be launched shortly to those that join updates.
In-Particular person Finale: The highest groups might be invited to San Francisco for the ultimate competitors.

We’re at the moment opening an curiosity record. Go to the hyperlink to get notified as quickly because the official guidelines, dates, and prize swimming pools are introduced. (Coming quickly!)

Conclusion

The OfficeQA benchmark represents a major step towards evaluating AI brokers on economically priceless, real-world grounded reasoning duties. By grounding our benchmark within the U.S. Treasury Bulletins, a corpus of almost 89,000 pages spanning over eight a long time, now we have created a difficult testbed that requires brokers to parse advanced tables, retrieve data throughout many paperwork, and carry out analytical reasoning with excessive precision.

The OfficeQA benchmark is freely accessible to the analysis neighborhood and may be discovered right here. We encourage groups to discover OfficeQA and current options on the benchmark as a part of the Databricks Grounded Reasoning Cup.

Authors: Arnav Singhvi, Krista Opsahl-Ong, Jasmine Collins, Ivan Zhou, Cindy Wang, Ashutosh Baheti, Jacob Portes, Sam Havens, Erich Elsen, Michael Bendersky, Matei Zaharia, Xing Chen.

We’d wish to thank Dipendra Kumar Misra, Owen Oertell, Andrew Drozdov, Jonathan Chang, Simon Favreau-Lessard, Erik Lindgren, Pallavi Koppol, Veronica Lyu, in addition to SuperAnnotate and Turing for serving to to create the questions in OfficeQA.

Lastly, we’d additionally wish to thank USAFacts for his or her steerage in figuring out the U.S. Treasury Bulletins and offering suggestions to make sure questions have been topical and related.

¹We tried to guage the lately launched Gemini File Search Software API as a part of a consultant Gemini Agent baseline with Gemini 3. Nonetheless, about 30% of the PDFs and parsed PDFs within the OfficeQA corpus didn’t ingest, and the File Search Software is incompatible with the Google Search Software. Since this may restrict the agent from answering OfficeQA questions that want exterior information, we excluded this setup from our baseline analysis. We’ll revisit it as soon as ingestion works reliably so we will measure its efficiency precisely.

Introducing OfficeQA: A Benchmark for Finish-to-Finish Grounded Reasoning

Dataset Desiderata

Introducing the OfficeQA Benchmark

Dataset Overview

Instance OfficeQA Questions

Baseline Brokers: Implementation and Efficiency

Failure Modes

Databricks Grounded Reasoning Cup

Conclusion

Related Articles

AI’s courageous new world of technical debt

Is bovine colostrum actually ‘liquid gold’ for intestine well being? : NPR

June 1, 2026 – iRunFar

LEAVE A REPLY Cancel reply

Latest Articles

AI’s courageous new world of technical debt

Is bovine colostrum actually ‘liquid gold’ for intestine well being? : NPR

June 1, 2026 – iRunFar

White Sangria Recipe – Love and Lemons

10Beauty raises US$23.5m to speed up robotic manicure enlargement