In July 2025, the US FDA publicly launched an preliminary batch of 200+ Full Response Letters (CRLs), resolution letters explaining why drug and biologic purposes weren’t authorized on first move, marking a serious transparency shift. For the primary time, sponsors, clinicians and information groups can analyze the trade by the company’s personal language about deficiencies throughout medical, CMC, security, labeling, and bioequivalence, through centralized, downloadable open FDA PDFs.
Because the FDA continues to launch new CRLs, the flexibility to quickly generate perception from this and different unstructured information, and add to their inside intelligence/information, turns into a serious aggressive benefit. Organizations that may successfully harness these unstructured information insights, within the type of PDFs, paperwork, photos, and past, can de-risk their very own submissions, establish frequent pitfalls, and in the end speed up their path to market. The problem is that this information, like many different regulatory information, is locked in PDFs, that are notoriously troublesome to course of at scale.
That is exactly the kind of problem Databricks was constructed to resolve. This weblog demonstrates how you can use Databricks’ newest AI tooling to speed up the extraction of key data trapped in PDFs – turning these important letters right into a supply of actionable intelligence.
What it takes to achieve success with AI
Given the technical depth required, engineers usually lead improvement in a silo, creating a large hole between the AI construct and the enterprise necessities. By the point a topic professional (SME) sees the outcome, it is usually not what they wanted. The suggestions loop is simply too sluggish and the challenge loses momentum.
Throughout early testing phases, it’s essential to ascertain a baseline. In lots of instances, different approaches require losing months with out floor truths, relying as a substitute on subjective commentary and “vibes”. This lack of empirical proof stalls progress. Conversely, Databricks tooling offers analysis options out-of-the-box and permits clients to emphasise high quality instantly – utilizing an iterative framework to achieve mathematical confidence within the extraction. AI success requires a brand new method constructed on speedy, collaborative iteration.
Databricks offers a unified platform the place enterprise SMEs and AI engineers can work collectively in real-time to construct, check, and deploy production-quality brokers. This framework is constructed on three key rules:
- Tight Enterprise-Technical Alignment: SMEs and tech leads collaborate in the identical UI for immediate suggestions, changing sluggish e mail loops.
- Floor Fact Analysis: Enterprise-defined “floor reality” labels are constructed immediately into the workflow for formal scoring.
- A Full Platform Method: This is not a sandbox or level answer; it’s absolutely built-in with automated pipelines, LLM-as-a-Choose analysis, production-reliable GPU throughput, and end-to-end Unity Catalog governance.
This unified platform method is what turns a prototype right into a trusted, production-ready AI system. Let’s stroll by the 4 steps to construct it.
From PDF to Manufacturing: A 4-Step Information
Constructing a production-quality AI system on unstructured information requires greater than only a good mannequin; it requires a seamless, iterative, and collaborative workflow. The Info Extraction Agent Brick, mixed with Databricks’ built-in AI capabilities, makes it straightforward to parse paperwork, extract key data, and operationalize the complete course of. This method empowers groups to maneuver sooner and ship higher-quality outcomes. Breaking down the 4 key steps to construct under.
Step 1: Parsing Unstructured PDFs into Textual content with ai_parse_document()
The primary hurdle is getting clear textual content out of the PDFs. CRLs can have complicated layouts with headers, footers, tables, charts, throughout a number of pages, and a number of columns. A easy textual content extraction will usually fail, producing inaccurate, and unusable output.
Not like fragile level options that battle with format, ai_parse_document() leverages state-of-the-art multimodal AI to know doc construction – precisely extracting textual content in studying order, preserving irregular desk hierarchies, and producing captions for figures.
Moreover, Databricks delivers a bonus in doc intelligence by reliably scaling to deal with enterprise-level volumes of complicated PDFs at 3-5x decrease value than main rivals. Groups don’t want to fret about file dimension limits, and the OCR and VLM below the hood guarantee correct parsing of traditionally “drawback PDFs” containing dense, irregular figures and different difficult buildings.
What as soon as required quite a few information scientists to configure and preserve bespoke parsing stacks throughout a number of distributors can now be completed with a single, SQL-native operate – permitting groups to course of hundreds of thousands of paperwork in parallel with out the failure modes that plague much less scalable parsers.
To get began, first, level a UC Quantity at your cloud storage containing your PDFs. In our instance, we’ll level the SQL operate on the CRL PDFs managed by a Quantity:
This single command processes all of your PDFs and creates a structured desk with the parsed content material and the mixed textual content, making it prepared for the following step.
Notice, we didn’t must configure any infrastructure, networking or exterior LLM or GPU calls – Databricks hosts the GPUs and mannequin backend, enabling dependable, scalable throughput with out further configuration. Not like platforms that cost licensing charges, Databricks makes use of a compute-based pricing mannequin – which means you solely pay for the assets you utilize. This permits for highly effective value optimizations by parallelization and function-level customization in your manufacturing pipelines.
Step 2: Iterative Info Extraction with Agent Bricks
After you have the textual content, the following aim is to extract particular, structured fields. For instance: What was the deficiency? What was the NDA ID? What was the rejection quotation? That is the place AI engineers and enterprise SMEs must collaborate carefully. The SME is aware of what to search for, and might work with the engineer to rapidly immediate the mannequin on how you can discover it.
Agent Bricks: Info Extraction offers a real-time, collaborative UI for this precise workflow.
As proven under, the interface permits a technical lead and a enterprise SME to work collectively:
- The Enterprise SME offers particular fields that have to be extracted (e.g.,
deficiency_summary_paragraphs, NDA_ID, FDA_Rejection_Citing). - The Info Extraction Agent will translate these necessities into efficient prompts – these editable pointers are within the right-hand panel.
- Each the Tech Lead and Enterprise SME can instantly see the JSON output within the middle panel and validate if the mannequin is appropriately extracting the knowledge from the doc on the left. From right here, both of the 2 can reformulate a immediate to make sure correct extractions.
This prompt suggestions loop is the important thing to success. If a subject shouldn’t be extracted appropriately, the staff can tweak the immediate, add a brand new subject, or refine the directions and see the lead to seconds. This iterative course of, the place a number of consultants collaborate in a single interface, is what separates profitable AI tasks from ones that fail in silos.
Step 3: Consider and Validate the Agent
In Step 2, we constructed an agent that from a “vibe verify”, regarded appropriate throughout iterative improvement. However how will we guarantee excessive accuracy and scalability when surfacing new information? A change within the immediate that fixes one doc may break ten others. That is the place formal analysis – a important and built-in a part of the Agent Bricks workflow – is available in.
This step is your high quality gate, and it offers two highly effective strategies for validation:
Technique A: Consider with Floor Fact Labels (The Gold Commonplace)
AI, like all information science challenge, fails in a vacuum with out correct area information. An funding from SMEs to offer a “golden set” (aka floor reality, labeled datasets) of manually extracted and human validated appropriate and related data, goes miles to make sure this answer generalizes throughout new information and codecs. It’s because labeled key:worth pairs rapidly assist the agent tune top quality prompts which result in enterprise related and correct extracts. Let’s dive into how Agent Bricks makes use of these labels to formally rating your agent.
Inside the Agent Bricks UI, present the bottom reality testset and, within the background, Agent Bricks runs throughout check paperwork. The UI will present a side-by-side comparability of your agent’s extracted output versus the “appropriate” labeled reply.
The UI offers a clear accuracy rating for every extraction subject which lets you immediately spot regressions whenever you change a immediate. With Agent Bricks, you achieve business-level confidence that the agent is acting at, or above, human-level accuracy.
Technique B: No Labels? Use LLM-as-a-Choose
However what when you’re ranging from scratch and don’t have any floor reality labels? It is a frequent “chilly begin” drawback.
The Agent Bricks analysis suite offers a strong answer: LLM-as-a-Choose. Databricks offers a collection of analysis frameworks, and Agent Bricks will leverage analysis fashions to behave as an neutral evaluator. The “Choose” mannequin is offered with the unique doc textual content and a set of subject prompts for every doc. The function of the “Choose” is to generate an “anticipated” response after which consider it in opposition to the output extracted by the agent.
LLM-as-a-Choose permits you to get a scalable, high-quality analysis rating and, observe, can be utilized in manufacturing to make sure brokers stay dependable and generalizable to manufacturing variability and scale. Extra on this in a future weblog.
Step 4: Integrating the Agent with ai_query() in your ETL pipeline
At this level, you constructed your agent in Step 2 and validated its accuracy in Step 3, and now have the arrogance to combine the extraction into your workflow. With a single click on, you possibly can deploy your agent as a serverless mannequin endpoint – instantly, your extraction logic is out there as a easy, scalable operate.
To take action, use the ai_query() operate in SQL to use this logic to new paperwork as they arrive. The ai_query() operate permits you to invoke any mannequin serving endpoint immediately and seamlessly in your end-to-end ETL information pipeline.
With this, Databricks Lakeflow Jobs ensures you will have a totally automated, production-grade ETL pipeline. Your Databricks Job takes uncooked PDFs touchdown in your cloud storage, parses them, extracts structured insights utilizing your top quality agent, and lands them in a desk prepared for evaluation, reporting, or referenced in a retrieval of a downstream agent software.
Databricks is the next-generation AI platform – one which breaks down the partitions between deeply technical groups and the area consultants who maintain the context wanted to construct significant AI. Success with AI isn’t simply fashions or infrastructure; it’s the tight, iterative collaboration between engineers and SMEs, the place every refines the opposite’s considering. Databricks offers groups a single atmosphere to co-develop, experiment rapidly, govern responsibly, and put the science again in information science.
Agent Bricks is the embodiment of this imaginative and prescient. With ai_parse_document() to parse unstructured content material, Agent Bricks: Info Extraction’s collaborative design interface to speed up top quality extractions, and ai_query() to use the answer in production-grade pipelines, groups can transfer from hundreds of thousands of messy PDFs to validated insights sooner than ever.
In our subsequent weblog, we’ll present how you can take these extracted insights and construct a production-grade chat agent able to answering natural-language questions like: “What are the commonest manufacturing readiness points for oncology medicine?”
