Why BAML? | Boundary Documentation

Let’s say you want to extract structured data from resumes. It starts simple enough…

But first, let’s see where we’re going with this story:

BAML: What it is and how it helps - see the full developer experience

It starts simple

You begin with a basic LLM call to extract a name and skills:

1 import openai
2 
3 def extract_resume(text):
4     response = openai.chat.completions.create(
5         model="gpt-4o",
6         messages=[{"role": "user", "content": f"Extract name and skills from: {text}"}]
7     )
8     return response.choices[0].message.content

This works… sometimes. But you need structured data, not free text.

You need structure

So you try JSON mode and add Pydantic for validation:

1 from pydantic import BaseModel
2 import json
3 
4 class Resume(BaseModel):
5     name: str
6     skills: list[str]
7 
8 def extract_resume(text):
9     prompt = f"""Extract resume data as JSON:
10 {text}
11 
12 Return JSON with fields: name (string), skills (array of strings)"""
13     
14     response = openai.chat.completions.create(
15         model="gpt-4o",
16         messages=[{"role": "user", "content": prompt}],
17         response_format={"type": "json_object"}
18     )
19     
20     data = json.loads(response.choices[0].message.content)
21     return Resume(**data)

Better! But now you need more fields. You add education, experience, and location:

1 class Education(BaseModel):
2     school: str
3     degree: str
4     year: int
5 
6 class Resume(BaseModel):
7     name: str
8     skills: list[str]
9     education: list[Education]
10     location: str
11     years_experience: int

The prompt gets longer and more complex. But wait - how do you test this without burning tokens?

Testing becomes expensive

Every test costs money and takes time:

1 # This burns tokens every time you run tests!
2 def test_resume_extraction():
3     test_resume = "John Doe, Python expert, MIT 2020..."
4     result = extract_resume(test_resume)  # API call = $$$
5     assert result.name == "John Doe"

You try mocking, but then you’re not testing your actual extraction logic. Your prompt could be completely broken and tests would still pass.

Error handling nightmare

Real resumes break your extraction. The LLM returns malformed JSON:

Resume extraction error in traditional approach

1 {
2   "name": "John Doe",
3   "skills": ["Python", "JavaScript"
4   // Missing closing bracket!

You add retry logic, JSON fixing, error handling:

1 import re
2 import time
3 
4 def extract_resume(text, max_retries=3):
5     for attempt in range(max_retries):
6         try:
7             response = openai.chat.completions.create(...)
8             content = response.choices[0].message.content
9             
10             # Try to fix common JSON issues
11             content = fix_json(content)
12             
13             data = json.loads(content)
14             return Resume(**data)
15         except (json.JSONDecodeError, ValidationError) as e:
16             if attempt == max_retries - 1:
17                 raise
18             time.sleep(2 ** attempt)  # Exponential backoff
19 
20 def fix_json(content):
21     # Remove text before/after JSON
22     json_match = re.search(r'\{.*\}', content, re.DOTALL)
23     if json_match:
24         content = json_match.group(0)
25     
26     # Fix common issues
27     content = content.replace(',}', '}')
28     content = content.replace(',]', ']')
29     # ... more fixes
30     
31     return content

Your simple extraction function is now 50+ lines of infrastructure code.

Multi-model chaos

Your company wants to use Claude for some tasks (better reasoning) and GPT-4-mini for others (cost savings):

1 def extract_resume(text, provider="openai", model="gpt-4o"):
2     if provider == "openai":
3         import openai
4         client = openai.OpenAI()
5         response = client.chat.completions.create(model=model, ...)
6     elif provider == "anthropic":
7         import anthropic
8         client = anthropic.Anthropic()
9         # Different API! Need to rewrite everything
10         response = client.messages.create(model=model, ...)
11     # ... handle different response formats

Each provider has different APIs, different response formats, different capabilities. Your code becomes a mess of if/else statements.

The prompt mystery

Your extraction fails on certain resumes. You need to debug, but what was actually sent to the LLM?

1 # What prompt was generated? How many tokens did it use?
2 # Why did this specific resume fail?
3 # How do I optimize for cost?
4 
5 # You can't easily see:
6 # - The exact prompt that was sent
7 # - How the schema was formatted  
8 # - Token usage breakdown
9 # - Why specific fields were missed

You start adding logging, token counting, prompt inspection tools…

Classification gets complex

Now you need to classify seniority levels:

1 from enum import Enum
2 
3 class SeniorityLevel(str, Enum):
4     JUNIOR = "junior"
5     MID = "mid" 
6     SENIOR = "senior"
7     STAFF = "staff"
8 
9 class Resume(BaseModel):
10     name: str
11     skills: list[str]
12     education: list[Education]
13     seniority: SeniorityLevel

But the LLM doesn’t know what these levels mean! You update the prompt:

1 prompt = f"""Extract resume data as JSON:
2 
3 Seniority levels:
4 - junior: 0-2 years experience
5 - mid: 2-5 years experience  
6 - senior: 5-10 years experience
7 - staff: 10+ years experience
8 
9 {text}
10 
11 Return JSON with fields: name, skills, education, seniority..."""

Your prompt is getting huge and your business logic is scattered between code and strings.

Production deployment headaches

In production, you need:

Retry policies for rate limits
Fallback models when primary is down
Cost tracking and optimization
Error monitoring and alerting
A/B testing different prompts

Your simple extraction function becomes a complex service:

1 class ResumeExtractor:
2     def __init__(self):
3         self.primary_client = openai.OpenAI()
4         self.fallback_client = anthropic.Anthropic()
5         self.token_tracker = TokenTracker()
6         self.error_monitor = ErrorMonitor()
7         
8     async def extract_with_fallback(self, text):
9         try:
10             return await self._extract_openai(text)
11         except RateLimitError:
12             return await self._extract_anthropic(text)
13         except Exception as e:
14             self.error_monitor.log(e)
15             raise
16             
17     def _extract_openai(self, text):
18         # 50+ lines of OpenAI-specific logic
19         pass
20         
21     def _extract_anthropic(self, text):  
22         # 50+ lines of Anthropic-specific logic
23         pass

Enter BAML

What if you could go back to something simple, but keep all the power?

1 class Education {
2   school string
3   degree string
4   year int
5 }
6 
7 enum SeniorityLevel {
8   JUNIOR @description("0-2 years of experience")
9   MID @description("2-5 years of experience")
10   SENIOR @description("5-10 years of experience")
11   STAFF @description("10+ years of experience, technical leadership")
12 }
13 
14 class Resume {
15   name string
16   skills string[]
17   education Education[]
18   seniority SeniorityLevel
19 }
20 
21 function ExtractResume(resume_text: string) -> Resume {
22   client GPT4
23   prompt #"
24     Extract information from this resume.
25     
26     For seniority level, consider:
27     {{ ctx.output_format.seniority }}
28     
29     Resume:
30     ---
31     {{ resume_text }}
32     ---
33     
34     {{ ctx.output_format }}
35   "#
36 }

Look what you get immediately:

BAML playground working with resume extraction

BAML playground showing successful resume extraction with clear prompts and structured output

1. Instant Testing

Test in VSCode playground without API calls or token costs:

VSCode playground showing resume extraction with prompt preview

See the exact prompt that will be sent to the LLM
Test with real data instantly - no API calls needed
Save test cases for regression testing
Visual prompt preview shows token usage and formatting

Build up a library of test cases that run instantly

2. Multi-Model Made Simple

1 client<llm> GPT4 {
2   provider openai
3   options { model "gpt-4o" }
4 }
5 
6 client<llm> Claude {
7   provider anthropic
8   options { model "claude-3-opus-20240229" }
9 }
10 
11 client<llm> GPT4Mini {
12   provider openai  
13   options { model "gpt-4o-mini" }
14 }
15 
16 // Same function, any model - just change the client
17 function ExtractResume(resume_text: string) -> Resume {
18   client GPT4  // Switch to Claude or GPT4Mini with one line
19   prompt #"..."#
20 }

3. Schema-Aligned Parsing (SAP)

BAML’s breakthrough innovation follows Postel’s Law: “Be conservative in what you do, be liberal in what you accept from others.”

Instead of rejecting imperfect outputs, SAP actively transforms them to match your schema using custom edit distance algorithms.

Performance Comparison

Error Correction

Token Efficiency

Chain-of-Thought

SAP vs Other Approaches:

Model	Function Calling	Python AST Parser	SAP
gpt-3.5-turbo	87.5%	75.8%	92%
gpt-4o	87.4%	82.1%	93%
claude-3-haiku	57.3%	82.6%	91.7%

Key insight: SAP + GPT-3.5 turbo beats GPT-4o + structured outputs, saving you money while improving accuracy.

4. Production Features Built-In

1 client<llm> RobustGPT4 {
2   provider openai
3   options { model "gpt-4o" }
4   retry_policy {
5     max_retries 3
6     strategy exponential_backoff
7   }
8 }
9 
10 client<llm> SmartFallback {
11   provider fallback
12   options {
13     clients ["GPT4", "Claude", "GPT4Mini"]
14   }
15 }

5. Token Optimization

See exact token usage for every call
BAML’s schema format uses 80% fewer tokens than JSON Schema
Optimize prompts with instant feedback

6. Type Safety Everywhere

1 from baml_client import baml as b
2 
3 # Fully typed, works in Python, TypeScript, Java, Go
4 resume = await b.ExtractResume(resume_text)
5 print(resume.seniority)  # Type: SeniorityLevel

BAML generates fully typed clients for all languages automatically

See how changes instantly update the prompt:

BAML prompt view updating in real-time as types change

Change your types → Prompt automatically updates → See the difference immediately

7. Advanced Streaming with UI Integration

BAML’s semantic streaming lets you build real UIs with loading bars and type-safe implementations:

1 class BlogPost {
2   title string @stream.done @stream.not_null
3   content string @stream.with_state
4 }

What this enables:

Loading bars - Show progress as structured data streams in
Semantic guarantees - Title only appears when complete, content streams token by token
Type-safe streaming - Full TypeScript/Python types for partial data
UI state management - Know exactly what’s loading vs complete

See semantic streaming in action - structured data streaming with loading states

The Bottom Line

You started with: A simple LLM call You ended up with: Hundreds of lines of infrastructure code

With BAML, you get:

The simplicity of your first attempt
All the production features you built manually
Better reliability than you could build yourself
10x faster development iteration
Full control and transparency

BAML is what LLM development should have been from the start. Ready to see the difference? Get started with BAML.