Comparing OpenAI SDK | Boundary Documentation

OpenAI SDK now supports structured outputs natively, making it easier than ever to get typed responses from GPT models.

Let’s explore how this works in practice and where you might hit limitations.

Why working with LLMs requires more than just OpenAI SDK

OpenAI’s structured outputs look fantastic at first:

1 from pydantic import BaseModel
2 from openai import OpenAI
3 
4 class Resume(BaseModel):
5     name: str
6     skills: list[str]
7 
8 client = OpenAI()
9 completion = client.beta.chat.completions.parse(
10     model="gpt-4o",
11     messages=[
12         {"role": "user", "content": "John Doe, Python, Rust"}
13     ],
14     response_format=Resume,
15 )
16 resume = completion.choices[0].message.parsed

Simple and type-safe! Let’s add education to make it more realistic:

1 +class Education(BaseModel):
2 +    school: str
3 +    degree: str
4 +    year: int
5 
6 class Resume(BaseModel):
7     name: str
8     skills: list[str]
9 +    education: list[Education]
10 
11 completion = client.beta.chat.completions.parse(
12     model="gpt-4o",
13     messages=[
14         {"role": "user", "content": """John Doe
15 Python, Rust
16 University of California, Berkeley, B.S. in Computer Science, 2020"""}
17     ],
18     response_format=Resume,
19 )

Still works! But let’s dig deeper…

The prompt mystery

Your extraction works 90% of the time, but fails on certain resumes. You need to debug:

1 # What prompt is actually being sent?
2 completion = client.beta.chat.completions.parse(
3     model="gpt-4o",
4     messages=[{"role": "user", "content": resume_text}],
5     response_format=Resume,
6 )
7 
8 # You can't see:
9 # - How the schema is formatted
10 # - What instructions the model receives
11 # - Why certain fields are misunderstood

You start experimenting with system messages:

1 completion = client.beta.chat.completions.parse(
2     model="gpt-4o",
3     messages=[
4         {"role": "system", "content": "Extract resume information accurately."},
5         {"role": "user", "content": resume_text}
6     ],
7     response_format=Resume,
8 )
9 
10 # But what if you need more specific instructions?
11 # How do you tell it to handle edge cases?

Classification without context

Now you need to classify resumes by seniority:

1 from enum import Enum
2 
3 class SeniorityLevel(str, Enum):
4     JUNIOR = "junior"
5     MID = "mid"
6     SENIOR = "senior"
7     STAFF = "staff"
8 
9 class Resume(BaseModel):
10     name: str
11     skills: list[str]
12     education: list[Education]
13     seniority: SeniorityLevel

But the model doesn’t know what these levels mean! You try adding a docstring:

1 class Resume(BaseModel):
2     """Resume with seniority classification.
3     
4     Seniority levels:
5     - junior: 0-2 years experience
6     - mid: 2-5 years experience
7     - senior: 5-10 years experience
8     - staff: 10+ years experience
9     """
10     name: str
11     skills: list[str]
12     education: list[Education]
13     seniority: SeniorityLevel

But docstrings aren’t sent to the model. So you resort to prompt engineering:

1 messages = [
2     {"role": "system", "content": """Extract resume information.
3     
4 Classify seniority as:
5 - junior: 0-2 years experience
6 - mid: 2-5 years experience  
7 - senior: 5-10 years experience
8 - staff: 10+ years experience"""},
9     {"role": "user", "content": resume_text}
10 ]

Now your business logic is split between types and prompts…

The vendor lock-in problem

Your team wants to experiment with Claude for better reasoning:

1 # With OpenAI SDK, you're stuck with OpenAI
2 from openai import OpenAI
3 client = OpenAI()
4 
5 # Want to try Claude? Start over with a different SDK
6 from anthropic import Anthropic
7 anthropic_client = Anthropic()
8 
9 # Completely different API
10 message = anthropic_client.messages.create(
11     model="claude-3-opus-20240229",
12     messages=[{"role": "user", "content": resume_text}],
13     # No structured outputs support!
14 )
15 
16 # Now you need custom parsing
17 import json
18 resume_data = json.loads(message.content)
19 resume = Resume(**resume_data)  # Hope it matches!

Testing and token tracking

You want to test your extraction and track costs:

1 # How do you test without burning tokens?
2 def test_resume_extraction():
3     completion = client.beta.chat.completions.parse(
4         model="gpt-4o",
5         messages=[{"role": "user", "content": test_resume}],
6         response_format=Resume,
7     )
8     # This costs money every time!
9 
10 # Mock the OpenAI client?
11 from unittest.mock import Mock
12 mock_client = Mock()
13 mock_client.beta.chat.completions.parse.return_value = ...
14 # You're not really testing the extraction logic
15 
16 # Track token usage?
17 completion = client.beta.chat.completions.parse(...)
18 print(completion.usage.total_tokens)  # At least this exists!
19 
20 # But how many tokens does the schema formatting use?
21 # Could you optimize it?

Production complexity creep

As your app scales, you need:

Retry logic for rate limits
Fallback to GPT-3.5 when GPT-4 is down
A/B testing different prompts
Structured logging for debugging

Your code evolves:

1 class ResumeExtractor:
2     def __init__(self):
3         self.client = OpenAI()
4         self.fallback_client = OpenAI()  # Different API key?
5         
6     def extract_with_retries(self, text: str, max_retries: int = 3):
7         for attempt in range(max_retries):
8             try:
9                 return self._extract(text, model="gpt-4o")
10             except RateLimitError:
11                 if attempt == max_retries - 1:
12                     # Try fallback model
13                     return self._extract(text, model="gpt-3.5-turbo")
14                 time.sleep(2 ** attempt)
15                 
16     def _extract(self, text: str, model: str):
17         messages = self._build_messages(text)
18         
19         completion = self.client.beta.chat.completions.parse(
20             model=model,
21             messages=messages,
22             response_format=Resume,
23         )
24         
25         self._log_usage(completion, model)
26         return completion.choices[0].message.parsed
27         
28     # ... more infrastructure code

The simple API is now buried in error handling and logging.

Enter BAML

BAML was built for real-world LLM applications. Here’s the same resume extraction:

1 class Education {
2   school string
3   degree string  
4   year int
5 }
6 
7 enum SeniorityLevel {
8   JUNIOR @description("0-2 years of experience")
9   MID @description("2-5 years of experience")
10   SENIOR @description("5-10 years of experience")  
11   STAFF @description("10+ years of experience, technical leadership")
12 }
13 
14 class Resume {
15   name string
16   skills string[]
17   education Education[]
18   seniority SeniorityLevel
19 }
20 
21 function ExtractResume(resume_text: string) -> Resume {
22   client GPT4
23   prompt #"
24     Extract structured information from this resume.
25     
26     When determining seniority, use these guidelines:
27     {{ ctx.output_format.seniority }}
28     
29     Resume:
30     ---
31     {{ resume_text }}
32     ---
33     
34     Output format:
35     {{ ctx.output_format }}
36   "#
37 }

See the difference?

The prompt is explicit - No guessing what’s sent
Enums have descriptions - Built into the type system
One place for everything - Types and prompts together

Multi-model freedom

1 // Define all your models
2 client<llm> GPT4 {
3   provider openai
4   options {
5     model "gpt-4o"
6     temperature 0.1
7   }
8 }
9 
10 client<llm> GPT35 {
11   provider openai
12   options {
13     model "gpt-3.5-turbo"
14     temperature 0.1
15   }
16 }
17 
18 client<llm> Claude {
19   provider anthropic
20   options {
21     model "claude-3-opus-20240229"
22   }
23 }
24 
25 client<llm> Llama {
26   provider ollama
27   options {
28     model "llama3"
29   }
30 }
31 
32 // Use ANY model with the SAME function
33 function ExtractResume(resume_text: string) -> Resume {
34   client GPT4  // Just change this line!
35   prompt #"..."#
36 }

In Python:

1 from baml_client import baml as b
2 
3 # Default model
4 resume = await b.ExtractResume(resume_text)
5 
6 # Use different models for different scenarios
7 cheap_extraction = await b.ExtractResume(simple_text, {"client": "GPT35"})
8 quality_extraction = await b.ExtractResume(complex_text, {"client": "Claude"})
9 private_extraction = await b.ExtractResume(sensitive_text, {"client": "Llama"})
10 
11 # Same interface, same types, different models!

Testing without burning money

With BAML’s VSCode extension:

BAML VSCode playground with instant testing

Write your test cases - Visual interface for test data
See the exact prompt - No hidden abstractions
Test instantly without API calls
Iterate until perfect - Instant feedback loop
Save test cases for CI/CD

No mocking, no token costs, real testing.

Built for production

1 // Retry configuration
2 client<llm> GPT4WithRetries {
3   provider openai
4   options {
5     model "gpt-4o"
6     temperature 0.1
7   }
8   retry_policy {
9     max_retries 3
10     strategy exponential_backoff
11   }
12 }
13 
14 // Fallback chains
15 client<llm> SmartRouter {
16   provider fallback
17   options {
18     clients ["GPT4", "Claude", "GPT35"]
19   }
20 }

All the production concerns handled declaratively.

The bottom line

OpenAI’s structured outputs are great if you:

Only use OpenAI models
Don’t need prompt customization
Have simple extraction needs

But production LLM applications need more:

BAML’s advantages over OpenAI SDK:

Model flexibility - Works with GPT, Claude, Gemini, Llama, and any future model
Prompt transparency - See and optimize exactly what’s sent to the LLM
Real testing - Test in VSCode without burning tokens or API calls
Production features - Built-in retries, fallbacks, and smart routing
Cost optimization - Understand token usage and optimize prompts
Schema-Aligned Parsing - Get structured outputs from any model, not just OpenAI
Streaming + Structure - Stream structured data with loading bars

Why this matters:

Future-proof - Never get locked into one model provider
Faster development - Instant testing and iteration in your editor
Better reliability - Built-in error handling and fallback strategies
Team productivity - Prompts are versioned, testable code
Cost control - Optimize token usage across different models

With BAML, you get all the benefits of OpenAI’s structured outputs plus the flexibility and control needed for production applications.

Limitations of BAML

BAML has some limitations:

It’s a new language (though easy to learn)
Best experience needs VSCode
Focused on structured extraction

If you’re building a simple OpenAI-only prototype, the OpenAI SDK is fine. If you’re building production LLM features that need to scale, try BAML.