Comparing OpenAI SDK

OpenAI SDK now supports structured outputs natively, making it easier than ever to get typed responses from GPT models.

Let’s explore how this works in practice and where you might hit limitations.

Why working with LLMs requires more than just OpenAI SDK

OpenAI’s structured outputs look fantastic at first:

1from pydantic import BaseModel
2from openai import OpenAI
3
4class Resume(BaseModel):
5 name: str
6 skills: list[str]
7
8client = OpenAI()
9completion = client.beta.chat.completions.parse(
10 model="gpt-4o",
11 messages=[
12 {"role": "user", "content": "John Doe, Python, Rust"}
13 ],
14 response_format=Resume,
15)
16resume = completion.choices[0].message.parsed

Simple and type-safe! Let’s add education to make it more realistic:

1+class Education(BaseModel):
2+ school: str
3+ degree: str
4+ year: int
5
6class Resume(BaseModel):
7 name: str
8 skills: list[str]
9+ education: list[Education]
10
11completion = client.beta.chat.completions.parse(
12 model="gpt-4o",
13 messages=[
14 {"role": "user", "content": """John Doe
15Python, Rust
16University of California, Berkeley, B.S. in Computer Science, 2020"""}
17 ],
18 response_format=Resume,
19)

Still works! But let’s dig deeper…

The prompt mystery

Your extraction works 90% of the time, but fails on certain resumes. You need to debug:

1# What prompt is actually being sent?
2completion = client.beta.chat.completions.parse(
3 model="gpt-4o",
4 messages=[{"role": "user", "content": resume_text}],
5 response_format=Resume,
6)
7
8# You can't see:
9# - How the schema is formatted
10# - What instructions the model receives
11# - Why certain fields are misunderstood

You start experimenting with system messages:

1completion = client.beta.chat.completions.parse(
2 model="gpt-4o",
3 messages=[
4 {"role": "system", "content": "Extract resume information accurately."},
5 {"role": "user", "content": resume_text}
6 ],
7 response_format=Resume,
8)
9
10# But what if you need more specific instructions?
11# How do you tell it to handle edge cases?

Classification without context

Now you need to classify resumes by seniority:

1from enum import Enum
2
3class SeniorityLevel(str, Enum):
4 JUNIOR = "junior"
5 MID = "mid"
6 SENIOR = "senior"
7 STAFF = "staff"
8
9class Resume(BaseModel):
10 name: str
11 skills: list[str]
12 education: list[Education]
13 seniority: SeniorityLevel

But the model doesn’t know what these levels mean! You try adding a docstring:

1class Resume(BaseModel):
2 """Resume with seniority classification.
3
4 Seniority levels:
5 - junior: 0-2 years experience
6 - mid: 2-5 years experience
7 - senior: 5-10 years experience
8 - staff: 10+ years experience
9 """
10 name: str
11 skills: list[str]
12 education: list[Education]
13 seniority: SeniorityLevel

But docstrings aren’t sent to the model. So you resort to prompt engineering:

1messages = [
2 {"role": "system", "content": """Extract resume information.
3
4Classify seniority as:
5- junior: 0-2 years experience
6- mid: 2-5 years experience
7- senior: 5-10 years experience
8- staff: 10+ years experience"""},
9 {"role": "user", "content": resume_text}
10]

Now your business logic is split between types and prompts…

The vendor lock-in problem

Your team wants to experiment with Claude for better reasoning:

1# With OpenAI SDK, you're stuck with OpenAI
2from openai import OpenAI
3client = OpenAI()
4
5# Want to try Claude? Start over with a different SDK
6from anthropic import Anthropic
7anthropic_client = Anthropic()
8
9# Completely different API
10message = anthropic_client.messages.create(
11 model="claude-3-opus-20240229",
12 messages=[{"role": "user", "content": resume_text}],
13 # No structured outputs support!
14)
15
16# Now you need custom parsing
17import json
18resume_data = json.loads(message.content)
19resume = Resume(**resume_data) # Hope it matches!

Testing and token tracking

You want to test your extraction and track costs:

1# How do you test without burning tokens?
2def test_resume_extraction():
3 completion = client.beta.chat.completions.parse(
4 model="gpt-4o",
5 messages=[{"role": "user", "content": test_resume}],
6 response_format=Resume,
7 )
8 # This costs money every time!
9
10# Mock the OpenAI client?
11from unittest.mock import Mock
12mock_client = Mock()
13mock_client.beta.chat.completions.parse.return_value = ...
14# You're not really testing the extraction logic
15
16# Track token usage?
17completion = client.beta.chat.completions.parse(...)
18print(completion.usage.total_tokens) # At least this exists!
19
20# But how many tokens does the schema formatting use?
21# Could you optimize it?

Production complexity creep

As your app scales, you need:

  • Retry logic for rate limits
  • Fallback to GPT-3.5 when GPT-4 is down
  • A/B testing different prompts
  • Structured logging for debugging

Your code evolves:

1class ResumeExtractor:
2 def __init__(self):
3 self.client = OpenAI()
4 self.fallback_client = OpenAI() # Different API key?
5
6 def extract_with_retries(self, text: str, max_retries: int = 3):
7 for attempt in range(max_retries):
8 try:
9 return self._extract(text, model="gpt-4o")
10 except RateLimitError:
11 if attempt == max_retries - 1:
12 # Try fallback model
13 return self._extract(text, model="gpt-3.5-turbo")
14 time.sleep(2 ** attempt)
15
16 def _extract(self, text: str, model: str):
17 messages = self._build_messages(text)
18
19 completion = self.client.beta.chat.completions.parse(
20 model=model,
21 messages=messages,
22 response_format=Resume,
23 )
24
25 self._log_usage(completion, model)
26 return completion.choices[0].message.parsed
27
28 # ... more infrastructure code

The simple API is now buried in error handling and logging.

Enter BAML

BAML was built for real-world LLM applications. Here’s the same resume extraction:

1class Education {
2 school string
3 degree string
4 year int
5}
6
7enum SeniorityLevel {
8 JUNIOR @description("0-2 years of experience")
9 MID @description("2-5 years of experience")
10 SENIOR @description("5-10 years of experience")
11 STAFF @description("10+ years of experience, technical leadership")
12}
13
14class Resume {
15 name string
16 skills string[]
17 education Education[]
18 seniority SeniorityLevel
19}
20
21function ExtractResume(resume_text: string) -> Resume {
22 client GPT4
23 prompt #"
24 Extract structured information from this resume.
25
26 When determining seniority, use these guidelines:
27 {{ ctx.output_format.seniority }}
28
29 Resume:
30 ---
31 {{ resume_text }}
32 ---
33
34 Output format:
35 {{ ctx.output_format }}
36 "#
37}

See the difference?

  1. The prompt is explicit - No guessing what’s sent
  2. Enums have descriptions - Built into the type system
  3. One place for everything - Types and prompts together

Multi-model freedom

1// Define all your models
2client<llm> GPT4 {
3 provider openai
4 options {
5 model "gpt-4o"
6 temperature 0.1
7 }
8}
9
10client<llm> GPT35 {
11 provider openai
12 options {
13 model "gpt-3.5-turbo"
14 temperature 0.1
15 }
16}
17
18client<llm> Claude {
19 provider anthropic
20 options {
21 model "claude-3-opus-20240229"
22 }
23}
24
25client<llm> Llama {
26 provider ollama
27 options {
28 model "llama3"
29 }
30}
31
32// Use ANY model with the SAME function
33function ExtractResume(resume_text: string) -> Resume {
34 client GPT4 // Just change this line!
35 prompt #"..."#
36}

In Python:

1from baml_client import baml as b
2
3# Default model
4resume = await b.ExtractResume(resume_text)
5
6# Use different models for different scenarios
7cheap_extraction = await b.ExtractResume(simple_text, {"client": "GPT35"})
8quality_extraction = await b.ExtractResume(complex_text, {"client": "Claude"})
9private_extraction = await b.ExtractResume(sensitive_text, {"client": "Llama"})
10
11# Same interface, same types, different models!

Testing without burning money

With BAML’s VSCode extension:

BAML VSCode playground with instant testing
  1. Write your test cases - Visual interface for test data
  2. See the exact prompt - No hidden abstractions
  3. Test instantly without API calls
  4. Iterate until perfect - Instant feedback loop
  5. Save test cases for CI/CD
Opening BAML playground from VSCode

No mocking, no token costs, real testing.

Built for production

1// Retry configuration
2client<llm> GPT4WithRetries {
3 provider openai
4 options {
5 model "gpt-4o"
6 temperature 0.1
7 }
8 retry_policy {
9 max_retries 3
10 strategy exponential_backoff
11 }
12}
13
14// Fallback chains
15client<llm> SmartRouter {
16 provider fallback
17 options {
18 clients ["GPT4", "Claude", "GPT35"]
19 }
20}

All the production concerns handled declaratively.

The bottom line

OpenAI’s structured outputs are great if you:

  • Only use OpenAI models
  • Don’t need prompt customization
  • Have simple extraction needs

But production LLM applications need more:

BAML’s advantages over OpenAI SDK:

  • Model flexibility - Works with GPT, Claude, Gemini, Llama, and any future model
  • Prompt transparency - See and optimize exactly what’s sent to the LLM
  • Real testing - Test in VSCode without burning tokens or API calls
  • Production features - Built-in retries, fallbacks, and smart routing
  • Cost optimization - Understand token usage and optimize prompts
  • Schema-Aligned Parsing - Get structured outputs from any model, not just OpenAI
  • Streaming + Structure - Stream structured data with loading bars

Why this matters:

  • Future-proof - Never get locked into one model provider
  • Faster development - Instant testing and iteration in your editor
  • Better reliability - Built-in error handling and fallback strategies
  • Team productivity - Prompts are versioned, testable code
  • Cost control - Optimize token usage across different models

With BAML, you get all the benefits of OpenAI’s structured outputs plus the flexibility and control needed for production applications.

Limitations of BAML

BAML has some limitations:

  1. It’s a new language (though easy to learn)
  2. Best experience needs VSCode
  3. Focused on structured extraction

If you’re building a simple OpenAI-only prototype, the OpenAI SDK is fine. If you’re building production LLM features that need to scale, try BAML.