Comparing Langchain

Langchain is one of the most popular frameworks for building LLM applications. It provides abstractions for chains, agents, memory, and more.

Let’s dive into how Langchain handles structured extraction and where it falls short.

Why working with LLMs requires more than just Langchain

Langchain makes structured extraction look simple at first:

1from pydantic import BaseModel, Field
2from langchain_openai import ChatOpenAI
3
4class Resume(BaseModel):
5 name: str
6 skills: List[str]
7
8llm = ChatOpenAI(model="gpt-4o")
9structured_llm = llm.with_structured_output(Resume)
10result = structured_llm.invoke("John Doe, Python, Rust")

That’s pretty neat! But now let’s add an Education model to make it more realistic:

1+class Education(BaseModel):
2+ school: str
3+ degree: str
4+ year: int
5
6class Resume(BaseModel):
7 name: str
8 skills: List[str]
9+ education: List[Education]
10
11structured_llm = llm.with_structured_output(Resume)
12result = structured_llm.invoke("""John Doe
13Python, Rust
14University of California, Berkeley, B.S. in Computer Science, 2020""")

Still works… but what’s actually happening under the hood? What prompt is being sent? How many tokens are we using?

Let’s dig deeper. Say you want to see what’s actually being sent to the model:

1# How do you debug this?
2structured_llm = llm.with_structured_output(Resume)
3
4# You need to enable verbose mode or dig into callbacks
5from langchain.globals import set_debug
6set_debug(True)
7
8# Now you get TONS of debug output...

But even with debug mode, you still can’t easily:

  • Modify the extraction prompt
  • See the exact token count
  • Understand why extraction failed for certain inputs

When things go wrong

Here’s where it gets tricky. Your PM asks: “Can we classify these resumes by seniority level?”

1from enum import Enum
2
3class SeniorityLevel(str, Enum):
4 JUNIOR = "junior"
5 MID = "mid"
6 SENIOR = "senior"
7 STAFF = "staff"
8
9class Resume(BaseModel):
10 name: str
11 skills: List[str]
12 education: List[Education]
13 seniority: SeniorityLevel

But now you realize you need to give the LLM context about what each level means:

1# Wait... how do I tell the LLM that "junior" means 0-2 years experience?
2# How do I customize the prompt?
3
4# You end up doing this:
5CLASSIFICATION_PROMPT = """
6Given the resume below, classify the seniority level:
7- junior: 0-2 years experience
8- mid: 2-5 years experience
9- senior: 5-10 years experience
10- staff: 10+ years experience
11
12Resume: {resume_text}
13"""
14
15# Now you need separate chains...
16classification_chain = LLMChain(llm=llm, prompt=PromptTemplate.from_template(CLASSIFICATION_PROMPT))
17extraction_chain = llm.with_structured_output(Resume)
18
19# And combine them somehow...

Your clean code is starting to look messy. But wait, there’s more!

Multi-model madness

Your company wants to use Claude for some tasks (better reasoning) and GPT-4-mini for others (cost savings). With Langchain:

1from langchain_anthropic import ChatAnthropic
2from langchain_openai import ChatOpenAI
3
4# Different providers, different imports
5claude = ChatAnthropic(model="claude-3-opus-20240229")
6gpt4 = ChatOpenAI(model="gpt-4o")
7gpt4_mini = ChatOpenAI(model="gpt-4o-mini")
8
9# But wait... does Claude support structured outputs the same way?
10claude_structured = claude.with_structured_output(Resume) # May not work!
11
12# You need provider-specific handling
13if provider == "anthropic":
14 # Use function calling? XML? JSON mode?
15 # Different providers have different capabilities
16 pass

Testing nightmare

Now you want to test your extraction logic without burning through API credits:

1# How do you test this?
2structured_llm = llm.with_structured_output(Resume)
3
4# Mock the entire LLM?
5from unittest.mock import Mock
6mock_llm = Mock()
7mock_llm.with_structured_output.return_value.invoke.return_value = Resume(...)
8
9# But you're not really testing your extraction logic...
10# Just that your mocks work

With BAML, testing is visual and instant:

VSCode test case buttons for instant testing

Test your prompts instantly without API calls or mocking

The token mystery

Your CFO asks: “Why is our OpenAI bill so high?” You investigate:

1# How many tokens does this use?
2structured_llm = llm.with_structured_output(Resume)
3result = structured_llm.invoke(long_resume_text)
4
5# You need callbacks or token counting utilities
6from langchain.callbacks import get_openai_callback
7
8with get_openai_callback() as cb:
9 result = structured_llm.invoke(long_resume_text)
10 print(f"Tokens: {cb.total_tokens}") # Finally!

But you still don’t know WHY it’s using so many tokens. Is it the schema format? The prompt template? The retry logic?

Enter BAML

BAML was built specifically for these LLM challenges. Here’s the same resume extraction:

1class Education {
2 school string
3 degree string
4 year int
5}
6
7class Resume {
8 name string
9 skills string[]
10 education Education[]
11 seniority SeniorityLevel
12}
13
14enum SeniorityLevel {
15 JUNIOR @description("0-2 years of experience")
16 MID @description("2-5 years of experience")
17 SENIOR @description("5-10 years of experience")
18 STAFF @description("10+ years of experience, technical leadership")
19}
20
21function ExtractResume(resume_text: string) -> Resume {
22 client GPT4
23 prompt #"
24 Extract information from this resume.
25
26 For seniority level, consider:
27 {{ ctx.output_format.seniority }}
28
29 Resume:
30 ---
31 {{ resume_text }}
32 ---
33
34 {{ ctx.output_format }}
35 "#
36}

Now look what you get:

  1. See exactly what’s sent to the LLM - The prompt is right there!
  2. Test without API calls - Use the VSCode playground
  3. Switch models instantly - Just change client GPT4 to client Claude
  4. Token count visibility - BAML shows exact token usage
  5. Modify prompts easily - It’s just a template string

Multi-model support done right

1// Define all your clients in one place
2client<llm> GPT4 {
3 provider openai
4 options {
5 model "gpt-4o"
6 temperature 0.1
7 }
8}
9
10client<llm> GPT4Mini {
11 provider openai
12 options {
13 model "gpt-4o-mini"
14 temperature 0.1
15 }
16}
17
18client<llm> Claude {
19 provider anthropic
20 options {
21 model "claude-3-opus-20240229"
22 max_tokens 4096
23 }
24}
25
26// Same function works with ANY model
27function ExtractResume(resume_text: string) -> Resume {
28 client GPT4 // Just change this line
29 prompt #"..."#
30}

Use it in Python:

1from baml_client import baml as b
2
3# Use default model
4resume = await b.ExtractResume(resume_text)
5
6# Override at runtime based on your needs
7resume_complex = await b.ExtractResume(complex_text, {"client": "Claude"})
8resume_simple = await b.ExtractResume(simple_text, {"client": "GPT4Mini"})

The bottom line

Langchain is great for building complex LLM applications with chains, agents, and memory. But for structured extraction, you’re fighting against abstractions that hide important details.

BAML gives you what Langchain can’t:

  • Full prompt transparency - See and control exactly what’s sent to the LLM
  • Native testing - Test in VSCode without API calls or burning tokens
  • Multi-model by design - Switch providers with one line, works with any model
  • Token visibility - Know exactly what you’re paying for and optimize costs
  • Type safety - Generated clients with autocomplete that always match your schema
  • Schema-Aligned Parsing - Get structured outputs from any model, even without function calling
  • Streaming + Structure - Stream structured data with loading bars and type-safe parsing

Why this matters for production:

  • Faster iteration - See changes instantly without running Python code
  • Better debugging - Know exactly why extraction failed
  • Cost optimization - Understand and reduce token usage
  • Model flexibility - Never get locked into one provider
  • Team collaboration - Prompts are code, not hidden strings

We built BAML because we were tired of wrestling with framework abstractions when all we wanted was reliable structured extraction with full developer control.

Limitations of BAML

BAML does have some limitations we are continuously working on:

  1. It is a new language. However, it is fully open source and getting started takes less than 10 minutes
  2. Developing requires VSCode. You could use vim but we don’t recommend it
  3. It’s focused on structured extraction - not a full LLM framework like Langchain

If you need complex chains and agents, use Langchain. If you want the best structured extraction experience with full control, try BAML.