Comparing Pydantic

Pydantic is a popular library for data validation in Python used by most — if not all — LLM frameworks, like instructor.

BAML also uses Pydantic. The BAML Rust compiler can generate Pydantic models from your .baml files. But that’s not all the compiler does — it also takes care of fixing common LLM parsing issues, supports more data types, handles retries, and reduces the amount of boilerplate code you have to write.

Let’s dive into how Pydantic is used and its limitations.

Why working with LLMs requires more than just Pydantic

Pydantic can help you get structured output from an LLM easily at first glance:

1class Resume(BaseModel):
2 name: str
3 skills: List[str]
4
5def create_prompt(input_text: str) -> str:
6 PROMPT_TEMPLATE = f"""Parse the following resume and return a structured representation of the data in the schema below.
7Resume:
8---
9{input_text}
10---
11
12Schema:
13{Resume.model_json_schema()['properties']}
14
15Output JSON:
16"""
17 return PROMPT_TEMPLATE
18
19def extract_resume(input_text: str) -> Union[Resume, None]:
20 prompt = create_prompt(input_text)
21 chat_completion = client.chat.completions.create(
22 model="gpt-4", messages=[{"role": "system", "content": prompt}]
23 )
24 try:
25 output = chat_completion.choices[0].message.content
26 if output:
27 return Resume.model_validate_json(output)
28 return None
29 except Exception as e:
30 raise e

That’s pretty good, but now we want to add an Education model to the Resume model. We add the following code:

1...
2+class Education(BaseModel):
3+ school: str
4+ degree: str
5+ year: int
6
7class Resume(BaseModel):
8 name: str
9 skills: List[str]
10+ education: List[Education]
11
12def create_prompt(input_text: str) -> str:
13 additional_models = ""
14+ if "$defs" in Resume.model_json_schema():
15+ additional_models += f"\nUse these other schema definitions as +well:\n{Resume.model_json_schema()['$defs']}"
16 PROMPT_TEMPLATE = f"""Parse the following resume and return a structured representation of the data in the schema below.
17Resume:
18---
19{input_text}
20---
21
22Schema:
23{Resume.model_json_schema()['properties']}
24
25+ {additional_models}
26
27Output JSON:
28""".strip()
29 return PROMPT_TEMPLATE
30...

A little ugly, but still readable… But managing all these prompt strings can make your codebase disorganized very quickly.

Then you realize the LLM sometimes outputs some text before giving you the json, like this:

1+ The output is:
2{
3 "name": "John Doe",
4 ... // truncated for brevity
5}

So you add a regex to address that that extracts everything in {}:

1def extract_resume(input_text: str) -> Union[Resume, None]:
2 prompt = create_prompt(input_text)
3 print(prompt)
4 chat_completion = client.chat.completions.create(
5 model="gpt-4", messages=[{"role": "system", "content": prompt}]
6 )
7 try:
8 output = chat_completion.choices[0].message.content
9 print(output)
10 if output:
11+ # Extract JSON block using regex
12+ json_match = re.search(r"\{.*?\}", output, re.DOTALL)
13+ if json_match:
14+ json_output = json_match.group(0)
15 return Resume.model_validate_json(output)
16 return None
17 except Exception as e:
18 raise e

Next you realize you actually want an array of Resumes, but you can’t really use List[Resume] because Pydantic and Python don’t work this way, so you have to add another wrapper:

1+class ResumeArray(BaseModel):
2+ resumes: List[Resume]

Now you need to change the rest of your code to handle different models. That’s good longterm, but it is now more boilerplate you have to write, test and maintain.

Next, you notice the LLM sometimes outputs a single resume {...}, and sometimes an array [{...}]… You must now change your parser to handle both cases:

1+def extract_resume(input_text: str) -> Union[List[Resume], None]:
2+ prompt = create_prompt(input_text) # Also requires changes
3 chat_completion = client.chat.completions.create(
4 model="gpt-4", messages=[{"role": "system", "content": prompt}]
5 )
6 try:
7 output = chat_completion.choices[0].message.content
8 if output:
9 # Extract JSON block using regex
10 json_match = re.search(r"\{.*?\}", output, re.DOTALL)
11 if json_match:
12 json_output = json_match.group(0)
13 try:
14+ parsed = json.loads(json_output)
15+ if isinstance(parsed, list):
16+ return list(map(Resume.model_validate_json, parsed))
17+ else:
18+ return [ResumeArray(**parsed)]
19 return None
20 except Exception as e:
21 raise e

You could retry the call against the LLM to fix the issue, but that will cost you precious seconds and tokens, so handling this corner case manually is the only solution.


A small tangent — JSON schemas vs type definitions

Sidenote: At this point your prompt looks like this:

JSON Schema:
{'name': {'title': 'Name', 'type': 'string'}, 'skills': {'items': {'type': 'string'}, 'title': 'Skills', 'type': 'array'}, 'education': {'anyOf': [{'$ref': '#/$defs/Education'}, {'type': 'null'}]}}
Use these other JSON schema definitions as well:
{'Education': {'properties': {'degree': {'title': 'Degree', 'type': 'string'}, 'major': {'title': 'Major', 'type': 'string'}, 'school': {'title': 'School', 'type': 'string'}, 'year': {'title': 'Year', 'type': 'integer'}}, 'required': ['degree', 'major', 'school', 'year'], 'title': 'Education', 'type': 'object'}}

and sometimes even GPT-4 outputs incorrect stuff like this, even though it’s technically correct JSON (OpenAI’s “JSON mode” will still break you)

{
"name":
{
"title": "Name",
"type": "string",
"value": "John Doe"
},
"skills":
{
"items":
{
"type": "string",
"values":
[
"Python",
"JavaScript",
"React"
]
... // truncated for brevity

(this is an actual result from GPT-4 before some more prompt engineering)

when all you really want is a prompt that looks like the one below — with way less tokens (and less likelihood of confusion). :

1Parse the following resume and return a structured representation of the data in the schema below.
2Resume:
3---
4John Doe
5Python, Rust
6University of California, Berkeley, B.S. in Computer Science, 2020
7---
8
9+JSON Schema:
10+{
11+ "name": string,
12+ "skills": string[]
13+ "education": {
14+ "school": string,
15+ "degree": string,
16+ "year": integer
17+ }[]
18+}
19
20Output JSON:

Ahh, much better. That’s 80% less tokens with a simpler prompt, for the same results. (See also Microsoft’s TypeChat which uses a similar schema format using typescript types)


But we digress, let’s get back to the point. You can see how this can get out of hand quickly, and how Pydantic wasn’t really made with LLMs in mind. We haven’t gotten around to adding resilience like retries, or falling back to a different model in the event of an outage. There’s still a lot of wrapper code to write.

Pydantic and Enums

There are other core limitations. Say you want to do a classification task using Pydantic. An Enum is a great fit for modelling this.

Assume this is our prompt:

Classify the company described in this text into the best
of the following categories:
Text:
---
{some_text}
---
Categories:
- Technology: Companies involved in the development and production of technology products or services
- Healthcare: Includes companies in pharmaceuticals, biotechnology, medical devices.
- Real estate: Includes real estate investment trusts (REITs) and companies involved in real estate development.
The best category is:

Since we have descriptions, we need to generate a custom enum we can use to build the prompt:

1class FinancialCategory(Enum):
2 technology = (
3 "Technology",
4 "Companies involved in the development and production of technology products or services.",
5 )
6 ...
7 real_estate = (
8 "Real Estate",
9 "Includes real estate investment trusts (REITs) and companies involved in real estate development.",
10 )
11
12 def __init__(self, category, description):
13 self._category = category
14 self._description = description
15
16 @property
17 def category(self):
18 return self._category
19
20 @property
21 def description(self):
22 return self._description

We add a class method to load the right enum from the LLM output string:

1 @classmethod
2 def from_string(cls, category: str) -> "FinancialCategory":
3 for c in cls:
4 if c.category == category:
5 return c
6 raise ValueError(f"Invalid category: {category}")

Update the prompt to use the enum descriptions:

1def print_categories_and_descriptions():
2 for category in FinancialCategory:
3 print(f"{category.category}: {category.description}")
4
5def create_prompt(text: str) -> str:
6 additional_models = ""
7 print_categories_and_descriptions()
8 PROMPT_TEMPLATE = f"""Classify the company described in this text into the best
9of the following categories:
10
11Text:
12---
13{text}
14---
15
16Categories:
17{print_categories_and_descriptions()}
18
19The best category is:
20"""
21 return PROMPT_TEMPLATE

And then we use it in our AI function:

1def classify_company(text: str) -> FinancialCategory:
2 prompt = create_prompt(text)
3 chat_completion = client.chat.completions.create(
4 model="gpt-4", messages=[{"role": "system", "content": prompt}]
5 )
6 try:
7 output = chat_completion.choices[0].message.content
8 if output:
9 # Use our helper function!
10 return FinancialCategory.from_string(output)
11 return None
12 except Exception as e:
13 raise e

What gets hairy is if you want to change your types.

  • What if you want the LLM to return an object instead? You have to change your enum, your prompt, AND your parser.
  • What if you want to handle cases where the LLM outputs “Real Estate” or “real estate”?
  • What if you want to save the enum information in a database? str(category) will save FinancialCategory.healthcare into your DB, but your parser only recognizes “Healthcare”, so you’ll need more boilerplate if you ever want to programmatically analyze your data.

Alternatives

There are libraries like instructor do provide a great amount of boilerplate but you’re still:

  1. Using prompts that you cannot control. E.g. a commit may change your results underneath you.
  2. Using more tokens than you may need to to declare schemas (higher costs and latencies)
  3. There are no included testing capabilities.. Developers have to copy-paste JSON blobs everywhere, potentially between their IDEs and other websites. Existing LLM Playgrounds were not made with structured data in mind.
  4. Lack of observability. No automatic tracing of requests.

Enter BAML

The Boundary toolkit helps you iterate seamlessly compared to Pydantic.

Here’s all the BAML code you need to solve the Extract Resume problem from earlier (VSCode prompt preview is shown on the right):

Here we use a “GPT4” client, but you can use any model. See client docs

The BAML compiler generates a python client that imports and calls the function:

1from baml_client import baml as b
2
3async def main():
4 resume = await b.ExtractResume(resume_text="""John Doe
5Python, Rust
6University of California, Berkeley, B.S. in Computer Science, 2020""")
7
8 assert resume.name == "John Doe"

That’s it! No need to write any more code. Since the compiler knows what your function signature is we literally generate a custom deserializer for your own unique usecase that just works.

Converting the Resume into an array of resumes requires a single line change in BAML (vs having to create array wrapper classes and parsing logic).

In this image we change the types and BAML automatically updates the prompt, parser, and the Python types you get back.

Adding retries or resilience requires just a couple of modifications. And best of all, you can test things instantly, without leaving your VSCode.

Conclusion

We built BAML because writing a Python library was just not powerful enough to do everything we envisioned, as we have just explored.

Check out the Hello World tutorial to get started.

Our mission is to make the best DX for AI engineers working with LLMs. Contact us at founders@boundaryml.com or Join us on Discord to stay in touch with the community and influence the roadmap.