Streaming

BAML lets you stream in structured JSON output from LLMs as it comes in.

If you tried streaming in a JSON output from an LLM you’d see something like:

{"items": [{"name": "Appl
{"items": [{"name": "Apple", "quantity": 2, "price": 1.
{"items": [{"name": "Apple", "quantity": 2, "price": 1.50}], "total_cost":
{"items": [{"name": "Apple", "quantity": 2, "price": 1.50}], "total_cost": 3.00} # Completed

BAML gives you fine-grained control of how it fixes this partial JSON and transforms it into a series of semantically valid partial objects.

You can check out more examples (including streaming in FastAPI and NextJS) in the BAML Examples repo.

Let’s stream the output of this function function ExtractReceiptInfo(email: string) -> ReceiptInfo for our example:

1class ReceiptItem {
2 name string
3 description string?
4 quantity int
5 price float
6}
7
8class ReceiptInfo {
9 items ReceiptItem[]
10 total_cost float?
11}
12
13function ExtractReceiptInfo(email: string) -> ReceiptInfo {
14 client GPT4o
15 prompt #"
16 Given the receipt below:
17
18 {{ email }}
19
20 {{ ctx.output_format }}
21 "#
22}

The BAML code generator creates a set of types in the baml_client library in a module called partial_types in baml_client. These types are modified from your original types to support streaming.

By default, BAML will convert all Class fields into nullable fields, and fill those fields with non-null values as much as possible given the tokens received so far.

BAML will generate b.stream.ExtractReceiptInfo() for you, which you can use like so:

main.py
1import asyncio
2from baml_client import b, partial_types, types
3
4# Using a stream:
5def example1(receipt: str):
6 stream = b.stream.ExtractReceiptInfo(receipt)
7
8 # partial is a Partial type with all Optional fields
9 for partial in stream:
10 print(f"partial: parsed {len(partial.items)} items (object: {partial})")
11
12 # final is the full, original, validated ReceiptInfo type
13 final = stream.get_final_response()
14 print(f"final: {len(final.items)} items (object: {final})")
15
16# Using only get_final_response() of a stream
17#
18# In this case, you should just use b.ExtractReceiptInfo(receipt) instead,
19# which is slightly faster and more efficient.
20def example2(receipt: str):
21 final = b.stream.ExtractReceiptInfo(receipt).get_final_response()
22 print(f"final: {len(final.items)} items (object: {final})")
23
24# Using the async client:
25async def example3(receipt: str):
26 # Note the import of the async client
27 from baml_client.async_client import b
28 stream = b.stream.ExtractReceiptInfo(receipt)
29 async for partial in stream:
30 print(f"partial: parsed {len(partial.items)} items (object: {partial})")
31
32 final = await stream.get_final_response()
33 print(f"final: {len(final.items)} items (object: {final})")
34
35receipt = """
3604/14/2024 1:05 pm
37
38Ticket: 220000082489
39Register: Shop Counter
40Employee: Connor
41Customer: Sam
42Item # Price
43Guide leash (1 Pair) uni UNI
441 $34.95
45The Index Town Walls
461 $35.00
47Boot Punch
483 $60.00
49Subtotal $129.95
50Tax ($129.95 @ 9%) $11.70
51Total Tax $11.70
52Total $141.65
53"""
54
55if __name__ == '__main__':
56 asyncio.run(example1(receipt))
57 asyncio.run(example2(receipt))
58 asyncio.run(example3(receipt))

Number fields are always streamed in only when the LLM completes them. E.g. if the final number is 129.95, you’ll only see null or 129.95 instead of partial numbers like 1, 12, 129.9, etc.

Semantic Streaming

BAML provides powerful attributes to control how your data streams, ensuring that partial values always maintain semantic validity. Here are the three key streaming attributes:

@stream.done

This attribute ensures a type or field is only streamed when it’s completely finished. It’s useful when you need atomic, fully-formed values.

For example:

1class ReceiptItem {
2 name string
3 quantity int
4 price float
5
6 // The entire ReceiptItem will only stream when complete
7 @@stream.done
8}
9
10// Receipts is a list of ReceiptItems,
11// each internal item will only stream when complete
12type Receipts = ReceiptItem[]
13
14class Person {
15 // Name will only appear when fully complete,
16 // until then it will be null
17 name string @stream.done
18 // Numbers (floats and ints) will only appear
19 // when fully complete by default
20 age int
21 // Bio will stream token by token
22 bio string
23}

@stream.not_null

This attribute ensures a containing object is only streamed when this field has a value. It’s particularly useful for discriminator fields or required metadata.

For example:

1class Message {
2 // Message won't stream until type is known
3 type "error" | "success" | "info" @stream.not_null
4 // Timestamp will only appear when fully complete
5 // until then it will be null
6 timestamp string @stream.done
7 // Content can stream token by token
8 content string
9}

@stream.with_state

This attribute adds metadata to track if a field has finished streaming. It’s perfect for showing loading states in UIs.

For example:

1class BlogPost {
2 // The blog post will only stream when title is known
3 title string @stream.done @stream.not_null
4 // The content will stream token by token, and include completion state
5 content string @stream.with_state
6}

This will generate the following code in the partial_types module:

1class StreamState(BaseModel, Generic[T]):
2 value: T,
3 state: "incomplete" | "complete"
4
5class BlogPost(BaseModel):
6 title: str
7 content: StreamState[str | None]

Type Transformation Summary

Here’s how these attributes affect your types in generated code:

BAML TypeGenerated Type (during streaming)Description
TPartial[T]?Default: Nullable and partial
T @stream.doneT?Nullable but always complete when present
T @stream.not_nullPartial[T]Always present but may be partial
T @stream.done @stream.not_nullTAlways present and always complete
T @stream.with_stateStreamState[Partial[T]?]Includes streaming state metadata

The return type of a function is not affected by streaming attributes!

Putting it all together

Let’s put all of these concepts together to design an application that streams a conversation containing stock recommendations, using semantic streaming to ensure that the streamed data obeys our domain’s invariants.

1enum Stock {
2 APPL
3 MSFT
4 GOOG
5 BAML
6}
7
8// Make recommendations atomic - we do not want a recommendation to be
9// modified by streaming additional messages.
10class Recommendation {
11 stock Stock
12 amount float
13 action "buy" | "sell"
14 @@stream.done
15}
16
17class AssistantMessage {
18 message_type "greeting" | "conversation" | "farewell" @stream.not_null
19 message string @stream.with_state @stream.not_null
20}
21
22function Respond(
23 history: (UserMessage | AssistantMessage | Recommendation)[]
24) -> Message | Recommendation {
25 client DeepseekR1
26 prompt #"
27 Make the message in the conversation, using a conversational
28 message or a stock recommendation, based on this conversation history:
29 {{ history }}.
30
31 {{ ctx.output_format }}
32 "#
33}

The above BAML code will generate the following Python definitions in the partial_types module. The use of streaming attributes has several effects on the generated code:

  • Recommendation does not have any partial fields because it was marked @stream.done.
  • The Message.message string is wrapped in StreamState, allowing runtime checking of its completion status. This status could be used to render a spinner as the message streams in.
  • The Message.message_type field may not be null, because it was marked as @stream.not_null.
1class StreamState(BaseModel, Generic[T]):
2 value: T,
3 state: Literal["Pending", "Incomplete", "Complete"]
4
5class Stock(str, Enum):
6 APPL = "APPL"
7 MSFT = "MSFT"
8 GOOG = "GOOG"
9 BAML = "BAML"
10
11class Recommendation(BaseClass):
12 stock: Stock
13 amount: float
14 action: Literal["buy", "sell"]
15
16class Message(BaseClass):
17 message_type: Literal["gretting","conversation","farewell"]
18 message: StreamState[string]
Built with