Multi-Modal (Images / Audio)

Multi-modal input

You can use audio or image input types in BAML prompts. Just create an input argument of that type and render it in the prompt.

Check the “raw curl” checkbox in the playground to see how BAML translates multi-modal input into the LLM Request body.

1// "image" is a reserved keyword so we name the arg "img"
2function DescribeMedia(img: image) -> string {
3 client openai/gpt-4o
4 // Most LLM providers require images or audio to be sent as "user" messages.
5 prompt #"
6 {{_.role("user")}}
7 Describe this image: {{ img }}
8 "#
9}
10
11// See the "testing functions" Guide for more on testing Multimodal functions
12test Test {
13 args {
14 img {
15 url "https://upload.wikimedia.org/wikipedia/en/4/4d/Shrek_%28character%29.png"
16 }
17 }
18}

See how to test images in the playground.

Calling Multimodal BAML Functions

Images

Calling a BAML function with an image input argument type (see image types)

The from_url and from_base64 methods create an Image object based on input type.

1from baml_py import Image
2from baml_client import b
3
4async def test_image_input():
5 # from URL
6 res = await b.TestImageInput(
7 img=Image.from_url(
8 "https://upload.wikimedia.org/wikipedia/en/4/4d/Shrek_%28character%29.png"
9 )
10 )
11
12 # Base64 image
13 image_b64 = "iVBORw0K...."
14 res = await b.TestImageInput(
15 img=Image.from_base64("image/png", image_b64)
16 )

Audio

Calling functions that have audio types. See audio types

1from baml_py import Audio
2from baml_client import b
3
4async def run():
5 # from URL
6 res = await b.TestAudioInput(
7 img=Audio.from_url(
8 "https://actions.google.com/sounds/v1/emergency/beeper_emergency_call.ogg"
9 )
10 )
11
12 # Base64
13 b64 = "iVBORw0K...."
14 res = await b.TestAudioInput(
15 audio=Audio.from_base64("audio/ogg", b64)
16 )