Multi-Modal (Images / Audio)

Multi-modal input

You can use audio, image, pdf, or video input types in BAML prompts. Just create an input argument of that type and render it in the prompt.

Switch from “Prompt Review” to “Raw cURL” in the playground to see how BAML translates multi-modal input into the LLM Request body.

1// "image" is a reserved keyword so we name the arg "img"
2function DescribeMedia(img: image) -> string {
3 client "openai-responses/gpt-5" // GPT-5 has excellent multimodal support
4 // Most LLM providers require images or audio to be sent as "user" messages.
5 prompt #"
6 {{_.role("user")}}
7 Describe this image: {{ img }}
8 "#
9}
10
11// See the "testing functions" Guide for more on testing Multimodal functions
12test Test {
13 functions [DescribeMedia]
14 args {
15 img {
16 url "https://upload.wikimedia.org/wikipedia/en/4/4d/Shrek_%28character%29.png"
17 }
18 }
19}

See how to test images in the playground.

Try it! Press ‘Run Test’ below!

Calling Multimodal BAML Functions

Images

Calling a BAML function with an image input argument type (see image types)

The from_url and from_base64 methods create an Image object based on input type.

1from baml_py import Image
2from baml_client import b
3
4async def test_image_input():
5 # from URL
6 res = await b.TestImageInput(
7 img=Image.from_url(
8 "https://upload.wikimedia.org/wikipedia/en/4/4d/Shrek_%28character%29.png"
9 )
10 )
11
12 # Base64 image
13 image_b64 = "iVBORw0K...."
14 res = await b.TestImageInput(
15 img=Image.from_base64("image/png", image_b64)
16 )

Audio

Calling functions that have audio types. See audio types

1from baml_py import Audio
2from baml_client import b
3
4async def run():
5 # from URL
6 res = await b.TestAudioInput(
7 img=Audio.from_url(
8 "https://actions.google.com/sounds/v1/emergency/beeper_emergency_call.ogg"
9 )
10 )
11
12 # Base64
13 b64 = "iVBORw0K...."
14 res = await b.TestAudioInput(
15 audio=Audio.from_base64("audio/ogg", b64)
16 )

Pdf

Calling functions that have pdf types. See pdf types

⚠️ Warning Pdf inputs must be provided as Base64 data (e.g. Pdf.from_base64). URL-based Pdf inputs are not currently supported. Additionally, Pdf inputs are only supported by models that explicitly allow document (Pdf) modalities, such as Gemini 2.x Flash/Pro or VertexAI Gemini. Make sure the client you select advertises Pdf support, otherwise your request will fail.

1from baml_py import Pdf
2from baml_client import b
3
4async def run():
5 # Base64 data
6 b64 = "JVBERi0K...."
7 res = await b.TestPdfInput(
8 pdf=Pdf.from_base64("application/pdf", b64)
9 )

Video

Calling functions that have video types. See video types

⚠️ Warning Video inputs require a model that supports video understanding (for example Gemini 2.x Flash/Pro). If your chosen model does not list video support your function call will return an error. When you supply a Video as a URL the URL is forwarded unchanged to the model; if the model cannot fetch remote content you must instead pass the bytes via Video.from_base64.

1from baml_py import Video
2from baml_client import b
3
4async def run():
5 # from URL
6 res = await b.TestVideoInput(
7 video=Video.from_url(
8 "https://example.com/sample.mp4"
9 )
10 )
11
12 # Base64
13 b64 = "AAAAGGZ0eXBpc29t...."
14 res = await b.TestVideoInput(
15 video=Video.from_base64("video/mp4", b64)
16 )

Controlling URL Resolution

By default, BAML automatically handles URL-to-base64 conversion based on what each provider supports. However, you can customize this behavior using the media_url_handler configuration:

Example: Optimizing for Performance

If you’re using Anthropic and want to avoid the latency of URL fetching:

1client<llm> FastClaude {
2 provider anthropic
3 options {
4 model "claude-3-5-sonnet-20241022"
5 api_key env.ANTHROPIC_API_KEY
6 media_url_handler {
7 image "send_url" // Anthropic can fetch URLs directly
8 pdf "send_base64" // Required by Anthropic API (As of October 2025)
9 }
10 }
11}

Example: Working with Google Cloud Storage

When using Google AI with images stored in GCS:

1client<llm> GeminiWithGCS {
2 provider google-ai
3 options {
4 model "gemini-1.5-pro"
5 api_key env.GOOGLE_API_KEY
6 media_url_handler {
7 image "send_base64_unless_google_url" // Preserve gs:// URLs, convert others
8 }
9 }
10}

Example: Ensuring Compatibility

For maximum compatibility across providers:

1client<llm> CompatibleClient {
2 provider openai
3 options {
4 model "gpt-4o"
5 api_key env.OPENAI_API_KEY
6 media_url_handler {
7 image "send_base64" // Ensure images are embedded
8 audio "send_base64" // OpenAI requires base64 for audio
9 pdf "send_base64" // Embed PDFs for reliability
10 }
11 }
12}

Random Thoughts

  1. send_url - Allows providers to fetch URLs reducing payload size
  2. send_base64 - Embedding content avoids external dependencies
  3. send_url_add_mime_type - Required for proper media handling for some providers (if the mime type is not provided, it will be downloaded to determine the mime type)
  4. send_base64_unless_google_url - Preserves Google Cloud Storage URLs for Google providers

See the provider documentation for provider-specific defaults and requirements.