Multi-Modal (Images / Audio)
Multi-modal input
You can use audio
, image
, pdf
, or video
input types in BAML prompts. Just create an input argument of that type and render it in the prompt.
Switch from “Prompt Review” to “Raw cURL” in the playground to see how BAML translates multi-modal input into the LLM Request body.
See how to test images in the playground.
Try it! Press ‘Run Test’ below!
Calling Multimodal BAML Functions
Images
Calling a BAML function with an image
input argument type (see image types)
The from_url
and from_base64
methods create an Image
object based on input type.
Audio
Calling functions that have audio
types. See audio types
Calling functions that have pdf
types. See pdf types
⚠️ Warning Pdf inputs must be provided as Base64 data (e.g.
Pdf.from_base64
). URL-based Pdf inputs are not currently supported. Additionally, Pdf inputs are only supported by models that explicitly allow document (Pdf) modalities, such as Gemini 2.x Flash/Pro or VertexAI Gemini. Make sure theclient
you select advertises Pdf support, otherwise your request will fail.
Video
Calling functions that have video
types. See video types
⚠️ Warning Video inputs require a model that supports video understanding (for example Gemini 2.x Flash/Pro). If your chosen model does not list video support your function call will return an error. When you supply a Video as a URL the URL is forwarded unchanged to the model; if the model cannot fetch remote content you must instead pass the bytes via
Video.from_base64
.