Add multiple modalities in a single message by domenic · Pull Request #89 · webmachinelearning/prompt-api (original) (raw)
I will propose an alternative PR with that.
First, let's try to have some discussion on alternatives. Here are some choices in the design space:
Always require the full format
The base format for a prompt is: Array<{ role, content: Array<{ type, data }>>. Always require people to do this. So e.g. a simple text prompt becomes
const response = await session.prompt([ { role: "user", content: [{ type: "text", data: "the text" }]} ]);
(NOTE: I am distinguishing here between data, the innermost thing, and content, the outermost thing. Right now our API uses content for both, with no data. And, if there are shorthands, we probably should pick a single name and stick with it, because it's confusing to make web developers switch between content vs. data depending on how much shorthand they use. But for the purposes of this discussion it is good to distinguish.)
Add a single string shorthand
Probably 80% of the use cases are are just text prompts. We could support "the text" as a shorthand for
[{ role: "user", content: [{ type: "text", data: "the text" }]}]
and then support zero other shorthands. This might be a reasonable balance between ease of use and simplicity / precision for the nontrivial cases.
OpenAI-esque shorthands
The latest OpenAI responses API supports the equivalents of the following shorthands:
- The single string shorthand
Array<{ role, content: string }>, wherecontent: aStringexpands tocontent: [{ type: "text", data: aString }].
This maybe grabs another chunk of the simple use cases, without being too ambiguous. By disallowing multimodal prompts in the shorthand, it avoids the confusing case which this PR discusses.
This shorthand format also seems to be supported by the Python transformers library.
Why I stopped believing in defaults
It's tempting to go further.
- Can't we just make the default
rolebe"user", if none is provided? - Shouldn't we allow
{ content: "a string" }, without requiringtype: "text"? Surely it's obvious that a string is text, right? - For cases where we accept an array, shouldn't we just accept a single item and convert that to an array for you?
But I think this puts us back into confusing territory, because now you can almost reproduce the problematic example:
const response = await session.prompt([ { content: "Here is an image: " }, { type: "image", content: imageBytes }, { content: ". Please describe it." } ]);
This sends three separate user-role messages, which is not what the developer intended.
So I'm currently leaning toward stopping at OpenAI-esque shorthands. Let me create a second PR to see what people think.