[Feature]: Support HF-style chat template for multi-modal data in offline chat (original) (raw)

🚀 The feature, motivation and pitch

Currently, we expect image_url, audio_url etc. to be inside the messages that are passed to the chat template. We would like to expand this to supporting image, audio etc. inputs, just like in HuggingFace Transformers:

messages = [ { "role": "user", "content": [ {"type": "image"}, {"type": "text", "text": "Can you describe this image?"} ] }, ]

To avoid having to pass multi-modal inputs separately, we propose the following extension:

messages = [ { "role": "user", "content": [ {"type": "image", "image": image}, {"type": "text", "text": "Can you describe this image?"} ] }, ]

This lets us pass multi-modal data such as PIL images to LLM.chat directly without having to encode them into base64 URLs.

Alternatives

No response

Additional context

cc @ywang96 @Isotr0py @hmellor

Before submitting a new issue...