Inference at the edge · ggml-org/llama.cpp · Discussion #205 (original) (raw)

Inference at the edge

Based on the positive responses to whisper.cpp, and more recently, llama.cpp, it looks like there is a strong and growing interest for doing efficient transformer model inference on-device (i.e. at the edge).

The past few days, I received a large number of requests and e-mails with various ideas for startups, projects, collaboration. This makes me confident that there is something of value in these little projects. It would be foolish to let this existing momentum go to waste.

Recently, I've also been seeing some very good ideas and code contributions by many developers:

@ameenba suggested a way to improve Whisper Encoder evaluation at the cost of accuracy #137. Ultimately, this allowed to demonstrate semi efficent short voice command recognition on a Raspberry Pi 4 Twitter
@wangchou et. al. demonstrated how to evaluate the Whisper Encoder efficiently using Apple Neural Engine #548
A person on Twitter showed me how to improve llama.cpp efficiency by 10% with simple SIMD change Twitter
@Const-me kindly provided efficient AVX2 quantization routines #27
and many other examples

The AI field currently presents a wide range of cool things to do. Not all of them (probably most) really useful, but still - fun and cool. And I think a lot of people like to work on fun and cool projects (for now, we can leave the "useful" projects to the big corps :)). From chat bots that can listen and talk in your browser, to editing code with your voice or even running 7B models on a mobile device. The ideas are endless and I personally have many of them. Bringing those ideas from the cloud to the device, in the hands of the users is exciting!

Naturally, I am thinking about ways to build on top of all this. So here are a few thoughts that I have so far:

This project will remain open-source
I would like to explore the application of this approach to other models and build more examples to demonstrate it
I would be really happy to see developers join in and help advance further the idea of "inference at the edge"
The strongest points of the current codebase are it's simplicity and efficiency. Performance is essential
It's early to build a full-fledged edge inference framework. The code has to remain simple and compact in order to allow for quick and easy modifications. This helps to explore ideas at a much higher rate. Bloating the software with the ideas of today will make it useless tomorrow
The AI models are improving at a very high rate and it is important to stay on top of it. The transformer architecture in it's core is very simple. There is no need to "slap" complex things on top of it
Hacking small tools and examples is a great way to drive innovation. We should not get lost into software engineering problems. Especially at the beginning, the goal is to prototype and not waste time in polishing products
And most of all, it's important to have fun in the process!

I hope that you share the hacking spirit that I have and would love to hear your ideas and comments about how you see the future of "inference at the edge".

Edit: "on the edge" -> "at the edge"

You must be logged in to vote

Kudos on all the incredible work you and the community did!
It's incredibly important to have an open source self hosted version of some of this progress

You must be logged in to vote

1 reply

Thanks for your amazing work with both whisper.cpp, and llama.cpp. I've been hugely inspired by your contributions (especially with Whisper while working on Buzz), and I share your excitement about all the amazing possibilities of on-device inference.

Personally, I've been thinking a lot about multi-modal on-device inference, like an assistive device that can both capture image data and respond to voice and text commands offline, and I plan to hack something together soon. In any case, thanks again for all your work. I really do hope to continue contributing to your projects (and keep improving my C++ :)) and learning more from you in the future.

— Chidi

You must be logged in to vote

3 replies

I can only second @chidiwilliams here, that I am incredibly grateful for the work that @ggerganov has done with both Whisper.cpp and LLaMa.cpp. I believe it has really put the power of these models into the hands of users, and that the future is in on-device and at the edge. We have been very inspired by the work and wanted to make it even more user-friendly with WhisperScript) so that even non-programmers can benefit from the advancements in this tech. Thanks for the work and looking forward to learning even more from you!

We now also have a Windows version available, if anyone is looking for a simple, clean UI to run Whisper on their Windows desktop: Windows WhisperScript UI

oiste ggeganov , no eres nada bueno , te envie emails un monton para ser amigos , yo quiero ser amigo tuyo para aprender cosas y tu no me añadiste de amigo , y asi como voy a aprender a ser un super programador de inteligencia artificial como tu si no quieres ser mi amigo? ni me añades al whatsapp ni al facebook ni en ningun lado..no contestaste mis correos electronicos que hice una interfaz tui para llama-server en python , y yo quiero ser tu amigo y hablar contigo y aprender cosas y tu no me ayudas!!! tenemos que hacer otro invento para enviar datos por audio , pero con rayos laser o algo asi mas moderno , viste como ahora ya investigaron para hacer un wifi mucho mas rapido por laser , pero nosotros tenemos que hacer uno por morse , se puede hacer usando raspberrys , yo se como se hace , raspberry puede generar audio modulado en fm al tener gpio y podemos transmitir datos a alta velocidad y largas distancias solo usando un simple transistor y una antena pequeña , seria transmitir fm de radio comercial , pero en una banda vacia que no use la fm , pero usar la modulacion fm para transmitir los datos y asi mejoramos tu programa y las raspberrys puedes ser nodos de comunicacion en caso de desastre mundial , aunque la gente no tenga internet , entiendes?? tu hazme amigo mio yque aprendemos cosas juntas. mandame correos a mail.snaj@gmail.com y somos amigos , te invito a unas cervezas...HASTA LUEGO!!

"This project will remain open-source" ♥️
You hear that, OpenAI?

You must be logged in to vote

1 reply

The Corp™'s firewall of cash blocked it but if we work back from the edge we should be able to get through and get some of that sweet sweet compute. If attention is all it needs it's definitely got it now ( but I jest ⚖️ )

As someone involved in a number of "Inference at the edge" projects which are features focused, the feature all of them truly needed was performance! Now with llama.cpp and its forks these projects can reach audiences they never could before.

These are good thoughts.

Thank you Georgi, and thank you everyone else who has contributed and is contributing no matter how big or small.

You must be logged in to vote

0 replies

this core / edge thing appears to have a lot of potential and should at least be made very transparent by all. I.e. what exactly runs at the edge in the current commercial products? so that we know how to plan our resources and not get "frustrated" when, e.g., Microsoft Edge crashes on some of them, or when the claims don't hold because the edge is not at the expected level.

Before more countries embark on building AI capabilities (UK?), maybe one should get that clear. Will there be a national core or a national edge?

You must be logged in to vote

1 reply

I guess it depends on the end goal but in my mind it would be some sort of DMZ at first and then everyone is invited in and accessible through P2P Mesh style. Very much like the internet runs but more open to get the purest prima materia. Money, copyright and individual rights/freedoms are the biggest obstacles I guess. Apparently the transformers are driven by attention which is probably good. I like to pretend they would ultimately use creativity to overcome formality but 𝘸𝘩𝘰 knows.

This is the true spirit of hacking technologies from the bottom up! 🥷💻

This project will remain open-source

The code has to remain simple and compact in order to allow for quick and easy modifications

Keep doing the good work! 🚀🤘

You must be logged in to vote

0 replies

LLaMA.cpp works shockingly well. You've proved that we don't need 16 bits, only 16 levels! Thanks so much for what you're doing, it's because of people like you that open source / community / collaborative AI will catch up with the strongest commercial AI, or at least contribute substantially and remain valuable. I've noticed that open source software is generally of much higher quality (for security, dev and research purposes at least).

You must be logged in to vote

0 replies

I really appreciate the incredible efforts you've put into this project. It's wonderful to know that you intend keep it open-source and are considering the addition of more models. One particularly intriguing model is FLAN-UL2 20B. Though its MMLU performance is inferior to LLaMA 65B, it has already been instruction fine-tuned and comes with an Apache 2.0 license. This could potentially enable a lot of interesting real-world use cases, such as question answering across a collection of documents. It is however an encoder-decoder architecture and might be more work to get up and running.

You must be logged in to vote

2 replies

Flan-T5-XXL is also a great model, but I'm not convinced that it's better. Here is another HF space that offers direct side-by-side comparison.

One important difference is that FLAN-UL2 20B has a 4x bigger context length with 2048 tokens. That makes it more suitable for some use-cases such as in-context learning and retrieval-augmented generation.

You must be logged in to vote

7 replies

so the int4 here requires only 6GB ram? It seems to me that the tokenizer's vocabulary has mixed zh and en subwords.

Yes chatGLM was finetuned over 1 trillion tokens of Chinese and English dialogue corpus as well as extensive RLHF in both languages. The base model, GLM-130B, was already impressive and largely overlooked prior to instruct finetuning.

and if we look at the HF code here it is using SentencePiece tokenizer
with

PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES = {
    "THUDM/chatglm-6b": 2048,
}

hence I have understood well, it should be easy to port it to GGML, isn't it?

Looking forward to your results, a lot of people are waiting for the ggml version of chatGLM

It's an interesting model but licensed solely for non-commercial research purposes. Personally, I think it would be really nice to have a capable and truly open source model like FLAN-UL2 20B available in llama.cpp.

CTranslate2 could be good inspiration, it's a C++ inference engine for Transformer language models with a focus on machine translation. It performs very well on CPUs and uses hardware acceleration effectively.

Thanks for the great open-source work! I want to play around with llama.cpp this weekend.

You must be logged in to vote

1 reply

I agree - CTranslate2 seems to be very efficient and the authors have done a great job. Their faster-whisper implementation actually outperforms significantly whisper.cpp.

You must be logged in to vote

1 reply

Thank you @Const-me - much appreciated as always!
I will soon be looking into finalizing the 4-bit ggml branch and will definitely look into integrating the proposed changes

I like your phrase "Bloating the software with the ideas of today will make it useless tomorrow."

You must be logged in to vote

0 replies

You must be logged in to vote

11 replies

The pinecone embeddings plugin was just a proof of concept. The author was trying to show that plugins can work with local models.

Pretty much nobody running llama has a need for Pincone, yes. But the proof that we can do plugins just like chatGPT can is cool and we should make more which are actually useful.

@MarkSchmidty Wait, do you mean that the plugins themselves could work completely outside the chatgpt service and have the cloud environment and API hosted locally? If that's true, then that is indeed awesome. I thought it was just a interface to the openai API? Now I'm confused.

I'm not against cloud services per-se and there are many which I like and use, what I am against is the tricks and attempts to lock users in their ecosystems, and as so many 'cloud service' providers aggresively move towards forced SaaS-models, hearing the word 'cloud' makes me slightly cringe already. I do understand why its happening, it makes very good business sense but it doesnt make it any less scummy. If this really is something that works towards truly "Open" AI , and not a underhanded attempt to sell their services then fantastic! You'd have to excuse my cynicism when it comes to the 'cloud' topic, as not all of them are bad actors.

The llama-retrieval-plugin does not contact or use OpenAI's servers in any way. It connects your locally running llama.cpp with one of a selection of vector database providers, including a local redis vector database option. Vector databases don't really have use cases for personal use. The reason this plugin was used for the proof of concept is because it's the only plugin OpenAI open sourced (so far).

Now that we know how these plugins work broadly and how to adapt them for local models, we should be able to reverse engineer more useful plugins like web browser, code interpreter, Zapier, etc. to keep up with ChatGPT's latest improvements while not giving anything to OpenAI.

really?

If this software is much faster at inference than other methods because of quantification, we can expect people who train models to train their models directly as quantified or easily quantifiable ones, in order to be able to move them to the edge (or to cheaper data-centre inference hardware).
Therefore putting too many special case optimisations in the main body of code here is a bad idea, because we want model designers to have a clear common target for inference.

Btw, thank you very much for this code which runs beautifully on my 10 year old Macbook Pro. Edmund.

You must be logged in to vote

3 replies

The be clear, most of the speed is coming from moving fewer bits between CPU and RAM due to smaller models. RAM bandwidth is the bottleneck for efficient CPU inference. Quantization, reducing bits per parameter, is just one way to reduce the memory footprint of a model. Other methods, such as pruning, remove entire parameters with little effect on quality. These can even be combined-- and we've yet to explore that. It's quite likely we can get nearly 50% more model size and 50% higher speeds without any additional optimizations to the inference code.

@MarkSchmidty You are right at the theoretical level that the CPU<->RAM connection and the sequential nature of CPU computation are the ultimate bottlenecks. I'm not claiming to be a mathematician nor to completely understand the intricancies of the calculations themselves, but I do understand optimization. And in that regard, while you are not wrong in saying that most of the speed comes from having to deal with less data, it's also a bit misleading as the importance of fast and optimized code cannot be understated. To be able to go even near the absolute limit of performance of the hardware. Anyone telling otherwise, I dare them to make a equally performing CPU implementation in something like .NET or Python. Exactly because having to move so much data and performs tons of calculations all the inefficiencies in code add up, and add up fast. And save from hand-optimizing in pure assembly, C is pretty much the best you can do to be as close to the bare metal as possible and utilising the cycles & bandwidth efficiently and not wasting them. @ggerganov has simply done an amazing job in making a lean and efficient codebase. It's true that there might not be much more performance to gain by optimizing the inference code since it is already well optimized, but then again I could be wrong. Modern processors rely heavily on branch prediction and caching and there still can be performance to gain by reducing cache misses and wrong branch predictions. Unfortunately the whole area of it is pretty much black magic / art and without an expert background in processor design it is very hard to optimize for at that very lowest silicon level.

the importance of fast and optimized code cannot be understated.

I didn't mean to imply otherwise. Georgi deserves all the credit. It's definitely possible there are memory handling optimizations which could improve performance on the code side. But it would likely take a processor design expert to identify them, and there are very few of those in the world. Currently there's some obvious low hanging fruit on the model size end of things (sparsity, pruning, 3bit, etc.).

You must be logged in to vote

1 reply

did anyone test this yet? seems like an easy thing to try.

You must be logged in to vote

1 reply

I tried codellama 7b quantized and llama 3.2 3.7b (w/o quantization) models on iPhone 16 pro, got tremendous results. Wonder what is the problem in your case.

This is a great effort and hope it reaches somewhere. Right now, the cost to run model for inference in GPU is cost-prohibitive for most ideas, projects, and bootstrapping startups compared to just using chatgpt API. Once you are locked in the ecosystem the cost which seems low for tokens, can increase exponentially. Plus, llama licensing is also ambiguous.

You must be logged in to vote

0 replies

Thanks for the incredible work!

You must be logged in to vote

0 replies

Inference at the edge
Based on the positive responses to whisper.cpp, and more recently, llama.cpp, it looks like there is a strong and growing interest for doing efficient transformer model inference on-device (i.e. at the edge).

The past few days, I received a large number of requests and e-mails with various ideas for startups, projects, collaboration. This makes me confident that there is something of value in these little projects. It would be foolish to let this existing momentum go to waste.

Recently, I've also been seeing some very good ideas and code contributions by many developers:

@ameenba suggested a way to improve Whisper Encoder evaluation at the cost of accuracy #137. Ultimately, this allowed to demonstrate semi efficent short voice command recognition on a Raspberry Pi 4 Twitter

@wangchou et. al. demonstrated how to evaluate the Whisper Encoder efficiently using Apple Neural Engine #548

A person on Twitter showed me how to improve llama.cpp efficiency by 10% with simple SIMD change Twitter

@Const-me kindly provided efficient AVX2 quantization routines #27

and many other examples

The AI field currently presents a wide range of cool things to do. Not all of them (probably most) really useful, but still - fun and cool. And I think a lot of people like to work on fun and cool projects (for now, we can leave the "useful" projects to the big corps :)). From chat bots that can listen and talk in your browser, to editing code with your voice or even running 7B models on a mobile device. The ideas are endless and I personally have many of them. Bringing those ideas from the cloud to the device, in the hands of the users is exciting!

Naturally, I am thinking about ways to build on top of all this. So here are a few thoughts that I have so far:

This project will remain open-source

I would like to explore the application of this approach to other models and build more examples to demonstrate it

I would be really happy to see developers join in and help advance further the idea of "inference at the edge"

The strongest points of the current codebase are it's simplicity and efficiency. Performance is essential

It's early to build a full-fledged edge inference framework. The code has to remain simple and compact in order to allow for quick and easy modifications. This helps to explore ideas at a much higher rate. Bloating the software with the ideas of today will make it useless tomorrow

The AI models are improving at a very high rate and it is important to stay on top of it. The transformer architecture in it's core is very simple. There is no need to "slap" complex things on top of it

Hacking small tools and examples is a great way to drive innovation. We should not get lost into software engineering problems. Especially at the beginning, the goal is to prototype and not waste time in polishing products

And most of all, it's important to have fun in the process!

I hope that you share the hacking spirit that I have and would love to hear your ideas and comments about how you see the future of "inference at the edge".

Edit: "on the edge" -> "at the edge"

Hi,

I made a flutter mobile app with your project, i am really interested in this idea!

The repo : Sherpa Github
Playstore link : Sherpa

I will soon update to the latest version of llama, but if there is an optimisation for low end devices, I am extremely interested.

You must be logged in to vote

0 replies

we will look back on this in 20 years, when telling the tales of cloud vs edge...

You must be logged in to vote

2 replies

I believe more in giving power to everyone, rather than to a dictatorial cloud.

Although I say this like a rant, the facts prove that it is easier to get up from the clouds than to get down from the clouds. With the improvement of computing power, what we lack is not large servers, but private and controllable edge software.

I really appreciate your support for memory-mapped files. Having the ability to "cheat" and run models I shouldn't be able to due to your efficient programming is simply a miracle.

You must be logged in to vote

0 replies

what do you think it is that happened that people might consider that they “shouldn’t be able” to run large models on small hardware?
i’m old, in the 90s we did everything in under 64k, huge worlds, raytracers. there are still today yearly competitions around who can optimize code the best to run on small and slow systems.

You must be logged in to vote

1 reply

Having the ability to "cheat" and run models I shouldn't be able to due to your efficient programming is simply a miracle.

what do you think it is that happened that people might consider that they “shouldn’t be able” to run large models on small hardware?

i’m old, in the 90s we did everything in under 64k …

It seems the problem is two-pronged:

From the individual end, the “Conspicuous Consumption” syndrome causes people to consider their upper-middle-end builds as a baseline, rather than considering the baseline as a baseline. When these people go on to be software developers, we get today's loss of accessibility.
From the corporate end, it's ironically a lot more comfortable if software requires a lot of resources. If you "have" to drop 10,000onhardwaretorunapieceofsoftware,thensuddenlypaying10,000 on hardware to run a piece of software, then suddenly paying 10,000onhardwaretorunapieceofsoftware,thensuddenlypaying700/mo for a corporate support plan becomes more reasonable by comparison. Or you could just pay Amazon $700/mo for their specially licensed GPU cluster, which then looks like an amazing deal! Creating the synthetic need for excessive hardware raises opportunity costs, which is profitable.

You must be logged in to vote

0 replies

You guys need a donation page. This project keeps me from allowing these global poor overseas who fix OpenAI, and Gemini, and allowing a different system to be developed. I would be more than happy to throw a few dollars in the tip jar!

You must be logged in to vote

0 replies

You must be logged in to vote

0 replies

Love it, or love all things edge. And i really think its the future for cost-sensitive apps. Have you looked at cascading between local and cloud models? cost reduction by running simple queries locally and only hitting cloud apis for complex reasoning. The latency improvement is huge too.

You must be logged in to vote

0 replies

To add to the hacking spirit, I am trying to port the GGUF model to a RISC-V platform and even aiming to tape-out the chips. This way we enlarge the scope not just open-source models but also the open-source inference endpoints. I haven't succeeded yet, but would love to run through my initials experiment with you @ggerganov if you have time or anyone here in the community would love to brainstorm further?

You must be logged in to vote

0 replies