Inference at the edge · ggml-org/llama.cpp · Discussion #205 (original) (raw)

Inference at the edge

Based on the positive responses to whisper.cpp, and more recently, llama.cpp, it looks like there is a strong and growing interest for doing efficient transformer model inference on-device (i.e. at the edge).

image

The past few days, I received a large number of requests and e-mails with various ideas for startups, projects, collaboration. This makes me confident that there is something of value in these little projects. It would be foolish to let this existing momentum go to waste.

Recently, I've also been seeing some very good ideas and code contributions by many developers:

The AI field currently presents a wide range of cool things to do. Not all of them (probably most) really useful, but still - fun and cool. And I think a lot of people like to work on fun and cool projects (for now, we can leave the "useful" projects to the big corps :)). From chat bots that can listen and talk in your browser, to editing code with your voice or even running 7B models on a mobile device. The ideas are endless and I personally have many of them. Bringing those ideas from the cloud to the device, in the hands of the users is exciting!

Naturally, I am thinking about ways to build on top of all this. So here are a few thoughts that I have so far:

I hope that you share the hacking spirit that I have and would love to hear your ideas and comments about how you see the future of "inference at the edge".

Edit: "on the edge" -> "at the edge"

You must be logged in to vote

Kudos on all the incredible work you and the community did!
It's incredibly important to have an open source self hosted version of some of this progress

You must be logged in to vote

1 reply

@PhillipGimmi

Thanks for your amazing work with both whisper.cpp, and llama.cpp. I've been hugely inspired by your contributions (especially with Whisper while working on Buzz), and I share your excitement about all the amazing possibilities of on-device inference.

Personally, I've been thinking a lot about multi-modal on-device inference, like an assistive device that can both capture image data and respond to voice and text commands offline, and I plan to hack something together soon. In any case, thanks again for all your work. I really do hope to continue contributing to your projects (and keep improving my C++ :)) and learning more from you in the future.

— Chidi

You must be logged in to vote

3 replies

@jonathgh

I can only second @chidiwilliams here, that I am incredibly grateful for the work that @ggerganov has done with both Whisper.cpp and LLaMa.cpp. I believe it has really put the power of these models into the hands of users, and that the future is in on-device and at the edge. We have been very inspired by the work and wanted to make it even more user-friendly with WhisperScript) so that even non-programmers can benefit from the advancements in this tech. Thanks for the work and looking forward to learning even more from you!

@jonathgh

We now also have a Windows version available, if anyone is looking for a simple, clean UI to run Whisper on their Windows desktop: Windows WhisperScript UI

@Debianreiser

oiste ggeganov , no eres nada bueno , te envie emails un monton para ser amigos , yo quiero ser amigo tuyo para aprender cosas y tu no me añadiste de amigo , y asi como voy a aprender a ser un super programador de inteligencia artificial como tu si no quieres ser mi amigo? ni me añades al whatsapp ni al facebook ni en ningun lado..no contestaste mis correos electronicos que hice una interfaz tui para llama-server en python , y yo quiero ser tu amigo y hablar contigo y aprender cosas y tu no me ayudas!!! tenemos que hacer otro invento para enviar datos por audio , pero con rayos laser o algo asi mas moderno , viste como ahora ya investigaron para hacer un wifi mucho mas rapido por laser , pero nosotros tenemos que hacer uno por morse , se puede hacer usando raspberrys , yo se como se hace , raspberry puede generar audio modulado en fm al tener gpio y podemos transmitir datos a alta velocidad y largas distancias solo usando un simple transistor y una antena pequeña , seria transmitir fm de radio comercial , pero en una banda vacia que no use la fm , pero usar la modulacion fm para transmitir los datos y asi mejoramos tu programa y las raspberrys puedes ser nodos de comunicacion en caso de desastre mundial , aunque la gente no tenga internet , entiendes?? tu hazme amigo mio yque aprendemos cosas juntas. mandame correos a mail.snaj@gmail.com y somos amigos , te invito a unas cervezas...HASTA LUEGO!!

"This project will remain open-source" ♥️
You hear that, OpenAI?

You must be logged in to vote

1 reply

@LongBeardMan

The Corp™'s firewall of cash blocked it but if we work back from the edge we should be able to get through and get some of that sweet sweet compute. If attention is all it needs it's definitely got it now ( but I jest ⚖️ )

As someone involved in a number of "Inference at the edge" projects which are features focused, the feature all of them truly needed was performance! Now with llama.cpp and its forks these projects can reach audiences they never could before.

These are good thoughts.

Thank you Georgi, and thank you everyone else who has contributed and is contributing no matter how big or small.

You must be logged in to vote

0 replies

this core / edge thing appears to have a lot of potential and should at least be made very transparent by all. I.e. what exactly runs at the edge in the current commercial products? so that we know how to plan our resources and not get "frustrated" when, e.g., Microsoft Edge crashes on some of them, or when the claims don't hold because the edge is not at the expected level.

Before more countries embark on building AI capabilities (UK?), maybe one should get that clear. Will there be a national core or a national edge?

You must be logged in to vote

1 reply

@LongBeardMan

I guess it depends on the end goal but in my mind it would be some sort of DMZ at first and then everyone is invited in and accessible through P2P Mesh style. Very much like the internet runs but more open to get the purest prima materia. Money, copyright and individual rights/freedoms are the biggest obstacles I guess. Apparently the transformers are driven by attention which is probably good. I like to pretend they would ultimately use creativity to overcome formality but 𝘸𝘩𝘰 knows.

This is the true spirit of hacking technologies from the bottom up! 🥷💻

This project will remain open-source

The code has to remain simple and compact in order to allow for quick and easy modifications

Keep doing the good work! 🚀🤘

You must be logged in to vote

0 replies

LLaMA.cpp works shockingly well. You've proved that we don't need 16 bits, only 16 levels! Thanks so much for what you're doing, it's because of people like you that open source / community / collaborative AI will catch up with the strongest commercial AI, or at least contribute substantially and remain valuable. I've noticed that open source software is generally of much higher quality (for security, dev and research purposes at least).

You must be logged in to vote

0 replies

I really appreciate the incredible efforts you've put into this project. It's wonderful to know that you intend keep it open-source and are considering the addition of more models. One particularly intriguing model is FLAN-UL2 20B. Though its MMLU performance is inferior to LLaMA 65B, it has already been instruction fine-tuned and comes with an Apache 2.0 license. This could potentially enable a lot of interesting real-world use cases, such as question answering across a collection of documents. It is however an encoder-decoder architecture and might be more work to get up and running.

You must be logged in to vote

2 replies

@sandorkonya

@yadamonk

Flan-T5-XXL is also a great model, but I'm not convinced that it's better. Here is another HF space that offers direct side-by-side comparison.

One important difference is that FLAN-UL2 20B has a 4x bigger context length with 2048 tokens. That makes it more suitable for some use-cases such as in-context learning and retrieval-augmented generation.

You must be logged in to vote

7 replies

@loretoparisi

so the int4 here requires only 6GB ram? It seems to me that the tokenizer's vocabulary has mixed zh and en subwords.

@MarkSchmidty

Yes chatGLM was finetuned over 1 trillion tokens of Chinese and English dialogue corpus as well as extensive RLHF in both languages. The base model, GLM-130B, was already impressive and largely overlooked prior to instruct finetuning.

@loretoparisi

and if we look at the HF code here it is using SentencePiece tokenizer
with

PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES = {
    "THUDM/chatglm-6b": 2048,
}

hence I have understood well, it should be easy to port it to GGML, isn't it?

@LeeeSe

Looking forward to your results, a lot of people are waiting for the ggml version of chatGLM

@yadamonk

It's an interesting model but licensed solely for non-commercial research purposes. Personally, I think it would be really nice to have a capable and truly open source model like FLAN-UL2 20B available in llama.cpp.

CTranslate2 could be good inspiration, it's a C++ inference engine for Transformer language models with a focus on machine translation. It performs very well on CPUs and uses hardware acceleration effectively.

Thanks for the great open-source work! I want to play around with llama.cpp this weekend.

You must be logged in to vote

1 reply

@ggerganov

I agree - CTranslate2 seems to be very efficient and the authors have done a great job. Their faster-whisper implementation actually outperforms significantly whisper.cpp.

You must be logged in to vote

1 reply

@ggerganov

Thank you @Const-me - much appreciated as always!
I will soon be looking into finalizing the 4-bit ggml branch and will definitely look into integrating the proposed changes

I like your phrase "Bloating the software with the ideas of today will make it useless tomorrow."

You must be logged in to vote

0 replies

You must be logged in to vote

11 replies

@MarkSchmidty

The pinecone embeddings plugin was just a proof of concept. The author was trying to show that plugins can work with local models.

Pretty much nobody running llama has a need for Pincone, yes. But the proof that we can do plugins just like chatGPT can is cool and we should make more which are actually useful.

@anzz1

@MarkSchmidty Wait, do you mean that the plugins themselves could work completely outside the chatgpt service and have the cloud environment and API hosted locally? If that's true, then that is indeed awesome. I thought it was just a interface to the openai API? Now I'm confused.

@anzz1

I'm not against cloud services per-se and there are many which I like and use, what I am against is the tricks and attempts to lock users in their ecosystems, and as so many 'cloud service' providers aggresively move towards forced SaaS-models, hearing the word 'cloud' makes me slightly cringe already. I do understand why its happening, it makes very good business sense but it doesnt make it any less scummy. If this really is something that works towards truly "Open" AI , and not a underhanded attempt to sell their services then fantastic! You'd have to excuse my cynicism when it comes to the 'cloud' topic, as not all of them are bad actors.

@MarkSchmidty

The llama-retrieval-plugin does not contact or use OpenAI's servers in any way. It connects your locally running llama.cpp with one of a selection of vector database providers, including a local redis vector database option. Vector databases don't really have use cases for personal use. The reason this plugin was used for the proof of concept is because it's the only plugin OpenAI open sourced (so far).

Now that we know how these plugins work broadly and how to adapt them for local models, we should be able to reverse engineer more useful plugins like web browser, code interpreter, Zapier, etc. to keep up with ChatGPT's latest improvements while not giving anything to OpenAI.

@m-a-sch

really?
image

Btw, thank you very much for this code which runs beautifully on my 10 year old Macbook Pro. Edmund.

You must be logged in to vote

3 replies

@MarkSchmidty

The be clear, most of the speed is coming from moving fewer bits between CPU and RAM due to smaller models. RAM bandwidth is the bottleneck for efficient CPU inference. Quantization, reducing bits per parameter, is just one way to reduce the memory footprint of a model. Other methods, such as pruning, remove entire parameters with little effect on quality. These can even be combined-- and we've yet to explore that. It's quite likely we can get nearly 50% more model size and 50% higher speeds without any additional optimizations to the inference code.

@anzz1

@MarkSchmidty You are right at the theoretical level that the CPU<->RAM connection and the sequential nature of CPU computation are the ultimate bottlenecks. I'm not claiming to be a mathematician nor to completely understand the intricancies of the calculations themselves, but I do understand optimization. And in that regard, while you are not wrong in saying that most of the speed comes from having to deal with less data, it's also a bit misleading as the importance of fast and optimized code cannot be understated. To be able to go even near the absolute limit of performance of the hardware. Anyone telling otherwise, I dare them to make a equally performing CPU implementation in something like .NET or Python. Exactly because having to move so much data and performs tons of calculations all the inefficiencies in code add up, and add up fast. And save from hand-optimizing in pure assembly, C is pretty much the best you can do to be as close to the bare metal as possible and utilising the cycles & bandwidth efficiently and not wasting them. @ggerganov has simply done an amazing job in making a lean and efficient codebase. It's true that there might not be much more performance to gain by optimizing the inference code since it is already well optimized, but then again I could be wrong. Modern processors rely heavily on branch prediction and caching and there still can be performance to gain by reducing cache misses and wrong branch predictions. Unfortunately the whole area of it is pretty much black magic / art and without an expert background in processor design it is very hard to optimize for at that very lowest silicon level.

@MarkSchmidty

the importance of fast and optimized code cannot be understated.

I didn't mean to imply otherwise. Georgi deserves all the credit. It's definitely possible there are memory handling optimizations which could improve performance on the code side. But it would likely take a processor design expert to identify them, and there are very few of those in the world. Currently there's some obvious low hanging fruit on the model size end of things (sparsity, pruning, 3bit, etc.).

You must be logged in to vote

1 reply

@Green-Sky

did anyone test this yet? seems like an easy thing to try.

You must be logged in to vote

1 reply

@af-kka

I tried codellama 7b quantized and llama 3.2 3.7b (w/o quantization) models on iPhone 16 pro, got tremendous results. Wonder what is the problem in your case.

This is a great effort and hope it reaches somewhere. Right now, the cost to run model for inference in GPU is cost-prohibitive for most ideas, projects, and bootstrapping startups compared to just using chatgpt API. Once you are locked in the ecosystem the cost which seems low for tokens, can increase exponentially. Plus, llama licensing is also ambiguous.

You must be logged in to vote

0 replies

Thanks for the incredible work!

You must be logged in to vote

0 replies

Inference at the edge

Based on the positive responses to whisper.cpp, and more recently, llama.cpp, it looks like there is a strong and growing interest for doing efficient transformer model inference on-device (i.e. at the edge).

image

The past few days, I received a large number of requests and e-mails with various ideas for startups, projects, collaboration. This makes me confident that there is something of value in these little projects. It would be foolish to let this existing momentum go to waste.

Recently, I've also been seeing some very good ideas and code contributions by many developers:

The AI field currently presents a wide range of cool things to do. Not all of them (probably most) really useful, but still - fun and cool. And I think a lot of people like to work on fun and cool projects (for now, we can leave the "useful" projects to the big corps :)). From chat bots that can listen and talk in your browser, to editing code with your voice or even running 7B models on a mobile device. The ideas are endless and I personally have many of them. Bringing those ideas from the cloud to the device, in the hands of the users is exciting!

Naturally, I am thinking about ways to build on top of all this. So here are a few thoughts that I have so far:

I hope that you share the hacking spirit that I have and would love to hear your ideas and comments about how you see the future of "inference at the edge".

Edit: "on the edge" -> "at the edge"

Hi,

I made a flutter mobile app with your project, i am really interested in this idea!

The repo : Sherpa Github
Playstore link : Sherpa

I will soon update to the latest version of llama, but if there is an optimisation for low end devices, I am extremely interested.

You must be logged in to vote

0 replies

we will look back on this in 20 years, when telling the tales of cloud vs edge...

You must be logged in to vote

2 replies

@Cyberhan123

I believe more in giving power to everyone, rather than to a dictatorial cloud.

@Cyberhan123

Although I say this like a rant, the facts prove that it is easier to get up from the clouds than to get down from the clouds. With the improvement of computing power, what we lack is not large servers, but private and controllable edge software.

I really appreciate your support for memory-mapped files. Having the ability to "cheat" and run models I shouldn't be able to due to your efficient programming is simply a miracle.

You must be logged in to vote

0 replies

what do you think it is that happened that people might consider that they “shouldn’t be able” to run large models on small hardware?
i’m old, in the 90s we did everything in under 64k, huge worlds, raytracers. there are still today yearly competitions around who can optimize code the best to run on small and slow systems.

You must be logged in to vote

1 reply

@James-E-A

Having the ability to "cheat" and run models I shouldn't be able to due to your efficient programming is simply a miracle.

what do you think it is that happened that people might consider that they “shouldn’t be able” to run large models on small hardware?

i’m old, in the 90s we did everything in under 64k …

It seems the problem is two-pronged:

You must be logged in to vote

0 replies

You guys need a donation page. This project keeps me from allowing these global poor overseas who fix OpenAI, and Gemini, and allowing a different system to be developed. I would be more than happy to throw a few dollars in the tip jar!

You must be logged in to vote

0 replies

You must be logged in to vote

0 replies

Love it, or love all things edge. And i really think its the future for cost-sensitive apps. Have you looked at cascading between local and cloud models? cost reduction by running simple queries locally and only hitting cloud apis for complex reasoning. The latency improvement is huge too.

You must be logged in to vote

0 replies

To add to the hacking spirit, I am trying to port the GGUF model to a RISC-V platform and even aiming to tape-out the chips. This way we enlarge the scope not just open-source models but also the open-source inference endpoints. I haven't succeeded yet, but would love to run through my initials experiment with you @ggerganov if you have time or anyone here in the community would love to brainstorm further?

You must be logged in to vote

0 replies