Add option to move param to device
before quantization by gau-nernst · Pull Request #699 · pytorch/ao (original) (raw)
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service andprivacy statement. We’ll occasionally send you account related emails.
Already on GitHub?Sign in to your account
Conversation9 Commits6 Checks16 Files changed
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.Learn more about bidirectional Unicode characters
[ Show hidden characters]({{ revealButtonHref }})
Fixes #655
Test
import torch from torchao.models.llama.model import Transformer, ModelArgs from torchao.quantization.quant_api import quantize, int8_weight_only import time import gc
def get_model(): with torch.device("meta"): model = Transformer(ModelArgs()) model.to_empty(device="cpu").bfloat16() return model
def clear_cuda(): gc.collect() torch.cuda.empty_cache() torch.cuda.reset_peak_memory_stats() assert torch.cuda.max_memory_reserved() == 0
model = get_model() time0 = time.perf_counter() quantize_(model, int8_weight_only()) model.to("cuda") torch.cuda.synchronize() duration = time.perf_counter() - time0 print("Quantize on CPU:") print(f" - Time taken: {duration:.2f} s") print(f" - Peak memory: {torch.cuda.max_memory_reserved() / 1e9:.2f} GiB") del model clear_cuda()
model = get_model() time0 = time.perf_counter() model.to("cuda") quantize_(model, int8_weight_only()) torch.cuda.synchronize() duration = time.perf_counter() - time0 print("Quantize on CUDA:") print(f" - Time taken: {duration:.2f} s") print(f" - Peak memory: {torch.cuda.max_memory_reserved() / 1e9:.2f} GiB") del model clear_cuda()
model = get_model() time0 = time.perf_counter() quantize_(model, int8_weight_only(), device="cuda") torch.cuda.synchronize() duration = time.perf_counter() - time0 print("Move to CUDA and quantize each param individually:") print(f" - Time taken: {duration:.2f} s") print(f" - Peak memory: {torch.cuda.max_memory_reserved() / 1e9:.2f} GiB") del model clear_cuda()
Quantize on CPU:
- Time taken: 10.48 s
- Peak memory: 6.99 GiB
Quantize on CUDA:
- Time taken: 1.96 s
- Peak memory: 14.50 GiB
Move to CUDA and quantize each param individually:
- Time taken: 1.94 s
- Peak memory: 8.29 GiB
@jerryzh168 Do you have any suggestions what tests I should add for this?
This PR may help the slow NF4 quantization issue observed in torchtune too #642 (though I'm not sure what is the NF4 quantization API. Does it use quantize_()
?)
🔗 Helpful Links
🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/699
- 📄 Preview Python docs built from this PR
Note: Links to docs will display an error until the docs builds have been completed.
✅ No Failures
As of commit 711d001 with merge base b523f9f ():
💚 Looks good so far! There are no failures yet. 💚
This comment was automatically generated by Dr. CI and updates every 15 minutes.
nf4 is not yet using quantize_
API, it is module swapping linear with a LoraLinear:
, maybe we could incorporate this in quantize_
as well if there is a weight only quantization use case for nf4 as well. cc @drisspg
the change makes sense, one thing I'm wondering is would it help further reduce cuda memory usage if you move the module back to cpu (original device) after quantization as well (and release the used cuda memory)? not sure if it would be slow though.
for tests, I think you'll probably need to construct a model with linear + non-linear modules and compare the max cuda memory for "Quantize on CUDA" and "Move to CUDA and quantize each param individually" and make sure the second one consumes less memory. you can add it in https://github.com/pytorch/ao/blob/main/test/quantization/test_quant_api.py I think
nf4 is not yet using quantize_ API, it is module swapping linear with a LoraLinear
I see. So users of NF4 need to implement this logic i.e. torchtune will have to implement this instead of torchao.
one thing I'm wondering is would it help further reduce cuda memory usage if you move the module back to cpu (original device) after quantization as well (and release the used cuda memory)? not sure if it would be slow though.
I can try this. But in the end, I think we should stick to 1 approach to avoid confusion to users.
I see. So users of NF4 need to implement this logic i.e. torchtune will have to implement this instead of torchao.
yeah according to Driss, this LoraLinear code is specific to the framework that uses nf4 (e.g. torchtune, huggingface, vllm etc.) so we don't want to offer an API for people
for tests, I think you'll probably need to construct a model with linear + non-linear modules and compare the max cuda memory for "Quantize on CUDA" and "Move to CUDA and quantize each param individually" and make sure the second one consumes less memory. you can add it in main/test/quantization/test_quant_api.py I think
Agreed this is the right test
the change makes sense, one thing I'm wondering is would it help further reduce cuda memory usage if you move the module back to cpu (original device) after quantization as well (and release the used cuda memory)? not sure if it would be slow though.
This scenario seems to optimize for people that have a GPU but want the model to run on the CPU, that seems rarer than the alternative of someone who wants to fit a model that in full precision does not fit on their GPU
Otherwise yeah once CI is green can merge this
This scenario seems to optimize for people that have a GPU but want the model to run on the CPU, that seems rarer than the alternative of someone who wants to fit a model that in full precision does not fit on their GPU
The difference in peak memory between this PR and quantize on CPU (8.29 GiB vs 6.99 GiB) is probably due to the largest nn.Linear
layer (LM head). I think what @jerryzh168 means is that, we can further eliminate this difference by moving quantized weight back to CPU one-by-one, so the peak VRAM after quantization will be the largest nn.Linear
layer. When we move the whole CPU quantized model back to CUDA again, the peak memory should be the same as quantize on CPU (6.99 GiB), if memory fragmentation doesn't haunt us. But yea it's a bit more convoluted.
Using Llama2-7B
Baseline CPU: Quantize on CPU, then move the whole model to CUDA
- Time taken: 10.48 s
- Peak memory: 6.99 GiB
Baseline CUDA: Move the whole model to CUDA, then quantize on CUDA
- Time taken: 1.96 s
- Peak memory: 14.50 GiB
Approach 1: For each param, move to CUDA, quantize on CUDA
- Time taken: 1.94 s
- Peak memory: 8.29 GiB
Approach 2: For each param, move to CUDA, quantize on CUDA, move back to CPU. Then move the whole model to CUDA
- Time taken: 5.58 s
- Peak memory: 7.03 GiB
There are tradeoffs between approach 1 and 2. Lmk which one you prefer. I think we should only keep one approach to avoid confusion.
Another thing we can consider is to measure the size of the weight to decide what to do with it. E.g. if the weight exceeds xx size (which hopefully only the LM head is), we will always quantize on CPU -> avoid the memory spike. Again, this also adds seemingly unnecessary complexity.
I like approach 1 better because it's a better tradeoff between the options we already have
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for benchmarking different approaches @gau-nernst , yeah Approach 1 sounds good to me as well, it is simpler and easier to explain in API as well
Labels
This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.