显存不够又想用某个模型时的模型量化操作

from transformers import AutoTokenizer, AutoModelForCausalLM,BitsAndBytesConfig
quantization_config= BitsAndBytesConfig(load_in_8bit=True)

tokenizer = AutoTokenizer.from_pretrained(path,trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(path,
device_map="cuda:0",
trust_remote_code=True,
quantization_config=quantization_config,          max_memory=torch.cuda.get_device_properties(0).total_memory
).eval()

你可能感兴趣的:(nlp,人工智能,linux,深度学习)