以ubuntu为例子
hugging face 主页:https://huggingface.co/
用邮箱注册一个账号
huggingface 的使用和github类似,但是github单个文件不大于50MB,而一个模型动辄几百MB,需要用到lfs
curl -s https://packagecloud.io/install/repositories/github/git-lfs/script.deb.sh | sudo bash
sudo apt-get install git-lfs
git lfs install
(base) workspace:~$ huggingface-cli login
_| _| _| _| _|_|_| _|_|_| _|_|_| _| _| _|_|_| _|_|_|_| _|_| _|_|_| _|_|_|_|
_| _| _| _| _| _| _| _|_| _| _| _| _| _| _| _|
_|_|_|_| _| _| _| _|_| _| _|_| _| _| _| _| _| _|_| _|_|_| _|_|_|_| _| _|_|_|
_| _| _| _| _| _| _| _| _| _| _|_| _| _| _| _| _| _| _|
_| _| _|_| _|_|_| _|_|_| _|_|_| _| _| _|_|_| _| _| _| _|_|_| _|_|_|_|
To login, `huggingface_hub` now requires a token generated from https://huggingface.co/settings/tokens .
Token:
Login successful
Your token has been saved to /home/t-enshengshi/.huggingface/token
为了后续create repo方便,这里使用开通write
权限而并非仅仅是read
权限。
huggingface-cli repo create model_name
git clone https://huggingface.co/username/model_name
上传类似于github的操作
git add .
git commit -m "first commit"
git push
上传文件包括了model和tokenizer两部分。
model.save_pretrained("pytorch_model.bin")
可以得到pytorch_model.bin
和 config.json
tokenizer.save_pretrained("./saved_pre_model")
可以得到added_tokens.json
merges.txt
special_tokens_map.json
tokenizer_config.json
vocab.json
我是用的Roberta,以roberta为例子
import torch
from transformers import RobertaTokenizer, RobertaConfig, RobertaModel
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
tokenizer = RobertaTokenizer.from_pretrained("Ensheng/coco")
model = RobertaModel.from_pretrained("Ensheng/coco")
output
Downloading: 100%
916k/916k [00:00<00:00, 1.60MB/s]
Downloading: 100%
434k/434k [00:00<00:00, 411kB/s]
Downloading: 100%
941/941 [00:00<00:00, 25.7kB/s]
Downloading: 100%
1.63k/1.63k [00:00<00:00, 45.5kB/s]
Downloading: 100%
1.10k/1.10k [00:00<00:00, 29.4kB/s]
Downloading: 100%
738/738 [00:00<00:00, 16.7kB/s]
Downloading: 100%
481M/481M [00:08<00:00, 58.3MB/s]
测试tokenizer 和model
nl_tokens=tokenizer.tokenize("return maximum value")
code_tokens=tokenizer.tokenize("def max(a,b): if a>b: return a else return b")
tokens=[tokenizer.cls_token]+nl_tokens+[tokenizer.sep_token]+code_tokens+[tokenizer.sep_token]
tokens_ids=tokenizer.convert_tokens_to_ids(tokens)
context_embeddings=model(torch.tensor(tokens_ids)[None,:])[0]
print(context_embeddings)
output
tensor([[[-0.6205, 0.2075, -0.6909, ..., 0.4914, 1.5620, 0.5642],
[-0.6205, 0.2075, -0.6909, ..., 0.4914, 1.5620, 0.5642],
[-0.6205, 0.2075, -0.6909, ..., 0.4914, 1.5620, 0.5642],
...,
[-0.3708, 0.5695, -1.5493, ..., -0.0023, 1.2854, 0.3780],
[ 1.3056, -0.1004, -0.6191, ..., -0.4956, 1.5792, 1.5347],
[ 0.1874, 1.3228, -0.9529, ..., -1.0119, 1.7750, 1.3678]]],
grad_fn=)