beitv3训练自己的数据集

工程:https://github.com/microsoft/unilm
第一步:下载数据集
数据集1:Download 2014 train images, 2014 val images
数据集2:(https://cs.stanford.edu/people/karpathy/deepimagesent/caption_datasets.zip)
安装以下方式存放:

/path/to/your_data/
  train2014/            
    COCO_train2014_000000000009.jpg                
    ...
  val2014/              
    COCO_val2014_000000000042.jpg
    ...       
  dataset_coco.json

备注dataset_coco.json是coco的caption数据集。如果要训练自己的数据集,需要把自己的数据集制作成caption数据集。
下载处理数据的模型:(https://conversationhub.blob.core.windows.net/beit-share-public/beit3/sentencepiece/beit3.spm)
处理数据:


```python
from datasets import CaptioningDataset
from transformers import XLMRobertaTokenizer

tokenizer = XLMRobertaTokenizer("/your_beit3_model_path/beit3.spm")

CaptioningDataset.make_coco_captioning_dataset_index(
    data_path="/path/to/your_data",
    tokenizer=tokenizer,
)

处理过程:读取dataset_coco.json–>
具体代码

def _make_captioning_coco_karpathy_dataset_index(
        data_path, 
        tokenizer, 
        split=("train", "restval"), 
        split_name="train", 
):
    coco_karpathy_split_json_file = os.path.join(data_path, "dataset_coco.json")
    items = []
    image_counter = set()
    print("read %s" % coco_karpathy_split_json_file)
    with open(coco_karpathy_split_json_file, mode="r", encoding="utf-8") as reader:
        data = json.loads(reader.read())
        for item in data["images"]:
            if item["split"] in split:
                image_path = os.path.join(item["filepath"], item["filename"])
                if item["split"] in ["train", "restval"]:
                    for sent in item["sentences"]:
                        tokens = tokenizer.tokenize(sent["raw"])###这里的tokens是该图片对应的文字描述,例如a woman wearing a net on her head cutting a cake;一张图片有很多描述;把每个描述都append到items中。
                        token_ids = tokenizer.convert_tokens_to_ids(tokens)
                        items.append({
                                "image_path": image_path, 
                                "text_segment": token_ids, 
                                "image_id": item["cocoid"], 
                        })
                else:
                    items.append({
                                "image_path": image_path, 
                                "text_segment": None, 
                                "image_id": item["cocoid"], 
                    })
                if image_path not in image_counter:
                    image_counter.add(image_path)
    print("Find %d images and %d image-text pairs for karpathy dataset %s split !" % \
        (len(image_counter), len(items), split_name))
    index_file = os.path.join(data_path, "coco_captioning.%s.jsonl" % split_name)
    _write_data_into_jsonl(items, index_file)
    pass

你可能感兴趣的:(人工智能)