Stable Diffusion是计算机视觉领域的一个生成式大模型,能够进行文生图(txt2img)和图生图(img2img)等图像生成任务。Stable Diffusion的开源公布,以及随之而来的一系列借助Stable Diffusion为基础的工作使得人工智能绘画领域呈现出前所未有的高品质创作与创意。
今年7月Stability AI 正式推出了 Stable Diffusion XL(SDXL)1.0,这是当前图像生成领域最好的开源模型。文生图模型又完成了进化过程中的一次重要迭代,SDXL 1.0几乎能够生成任何艺术风格的高质量图像,并且是实现逼真效果的最佳开源模型。该模型在色彩的鲜艳度和准确度方面做了很好的调整,对比度、光线和阴影都比上一代更好,并全部采用原生1024x1024分辨率。除此之外,SDXL 1.0 对于难以生成的概念有了很大改善,例如手、文本以及空间的排列。
目前关于文生图(text2img)模型的训练教程多集中在LoRA、DreamBooth、Text Inversion等模型,且训练方式大多也依赖于可视化UI界面工具,如SD WebUI、AI 绘画一键启动软件等等。而Full Fine-tuning的详细教程可以说几乎没有,所以这里记录一下我在微调SDXL Base模型过程中所参考的资料,以及一些训练参数的说明。
使用科研机构、公司以及Kaggle平台的公开数据集,具体如下:
数据集POSTER-TEXT是关于电商海报图片的文本图像生成任务,它包含114,009条记录,由阿里巴巴集团提供。包括原始海报图及擦除海报图中文字后的图片。
Paper:TextPainter: Multimodal Text Image Generation with Visual-harmony and Text-comprehension for Poster Design ACM MM 2023.
Source:https://tianchi.aliyun.com/dataset/160034
数据集AutoPoster-Dataset是关于电商海报图片的自动化生成任务,它包含 76000 条记录,由阿里巴巴集团提供。
在一些图像中存在重复标注的问题。该论文提到训练集中有69,249张图像,测试集中有7,711张图像。但实际上,在去除重复数据后,训练集中有68,866张唯一的广告海报图像,测试集中有7,671张唯一的图像。
Paper: AutoPoster: A Highly Automatic and Content-aware Design System for Advertising Poster Generation ACM MM 2023
Source:https://tianchi.aliyun.com/dataset/159829
Paper: Composition-aware Graphic Layout GAN for Visual-textual Presentation Designs IJCAI 2022
Github: https://github.com/minzhouGithub/CGL-GAN
Source:https://tianchi.aliyun.com/dataset/142692
作为第一个包含复杂布局的公共数据集,它在建模布局内关系方面提供了更多的困难,并代表了需要复杂布局的扩展任务。包含9,974个训练图片和905个测试图片。
Paper: A New Dataset and Benchmark for Content-aware Visual-Textual Presentation Layout CVPR 2023
Github: https://github.com/PKU-ICST-MIPL/PosterLayout-CVPR2023
Source:http://59.108.48.34/tiki/PosterLayout/
电商海报图片,但是数据未公开,单位为中科大和阿里巴巴
Paper: TextPainter: Multimodal Text Image Generation with
Visual-harmony and Text-comprehension for Poster Design ACM MM 2023
Source:None
Kaggle上的公开数据,需要从提供的csv或json文件中的图片url地址自己写个下载脚本。\
Source:
以第一个数据源为例:
import csv
import os
import requests
import warnings
warnings.filterwarnings('ignore')
csv_file = r"C:\Users\xxx\Downloads\tvs.csv"
url_prefix = 'https://www.themoviedb.org/t/p/w600_and_h900_bestv2'
save_root_path = r"D:\dataset\download_data\tv_series"
def parse_csv(path):
cnt = 0
s = requests.Session()
s.verify = False # 全局关闭ssl验证
with open(path, 'r', encoding='utf-8') as f:
reader = csv.DictReader(f)
for row in reader:
raw_img_url = row['poster_path'] # url item
img_url = url_prefix + raw_img_url
if raw_img_url == '':
continue
try:
img_file = s.get(img_url, verify=False)
except Exception as e:
print(repr(e))
print("错误的状态响应码:{}".format(img_file.status_code))
if img_file.status_code == 200:
img_name = raw_img_url.split('/')[-1]
# img_name = row['url'].split('/')[-2] + '.jpg'
save_path = os.path.join(save_root_path, img_name)
with open(save_path, 'wb') as img:
img.write(img_file.content)
cnt += 1
print(cnt, 'saved!')
print("Done!")
if __name__ == '__main__':
if not os.path.exists(save_root_path):
os.makedirs(save_root_path)
parse_csv(csv_file)
数据筛选标准:低于512和超过1024的去除,文件大小/分辨率<0.0005的去除,dpi小于96的去除。这里的0.0005是根据SD所生成图片的文件大小(kb)和分辨率所确定的一个主观参数标准,用于保证图片质量。统计八张SD所生成图片在该指标下的数值如下:
SD生成的文件大小/图像分辨率:0.00129, 0.0012, 0.0011, 0.00136, 0.0014, 0.0015, 0.0013, 0.00149
图片标注:使用BLIP和Waifu模型自动标注,上文给出的那个知乎链接中有详细的说明,这里不做赘述。
SDXL’s VAE is known to suffer from numerical instability issues. This is why we also expose a CLI argument namely --pretrained_vae_model_name_or_path that lets you specify the location of a better VAE (such as this one).
_load_target_model()
判断是否从单个safetensor读取模型,有修改StableDiffusionXLPipeline读取模型的代码optimizer_args
里新增该参数,默认值1.0。参考:https://huggingface.co/docs/transformers/main_classes/optimizer_schedules目前,AIGC领域的测评过程整体上还是比较主观,但这里还是通过美学评分(Aesthetics)和CLIP score指标来分别衡量生成的图片质量与文图匹配度。评测代码基于GhostMix的作者开发的GhostReview,笔者仅取其中的一部分并做了一些优化,请结合着原作者的代码理解,具体代码如下:
import numpy as np
import torch
import pytorch_lightning as pl
import torch.nn as nn
import clip
import os
import torch.nn.functional as F
import pandas as pd
from PIL import Image
import scipy
class MLP(pl.LightningModule):
def __init__(self, input_size, xcol='emb', ycol='avg_rating'):
super().__init__()
self.input_size = input_size
self.xcol = xcol
self.ycol = ycol
self.layers = nn.Sequential(
nn.Linear(self.input_size, 1024),
# nn.ReLU(),
nn.Dropout(0.2),
nn.Linear(1024, 128),
# nn.ReLU(),
nn.Dropout(0.2),
nn.Linear(128, 64),
# nn.ReLU(),
nn.Dropout(0.1),
nn.Linear(64, 16),
# nn.ReLU(),
nn.Linear(16, 1)
)
def forward(self, x):
return self.layers(x)
def training_step(self, batch, batch_idx):
x = batch[self.xcol]
y = batch[self.ycol].reshape(-1, 1)
x_hat = self.layers(x)
loss = F.mse_loss(x_hat, y)
return loss
def validation_step(self, batch, batch_idx):
x = batch[self.xcol]
y = batch[self.ycol].reshape(-1, 1)
x_hat = self.layers(x)
loss = F.mse_loss(x_hat, y)
return loss
def configure_optimizers(self):
optimizer = torch.optim.Adam(self.parameters(), lr=1e-3)
return optimizer
def normalized(a, axis=-1, order=2):
l2 = np.atleast_1d(np.linalg.norm(a, order, axis))
l2[l2 == 0] = 1
return a / np.expand_dims(l2, axis)
def PredictionLAION(image, laion_model, clip_model, clip_process, device='cpu'):
image = clip_process(image).unsqueeze(0).to(device)
with torch.no_grad():
image_features = clip_model.encode_image(image)
im_emb_arr = normalized(image_features.cpu().detach().numpy())
prediction = laion_model(torch.from_numpy(im_emb_arr).to(device).type(torch.FloatTensor))
return float(prediction)
# ClipScore for 1 image
# 1张图片的ClipScore
def get_clip_score(image, text, clip_model, preprocess, device='cpu'):
# Preprocess the image and tokenize the text
image_input = preprocess(image).unsqueeze(0)
text_input = clip.tokenize([text], truncate=True)
# Move the inputs to GPU if available
image_input = image_input.to(device)
text_input = text_input.to(device)
# Generate embeddings for the image and text
with torch.no_grad():
image_features = clip_model.encode_image(image_input)
text_features = clip_model.encode_text(text_input)
# Normalize the features
image_features = image_features / image_features.norm(dim=-1, keepdim=True)
text_features = text_features / text_features.norm(dim=-1, keepdim=True)
# Calculate the cosine similarity to get the CLIP score
clip_score = torch.matmul(image_features, text_features.T).item()
return clip_score
if __name__ == '__main__':
# 读取图片路径
ImgRoot = './Image/ImageRating'
DataFramePath = './dataresult/MyImageRating' # all prompts results of each model
ModelSummaryFile = './ImageRatingSummary/MyModelSummary_Total.csv'
PromptsFolder = os.listdir(ImgRoot)
if not os.path.exists(DataFramePath):
os.makedirs(DataFramePath)
# 读取图片对应的Prompts
PromptDataFrame = pd.read_csv('./PromptsForReviews/mytest.csv')
PromptsList = list(PromptDataFrame['Prompts'])
# 载入评估模型
device = "cuda" if torch.cuda.is_available() else "cpu"
MLP_Model = MLP(768) # CLIP embedding dim is 768 for CLIP ViT L 14
# load LAION aesthetics model
state_dict = torch.load("./models/sac+logos+ava1-l14-linearMSE.pth", map_location=torch.device(device))
MLP_Model.load_state_dict(state_dict)
MLP_Model.to(device)
MLP_Model.eval()
# Load the pre-trained CLIP model and the image
CLIP_Model, CLIP_Preprocess = clip.load('ViT-L/14', device=device, download_root='./models/clip') # RN50x64
CLIP_Model.to(device)
CLIP_Model.eval()
# 跳过已经做过的Prompts
try:
DataSummaryDone = pd.read_csv(ModelSummaryFile)
PromptsNotDone = [i for i in PromptsFolder if i not in list(DataSummaryDone['Model'])]
except:
DataSummaryDone = pd.DataFrame()
PromptsNotDone = [i for i in PromptsFolder]
if not PromptsNotDone:
import sys
sys.exit("There are no models to analyze.")
for i, name in enumerate(PromptsNotDone):
FolderPath = os.path.join(ImgRoot, str(name))
ImageInFolder = os.listdir(FolderPath)
DataCollect = pd.DataFrame()
for j, img in enumerate(ImageInFolder):
prompt_index = int(img.split('-')[1])
txt = PromptsList[prompt_index]
ImagePath = os.path.join(FolderPath, img)
Img = Image.open(ImagePath)
# Clipscore
ImgClipScore = get_clip_score(Img, txt, CLIP_Model, CLIP_Preprocess, device)
# aesthetics scorer
# ImageScore = predict(Img)
# LAION aesthetics scorer
ImageLAIONScore = PredictionLAION(Img, MLP_Model, CLIP_Model, CLIP_Preprocess, device)
# temp = list(ImageScore)
temp = list()
temp.append(float(ImgClipScore))
temp.append(ImageLAIONScore)
temp = pd.DataFrame(temp)
DataCollect = pd.concat([DataCollect, temp], axis=1)
print("Model:{}/{}, image:{}/{}".format(i+1, len(PromptsNotDone), j+1, len(ImageInFolder)))
DataCollect = DataCollect.T
DataCollect['ImageIndex'] = [i + 1 for i in range(len(ImageInFolder))]
DataCollect.columns = ['ClipScore', 'LAIONScore', 'ImageIndex']
# 保存原数据
DataCollect.to_csv(os.path.join(DataFramePath, str(name) + '.csv'), index=False)
print("One Results File Saved!")
print('Image rating complete!')
# do some calculation
ModelSummary = pd.DataFrame()
for i in PromptsNotDone:
DataCollect = pd.read_csv(os.path.join('dataresult/MyImageRating', str(i) + '.csv'))
temp = pd.DataFrame(DataCollect['LAIONScore'].describe()).T
# 计算数据的偏度
temp['skew'] = scipy.stats.skew(DataCollect['LAIONScore'], axis=0, bias=True, nan_policy="propagate")
# 计算数据的峰度
temp['kurtosis'] = scipy.stats.kurtosis(DataCollect['LAIONScore'], axis=0, fisher=True, bias=True,
nan_policy="propagate")
temp.columns = [i + '_LAIONScore' for i in list(temp.columns)]
# temp['RatingScore_mean']=np.mean(DataCollect['Rating'])
# temp['RatingScore_std']=np.std(DataCollect['Rating'])
temp['Clipscore_mean'] = np.mean(DataCollect['ClipScore'])
temp['Clipscore_std'] = np.std(DataCollect['ClipScore'])
# temp['Artifact_mean']=np.mean(DataCollect['Artifact'])
# temp['Artifact_std']=np.std(DataCollect['Artifact'])
temp['Model'] = str(i)
ModelSummary = pd.concat([ModelSummary, temp], axis=0)
# save results
new_order = ['Model', 'count_LAIONScore', 'mean_LAIONScore', 'std_LAIONScore',
'min_LAIONScore', '25%_LAIONScore', '50%_LAIONScore', '75%_LAIONScore',
'max_LAIONScore', 'skew_LAIONScore', 'kurtosis_LAIONScore',
'Clipscore_mean', 'Clipscore_std']
# 使用 reindex() 方法重新排序列
ModelSummary = ModelSummary.reindex(columns=new_order)
DataSummaryDone = pd.concat([DataSummaryDone, ModelSummary], axis=0)
DataSummaryDone.to_csv('./ImageRatingSummary/MyModelSummary_Total.csv')
pd.set_option('display.max_rows', None) # None表示没有限制
pd.set_option('display.max_columns', None) # None表示没有限制
pd.set_option('display.width', 1000) # 设置宽度为1000字符
print(DataSummaryDone)
下图给出了本文所训练的SDXL-Poster与主流文生图模型的比较结果,注意其中包括Anything模型开始往下的结果都是笔者自己调用相关模型生成的180张图片计算得来,所以标准差都偏大;而上方则是GhostReview作者调用这些模型生成960张图片的计算而来的结果。由于样本数量不一致,请读者谨慎参考。
将本文训练的SDXL-Poster与SDXL-Base、CyberRealistic比较。
A feline peering out from a striped transparent travel bag with a bicycle in the background. Outdoor setting, sunset ambiance. Product advertisement of pet bag, No humans, focus on cat and bag, vibrant colors, recreational theme
Four amber glass bottles with droppers placed side by side, arranged on a white background, skincare product promotion, no individuals present, still life setup
简单文字,5秒出图。论文:https://arxiv.org/abs/2309.15807
知乎详细解读:https://zhuanlan.zhihu.com/p/659476603
介绍了EMU的训练方式:quality-tuning,一种有监督微调。其有三个关键:
生成图像中包含文字的生成模型:ideogram,2023年8月23日发布,免费,官网https://ideogram.ai/
“文本渲染仍然不可靠,他们认为该模型很难将单词 token 映射为图像中的字母”