最近AI绘图非常火,只需要输入文本就能得到令人惊艳的图。
举个例子,输入“photo of a gorgeous young woman in the style of stefan kostic and david la chapelle, coy, shy, alluring, evocative, stunning, award winning, realistic, sharp focus, 8 k high definition, 3 5 mm film photography, photo realistic, insanely detailed, intricate, elegant, art by stanley lau and artgerm” 得到:
输入“temple in ruines, forest, stairs, columns, cinematic, detailed, atmospheric, epic, concept art, Matte painting, background, mist, photo-realistic, concept art, volumetric light, cinematic epic + rule of thirds octane render, 8k, corona render, movie concept art, octane render, cinematic, trending on artstation, movie concept art, cinematic composition , ultra-detailed, realistic , hyper-realistic , volumetric lighting, 8k –ar 2:3 –test –uplight” 得到:
以上效果出自最近开源的效果非常好的模型——stable diffusion。那可能会有很多人和我一样,想得到自己的定制化的模型,专门用来生成人脸、动漫或者其他。
github上有个小哥还真就做了这件事了,他专门finetune了一个神奇宝贝版stable diffusion,以下是他模型的效果:输入“robotic cat with wings” 得到:
是不是很有趣,今天这篇文章就介绍一下如何快速finetune stable diffusion。
小哥写的详细介绍可以移步:https://github.com/LambdaLabsML/examples/tree/main/stable-diffusion-finetuning
1、准备数据
深度学习的训练,首先就是要解决数据问题。由于stable diffusion的训练数据是 文本-图像 匹配的pairs,因此我们要按照它的要求准备数据。
准备好你的所有图片,当然对于大部分人来说,要得到图片容易,但是手里的图片数据都是没有文本标注的,但是我们可以用BLIP算法来自动生成标注。
BLIP项目地址:https://github.com/salesforce/BLIP
效果见下图:
BLIP自动给妙蛙种子生成了一段描述,当然算法的效果很难达到完美,但是足够用了。如果觉得不够好,那完全也可以自己标注。
将得到的text,与图片名使用json格式存起来:
{"0001.jpg": "This is a young woman with a broad forehead.","0002.jpg": "The young lady has a melon seed face and her chin is relatively narrow.","0003.jpg": "This is a melon seed face woman who has a broad chin.There is a young lady with a broad forehead."}
2、下载代码模型
这里我们使用小哥魔改的stable diffusion代码,更加方便finetune。
finetune代码地址:https://github.com/justinpinkney/stable-diffusion
按照这个代码readme里的要求装好环境。同时下载好stable diffusion预训练好的模型 sd-v1-4-full-ema.ckpt ,放到目录里。
模型下载地址:CompVis/stable-diffusion-v-1-4-original · Hugging Face
3、配置与运行
stable diffusion使用yaml文件来配置训练,由于小哥给的yaml需要配置特定的数据格式,太麻烦了,我这边直接给出一个更简单方便的。只需要修改放图片的文件夹路径,以及第一步生成的配对数据的json文件路径。具体改哪儿直接看下面:
model:base_learning_rate: 1.0e-04target: ldm.models.diffusion.ddpm.LatentDiffusionparams:linear_start: 0.00085linear_end: 0.0120num_timesteps_cond: 1log_every_t: 200timesteps: 1000first_stage_key: "image"cond_stage_key: "txt"image_size: 64channels: 4cond_stage_trainable: false # Note: different from the one we trained beforeconditioning_key: crossattnscale_factor: 0.18215scheduler_config: # 10000 warmup stepstarget: ldm.lr_scheduler.LambdaLinearSchedulerparams:warm_up_steps: [ 1 ] # NOTE for resuming. use 10000 if starting from scratchcycle_lengths: [ 10000000000000 ] # incredibly large number to prevent corner casesf_start: [ 1.e-6 ]f_max: [ 1. ]f_min: [ 1. ]unet_config:target: ldm.modules.diffusionmodules.openaimodel.UNetModelparams:image_size: 32 # unusedin_channels: 4out_channels: 4model_channels: 320attention_resolutions: [ 4, 2, 1 ]num_res_blocks: 2channel_mult: [ 1, 2, 4, 4 ]num_heads: 8use_spatial_transformer: Truetransformer_depth: 1context_dim: 768use_checkpoint: Truelegacy: Falsefirst_stage_config:target: ldm.models.autoencoder.AutoencoderKLckpt_path: "models/first_stage_models/kl-f8/model.ckpt"params:embed_dim: 4monitor: val/rec_lossddconfig:double_z: truez_channels: 4resolution: 256in_channels: 3out_ch: 3ch: 128ch_mult:- 1- 2- 4- 4num_res_blocks: 2attn_resolutions: []dropout: 0.0lossconfig:target: torch.nn.Identitycond_stage_config:target: ldm.modules.encoders.modules.FrozenCLIPEmbedderdata:target: main.DataModuleFromConfigparams:batch_size: 1num_workers: 4num_val_workers: 0 # Avoid a weird val dataloader issuetrain:target: ldm.data.simple.FolderDataparams:root_dir: '你存图片的文件夹路径/'caption_file: '图片对应的标注文件.json'image_transforms:- target: torchvision.transforms.Resizeparams:size: 512interpolation: 3- target: torchvision.transforms.RandomCropparams:size: 512- target: torchvision.transforms.RandomHorizontalFlipvalidation:target: ldm.data.simple.TextOnlyparams:captions:- "测试时候用的prompt"- "A frontal selfie of handsome caucasian guy with blond hair and blue eyes, with face in the center"output_size: 512n_gpus: 2 # small hack to sure we see all our sampleslightning:find_unused_parameters: Falsemodelcheckpoint:params:every_n_train_steps: 30000save_top_k: -1monitor: nullcallbacks:image_logger:target: main.ImageLoggerparams:batch_frequency: 30000max_images: 1increase_log_steps: Falselog_first_step: Truelog_all_val: Truelog_images_kwargs:use_ema_scope: Trueinpaint: Falseplot_progressive_rows: Falseplot_diffusion_rows: FalseN: 4unconditional_guidance_scale: 3.0unconditional_guidance_label: [""]trainer:benchmark: Truenum_sanity_val_steps: 0accumulate_grad_batches: 1
最后一步,运行命令:
python main.py --base yaml文件路径.yaml --gpus 0,1 --scale_lr False --num_nodes 1 --check_val_every_n_epoch 2 --finetune_from 上面下载的模型路径.ckpt
大功告成,等待模型训练就行了。需要注意的是,我这边启用了两个GPU,并且stable diffusion是比较吃显存的,我在V100上进行训练batchsize也只能设为1。