Train¶

make caption and tag¶

BLIP deepdanbooru

Textual Inversion¶

prepare data and trigger word, wedmz for example. data can be images only or with caption. caption need contain all discribe for all thing except feature of characters.

e.g. A girl with long hair stand before table. the girl with long hair is we need, so convert it to ‘A girl stand before table.’

make dataset

# dataset.toml
[general]
enable_bucket = true                        # 是否使用Aspect Ratio Bucketing，相同比例的图片不需要

[[datasets]]
resolution = 512                            # 输出图像的分辨率
batch_size = 4

    [[datasets.subsets]]
    image_dir = 'data/processed'    # 指定包含训练图像的文件夹
    num_repeats = 1                 # 训练图像的迭代次数
    # class_tokens = 'shs 1girl'    # 指定标识符类 无需caption训练时需要
    caption_extension = '.txt'      # 使用caption

设置测试的prompts –n Negative prompts up to the next option. –w Specify the width of the generated image. –h Specifies the height of the generated image. –d Specify the seed for the generated image. –l Specify the CFG scale of the generated image. –s Specifies the number of steps during generation.

# prompts.txt
# prompt 1
a photo of wedmz --n low quality, worst quality, bad anatomy,bad composition, poor, low effort --w 512 --h 512 --d 1 --l 7.5 --s 25

# prompt 2
a photo of wedmz:0.5 --n low quality, worst quality, bad anatomy,bad composition, poor, low effort --w 512 --h 512 --d 2 --l 7.5 --s 25

启动训练

accelerate launch --num_cpu_threads_per_process 4 sd-scripts-mirror/train_textual_inversion.py \
    --pretrained_model_name_or_path=stable-diffusion-webui/models/Stable-diffusion/Chilloutmix-Ni-pruned-fp32-fix.safetensors \
    --dataset_config=dataset.toml \
    --output_dir=output \
    --output_name=wedmz \
    --save_model_as=pt \
    --prior_loss_weight=1.0 \
    --max_train_steps=3000 \
    --learning_rate=1e-4 \
    --optimizer_type="AdamW8bit" \
    --mixed_precision="fp16" \
    --token_string=wedmz \
    --init_word=1girl \
    --num_vectors_per_token=3 \
    --sample_every_n_steps=1000 \
    --sample_prompts="prompts.txt" \
    --save_every_n_epochs 10 \
    --save_state \
    --cache_latents \
    --use_style_template

--use_object_template 和 --use_style_template二选一。 num_vectors_per_token根据图片数量，10张以下，一般2个，后续每增加10张图片，加1个vectors。 lr 0.05:10, 0.02:20, 0.01:60, 0.005:200, 0.002:500, 0.001:3000, 0.0005

Tricks¶

caption split and shuffle for example: cap: A, b, c -> b, A, b

batch processing seems to be good [1, 77x3] -> [1, 3, 77]

DreamBooth¶

LoRA¶

Prepare Data¶

图片要求

图片尺寸默认512×512，如果显存有12g以上，推荐使用768×768的图片来训练模型，用大尺寸训练后的模型可以适当减少生成宽图出现的肢体叠加等效果。
至少15张图片，每张图片的训练步数不少于100
人像照片要求多角度，特别是脸部特写（尽量高分辨率），多角度，多表情，不同灯光效果，不同姿势等，脸部不要大面积遮挡
图片构图尽量简单，避免复杂的其他因素干扰
可以单张脸部特写+单张服装按比例组成的一组照片（推荐比例是3:1）
减少重复或高度相似的图片，避免造成过拟合

图片处理

裁剪照片到768×768分辨率，裁剪照片可以到 birme,imgtools 等站点在线批量裁剪，必要时需要对复杂图片进行抠图 photopea。
准备图片解析词，用stable diffusion 图片预处理模块preprocess image进行图片解析，注意图片尺寸修改为我们裁剪后的图片尺寸，图片解析使用BLIP（句子）/deepbooru（短语），图片少于15张建议勾选create Flipped copies选项。
编辑我们生成好的解析词文件，加入我们的关键人物tag，如果是服饰图片，可以给服饰加上我们自定义的tag用于区分服饰，相同的发型也可以打上发型的自定义tag，后面使用该lora模型可以加上服饰或发型部分的tag用于生成对应要求的图像，这里图片解析词我们可以使用kohay_ss的Utilities下的Captioning批量给我们处理后的解析词文件增加对应的角色tag和服饰tag。

Note：打标对prompt影响重大。caption应该详细的描述任何与期望生成结构无关的任何细节。

Train configure¶

dataset.toml

[general]
shuffle_caption = true
enable_bucket = true                        # 是否使用Aspect Ratio Bucketing

[[datasets]]
resolution = 512                            # 学习分辨率
batch_size = 4                              # 批量大小

  [[datasets.subsets]]
  image_dir = 'data/zenli'                     # 指定包含训练图像的文件夹
  num_repeats = 100                          # 训练图像的迭代次数
  caption_extension = '.txt'

# prompt 1
1girl, black_hair, denim, fur_trim, jacket, lips, long_hair, open_mouth, realistic, smile, solo, teeth, upper_body, white_background --n low quality, worst quality, bad anatomy,bad composition, poor, low effort --w 512 --h 512 --d 1 --l 7.5 --s 25

# prompt 2
1girl, 3d, asian, black_hair, brown_eyes, earrings, lace_cheongsam, floral_print, jewelry, lips, looking_at_viewer, makeup, open_mouth, realistic, red_lips, solo, teeth, upper_body --n low quality, worst quality, bad anatomy,bad composition, poor, low effort --w 512 --h 512 --d 2 --l 7.5 --s 25

accelerate launch --num_cpu_threads_per_process=4 "sd-scripts-mirror/train_network.py" \
    --pretrained_model_name_or_path=stable-diffusion-webui/models/Stable-diffusion/Chilloutmix-Ni-pruned-fp32-fix.safetensors \
    --dataset_config=dataset.toml \
    --output_dir=output \
    --output_name=zenli32 \
    --save_model_as=safetensors \
    --prior_loss_weight=1.0 \
    --sample_every_n_steps=1000 \
    --sample_prompts="prompts.txt" \
    --max_train_steps=4000 \
    --learning_rate=1e-4 \
    --optimizer_type="AdamW8bit" \
    --mixed_precision="fp16" \
    --cache_latents \
    --save_every_n_epochs=1 \
    --shuffle_caption \
    --network_module=networks.lora \
    --network_dim=32 \
    --network_alpha=32

Evaluate¶

(8k, RAW photo, best quality, masterpiece:1.2), (realistic, photo-realistic:1.37), ultra-detailed, 1 girl,(upper body:1.2), (blue background:1.2), solo,red lip, medium breasts,beautiful detailed eyes,(collared shirt:1.1), bowtie,white skirt,(long hair:1.2) Negative prompt: EasyNegative, paintings, sketches, (worst quality:2), (low quality:2), (normal quality:2), lowres, normal quality, ((monochrome)), ((grayscale)), skin spots, acnes, skin blemishes, age spot, glans,extra fingers,fewer fingers,(watermark:1.2),(white letters:1/1) Steps: 20, Sampler: DPM++ 2M Karras, CFG scale: 7, Seed: 12345, Size: 512x512, Model hash: fc2511737a, Model: Chilloutmix-Ni-pruned-fp32-fix, Denoising strength: 0.7, Hires upscale: 1.5, Hires steps: 15, Hires upscaler: Latent (nearest), AddNet Enabled: True, AddNet Module 1: LoRA, AddNet Model 1: zenli32(2e59ff44fd95), AddNet Weight A 1: 0.1, AddNet Weight B 1: 0.1, Script: X/Y/Z plot, X Type: AddNet Weight 1, X Values: “0.1,0.3,0.5,0.7,0.9”, Y Type: AddNet Model 1, Y Values: “zenli32(2e59ff44fd95),zenli32s(8a5f12d4ef30)”

Tricks¶

标语顺序打乱：训练集中图片的位置和顺序，对训练效果有影响。如果训练集中有几类服饰，而你又不希望任何一类服饰在模型中过于有存在感的话，应该通过重命名工具，在训练集预处理前就把图片顺序随机化，至少不能有连续四张图有你不想保留的共同特征，这样预处理后的训练集顺序就是打乱的，再拿去训练就可以只学会身材和相貌，但不让模型轻易学会某套衣服。
抠掉背景：图像的训练没有那么智能，所以如果lora模型是人物模型，那么训练集图片就不应该有任何背景(也可使用纯白背景)，需要抠图扣掉。如果有一些很显眼你又不想要的特征的话，在此步骤顺便直接PS掉是最快捷的办法。
镜像翻转：如果训练出的lora模型跑的图，左右脸总是不太对称，或者总是面朝一侧，那么就应该在生成训练集的时候，生成水平翻转的图像。这样训练出来的模型就不会有左右的倾向性了。
重复次数：训练集图片训练的重复次数，通常大家都推荐6次，防止过拟合。但如果使用了镜像翻转，导致相同的图片实际上训练了两次，那么我感觉就可以把重复次数改为三次。

tricks2¶

训练总数：建议50张图片数据集深度训练15000次左右，更大数据集可以用Dadaptation优化器测试最佳总步数。
训练轮次：建议10-15次预设，每个图片建议单轮训练20-30次。

3、训练分辨率：建议是768x1024，可以根据电脑显存来进行调整。

4、训练源模型：建议chilloutmix_Niprundefp32fix，1.5模型。

5、Text Encoder learning rate （文本编码器学习率）：主要影响了鲁棒性，乏化性和拟合度，过低不利于更换特征。

6、Unet learning rate（Unet学习率）：主要是影响了模型像与不像，影响了lost率和拟合度，不拟合加大，过拟合减小。

7、文本编码器学习率和Unet学习率的关系：没有必然的1/5~~1/10倍率关系、庞大数据集下Unet甚至可以低过text。

8、Network Rank（dimension）（网络大小）：强化训练细节，建议128-192,128以上增加提升相对不明显。

9、Network Alpha（网络alpha）：建议96以上，弱化训练细节，有正则化效果，可以与dim同步增加。

10、让AI训练AI：首发训练采用Dadaption，所有学习率均设为1.

11、手动训练方法：建议用AadmW优化器，可以通过调整学习率获得很像与易用性的平衡。

12、lost率控制：不是越低越好，越低模型越拟合，但是模型也越难更换特征，甚至会影响动作和表情。

13、lion优化器：不建议用在深度训练中，太快拟合虽然能很像，但是造成图片泛用性差。

14、本地深度训练方法：可以用远程操作软件监控，训练过程中发现学习率不合适可以远程操作修改。

Kohya_ss Tutorial¶

cf: https://pix.ink/article/8681wp3js3w32

learning rate 通常将大模型checkpoint设为5e-6～1e-5、LoRA设为1e-4～1e-3、ControlNet设为1e-5～5e-5、PFG设为1e-5 。而训练文本编码器则是将其设置为UNet一半左右的学习率效果会比较好。
batch size 大batchsize相比起来细节会忽略更多，更注重整体，所以增加batch size需要相应的提高学习率。有传言是学习率应该提高原本batchsize的开根倍数，例如2倍batchsize，那么学习率应该乘以√2倍。
steps/epochs 总步数是指训练的长度。训练所有数据集一次，称为一个轮次。例如，如果有4张图片,每个重复学习10次和4个批量大小batchsize。那么一个轮次步数为40步，实际步数为40/4=10步。总步数=图片数 x 重复次数 x 轮次。实际步数=总步数/批量大小

一般来说总步数6000以上才有比较不错的结果。

DreamBooth¶

https://dfldata.cc/forum.php?mod=viewthread&tid=13810 为了防止过拟合,原作者用了一个青蛙和兔子做解释,学习画青蛙,到最后只会画图上的青蛙,在正则化中加入各种各样的青蛙,机器后面就会用学到的青蛙画正则化中类似的各种各样的青蛙了. 常用的方法比如人像训练,只用大头照训练的lora不太会画全身像,那在正则化中可以加入全身像的图片,以提高lora的泛用性.

https://nga.178.com/read.php?tid=35591424&rand=408

https://towardsdatascience.com/how-to-fine-tune-stable-diffusion-using-dreambooth-dfa6694524ae You need to collect high quality datasets to get consistent and good results. The training images should match the expected output and resized to 512 x 512 in resolution.

Please note that artifacts such as motion blur or low resolution will affect the generated images. This is applicable to any unwanted text, watermarks or icons in your training datasets. Make sure to pay attention to the datasets that you used for training.

Depending on your use cases, you can use the following guidelines:

Object Use images of your object with a normal background. Transparent background may leave a fringe or border around the object. All training images should focus on just the object with variations on:

camera angle pose props (clothing, haircut, etc.) background (taken at different locations) The number of training images should be around 5 to 20. You may need to crop the images to focus on just the object.

Instance images — Custom images that represents the specific concept for dreambooth training. You should collect high quality images based on your use cases. Class images — Regularization images for prior-preservation loss to prevent overfitting. You should generate these images directly from the base pre-trained model. You can choose to generate them on your own or generate them on the fly when running the training script.

Data Source¶

red book¶

小红书无水印图片提取

"imageList": [{
    "fileId": "6f7694e8-0f33-8066-78ef-1dbf9f1254f5",
    "height": 1920,
    "width": 1920,
    "url": "https:\u002F\u002Fsns-img-hw.xhscdn.com\u002F6f7694e8-0f33-8066-78ef-1dbf9f1254f5",
    "traceId": "0302bg016dpc07yobuv0112b6x90et5hmb"
}],

# image
https://sns-img-bd.xhscdn.com/<traceId>
https://sns-img-qc.xhscdn.com/<traceId>
# video
http://sns-video-bd.xhscdn.com/<traceId>

明星照片写真¶

明星照片写真：https://www.jj20.com/mx/dalu/nu/zengli/ xiaoxiao from telegram：MI9/Andriod/data/org.telegram.messager.web\files\Telegram\Telegram Images

24tupian.org¶

必须在https://big.diercun.com/*.jpg页面下运行该脚本。

function download(src, fn){
    let a = document.createElement("a");
    a.href = src;
    a.download = fn;
    a.click();
}

function sleep(delay) {
    var start = (new Date()).getTime();
    while ((new Date()).getTime() - start < delay) {
        continue;
    }
}

for (let p = 1; p <= 33; p++) {
    let url = `https://big.diercun.com/hd2/2022/1113/45/24mnorg_1 ${p}.jpg`;
    let fn = `${p}.jpg`
    download(url, fn);
    console.log(url, fn)
    sleep(3000);
}

Reference¶

novelai-aspect-ratio-bucketing