目录

一、安装配置环境

二、运行测试,遇到的问题

1、按照官网,首先建立各种路径,

2、建立训练数据集

3、运行train 测试

三、阅读代码记录

网络结构

1、打印日志处

2、debug 参数记录

3、一些入口

1) dataloader 的建立

2) 创建模型

3) Loss actor 以及 optimer等

4)train 过程开始

5) 数据送入模型

6)forward pass

7)compute loss

8) 断点续训

4、模型处理过程

1) 送入backbone

首先进行 patch_embed

5、加载预训练 backbone

6、标签的设计

7、datalodaer的创建

8、 数据的加载过程


一、安装配置环境

代码地址GitHub – botaoye/OSTrack: [ECCV 2022] Joint Feature Learning and Relation Modeling for Tracking: A One-Stream Framework

按照官网的 option1 方法,在根目录下执行

conda create -n ostrack python=3.8conda activate ostrackbash install.sh

二、运行测试,遇到的问题

1、按照官网,首先建立各种路径,

执行

python tracking/create_default_local_file.py --workspace_dir . --data_dir ./data --save_dir ./output

遇到问题

ImportError: libGL.so.1: cannot open shared object file: No such file or directory

解决办法

apt-get install libgl1

运行上述脚本后,会在/root/data/zjx/Code-subject/OSTrack-main/lib/train/admin 目录下生成 local.py 文件 以及 lib/test/evaluation/ 下的 local.py 。里面是各种路径的默认设置。

2、建立训练数据集

在 项目 根目录路径下 按照官网的 格式 设立。然

python tracking/train.py --script ostrack --config vitb_256_mae_ce_32x4_ep300 --save_dir ./output --mode single --use_wandb 1

后设立 预训练权重文件, 创建 pretrained_models 文件夹。

train.py 文件中的 路径 修改一下

"python /root/data/zjx/Code-subject/TSTrackOur/lib/train/run_training.py --script %s --config %s --save_dir %s --use_lmdb %d " \"--script_prv %s --config_prv %s --distill %d --script_teacher %s --config_teacher %s --use_wandb %d"\% (args.script, args.config, args.save_dir, args.use_lmdb, args.script_prv, args.config_prv, args.distill, args.script_teacher, args.config_teacher, args.use_wandb)

3、运行train 测试

在终端运行按官网来。在 本地编译器 需要运行的 是 lib/train/run_train.py ,其中的参数设置成

Namespace(config=’vitb_256_mae_ce_32x4_ep300′, config_prv=’baseline’, config_teacher=None, distill=0, ip=’127.0.0.1′, mode=’single’, nproc_per_node=None, port=20000, rank=None, save_dir=’./output’, script=’ostrack’, script_prv=None, script_teacher=None, use_lmdb=0, use_wandb=0, world_size=None)

--script ostrack
--config vitb_256_mae_ce_32x4_ep300 
--save_dir ./output
--use_lmdb 0
--script_prv None
--config_prv baseline
--distill 0
--script_teacher None
--config_teacher None
--use_wandb 0

当时 只用了GOT10k 一个数据集做运行测试, 所以需要去相应的配置文件下 注销掉 其它用到的数据集。

--script ostrack --config vitb_256_mae_ce_32x4_ep300

去这个文件下更改

TRAIN:DATASETS_NAME:#- LASOT- GOT10K_vottrain#- COCO17#- TRACKINGNET

终端运行时 单卡训练时需要设置参数–mode single。 那个 wandb 先不用设置,实现需要创建账户的

python tracking/train.py --script ostrack --config vitb_256_mae_ce_32x4_ep300 --save_dir ./output --mode single 

遇到的问题

1)

Traceback (most recent call last):File "/root/data/zjx/Code-subject/OSTrack-main/lib/train/../../lib/train/trainers/base_trainer.py", line 85, in trainself.train_epoch()File "/root/data/zjx/Code-subject/OSTrack-main/lib/train/../../lib/train/trainers/ltr_trainer.py", line 133, in train_epochself.cycle_dataset(loader)File "/root/data/zjx/Code-subject/OSTrack-main/lib/train/../../lib/train/trainers/ltr_trainer.py", line 74, in cycle_datasetfor i, data in enumerate(loader, 1):File "/root/anaconda3/envs/ostrack/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 521, in __next__data = self._next_data()File "/root/anaconda3/envs/ostrack/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1203, in _next_datareturn self._process_data(data)File "/root/anaconda3/envs/ostrack/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1229, in _process_datadata.reraise()File "/root/anaconda3/envs/ostrack/lib/python3.8/site-packages/torch/_utils.py", line 425, in reraiseraise self.exc_type(msg)ValueError: Caught ValueError in DataLoader worker process 0.Original Traceback (most recent call last):File "/root/anaconda3/envs/ostrack/lib/python3.8/site-packages/torch/utils/data/_utils/worker.py", line 287, in _worker_loopdata = fetcher.fetch(index)File "/root/anaconda3/envs/ostrack/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 44, in fetchdata = [self.dataset[idx] for idx in possibly_batched_index]File "/root/anaconda3/envs/ostrack/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 44, in data = [self.dataset[idx] for idx in possibly_batched_index]File "/root/data/zjx/Code-subject/OSTrack-main/lib/train/../../lib/train/data/sampler.py", line 98, in __getitem__return self.getitem()File "/root/data/zjx/Code-subject/OSTrack-main/lib/train/../../lib/train/data/sampler.py", line 108, in getitemdataset = random.choices(self.datasets, self.p_datasets)[0]File "/root/anaconda3/envs/ostrack/lib/python3.8/random.py", line 404, in choicesraise ValueError('The number of weights does not match the population')ValueError: The number of weights does not match the population

解决办法:

第一个问题去yaml设置文件中 将 num_worker 设置为0

NUM_WORKER: 0

第二个 debug 截图所示

根据问题出处 lib\train\data\sampler.py — 109

dataset = random.choices(self.datasets, self.p_datasets)[0]

替换 (因为测试运行时只用了一个数据集 GOT10k)

dataset = self.datasets[0]

继续运行测试,遇到

FileNotFoundError: [Errno 2] No such file or directory: '/root/data/zjx/Code-subject/OSTrack-main/tracking/data/got10k/train/GOT-10k_Train_008341/groundtruth.txt'

解决办法:GOT10k数据集的格式 改一下, 将 train 文件夹下的所有 split 文件夹下的 文件 放到 train下即可。

继续运行测试,遇到

RuntimeError: CUDA out of memory. Tried to allocate 24.00 MiB (GPU 0; 10.76 GiB total capacity; 9.68 GiB already allocated; 13.56 MiB free; 9.74 GiB reserved in total by PyTorch)

解决办法,去 yaml 文件 调小 batch size。

三、阅读代码记录

网络结构

OSTrack((backbone): VisionTransformerCE((patch_embed): PatchEmbed((proj): Conv2d(3, 768, kernel_size=(16, 16), stride=(16, 16))(norm): Identity())(pos_drop): Dropout(p=0.0, inplace=False)(blocks): Sequential((0): CEBlock((norm1): LayerNorm((768,), eps=1e-06, elementwise_affine=True)(attn): Attention((qkv): Linear(in_features=768, out_features=2304, bias=True)(attn_drop): Dropout(p=0.0, inplace=False)(proj): Linear(in_features=768, out_features=768, bias=True)(proj_drop): Dropout(p=0.0, inplace=False))(drop_path): Identity()(norm2): LayerNorm((768,), eps=1e-06, elementwise_affine=True)(mlp): Mlp((fc1): Linear(in_features=768, out_features=3072, bias=True)(act): GELU()(drop1): Dropout(p=0.0, inplace=False)(fc2): Linear(in_features=3072, out_features=768, bias=True)(drop2): Dropout(p=0.0, inplace=False)))(1): CEBlock((norm1): LayerNorm((768,), eps=1e-06, elementwise_affine=True)(attn): Attention((qkv): Linear(in_features=768, out_features=2304, bias=True)(attn_drop): Dropout(p=0.0, inplace=False)(proj): Linear(in_features=768, out_features=768, bias=True)(proj_drop): Dropout(p=0.0, inplace=False))(drop_path): DropPath(drop_prob=0.009)(norm2): LayerNorm((768,), eps=1e-06, elementwise_affine=True)(mlp): Mlp((fc1): Linear(in_features=768, out_features=3072, bias=True)(act): GELU()(drop1): Dropout(p=0.0, inplace=False)(fc2): Linear(in_features=3072, out_features=768, bias=True)(drop2): Dropout(p=0.0, inplace=False)))(2): CEBlock((norm1): LayerNorm((768,), eps=1e-06, elementwise_affine=True)(attn): Attention((qkv): Linear(in_features=768, out_features=2304, bias=True)(attn_drop): Dropout(p=0.0, inplace=False)(proj): Linear(in_features=768, out_features=768, bias=True)(proj_drop): Dropout(p=0.0, inplace=False))(drop_path): DropPath(drop_prob=0.018)(norm2): LayerNorm((768,), eps=1e-06, elementwise_affine=True)(mlp): Mlp((fc1): Linear(in_features=768, out_features=3072, bias=True)(act): GELU()(drop1): Dropout(p=0.0, inplace=False)(fc2): Linear(in_features=3072, out_features=768, bias=True)(drop2): Dropout(p=0.0, inplace=False)))(3): CEBlock((norm1): LayerNorm((768,), eps=1e-06, elementwise_affine=True)(attn): Attention((qkv): Linear(in_features=768, out_features=2304, bias=True)(attn_drop): Dropout(p=0.0, inplace=False)(proj): Linear(in_features=768, out_features=768, bias=True)(proj_drop): Dropout(p=0.0, inplace=False))(drop_path): DropPath(drop_prob=0.027)(norm2): LayerNorm((768,), eps=1e-06, elementwise_affine=True)(mlp): Mlp((fc1): Linear(in_features=768, out_features=3072, bias=True)(act): GELU()(drop1): Dropout(p=0.0, inplace=False)(fc2): Linear(in_features=3072, out_features=768, bias=True)(drop2): Dropout(p=0.0, inplace=False)))(4): CEBlock((norm1): LayerNorm((768,), eps=1e-06, elementwise_affine=True)(attn): Attention((qkv): Linear(in_features=768, out_features=2304, bias=True)(attn_drop): Dropout(p=0.0, inplace=False)(proj): Linear(in_features=768, out_features=768, bias=True)(proj_drop): Dropout(p=0.0, inplace=False))(drop_path): DropPath(drop_prob=0.036)(norm2): LayerNorm((768,), eps=1e-06, elementwise_affine=True)(mlp): Mlp((fc1): Linear(in_features=768, out_features=3072, bias=True)(act): GELU()(drop1): Dropout(p=0.0, inplace=False)(fc2): Linear(in_features=3072, out_features=768, bias=True)(drop2): Dropout(p=0.0, inplace=False)))(5): CEBlock((norm1): LayerNorm((768,), eps=1e-06, elementwise_affine=True)(attn): Attention((qkv): Linear(in_features=768, out_features=2304, bias=True)(attn_drop): Dropout(p=0.0, inplace=False)(proj): Linear(in_features=768, out_features=768, bias=True)(proj_drop): Dropout(p=0.0, inplace=False))(drop_path): DropPath(drop_prob=0.045)(norm2): LayerNorm((768,), eps=1e-06, elementwise_affine=True)(mlp): Mlp((fc1): Linear(in_features=768, out_features=3072, bias=True)(act): GELU()(drop1): Dropout(p=0.0, inplace=False)(fc2): Linear(in_features=3072, out_features=768, bias=True)(drop2): Dropout(p=0.0, inplace=False)))(6): CEBlock((norm1): LayerNorm((768,), eps=1e-06, elementwise_affine=True)(attn): Attention((qkv): Linear(in_features=768, out_features=2304, bias=True)(attn_drop): Dropout(p=0.0, inplace=False)(proj): Linear(in_features=768, out_features=768, bias=True)(proj_drop): Dropout(p=0.0, inplace=False))(drop_path): DropPath(drop_prob=0.055)(norm2): LayerNorm((768,), eps=1e-06, elementwise_affine=True)(mlp): Mlp((fc1): Linear(in_features=768, out_features=3072, bias=True)(act): GELU()(drop1): Dropout(p=0.0, inplace=False)(fc2): Linear(in_features=3072, out_features=768, bias=True)(drop2): Dropout(p=0.0, inplace=False)))(7): CEBlock((norm1): LayerNorm((768,), eps=1e-06, elementwise_affine=True)(attn): Attention((qkv): Linear(in_features=768, out_features=2304, bias=True)(attn_drop): Dropout(p=0.0, inplace=False)(proj): Linear(in_features=768, out_features=768, bias=True)(proj_drop): Dropout(p=0.0, inplace=False))(drop_path): DropPath(drop_prob=0.064)(norm2): LayerNorm((768,), eps=1e-06, elementwise_affine=True)(mlp): Mlp((fc1): Linear(in_features=768, out_features=3072, bias=True)(act): GELU()(drop1): Dropout(p=0.0, inplace=False)(fc2): Linear(in_features=3072, out_features=768, bias=True)(drop2): Dropout(p=0.0, inplace=False)))(8): CEBlock((norm1): LayerNorm((768,), eps=1e-06, elementwise_affine=True)(attn): Attention((qkv): Linear(in_features=768, out_features=2304, bias=True)(attn_drop): Dropout(p=0.0, inplace=False)(proj): Linear(in_features=768, out_features=768, bias=True)(proj_drop): Dropout(p=0.0, inplace=False))(drop_path): DropPath(drop_prob=0.073)(norm2): LayerNorm((768,), eps=1e-06, elementwise_affine=True)(mlp): Mlp((fc1): Linear(in_features=768, out_features=3072, bias=True)(act): GELU()(drop1): Dropout(p=0.0, inplace=False)(fc2): Linear(in_features=3072, out_features=768, bias=True)(drop2): Dropout(p=0.0, inplace=False)))(9): CEBlock((norm1): LayerNorm((768,), eps=1e-06, elementwise_affine=True)(attn): Attention((qkv): Linear(in_features=768, out_features=2304, bias=True)(attn_drop): Dropout(p=0.0, inplace=False)(proj): Linear(in_features=768, out_features=768, bias=True)(proj_drop): Dropout(p=0.0, inplace=False))(drop_path): DropPath(drop_prob=0.082)(norm2): LayerNorm((768,), eps=1e-06, elementwise_affine=True)(mlp): Mlp((fc1): Linear(in_features=768, out_features=3072, bias=True)(act): GELU()(drop1): Dropout(p=0.0, inplace=False)(fc2): Linear(in_features=3072, out_features=768, bias=True)(drop2): Dropout(p=0.0, inplace=False)))(10): CEBlock((norm1): LayerNorm((768,), eps=1e-06, elementwise_affine=True)(attn): Attention((qkv): Linear(in_features=768, out_features=2304, bias=True)(attn_drop): Dropout(p=0.0, inplace=False)(proj): Linear(in_features=768, out_features=768, bias=True)(proj_drop): Dropout(p=0.0, inplace=False))(drop_path): DropPath(drop_prob=0.091)(norm2): LayerNorm((768,), eps=1e-06, elementwise_affine=True)(mlp): Mlp((fc1): Linear(in_features=768, out_features=3072, bias=True)(act): GELU()(drop1): Dropout(p=0.0, inplace=False)(fc2): Linear(in_features=3072, out_features=768, bias=True)(drop2): Dropout(p=0.0, inplace=False)))(11): CEBlock((norm1): LayerNorm((768,), eps=1e-06, elementwise_affine=True)(attn): Attention((qkv): Linear(in_features=768, out_features=2304, bias=True)(attn_drop): Dropout(p=0.0, inplace=False)(proj): Linear(in_features=768, out_features=768, bias=True)(proj_drop): Dropout(p=0.0, inplace=False))(drop_path): DropPath(drop_prob=0.100)(norm2): LayerNorm((768,), eps=1e-06, elementwise_affine=True)(mlp): Mlp((fc1): Linear(in_features=768, out_features=3072, bias=True)(act): GELU()(drop1): Dropout(p=0.0, inplace=False)(fc2): Linear(in_features=3072, out_features=768, bias=True)(drop2): Dropout(p=0.0, inplace=False))))(norm): LayerNorm((768,), eps=1e-06, elementwise_affine=True))(box_head): CenterPredictor((conv1_ctr): Sequential((0): Conv2d(768, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))(1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)(2): ReLU(inplace=True))(conv2_ctr): Sequential((0): Conv2d(256, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))(1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)(2): ReLU(inplace=True))(conv3_ctr): Sequential((0): Conv2d(128, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))(1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)(2): ReLU(inplace=True))(conv4_ctr): Sequential((0): Conv2d(64, 32, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))(1): BatchNorm2d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)(2): ReLU(inplace=True))(conv5_ctr): Conv2d(32, 1, kernel_size=(1, 1), stride=(1, 1))(conv1_offset): Sequential((0): Conv2d(768, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))(1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)(2): ReLU(inplace=True))(conv2_offset): Sequential((0): Conv2d(256, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))(1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)(2): ReLU(inplace=True))(conv3_offset): Sequential((0): Conv2d(128, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))(1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)(2): ReLU(inplace=True))(conv4_offset): Sequential((0): Conv2d(64, 32, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))(1): BatchNorm2d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)(2): ReLU(inplace=True))(conv5_offset): Conv2d(32, 2, kernel_size=(1, 1), stride=(1, 1))(conv1_size): Sequential((0): Conv2d(768, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))(1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)(2): ReLU(inplace=True))(conv2_size): Sequential((0): Conv2d(256, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))(1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)(2): ReLU(inplace=True))(conv3_size): Sequential((0): Conv2d(128, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))(1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)(2): ReLU(inplace=True))(conv4_size): Sequential((0): Conv2d(64, 32, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))(1): BatchNorm2d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)(2): ReLU(inplace=True))(conv5_size): Conv2d(32, 2, kernel_size=(1, 1), stride=(1, 1))))

1、打印日志处

1)

script_name: ostrack.pyconfig_name: vitb_256_mae_ce_32x4_ep300.yaml

run_training.py — 42

2)

New configuration is shown below.MODEL configuration: {'PRETRAIN_FILE': 'mae_pretrain_vit_base.pth', 'EXTRA_MERGER': False, 'RETURN_INTER': False, 'RETURN_STAGES': [], 'BACKBONE': {'TYPE': 'vit_base_patch16_224_ce', 'STRIDE': 16, 'MID_PE': False, 'SEP_SEG': False, 'CAT_MODE': 'direct', 'MERGE_LAYER': 0, 'ADD_CLS_TOKEN': False, 'CLS_TOKEN_USE_MODE': 'ignore', 'CE_LOC': [3, 6, 9], 'CE_KEEP_RATIO': [0.7, 0.7, 0.7], 'CE_TEMPLATE_RANGE': 'CTR_POINT'}, 'HEAD': {'TYPE': 'CENTER', 'NUM_CHANNELS': 256}}TRAIN configuration: {'LR': 0.0004, 'WEIGHT_DECAY': 0.0001, 'EPOCH': 300, 'LR_DROP_EPOCH': 240, 'BATCH_SIZE': 4, 'NUM_WORKER': 0, 'OPTIMIZER': 'ADAMW', 'BACKBONE_MULTIPLIER': 0.1, 'GIOU_WEIGHT': 2.0, 'L1_WEIGHT': 5.0, 'FREEZE_LAYERS': [0], 'PRINT_INTERVAL': 50, 'VAL_EPOCH_INTERVAL': 20, 'GRAD_CLIP_NORM': 0.1, 'AMP': False, 'CE_START_EPOCH': 20, 'CE_WARM_EPOCH': 80, 'DROP_PATH_RATE': 0.1, 'SCHEDULER': {'TYPE': 'step', 'DECAY_RATE': 0.1}}DATA configuration: {'SAMPLER_MODE': 'causal', 'MEAN': [0.485, 0.456, 0.406], 'STD': [0.229, 0.224, 0.225], 'MAX_SAMPLE_INTERVAL': 200, 'TRAIN': {'DATASETS_NAME': ['GOT10K_vottrain'], 'DATASETS_RATIO': [1, 1, 1, 1], 'SAMPLE_PER_EPOCH': 60000}, 'VAL': {'DATASETS_NAME': ['GOT10K_votval'], 'DATASETS_RATIO': [1], 'SAMPLE_PER_EPOCH': 10000}, 'SEARCH': {'SIZE': 256, 'FACTOR': 4.0, 'CENTER_JITTER': 3, 'SCALE_JITTER': 0.25, 'NUMBER': 1}, 'TEMPLATE': {'NUMBER': 1, 'SIZE': 128, 'FACTOR': 2.0, 'CENTER_JITTER': 0, 'SCALE_JITTER': 0}}

train_script.py — 32 33

3)

No matching checkpoint file found

base_trainer.py — 174

4)

[train: 1, 50 / 15000] FPS: 5.9 (5.0),DataTime: 0.508 (0.002),ForwardTime: 0.171,TotalTime: 0.681,Loss/total: 50.35498,Loss/giou: 1.22484,Loss/l1: 0.28600,Loss/location: 46.47531,IoU: 0.07033

ltr_trainer.py — 112

2、debug 参数记录

1、settings

2、config

{'MODEL': {'PRETRAIN_FILE': 'mae_pretrain_vit_base.pth', 'EXTRA_MERGER': False, 'RETURN_INTER': False, 'RETURN_STAGES': [], 'BACKBONE': {'TYPE': 'vit_base_patch16_224', 'STRIDE': 16, 'MID_PE': False, 'SEP_SEG': False, 'CAT_MODE': 'direct', 'MERGE_LAYER': 0, 'ADD_CLS_TOKEN': False, 'CLS_TOKEN_USE_MODE': 'ignore', 'CE_LOC': [], 'CE_KEEP_RATIO': [], 'CE_TEMPLATE_RANGE': 'ALL'}, 'HEAD': {'TYPE': 'CENTER', 'NUM_CHANNELS': 256}}, 'TRAIN': {'LR': 0.0001, 'WEIGHT_DECAY': 0.0001, 'EPOCH': 500, 'LR_DROP_EPOCH': 400, 'BATCH_SIZE': 16, 'NUM_WORKER': 8, 'OPTIMIZER': 'ADAMW', 'BACKBONE_MULTIPLIER': 0.1, 'GIOU_WEIGHT': 2.0, 'L1_WEIGHT': 5.0, 'FREEZE_LAYERS': [0], 'PRINT_INTERVAL': 50, 'VAL_EPOCH_INTERVAL': 20, 'GRAD_CLIP_NORM': 0.1, 'AMP': False, 'CE_START_EPOCH': 20, 'CE_WARM_EPOCH': 80, 'DROP_PATH_RATE': 0.1, 'SCHEDULER': {'TYPE': 'step', 'DECAY_RATE': 0.1}}, 'DATA': {'SAMPLER_MODE': 'causal', 'MEAN': [0.485, 0.456, 0.406], 'STD': [0.229, 0.224, 0.225], 'MAX_SAMPLE_INTERVAL': 200, 'TRAIN': {'DATASETS_NAME': ['LASOT', 'GOT10K_vottrain'], 'DATASETS_RATIO': [1, 1], 'SAMPLE_PER_EPOCH': 60000}, 'VAL': {'DATASETS_NAME': ['GOT10K_votval'], 'DATASETS_RATIO': [1], 'SAMPLE_PER_EPOCH': 10000}, 'SEARCH': {'SIZE': 320, 'FACTOR': 5.0, 'CENTER_JITTER': 4.5, 'SCALE_JITTER': 0.5, 'NUMBER': 1}, 'TEMPLATE': {'NUMBER': 1, 'SIZE': 128, 'FACTOR': 2.0, 'CENTER_JITTER': 0, 'SCALE_JITTER': 0}}, 'TEST': {'TEMPLATE_FACTOR': 2.0, 'TEMPLATE_SIZE': 128, 'SEARCH_FACTOR': 5.0, 'SEARCH_SIZE': 320, 'EPOCH': 500}}

3、actor

4、self.loaders

5、最终的 out

6、gt_dict

7、pred_dict

8、model_kwargs

9、data

10、index

11、checkpoint

12、dir_list

dir_list

13、seq_ids

14、self.sequence_list

15、meta_info

[‘[METAINFO]\n’, ‘url: https://youtu.be/ZyPZRpP9dDg\n’, ‘begin: 00:00:32\n’, ‘end: 00:00:41\n’, ‘anno_fps: 10Hz\n’, ‘object_class: ichneumon\n’, ‘motion_class: walking\n’, ‘major_class: viverrine\n’, ‘root_class: animal\n’, ‘motion_adverb: slowly\n’, ‘resolution: (1920, 1080)’]

16、object_meta

17、

18、

19、

排序后

20、

21、

22、

23、

24、

25、

26、

27、

3、一些入口

1) dataloader 的建立

train_script.py — 48

loader_train, loader_val = build_dataloaders(cfg, settings)

2) 创建模型

train_script.py — 55

net = build_ostrack(cfg)

这里面包括 预训练权重的加载, 以及 加载 整个模型

3) Loss actor 以及 optimer等

train_script.py — 71 始

这里的 actor 就是执行训练过程的

4)train 过程开始

train_script.py —88

5) 数据送入模型

在这上面的是

actors/ostrack.py — 69

这里才算是 数据送入模型的开始

ostrack\ostrack.py — 40

6)forward pass

actors\ostrack.py — 31

前向传播过程

7)compute loss

actors\ostrack.py — 34

8) 断点续训

base_trainer.py — 169

4、模型处理过程

1) 送入backbone

 x, aux_dict = self.backbone(z=template, x=search,ce_template_mask=ce_template_mask,ce_keep_rate=ce_keep_rate,return_last_attn=return_last_attn, )# 跳转到 vit_ce.py---191x Tensor:(4,320,768)

首先进行 patch_embed

x = self.patch_embed(x)z = self.patch_embed(z)

处理过程为

def forward(self, x):# allow different input size# B, C, H, W = x.shape# _assert(H == self.img_size[0], f"Input image height ({H}) doesn't match model ({self.img_size[0]}).")# _assert(W == self.img_size[1], f"Input image width ({W}) doesn't match model ({self.img_size[1]}).")x = self.proj(x)# Tensor:(4,768,16,16)if self.flatten:x = x.flatten(2).transpose(1, 2)# BCHW -> BNC# Tensor:(4,256,768)x = self.norm(x)# Tensor:(4,256,768)return x

先经过16X16 的卷积,然后再拉直

文中的

attn.py — 37

qkv = self.qkv(x).reshape(B, N, 3, self.num_heads, C // self.num_heads).permute(2, 0, 3, 1, 4)

得到输出的bbox过程

def cal_bbox(self, score_map_ctr, size_map, offset_map, return_score=False):max_score, idx = torch.max(score_map_ctr.flatten(1), dim=1, keepdim=True)# shape都是 Tensor:(4,1) 按 batch 拿出最大的得分和所对应的索引idx_y = idx // self.feat_sz# Tensor:(4,1)idx_x = idx % self.feat_sz# Tensor:(4,1)idx = idx.unsqueeze(1).expand(idx.shape[0], 2, 1)# Tensor:(4,2,1)size = size_map.flatten(2).gather(dim=2, index=idx)# Tensor:(4,2,1)offset = offset_map.flatten(2).gather(dim=2, index=idx).squeeze(-1)# Tensor:(4,2)# bbox = torch.cat([idx_x - size[:, 0] / 2, idx_y - size[:, 1] / 2,# idx_x + size[:, 0] / 2, idx_y + size[:, 1] / 2], dim=1) / self.feat_sz# cx, cy, w, hbbox = torch.cat([(idx_x.to(torch.float) + offset[:, :1]) / self.feat_sz,(idx_y.to(torch.float) + offset[:, 1:]) / self.feat_sz,size.squeeze(-1)], dim=1)# Tensor:(4,4)if return_score:return bbox, max_scorereturn bbox

5、加载预训练 backbone

骨干网络模型定义处

vit_ce.py — 197

backbone结构:

VisionTransformerCE((patch_embed): PatchEmbed((proj): Conv2d(3, 768, kernel_size=(16, 16), stride=(16, 16))(norm): Identity())(pos_drop): Dropout(p=0.0, inplace=False)(blocks): Sequential((0): CEBlock((norm1): LayerNorm((768,), eps=1e-06, elementwise_affine=True)(attn): Attention((qkv): Linear(in_features=768, out_features=2304, bias=True)(attn_drop): Dropout(p=0.0, inplace=False)(proj): Linear(in_features=768, out_features=768, bias=True)(proj_drop): Dropout(p=0.0, inplace=False))(drop_path): Identity()(norm2): LayerNorm((768,), eps=1e-06, elementwise_affine=True)(mlp): Mlp((fc1): Linear(in_features=768, out_features=3072, bias=True)(act): GELU()(drop1): Dropout(p=0.0, inplace=False)(fc2): Linear(in_features=3072, out_features=768, bias=True)(drop2): Dropout(p=0.0, inplace=False)))(1): CEBlock((norm1): LayerNorm((768,), eps=1e-06, elementwise_affine=True)(attn): Attention((qkv): Linear(in_features=768, out_features=2304, bias=True)(attn_drop): Dropout(p=0.0, inplace=False)(proj): Linear(in_features=768, out_features=768, bias=True)(proj_drop): Dropout(p=0.0, inplace=False))(drop_path): DropPath(drop_prob=0.009)(norm2): LayerNorm((768,), eps=1e-06, elementwise_affine=True)(mlp): Mlp((fc1): Linear(in_features=768, out_features=3072, bias=True)(act): GELU()(drop1): Dropout(p=0.0, inplace=False)(fc2): Linear(in_features=3072, out_features=768, bias=True)(drop2): Dropout(p=0.0, inplace=False)))(2): CEBlock((norm1): LayerNorm((768,), eps=1e-06, elementwise_affine=True)(attn): Attention((qkv): Linear(in_features=768, out_features=2304, bias=True)(attn_drop): Dropout(p=0.0, inplace=False)(proj): Linear(in_features=768, out_features=768, bias=True)(proj_drop): Dropout(p=0.0, inplace=False))(drop_path): DropPath(drop_prob=0.018)(norm2): LayerNorm((768,), eps=1e-06, elementwise_affine=True)(mlp): Mlp((fc1): Linear(in_features=768, out_features=3072, bias=True)(act): GELU()(drop1): Dropout(p=0.0, inplace=False)(fc2): Linear(in_features=3072, out_features=768, bias=True)(drop2): Dropout(p=0.0, inplace=False)))(3): CEBlock((norm1): LayerNorm((768,), eps=1e-06, elementwise_affine=True)(attn): Attention((qkv): Linear(in_features=768, out_features=2304, bias=True)(attn_drop): Dropout(p=0.0, inplace=False)(proj): Linear(in_features=768, out_features=768, bias=True)(proj_drop): Dropout(p=0.0, inplace=False))(drop_path): DropPath(drop_prob=0.027)(norm2): LayerNorm((768,), eps=1e-06, elementwise_affine=True)(mlp): Mlp((fc1): Linear(in_features=768, out_features=3072, bias=True)(act): GELU()(drop1): Dropout(p=0.0, inplace=False)(fc2): Linear(in_features=3072, out_features=768, bias=True)(drop2): Dropout(p=0.0, inplace=False)))(4): CEBlock((norm1): LayerNorm((768,), eps=1e-06, elementwise_affine=True)(attn): Attention((qkv): Linear(in_features=768, out_features=2304, bias=True)(attn_drop): Dropout(p=0.0, inplace=False)(proj): Linear(in_features=768, out_features=768, bias=True)(proj_drop): Dropout(p=0.0, inplace=False))(drop_path): DropPath(drop_prob=0.036)(norm2): LayerNorm((768,), eps=1e-06, elementwise_affine=True)(mlp): Mlp((fc1): Linear(in_features=768, out_features=3072, bias=True)(act): GELU()(drop1): Dropout(p=0.0, inplace=False)(fc2): Linear(in_features=3072, out_features=768, bias=True)(drop2): Dropout(p=0.0, inplace=False)))(5): CEBlock((norm1): LayerNorm((768,), eps=1e-06, elementwise_affine=True)(attn): Attention((qkv): Linear(in_features=768, out_features=2304, bias=True)(attn_drop): Dropout(p=0.0, inplace=False)(proj): Linear(in_features=768, out_features=768, bias=True)(proj_drop): Dropout(p=0.0, inplace=False))(drop_path): DropPath(drop_prob=0.045)(norm2): LayerNorm((768,), eps=1e-06, elementwise_affine=True)(mlp): Mlp((fc1): Linear(in_features=768, out_features=3072, bias=True)(act): GELU()(drop1): Dropout(p=0.0, inplace=False)(fc2): Linear(in_features=3072, out_features=768, bias=True)(drop2): Dropout(p=0.0, inplace=False)))(6): CEBlock((norm1): LayerNorm((768,), eps=1e-06, elementwise_affine=True)(attn): Attention((qkv): Linear(in_features=768, out_features=2304, bias=True)(attn_drop): Dropout(p=0.0, inplace=False)(proj): Linear(in_features=768, out_features=768, bias=True)(proj_drop): Dropout(p=0.0, inplace=False))(drop_path): DropPath(drop_prob=0.055)(norm2): LayerNorm((768,), eps=1e-06, elementwise_affine=True)(mlp): Mlp((fc1): Linear(in_features=768, out_features=3072, bias=True)(act): GELU()(drop1): Dropout(p=0.0, inplace=False)(fc2): Linear(in_features=3072, out_features=768, bias=True)(drop2): Dropout(p=0.0, inplace=False)))(7): CEBlock((norm1): LayerNorm((768,), eps=1e-06, elementwise_affine=True)(attn): Attention((qkv): Linear(in_features=768, out_features=2304, bias=True)(attn_drop): Dropout(p=0.0, inplace=False)(proj): Linear(in_features=768, out_features=768, bias=True)(proj_drop): Dropout(p=0.0, inplace=False))(drop_path): DropPath(drop_prob=0.064)(norm2): LayerNorm((768,), eps=1e-06, elementwise_affine=True)(mlp): Mlp((fc1): Linear(in_features=768, out_features=3072, bias=True)(act): GELU()(drop1): Dropout(p=0.0, inplace=False)(fc2): Linear(in_features=3072, out_features=768, bias=True)(drop2): Dropout(p=0.0, inplace=False)))(8): CEBlock((norm1): LayerNorm((768,), eps=1e-06, elementwise_affine=True)(attn): Attention((qkv): Linear(in_features=768, out_features=2304, bias=True)(attn_drop): Dropout(p=0.0, inplace=False)(proj): Linear(in_features=768, out_features=768, bias=True)(proj_drop): Dropout(p=0.0, inplace=False))(drop_path): DropPath(drop_prob=0.073)(norm2): LayerNorm((768,), eps=1e-06, elementwise_affine=True)(mlp): Mlp((fc1): Linear(in_features=768, out_features=3072, bias=True)(act): GELU()(drop1): Dropout(p=0.0, inplace=False)(fc2): Linear(in_features=3072, out_features=768, bias=True)(drop2): Dropout(p=0.0, inplace=False)))(9): CEBlock((norm1): LayerNorm((768,), eps=1e-06, elementwise_affine=True)(attn): Attention((qkv): Linear(in_features=768, out_features=2304, bias=True)(attn_drop): Dropout(p=0.0, inplace=False)(proj): Linear(in_features=768, out_features=768, bias=True)(proj_drop): Dropout(p=0.0, inplace=False))(drop_path): DropPath(drop_prob=0.082)(norm2): LayerNorm((768,), eps=1e-06, elementwise_affine=True)(mlp): Mlp((fc1): Linear(in_features=768, out_features=3072, bias=True)(act): GELU()(drop1): Dropout(p=0.0, inplace=False)(fc2): Linear(in_features=3072, out_features=768, bias=True)(drop2): Dropout(p=0.0, inplace=False)))(10): CEBlock((norm1): LayerNorm((768,), eps=1e-06, elementwise_affine=True)(attn): Attention((qkv): Linear(in_features=768, out_features=2304, bias=True)(attn_drop): Dropout(p=0.0, inplace=False)(proj): Linear(in_features=768, out_features=768, bias=True)(proj_drop): Dropout(p=0.0, inplace=False))(drop_path): DropPath(drop_prob=0.091)(norm2): LayerNorm((768,), eps=1e-06, elementwise_affine=True)(mlp): Mlp((fc1): Linear(in_features=768, out_features=3072, bias=True)(act): GELU()(drop1): Dropout(p=0.0, inplace=False)(fc2): Linear(in_features=3072, out_features=768, bias=True)(drop2): Dropout(p=0.0, inplace=False)))(11): CEBlock((norm1): LayerNorm((768,), eps=1e-06, elementwise_affine=True)(attn): Attention((qkv): Linear(in_features=768, out_features=2304, bias=True)(attn_drop): Dropout(p=0.0, inplace=False)(proj): Linear(in_features=768, out_features=768, bias=True)(proj_drop): Dropout(p=0.0, inplace=False))(drop_path): DropPath(drop_prob=0.100)(norm2): LayerNorm((768,), eps=1e-06, elementwise_affine=True)(mlp): Mlp((fc1): Linear(in_features=768, out_features=3072, bias=True)(act): GELU()(drop1): Dropout(p=0.0, inplace=False)(fc2): Linear(in_features=3072, out_features=768, bias=True)(drop2): Dropout(p=0.0, inplace=False))))(norm): LayerNorm((768,), eps=1e-06, elementwise_affine=True))

加载的cfg文件内容

{‘MODEL’: {‘PRETRAIN_FILE’: ‘mae_pretrain_vit_base.pth’, ‘EXTRA_MERGER’: False, ‘RETURN_INTER’: False, ‘RETURN_STAGES’: [], ‘BACKBONE’: {‘TYPE’: ‘vit_base_patch16_224_ce’, ‘STRIDE’: 16, ‘MID_PE’: False, ‘SEP_SEG’: False, ‘CAT_MODE’: ‘direct’, ‘MERGE_LAYER’: 0, ‘ADD_CLS_TOKEN’: False, ‘CLS_TOKEN_USE_MODE’: ‘ignore’, ‘CE_LOC’: [3, 6, 9], ‘CE_KEEP_RATIO’: [0.7, 0.7, 0.7], ‘CE_TEMPLATE_RANGE’: ‘CTR_POINT’}, ‘HEAD’: {‘TYPE’: ‘CENTER’, ‘NUM_CHANNELS’: 256}}, ‘TRAIN’: {‘LR’: 0.0004, ‘WEIGHT_DECAY’: 0.0001, ‘EPOCH’: 300, ‘LR_DROP_EPOCH’: 240, ‘BATCH_SIZE’: 4, ‘NUM_WORKER’: 0, ‘OPTIMIZER’: ‘ADAMW’, ‘BACKBONE_MULTIPLIER’: 0.1, ‘GIOU_WEIGHT’: 2.0, ‘L1_WEIGHT’: 5.0, ‘FREEZE_LAYERS’: [0], ‘PRINT_INTERVAL’: 50, ‘VAL_EPOCH_INTERVAL’: 20, ‘GRAD_CLIP_NORM’: 0.1, ‘AMP’: False, ‘CE_START_EPOCH’: 20, ‘CE_WARM_EPOCH’: 80, ‘DROP_PATH_RATE’: 0.1, ‘SCHEDULER’: {‘TYPE’: ‘step’, ‘DECAY_RATE’: 0.1}}, ‘DATA’: {‘SAMPLER_MODE’: ‘causal’, ‘MEAN’: [0.485, 0.456, 0.406], ‘STD’: [0.229, 0.224, 0.225], ‘MAX_SAMPLE_INTERVAL’: 200, ‘TRAIN’: {‘DATASETS_NAME’: [‘GOT10K_vottrain’], ‘DATASETS_RATIO’: [1, 1, 1, 1], ‘SAMPLE_PER_EPOCH’: 60000}, ‘VAL’: {‘DATASETS_NAME’: [‘GOT10K_votval’], ‘DATASETS_RATIO’: [1], ‘SAMPLE_PER_EPOCH’: 10000}, ‘SEARCH’: {‘SIZE’: 256, ‘FACTOR’: 4.0, ‘CENTER_JITTER’: 3, ‘SCALE_JITTER’: 0.25, ‘NUMBER’: 1}, ‘TEMPLATE’: {‘NUMBER’: 1, ‘SIZE’: 128, ‘FACTOR’: 2.0, ‘CENTER_JITTER’: 0, ‘SCALE_JITTER’: 0}}, ‘TEST’: {‘TEMPLATE_FACTOR’: 2.0, ‘TEMPLATE_SIZE’: 128, ‘SEARCH_FACTOR’: 4.0, ‘SEARCH_SIZE’: 256, ‘EPOCH’: 300}}

6、标签的设计

gt_guassuan_pans

它的设立跟 gt_bbox 的 有关,这个是分类标签

7、datalodaer的创建

train_script.py — 48

用到了数据增强。

数据的加载 ,这个就与Dataloader 与 Dataset 的机制有关了。 自己定义导入数据时需要继承 Dataset 父类,并重写 __len__ 和 __getitem__ 方法。

这里的代码实现主要在 data\sampler.py 文件中。

对于got10k,每个视频序列下包含的文件如下所示

其中 absence.label 是occlusion 的,内部内容举例

with open(occlusion_file, 'r', newline='') as f:occlusion = torch.ByteTensor([int(v[0]) for v in csv.reader(f)])# =======tensor([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], dtype=torch.uint8)

cover.label 举例

with open(cover_file, 'r', newline='') as f:cover = torch.ByteTensor([int(v[0]) for v in csv.reader(f)])# Tensor:(110,)# =================tensor([8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 7, 5, 1, 2, 4, 5, 8, 8, 8, 8,8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8,8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8,8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8,8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8], dtype=torch.uint8)

视频序列采样时, 模板帧需要在 search 前面

# Sample test and train frames in a causal manner, i.e. search_frame_ids > template_frame_ids

建立一个 训练单位 data

sampler.py — 157

data = TensorDict({'template_images': template_frames, 'template_anno': template_anno['bbox'], 'template_masks': template_masks, 'search_images': search_frames, 'search_anno': search_anno['bbox'], 'search_masks': search_masks, 'dataset': dataset.get_name(), 'test_class': meta_obj_test.get('object_class_name')})

裁剪的区域是根据bbox 来的,将 输入 resize成 128X128 processing_utils.py — 68

输入的normalize 过程

transforms.py — 255

输入数据的 数据增强操作顺序

范围 (0~255)归一化到 (0,1)

然后进行 数据增强

最后归一化

def transform_image(self, image):return tvisf.normalize(image, self.mean, self.std, self.inplace)

随机数的影响 transforms.py — 102

rand_params = self.roll()

模板时

搜索区域时

说明 随机数对于模板和搜索区域不统一。

8、 数据的加载过程

数据的加载过程都是在 sampler 中实现的,它重写了 Dataset 类中的方法, 所以Dataloadre 加载 导入输入数据时 从这里进行。

class TrackingSampler(torch.utils.data.Dataset):

而 processing 中的内容是对 原始的输入数据进行处理 ,在这里面包括 裁剪resize, 数据增强, 归一化 等 处理。

注意,是否使用 lmdb 是由 use_lmdb 参数决定的。

2、 最终的预测

 def forward(self, x, gt_score_map=None):""" Forward pass with input x. """score_map_ctr, size_map, offset_map = self.get_score_map(x)# Tensor:(4,1,16,16) , Tensor:(4,2,16,16), Tensor:(4,2,16,16)# assert gt_score_map is Noneif gt_score_map is None:# Truebbox = self.cal_bbox(score_map_ctr, size_map, offset_map)# Tensor:(4,4)else:bbox = self.cal_bbox(gt_score_map.unsqueeze(1), size_map, offset_map)return score_map_ctr, bbox, size_map, offset_map

都用上了,中和这些计算bbox head.py — 131

3、 保存训练的模型

base_trainer.py –198

 # only save the last 10 checkpointssave_every_epoch = getattr(self.settings, "save_every_epoch", False)save_epochs = [79, 159, 239]if epoch > (max_epochs - 1) or save_every_epoch or epoch % 40 == 0 or epoch in save_epochs or epoch > (max_epochs - 5):# if epoch > (max_epochs - 10) or save_every_epoch or epoch % 100 == 0:if self._checkpoint_dir:if self.settings.local_rank in [-1, 0]:self.save_checkpoint()