目录
TimeSformer理解
使用TimeSformer预训练模型,并提取视频特征(Linux 代码实战)
一、下载官方代码:
二、创建环境:
三、准备想要预训练的数据集:
四、进行预训练
1)选择模型配置:
2)进行程序运行:
用预训练好的模型 提取视频特征 并保存为.npy文件
TimeSformer理解
关于论文学习
使用TimeSformer预训练模型,并提取视频特征
(Linux 代码实战)
一、下载官方代码:
git clone https://github.com/facebookresearch/TimeSformercd TimeSformer # 进入文件夹
二、创建环境:
# 创建环境conda create -n TimeSformer python=3.7 -y# 激活环境conda activate TimeSformer# 下载pytorch及其相关包conda install pytorch==1.8.0 torchvision==0.9.0 torchaudio==0.8.0 cudatoolkit=10.2 -c pytorch# 按照官方步骤安装剩下的包pip install 'git+https://github.com/facebookresearch/fvcore'pip install simplejsonpip install einopspip install timmconda install av -c conda-forgepip install psutilpip install scikit-learnpip install opencv-pythonpip install tensorboardpip install psutilpip install matplotlibpip install opencv-pythonpip install avpip install scipypip install tensorboardpip install sklearnpip install timm
三、准备想要预训练的数据集:
(这里我下载的是THUMOS14里未剪辑的视频)
# 下载TH14数据集压缩包wget -c https://storage.googleapis.com/thumos14_files/TH14_validation_set_mp4.zip# 解压数据集unzip TH14_validation_set_mp4.zip
1)生成train.csv
# 官方文档要求的csv格式Construct the Kinetics video loader with a given csv file. The format of the csv file is: ``` path_to_video_1 label_1 path_to_video_2 label_2 ... path_to_video_N label_N ```
四、进行预训练
1)选择模型配置:
这里本人选择的是TimeSformer_divST_16x16_448.yaml,更改第9行(更改数据集路径)和第42行(根据自己的GPU数量进行修改)
2)进行程序运行:
先将TimeSformer/tools/文件夹内的run_net.py粘贴到TimeSformer/文件夹下,然后运行程序
python run_net.py --cfg configs/Kinetics/TimeSformer_divST_16x16_448.yaml
若下载初始权重不成功,可复制网址,粘贴到网页上,进行下载,在传入服务器相应文件夹中
用预训练好的模型 提取视频特征 并保存为.npy文件
1)首先,在TimeSformer里创建文件Video_frame_lift.py
输入模型的是图片,所以需要先对视频提帧并保存(最后输入模型的根据模型具体参数,分别是8,16,32张图片,原始策略是均匀分段选择图片,可以自己更改)
首先需要准备一个存放视频目录的文件,方便进行批量处理,我这里选择生成格式为 视频名+’\t’+视频路径的txt文件,生成代码如下:
# Video_frame_lift.pyimport os path = '/Video_feature_extraction/TH14_validation_set_mp4' # 要遍历的目录txt_path = '/Video_feature_extraction/video_validation.txt' # 生成txt文件的路径with open(txt_path, 'w') as f: for root, dirs, names in os.walk(path): for name in names: ext = os.path.splitext(name)[1] # 获取后缀名 if ext == '.mp4': video_path = os.path.join(root, name) # mp4文件原始地址 video_name = name.split('.')[0] f.write(video_name+'\t'+video_path+'\n')
2)然后,用ffmpeg进行视频提帧,创建文件ffmpeg.py
# ffmpeg.pyimport osimport sysimport subprocess OUT_DATA_DIR="/Video_feature_extraction/validation_pics" # 输出图片的文件夹txt_path = "/Video_feature_extraction/video_validation.txt"filelist = []i = 1with open(txt_path, 'r', encoding='utf-8') as f: for line in f: line = line.rstrip('\n') video_name = line.split('\t')[0].split('.')[0] dst_path = os.path.join(OUT_DATA_DIR, video_name) video_path = line.split('\t')[1] if not os.path.exists(dst_path): os.makedirs(dst_path) print(i) i += 1 cmd = 'ffmpeg -i {} -r 1 -q:v 2 -f image2 {}/%05d.jpg'.format(video_path, dst_path) print(cmd) subprocess.call(cmd, shell=True,stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL)
注:如果发现生成的文件夹下,没有抽出来的图片,可能是ffmpeg版本问题。
3)在TimeSformer文件夹内创建models文件夹,然后创建transforms.py
(即TimeSformer/models/transforms.py)
# transforms.pyimport torchvisionimport randomfrom PIL import Image, ImageOpsimport numpy as npimport numbersimport mathimport torchclass GroupRandomCrop(object): def __init__(self, size): if isinstance(size, numbers.Number): self.size = (int(size), int(size)) else: self.size = size def __call__(self, img_group): w, h = img_group[0].size th, tw = self.size out_images = list() x1 = random.randint(0, w - tw) y1 = random.randint(0, h - th) for img in img_group: assert(img.size[0] == w and img.size[1] == h) if w == tw and h == th: out_images.append(img) else: out_images.append(img.crop((x1, y1, x1 + tw, y1 + th))) return out_imagesclass GroupCenterCrop(object): def __init__(self, size): self.worker = torchvision.transforms.CenterCrop(size) def __call__(self, img_group): return [self.worker(img) for img in img_group]class GroupRandomHorizontalFlip(object): """Randomly horizontally flips the given PIL.Image with a probability of 0.5""" def __init__(self, is_flow=False): self.is_flow = is_flow def __call__(self, img_group, is_flow=False): v = random.random() if v width, then image will be rescaled to (size * height / width, size) size: size of the smaller edge interpolation: Default: PIL.Image.BILINEAR """ def __init__(self, size, interpolation=Image.BILINEAR): self.worker = torchvision.transforms.Resize(size, interpolation) def __call__(self, img_group): return [self.worker(img) for img in img_group]class GroupOverSample(object): def __init__(self, crop_size, scale_size=None): self.crop_size = crop_size if not isinstance(crop_size, int) else (crop_size, crop_size) if scale_size is not None: self.scale_worker = GroupScale(scale_size) else: self.scale_worker = None def __call__(self, img_group): if self.scale_worker is not None: img_group = self.scale_worker(img_group) image_w, image_h = img_group[0].size crop_w, crop_h = self.crop_size offsets = GroupMultiScaleCrop.fill_fix_offset(False, image_w, image_h, crop_w, crop_h) oversample_group = list() for o_w, o_h in offsets: normal_group = list() flip_group = list() for i, img in enumerate(img_group): crop = img.crop((o_w, o_h, o_w + crop_w, o_h + crop_h)) normal_group.append(crop) flip_crop = crop.copy().transpose(Image.FLIP_LEFT_RIGHT) if img.mode == 'L' and i % 2 == 0: flip_group.append(ImageOps.invert(flip_crop)) else: flip_group.append(flip_crop) oversample_group.extend(normal_group) oversample_group.extend(flip_group) return oversample_groupclass GroupMultiScaleCrop(object): # 完成对所有图片的裁剪, 过程没看懂,反正随机剪切后是224 x 224 def __init__(self, input_size, scales=None, max_distort=1, fix_crop=True, more_fix_crop=True): self.scales = scales if scales is not None else [1, .875, .75, .66] self.max_distort = max_distort self.fix_crop = fix_crop self.more_fix_crop = more_fix_crop self.input_size = input_size if not isinstance(input_size, int) else [input_size, input_size] # 前面输入是224 self.interpolation = Image.BILINEAR # 双线性插值 def __call__(self, img_group): # 具体使用的时候,都是不需要传入img_group的,DataSet类会自动传入 im_size = img_group[0].size # 第一幅图像的大小 crop_w, crop_h, offset_w, offset_h = self._sample_crop_size(im_size) crop_img_group = [img.crop((offset_w, offset_h, offset_w + crop_w, offset_h + crop_h)) for img in img_group] ret_img_group = [img.resize((self.input_size[0], self.input_size[1]), self.interpolation) for img in crop_img_group] return ret_img_group # 没有堆叠在一起,只是以列表的形式把一组图片保存起来。 def _sample_crop_size(self, im_size): image_w, image_h = im_size[0], im_size[1] # 宽,高 # find a crop size base_size = min(image_w, image_h) # 最小值 crop_sizes = [int(base_size * x) for x in self.scales] # 最小值乘以 [1, .875, .75, .66] # 小知识 2 * [2, 4, 6, 8] = [2, 4, 6, 8, 2, 4, 6, 8] 所以要用上面的方法 crop_h = [self.input_size[1] if abs(x - self.input_size[1]) < 3 else x for x in crop_sizes] crop_w = [self.input_size[0] if abs(x - self.input_size[0]) = 3, 则,crop_h = [224, 196.0, 168.0, 147.84] # 如果所有的 abs(x - self.input_size[1]) < 3, 则,crop_h = [224, 224.0, 224.0, 224] 没看懂 pairs = [] for i, h in enumerate(crop_h): for j, w in enumerate(crop_w): if abs(i - j) <= self.max_distort: pairs.append((w, h)) crop_pair = random.choice(pairs) # 随机选择一个来裁剪 if not self.fix_crop: # 如果不是固定大小 w_offset = random.randint(0, image_w - crop_pair[0]) h_offset = random.randint(0, image_h - crop_pair[1]) else: w_offset, h_offset = self._sample_fix_offset(image_w, image_h, crop_pair[0], crop_pair[1]) return crop_pair[0], crop_pair[1], w_offset, h_offset def _sample_fix_offset(self, image_w, image_h, crop_w, crop_h): offsets = self.fill_fix_offset(self.more_fix_crop, image_w, image_h, crop_w, crop_h) return random.choice(offsets) @staticmethod def fill_fix_offset(more_fix_crop, image_w, image_h, crop_w, crop_h): w_step = (image_w - crop_w) // 4 h_step = (image_h - crop_h) // 4 ret = list() ret.append((0, 0)) # upper left ret.append((4 * w_step, 0)) # upper right ret.append((0, 4 * h_step)) # lower left ret.append((4 * w_step, 4 * h_step)) # lower right ret.append((2 * w_step, 2 * h_step)) # center if more_fix_crop: ret.append((0, 2 * h_step)) # center left ret.append((4 * w_step, 2 * h_step)) # center right ret.append((2 * w_step, 4 * h_step)) # lower center ret.append((2 * w_step, 0 * h_step)) # upper center ret.append((1 * w_step, 1 * h_step)) # upper left quarter ret.append((3 * w_step, 1 * h_step)) # upper right quarter ret.append((1 * w_step, 3 * h_step)) # lower left quarter ret.append((3 * w_step, 3 * h_step)) # lower righ quarter return retclass GroupRandomSizedCrop(object): """Random crop the given PIL.Image to a random size of (0.08 to 1.0) of the original size and and a random aspect ratio of 3/4 to 4/3 of the original aspect ratio This is popularly used to train the Inception networks size: size of the smaller edge interpolation: Default: PIL.Image.BILINEAR """ def __init__(self, size, interpolation=Image.BILINEAR): self.size = size self.interpolation = interpolation def __call__(self, img_group): for attempt in range(10): area = img_group[0].size[0] * img_group[0].size[1] target_area = random.uniform(0.08, 1.0) * area aspect_ratio = random.uniform(3. / 4, 4. / 3) w = int(round(math.sqrt(target_area * aspect_ratio))) h = int(round(math.sqrt(target_area / aspect_ratio))) if random.random() < 0.5: w, h = h, w if w <= img_group[0].size[0] and h <= img_group[0].size[1]: x1 = random.randint(0, img_group[0].size[0] - w) y1 = random.randint(0, img_group[0].size[1] - h) found = True break else: found = False x1 = 0 y1 = 0 if found: out_group = list() for img in img_group: img = img.crop((x1, y1, x1 + w, y1 + h)) assert(img.size == (w, h)) out_group.append(img.resize((self.size, self.size), self.interpolation)) return out_group else: # Fallback scale = GroupScale(self.size, interpolation=self.interpolation) crop = GroupRandomCrop(self.size) return crop(scale(img_group))class Stack(object): # 将所有图像水平堆叠,变成3 x 224 x (N * 224)大小的图像 def __init__(self, roll=False): self.roll = roll def __call__(self, img_group): if img_group[0].mode == 'L': return np.concatenate([np.expand_dims(x, 2) for x in img_group], axis=2) elif img_group[0].mode == 'RGB': if self.roll: return np.concatenate([np.array(x)[:, :, ::-1] for x in img_group], axis=2) # 将图片水平方向翻转后,横向堆叠在一起(axis=2,表示横向) # 感觉没有多大的意义,因为前面已经将图像水平翻转了。所以此处多次一举了。 # [:, :, ::-1] 是将一副3*224*224的图像从最后一列往第一列倒着写,步长是1。即相当于水平翻转 # [:, :, ::-2] 是将一副3*224*224的图像从最后一列往第一列倒着写,步长是2。图像大小减小一倍(水平方向) # [:, :, ::1] 是将一副3*224*224的图像从第一一列往最后一列写,步长是1。即相当于原图 # [:, :, ::2] 是将一副3*224*224的图像从第一一列往最后一列写,步长是2。图像大小减小一倍(水平方向) # [:, :, :2:] 是将一副3*224*224的图像从第一一列往最后一列写,步长是1,但只取前两列。 else: return np.concatenate(img_group, axis=2)class ToTorchFormatTensor(object): # 可以传入 numpy数据类型,也可以传入PIL Image数据类型。 """ Converts a PIL.Image (RGB) or numpy.ndarray (H x W x C) in the range [0, 255] to a torch.FloatTensor of shape (C x H x W) in the range [0.0, 1.0] """ def __init__(self, div=True): self.div = div def __call__(self, pic): # 因为ToTorchFormatTensor处理的是一张图片,而不是一组图片,所以使用之前需要将所有图片堆叠在一起。 if isinstance(pic, np.ndarray): # handle numpy array img = torch.from_numpy(pic).permute(2, 0, 1).contiguous() else: # handle PIL Image img = torch.ByteTensor(torch.ByteStorage.from_buffer(pic.tobytes())) # 这一步的 img范围是0到255的tensor类型 img = img.view(pic.size[1], pic.size[0], len(pic.mode)) # put it from HWC to CHW format # yikes, this transpose takes 80% of the loading time/CPU img = img.transpose(0, 1).transpose(0, 2).contiguous() return img.float().div(255) if self.div else img.float() # 如果div是真,才需要除以255,否则就不需要除了。传入是假,所以不需要除了class IdentityTransform(object): def __call__(self, data): return dataif __name__ == "__main__": trans = torchvision.transforms.Compose([ GroupScale(256), GroupRandomCrop(224), Stack(), ToTorchFormatTensor(), GroupNormalize( mean=[.485, .456, .406], std=[.229, .224, .225] )] ) im = Image.open('') color_group = [im] * 3 rst = trans(color_group) gray_group = [im.convert('L')] * 9 gray_rst = trans(gray_group) trans2 = torchvision.transforms.Compose([ GroupRandomSizedCrop(256), Stack(), ToTorchFormatTensor(), GroupNormalize( mean=[.485, .456, .406], std=[.229, .224, .225]) ]) print(trans2(color_group))
4)接下来,在TimeSformer文件夹内创建dataloader.py
import jsonimport torchvisionimport randomimport osimport numpy as npimport torchimport torch.nn.functional as Fimport cv2from torch.utils.data import Datasetfrom torch.autograd import Variable from models.transforms import * class VideoClassificationDataset(Dataset): def __init__(self, opt, mode): # python 3 # super().__init__() super(VideoClassificationDataset, self).__init__() self.mode = mode # to load train/val/test data self.feats_dir = opt['feats_dir'] if self.mode == 'val': self.n = 5000 #提取的视频数量 if self.mode != 'inference': print(f'load feats from {self.feats_dir}') with open(self.feats_dir) as f: feat_class_list = f.readlines() self.feat_class_list = feat_class_list mean =[0.485, 0.456, 0.406] std = [0.229, 0.224, 0.225] model_transform_params = { "side_size": 256, "crop_size": 224, "num_segments": 8, "sampling_rate": 5 } # Get transform parameters based on model transform_params = model_transform_params transform_train = torchvision.transforms.Compose([ GroupMultiScaleCrop(transform_params["crop_size"], [1, .875, .75, .66]), GroupRandomHorizontalFlip(is_flow=False), Stack(roll=False), ToTorchFormatTensor(div=True), GroupNormalize(mean, std), ]) transform_val = torchvision.transforms.Compose([ GroupScale(int(transform_params["side_size"])), GroupCenterCrop(transform_params["crop_size"]), Stack(roll=False), ToTorchFormatTensor(div=True), GroupNormalize(mean, std), ]) self.transform_params = transform_params self.transform_train = transform_train self.transform_val = transform_val print("Finished initializing dataloader.") def __getitem__(self, ix): """This function returns a tuple that is further passed to collate_fn """ ix = ix % self.n fc_feat = self._load_video(ix) data = { 'fc_feats': Variable(fc_feat), 'video_id': ix, } return data def __len__(self): return self.n def _load_video(self, idx): prefix = '{:05d}.jpg' feat_path_list = [] for i in range(len(self.feat_class_list)): video_name = self.feat_class_list[i].rstrip('\n').split('\t')[0]+'-' feat_path = self.feat_class_list[i].rstrip('\n').split('\t')[1] feat_path_list.append(feat_path) video_data = {} if self.mode == 'val': images = [] frame_list =os.listdir(feat_path_list[idx]) average_duration = len(frame_list) // self.transform_params["num_segments"] # offests为采样坐标 offsets = np.array([int(average_duration / 2.0 + average_duration * x) for x in range(self.transform_params["num_segments"])]) offsets = offsets + 1 for seg_ind in offsets: p = int(seg_ind) seg_imgs = Image.open(os.path.join(feat_path_list[idx], prefix.format(p))).convert('RGB') images.append(seg_imgs) video_data = self.transform_val(images) video_data = video_data.view((-1, self.transform_params["num_segments"]) + video_data.size()[1:]) return video_data
4)视频特征提取并存为npy文件
先选择你要用的模型,这里我选择作者提供的已经预训练好的model(需要VPN才能下载作者模型),将model下载下来。
我下载了这两个:
TimeSformer_divST_8x32_224_K400.pyth
TimeSformer_divST_16x16_448_K600.pyth
提取特征时为了保持一致性,加载模型应该用eval()模式,这样同一个视频每次提取的特征是固定不变的。在TimeSformer文件夹内创建extract.py
import argparseimport osimport torchimport numpy as npfrom torch.utils.data import DataLoaderimport randomfrom dataloader import VideoClassificationDatasetfrom timesformer.models.vit import TimeSformer device = torch.device("cuda:6") if __name__ == '__main__': opt = argparse.ArgumentParser() opt.add_argument('test_list_dir', help="Directory where test features are stored.") opt = vars(opt.parse_args()) test_opts = {'feats_dir': opt['test_list_dir']} # =================模型建立====================== model = TimeSformer(img_size=224, num_classes=20, num_frames=8, attention_type='divided_space_time', pretrained_model='checkpoints/TimeSformer_divST_16x16_448_K600.pyth') model = model.eval().to(device) print(model) # ================数据加载======================== print("Use", torch.cuda.device_count(), 'gpus') test_loader = {} test_dataset = VideoClassificationDataset(test_opts, 'val') test_loader = DataLoader(test_dataset, batch_size=1, num_workers=6, shuffle=False) # ===================训练和验证======================== i = 0 file1 = open("./video_validation.txt") file1_list = file1.readlines() for data in test_loader: model_input = data['fc_feats'].to(device) name_feature = file1_list[i].rstrip().split('\t')[0].split('.')[0] i = i + 1 out = model(model_input, ) out = out.squeeze(0) out = out.cpu().detach().numpy() print("out.size():",out.size()) np.save('video_feature/' + name_feature + '.npy', out) print(i)
然后终端输入:
python extract.py ./video_validation.txt
运行程序
参考文章:
使用TimeSformer预训练模型提取视频特征_yxy520ya的博客-CSDN博客_视频特征提取
transforms工具类_大侠刷题啦的博客-CSDN博客