MVVA-net: a video aesthetic quality assessment network with cognitive fusion of multi-type feature–based strong generalization. [Dataset]

MVVA-net: a video aesthetic quality assessment network with cognitive fusion of multi-type feature–based strong generalization. [Dataset]

Contributors

Min Li
Data Collector

Zheng Wang
Data Collector

Professor Jinchang Ren j.ren@rgu.ac.uk
Data Collector

Meijun Sun
Data Collector

Abstract

Most of the existing video aesthetic quality assessment datasets (as seen in Table 1) are not public, some are not large enough, which makes the trained depth model perform poorly and some are based on the professionalism of video shooting or the ratings of video website users as the evaluation criteria for the evaluation of video aesthetic quality. To solve these problems, the authors built a large-scale short video aesthetics (SVA) dataset with scientific annotation methods. SVA includes 6900 edited videos from YouTube and AVAQ6000, each lasting 10 to 30 s. The labeling process involves 15 viewers of different genders and different ages. Before labeling, each viewer will watch some indicative videos with high and low aesthetic quality in advance. When labeling, the viewer watches the video and assigns an aesthetic quality score of 1 to 10 points to the watched video, of which 1 to 5 points are assigned to videos with low aesthetic quality, and 6 to 10 points are assigned to high aesthetics quality. After labeling, the final decimal aesthetic score of each video is the average score after removing the highest and lowest scores. If the decimal aesthetic score of a video is greater than σ, the video is considered to be of high aesthetic quality; otherwise, the video is considered to be of low aesthetic quality. In this paper, we set σ to 5. In SVA, 3735 videos are labeled as high aesthetic quality and 3165 videos are labeled as low aesthetic quality.

Citation

LI, M., WANG, Z., REN, J. and SUN, M. 2022. MVVA-net: a video aesthetic quality assessment network with cognitive fusion of multi-type feature–based strong generalization. [Dataset]. Hosted on GitHub [online]. Available from: https://github.com/Lm0324/MVVA-Net

Acceptance Date	Sep 29, 2021
Online Publication Date	Mar 12, 2022
Publication Date	Jul 31, 2022
Deposit Date	Jun 16, 2022
Publicly Available Date	Jul 1, 2022
Keywords	Videos; Social media platforms; Aesthetic quality; Short video aesthetics (SVA); Multi-type feature fusion network (MVVA-Net)
Public URL	https://rgu-repository.worktribe.com/output/1628665
Publisher URL	https://github.com/Lm0324/MVVA-Net
Related Public URLs	https://rgu-repository.worktribe.com/output/1628644
Type of Data	SVA files
Collection Date	Mar 12, 2022
Collection Method	The intra-frame aesthetic branch takes key frames as input to extract intra-frame aesthetic features; the inter-frame aesthetic branch takes sequential frames as input to extract inter-frame aesthetic features. The authors adaptively fuse the multi-type features extracted from the two branches to evaluate the aesthetic quality of the video. At the same time, both the intra-frame aesthetic branch and the inter-frame aesthetic branch support videos of different durations with different frame numbers as input. The method designs two branches to extract intra-frame aesthetic features and inter-frame aesthetic features of the video, respectively. These two branches take different types of video frames, namely key frames and sequence frames, as input. The sequential frame is a frame extracted from video based on a fixed interval, which contains the changing relationship between frames in a video, so it is used as the input of inter-frame aesthetic branch to extract inter-frame aesthetic features; the key frame is obtained by frame difference method, which can represent different pictures in the video, so it is used as the input of intra-frame aesthetic branch to extract intra-frame aesthetic features. In this study a dataset of 6900 video shorts was constructed. In order to comprehensively consider intra-frame aesthetics and inter-frame aesthetics, and improve the generalization ability of the model, the authors propose a method of fusing multi-type features for video aesthetic quality assessment based on the strategy of not fixed model input. The experimental results show that our model has shown good performance on different datasets and demonstrated strong generalization ability.