TY - GEN
T1 - An Efficient and Lightweight Structure for Spatial-Temporal Feature Extraction in Video Super Resolution
AU - He, Xiaonan
AU - Xia, Yukun
AU - Qiao, Yuansong
AU - Lee, Brian
AU - Ye, Yuhang
N1 - Publisher Copyright:
© 2024, The Author(s), under exclusive license to Springer Nature Switzerland AG.
PY - 2024
Y1 - 2024
N2 - Video Super Resolution (VSR) model based on deep convolutional neural network (CNN) uses multiple Low-Resolution (LR) frames as input and has a strong ability to recover High-Resolution (HR) frames and maintain video temporal information. However, to realize the above advantages, VSR must consider both spatial and temporal information to improve the perceived quality of the output video, leading to expensive operations such as cross-frame convolution. Therefore, how to balance the output video quality and computational cost is a worthy issue to be studied. To address the above problem, we propose an efficient and lightweight multi-scale 3D video super-resolution scheme that arranges 3D convolution features extraction blocks using a U-Net structure to achieve multi-scale feature extraction in both spatial and temporal dimensions. Quantitative and qualitative evaluation results on public video datasets show that compared to other simple cascaded spatial-temporal feature extraction structures, an U-Net structure achieves comparable texture details and temporal consistency while with a significant reduction in computation costs and latency.
AB - Video Super Resolution (VSR) model based on deep convolutional neural network (CNN) uses multiple Low-Resolution (LR) frames as input and has a strong ability to recover High-Resolution (HR) frames and maintain video temporal information. However, to realize the above advantages, VSR must consider both spatial and temporal information to improve the perceived quality of the output video, leading to expensive operations such as cross-frame convolution. Therefore, how to balance the output video quality and computational cost is a worthy issue to be studied. To address the above problem, we propose an efficient and lightweight multi-scale 3D video super-resolution scheme that arranges 3D convolution features extraction blocks using a U-Net structure to achieve multi-scale feature extraction in both spatial and temporal dimensions. Quantitative and qualitative evaluation results on public video datasets show that compared to other simple cascaded spatial-temporal feature extraction structures, an U-Net structure achieves comparable texture details and temporal consistency while with a significant reduction in computation costs and latency.
KW - 3D convolution
KW - Efficiency
KW - U-Net
KW - Video Super Resolution
UR - http://www.scopus.com/inward/record.url?scp=85184278560&partnerID=8YFLogxK
U2 - 10.1007/978-3-031-50069-5_30
DO - 10.1007/978-3-031-50069-5_30
M3 - Conference contribution
AN - SCOPUS:85184278560
SN - 9783031500688
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 362
EP - 374
BT - Advances in Computer Graphics - 40th Computer Graphics International Conference, CGI 2023, Proceedings
A2 - Sheng, Bin
A2 - Bi, Lei
A2 - Kim, Jinman
A2 - Magnenat-Thalmann, Nadia
A2 - Thalmann, Daniel
PB - Springer Science and Business Media Deutschland GmbH
T2 - 40th Computer Graphics International Conference, CGI 2023
Y2 - 28 August 2023 through 1 September 2023
ER -