TitleAudio Matters in Video Super-Resolution by Implicit Semantic Guidance
AuthorsChen, Yanxiang
Zhao, Pengcheng
Qi, Meibin
Zhao, Yang
Jia, Wei
Wang, Ronggang
AffiliationHefei Univ Technol, Key Lab Knowledge Engn Big Data, Minist Educ, Hefei 230601, Peoples R China
Hefei Univ Technol, Sch Comp Sci & Informat Engn, Intelligent Interconnected Syst Anhui Prov Lab, Hefei 230601, Peoples R China
Peng Cheng Natl Lab, Shenzhen 518000, Peoples R China
Peking Univ Shenzhen Grad Sch, Sch Elect & Comp Engn, Shenzhen 518055, Peoples R China
KeywordsNETWORK
ENHANCEMENT
Issue Date2022
PublisherIEEE TRANSACTIONS ON MULTIMEDIA
AbstractVideo super-resolution (VSR) aims to use multiple consecutive low-resolution frames to recover the corresponding high-resolution frames. However, existing VSR methods only consider videos as image sequences, ignoring another essential timing information-audio, while in fact, there is a semantic link between audio and vision, and extensive studies have shown that audio can provide supervisory information in visual networks. Meanwhile, the addition of semantic priors has been proven to be effective in super-resolution (SR) tasks, but a pretrained segmentation network is required to obtain semantic segmentation maps. By contrast, audio as the information contained in the video itself can be directly used. Therefore, in this study, we propose a novel and pluggable multiscale audiovisual fusion (MS-AVF) module to enhance VSR performance by exploiting the relevant audio information, which can be regarded as implicit semantic guidance compared with the kind of explicit segmentation priors. Specifically, we first fuse audiovisual features on the semantic feature maps of different granularities of the target frames, and then through a top-down multiscale fusion approach, feedback high-level semantics to the underlying global visual features layer by layer, thereby providing effective audio implicit semantic guidance for VSR. Experimental results show that audio can further improve the VSR effect. Moreover, by visualizing the learned attention mask, the proposed end-to-end model can automatically learn potential audiovisual semantic links, especially improving the accuracy and effectiveness of the SR of sound sources and their surrounding regions.
URIhttp://hdl.handle.net/20.500.11897/650387
ISSN1520-9210
DOI10.1109/TMM.2022.3152941
IndexedSCI(E)
Appears in Collections:深圳研究生院待认领

Files in This Work
There are no files associated with this item.

Web of Science®



Checked on Last Week

Scopus®



Checked on Current Time

百度学术™



Checked on Current Time

Google Scholar™





License: See PKU IR operational policies.