多模态特征融合与多任务学习的特种视频分类

吴晓雨; 顾超男; 王生进

doi:10.3788/OPE.20202805.1177

您当前的位置：

首页 >

文章列表页 >

多模态特征融合与多任务学习的特种视频分类

信息科学 | 更新时间：2020-07-09

- 多模态特征融合与多任务学习的特种视频分类
- Special video classification based on multitask learning and multimodal feature fusion
- 光学精密工程 2020年28卷第5期页码：1177-1186
- 作者机构：
  
  1.中国传媒大学信息与通信工程学院，北京 100024
  2.清华大学电子工程系，北京 100084
- 作者简介：
  
  [ "吴晓雨(1979-)，女，辽宁盘锦人，博士，副教授，2004年于吉林大学获得硕士学位，2009年于中科院自动化研究所获得博士学位，主要从事计算机视觉、视频分析与理解方面的研究。E-mail: wuxiaoyu@cuc.edu.cn" ]
  [ "顾超男(1995-)，女，河北保定人，硕士研究生，2014年于中国传媒大学获得学士学位，主要从事视频内容理解的算法研究。E-mail：gcn@cuc.edu.cn" ]
- 基金信息：
  
  国家自然科学基金资助项目(61801441);北京信息科学与技术国家研究中心跨媒体智能专项资助(BNR2019TD01022);“北京市高精尖”学科建设项目(中国传媒大学互联网信息学科);中国传媒大学中央高校基本科研业务费专项资金资助项目(CUC2019B066);中国传媒大学中央高校基本科研业务费专项资金资助项目(CUC18A002-2)
- DOI：10.3788/OPE.20202805.1177
  中图分类号： TP391.4
- 收稿日期：2019-11-29，
  
  修回日期：2020-01-08，
  
  录用日期：2020-1-8，
  
  纸质出版日期：2020-05-25
- 稿件说明：
移动端阅览
吴晓雨, 顾超男, 王生进. 多模态特征融合与多任务学习的特种视频分类[J]. 光学精密工程, 2020,28(5):1177-1186.

Xiao-yu WU, Chao-nan GU, Sheng-jin WANG. Special video classification based on multitask learning and multimodal feature fusion[J]. Optics and precision engineering, 2020, 28(5): 1177-1186.
吴晓雨, 顾超男, 王生进. 多模态特征融合与多任务学习的特种视频分类[J]. 光学精密工程, 2020,28(5):1177-1186. DOI： 10.3788/OPE.20202805.1177.

Xiao-yu WU, Chao-nan GU, Sheng-jin WANG. Special video classification based on multitask learning and multimodal feature fusion[J]. Optics and precision engineering, 2020, 28(5): 1177-1186. DOI： 10.3788/OPE.20202805.1177.

摘要

特种视频(本文特指暴力视频)的智能分类技术有助于实现网络信息内容安全的智能监控。针对现有特种视频多模态特征融合时未考虑语义一致性等问题，本文提出了一种基于音视频多模态特征融合与多任务学习的特种视频识别方法。首先，提取特种视频的表观信息和运动信息随时空变化的视觉语义特征及音频信息语义特征；然后，构建具有语义保持的共享特征子空间，以实现音视频多种模态特征的融合；最后，提出基于音视频特征的语义一致性度量和特种视频分类的多任务学习特种视频分类理论框架，设计了对应的损失函数，实现了端到端的特种视频智能识别。实验结果表明，本文提出的算法在Violent Flow和MediaEval VSD 2015两个数据集上平均精度分别为97.97%和39.76%，优于已有研究。结果证明了该算法的有效性，有助于提升特种视频监控的智能化水平。

Abstract

Classification of special videos is significant for intelligent surveillance of internet content. Existing algorithms that fuse multimodal features forclassification of special videoscannot measure multimodal audio-visual semantic correspondence.An algorithm for recognizing special videos based on multimodal audio-visual feature fusion was proposed herein over the framework of multitask learning. First

audio semantic features and spatial-temporal visual semantic cues

including appearance and motion

were extracted. A latent subspace to fuse audio and visual features whilst preserving their semantic information was learned and developed through jointly learning audio-visual semantic correspondence and special video classification. Subsequently

a multitask learning loss function was presented viacombination of the correspondence loss

obtained based on the measured audio-visual semantic information

and the cross-entropy loss of special video classification. Finally

an end-to-end intelligent system for special video recognition was implemented. Experimental results demonstrate that the accuracy of the proposed algorithm is 97.97% with respect to the Violent Flow dataset

and the average accuracy is 39.76% with respect to the Media Eval VSD 2015 dataset

where by the algorithm outperforms the other existing methods. These results show that the proposed algorithm is effective for improving the intelligence of network content surveillance.

关键词

Keywords

references

马晓晨 , 韦世奎 , 蒋翔 , 等 . 基于相机溯源的潜在不良视频通话预警 . 光学精密工程 , 2018 . 26 ( 11 ): 2785 - 2794 . http://d.old.wanfangdata.com.cn/Periodical/gxjmgc201811020 http://d.old.wanfangdata.com.cn/Periodical/gxjmgc201811020 .

X CH MA , SH K WEI , X JIANG , 等 . Early warning of illegal video chats based on camera source identification . Opt. Precision Eng. , 2018 . 26 ( 11 ): 2785 - 2794 . http://d.old.wanfangdata.com.cn/Periodical/gxjmgc201811020 http://d.old.wanfangdata.com.cn/Periodical/gxjmgc201811020 .

H D CLAIRE , P CEDRIC , S MOHAMMAD , 等 . VSD, a public dataset for the detection of violent scenes in movies: design, annotation, analysis and evaluation . Multimedia Tools and Applications , 2014 . 74 ( 17 ): 7379 - 7404 . http://cn.bing.com/academic/profile?id=257232beac94c67b5f15c46adb8eb79a&encoded=0&v=paper_preview&mkt=zh-cn http://cn.bing.com/academic/profile?id=257232beac94c67b5f15c46adb8eb79a&encoded=0&v=paper_preview&mkt=zh-cn .

D MOREIRA , S AVILA , M PEREZ , 等 . Multimodal data fusion for sensitive scene localization . Information Fusion , 2019 . ( 45 ): 307 - 323 . http://www.wanfangdata.com.cn/details/detail.do?_type=perio&id=655af91660d23b119d58299cb1d77991 http://www.wanfangdata.com.cn/details/detail.do?_type=perio&id=655af91660d23b119d58299cb1d77991 .

WANG H M, YANG L, WU X Y, et al . A review of bloody violence in video classification [C]. International Conference on the Frontiers & Advances in Data Science , 2017: 86-91.

YI Y, WANG H, ZHANG B, et al . MIC-TJU at affective impact of movies task[C]. MediaEval Workshop , 2015, 7.

LAM, LE S P, DO T, et al . Computational optimization for violent scenes detection[C]. International Conference on Computer, Control, Informatics and its Applications , 2016: 141-146.

DAI Q, ZHAO R, WU Z, et al . Fudan-Huawei at mediaeval 2015: Detecting violent scenes and affective impact in movies with deep learning[C]. MediaEval Workshop , 2015, 5.

SIMONYAN K, ZISSERMAN A. Two-stream convolutional networks for action recognition in videos[C]. NeurIPS , 2014: 568-576.

SUTSKEVER I, VINYALS O, LE Q V. Sequence to sequence learning with neural networks[C]. NeurIPS , 2014: 3104-3112.

SWATHIKIRAN S, OSWALD L. Learning to detect violent videos using convolutional long short-term memory[C]. IEEE International Conference on Advanced Video and Signal Based Surveillance , 2017: 1-6.

T BALTRUSAITIS , C AHUJA , L P MORENCY . Multimodal machine learning: a survey and taxonomy . IEEE Transactions on Pattern Analysis and Machine Intelligence , 2019 . 41 ( 2 ): 423 - 443 . DOI: 10.1109/TPAMI.2018.2798607 http://doi.org/10.1109/TPAMI.2018.2798607 .

崔鑫 , 彭宗举 , 陈芬 . 联合多特征的未来视频快速编码 . 光学精密工程 , 2019 . 27 ( 4 ): 990 - 999 . http://d.old.wanfangdata.com.cn/Periodical/gxjmgc201904026 http://d.old.wanfangdata.com.cn/Periodical/gxjmgc201904026 .

X CUI , Z J PENG , F CHEN . Joint Multi-feature fast coding for future video coding . Opt. Precision Eng. , 2019 . 27 ( 4 ): 990 - 999 . http://d.old.wanfangdata.com.cn/Periodical/gxjmgc201904026 http://d.old.wanfangdata.com.cn/Periodical/gxjmgc201904026 .

WU Z, JIANG Y G, WANG X, et al . Multi-Stream multi-class fusion of deep networks for video classification[C]. ACM International Conference on Multimedia , 2016: 791-800.

潘仙张 , 张石清 , 郭文平 . 多模深度卷积神经网络应用于视频表情识别 . 光学精密工程 , 2019 . 27 ( 4 ): 963 - 970 . http://d.old.wanfangdata.com.cn/Periodical/gxjmgc201904023 http://d.old.wanfangdata.com.cn/Periodical/gxjmgc201904023 .

X ZH PAN , SH Q ZHANG , W P GUO . Video-based facial expression recognition using multimodal deep convolutional neural networks . Opt. Precision Eng. , 2019 . 27 ( 4 ): 963 - 970 . http://d.old.wanfangdata.com.cn/Periodical/gxjmgc201904023 http://d.old.wanfangdata.com.cn/Periodical/gxjmgc201904023 .

P K ATREY , M A HOSSAIN , A E SADDIK , 等 . Multimodal fusion for multimedia analysis: a survey . Multimedia Systems , 2010 . 16 ( 6 ): 345 - 379 .

QIU Z, YAO T, TAO M. Learning spatial-temporal representation with pseudo-3d residual networks[C]. IEEE International Conference on Computer Vision , 2017: 5534-5542.

CARREIRA J, ZISSERMAN A. Quo vadis, action recognition A new model and the kinetics dataset[C]. IEEE Conference on Computer Vision and Pattern Recognition , 2017: 6299-6308.

HERSHEY S, CHAUDHURI S, ELLIS D P W, et al . CNN architectures for large-scale audio classification[C]. International Conference on Acoustics, Speech and Signal Processing , 2017: 131-135.

WU Z, JIANG Y G, WANG J, et al . Exploring inter-feature and inter-class relationships with deep neural networks for video classification[C]. ACM International Conference on Multimedia , 2014: 167-176.

HASSNER T, ITCHER Y, KLIPER C O. Violent flows: Real-time detection of violent crowd behavior[C]. IEEE Conference on Computer Vision and Pattern Recognition , 2012: 1-6

BILINSKI P, BREMOND F. Human violence recognition and detection in surveillance videos[C]. IEEE International Conference on Advanced Video and Signal Based Surveillance , 2016: 30-36.

T ZHANG , W JIA , X HE , 等 . Discriminative dictionary learning with motion weber local descriptor for violence detection . IEEE Transactions on Circuits and Systems for Video Technology , 2017 . 27 ( 3 ): 696 - 709 . DOI: 10.1109/TCSVT.2016.2589858 http://doi.org/10.1109/TCSVT.2016.2589858 .

MATS S, YOANN B, HANLI W, et al . The mediaeval 2015 affective impact of movies task[C]. MediaEval Workshop , 2015: 1.

A ESRA , H FRANK , A SAHIN . Breaking down violence detection: combining divide-et-impera and coarse-to-fine strategies . Neurocomputing , 2016 . 208 225 - 237 . DOI: 10.1016/j.neucom.2016.05.050 http://doi.org/10.1016/j.neucom.2016.05.050 .

浏览量

529

下载量

CSCD

文章被引用时，请邮件提醒。

提交

工具集

关联资源

融合多维姿态自适应感知的无人机目标定位

基于分数阶微分的高光谱图像特征提取与分类

面向高光谱影像场景分类的轻量化深度全局-局部知识蒸馏网络

多分支无锚框网络密集行人检测算法

SENet生成对抗网络在图像语义描述中的应用