多模态机器学习(MultiModal Machine Learning, MMML)是一种机器学习方法,它旨在解决复杂任务,如多模态情感分析、跨语言图像搜索等,这些任务需要同时考虑多种模态的数据并从中提取有用的信息。
2.Tensor fusion network for multimodal sentiment analysis
3.On the Benefits of Early Fusion in Multimodal Representation Learning
4.Extending long short-term memory for multi-view structured learning
5.Devise: A deep visual-semantic embedding model
6.Learning transferable visual models from natural language supervision
7.Order-embeddings of images and language
8.Learning Concept Taxonomies from Multi-modal Data
9.Does my multimodal model learn cross-modal interactions? It’s harder to tell than you might think!
10.Learning factorized multimodal representations
11.Multimodal clustering networks for self-supervised learning from unlabeled videos
12.Deep multimodal subspace clustering networks
2.Unsupervised multimodal representation learning across medical images and reports
3.Clip-event: Connecting text and images with event structures
4.Learning by aligning videos in time
5.Multimodal adversarial network for cross-modal retrieval
6.Videobert: A joint model for video and language representation learning
7.Visualbert: A simple and performant baseline for vision and language
8.Decoupling the role of data, attention, and losses in multimodal transformers
9.Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks
10.MTAG: Modal-Temporal Attention Graph for Unaligned Human Multimodal Language Sequences
2.Dynamic memory networks for visual and textual question answering
3.A Survey of Reinforcement Learning Informed by Natural Language
4.Mfas: Multimodal fusion architecture search
5.Multi-view intact space learning
6.Neuro-Symbolic Visual Reasoning: Disentangling Visual from Reasoning
7.Probabilistic neural symbolic models for interpretable visual question answering
8.Learning by abstraction: The neural state machine
9.Socratic models: Composing zero-shot multimodal reasoning with language
10.Vqa-lol: Visual question answering under the lens of logic
11.Multimodal logical inference system for visual-textual entailment
12.Towards causal vqa: Revealing and reducing spurious correlations by invariant and covariant semantic editing
13.Counterfactual vqa: A cause-effect look at language bias
14.Exploring visual relationship for image captioning
15.KAT: A Knowledge Augmented Transformer for Vision-and-Language
16.Building a large-scale multimodal knowledge base system for answering visual queries
17.Visualcomet: Reasoning about the dynamic context of a still image
18.From Recognition to Cognition: Visual Commonsense Reasoning
2.Extractive Text-Image Summarization Using Multi-Modal RNN
3.Multi-modal Summarization for Asynchronous Collection of Text, Image, Audio and Video
4.Multimodal abstractive summarization ` for how2 videos
5.Deep fragment embeddings for bidirectional image sentence mapping
6.Phrase-based image captioning
7.Style transfer for co-speech gesture animation: A multi-speaker conditional-mixture approach
8.You said that?: Synthesising talking faces from audio
9.Zero-shot text-to-image generation
10.Stochastic video generation with a learned prior
11.Parallel wavenet: Fast high-fidelity speech synthesis
12.Arbitrary talking face generation via attentional audio-visual coherence learning
简述:这篇论文提出了一个叫做Multimodal Adaptation Gate(MAG)的装置,可以附加到BERT和XLNet上,让它们在微调期间接受多模态非语言数据。这个装置通过生成对BERT和XLNet内部表示的转变来实现,而这个转变是有条件于视觉和声学模态的。实验表明,微调MAG-BERT和MAG-XLNet可以显著提高情感分析性能,超过了以前的基线和仅语言微调的BERT和XLNet。在CMU-MOSI数据集上,MAG-XLNet首次实现了人类级别的多模态情感分析性能。
2.Multimodal few-shot learning with frozen language models
3.HighMMT: Towards Modality and Task Generalization for High-Modality Representation Learning
4.FLAVA: A Foundational Language And Vision Alignment Model
5.Pretrained transformers as universal computation engines
6.Scaling up visual and visual language representation learning with noisy text supervision
7.Foundations of multimodal co-learning
8.Found in translation: Learning robust joint representations by cyclic translations between modalities
9.Vokenization: Improving Language Understanding with Contextualized, VisualGrounded Supervision
10.Combining labeled and unlabeled data with co-training
11.Cross-modal data programming enables rapid medical machine learning
12.An information theoretic framework for multi-view learning
13.Comprehensive Semi-Supervised Multi-Modal Learning
2.Multimodal explanations: Justifying decisions and pointing to the evidence
3.Women also snowboard: Overcoming bias in captioning models
4.FairCVtest Demo: Understanding Bias in Multimodal Learning with a Testbed in Fair Automatic Recruitment
5.Smil: Multimodal learning with severely missing modality
6.VL-InterpreT: An Interactive Visualization Tool for Interpreting Vision-Language Transformers
7.Behind the scene: Revealing the secrets of pre-trained vision-and-language models
8.Winoground: Probing Vision and Language Models for Visio-Linguistic Compositionality
9.Does my multimodal model learn cross-modal interactions? It’s harder to tell than you might think!
10.MultiViz: Towards Visualizing and Understanding Multimodal Models
11.M2Lens: Visualizing and explaining multimodal models for sentiment analysis
12. HighMMT: Towards Modality and Task Generalization for High-Modality Representation Learning
13.One model to learn them all
14.What Makes Training Multi-Modal Classification Networks Hard?
15.Characterizing and overcoming the greedy nature of learning in multi-modal deep neural networks
16.MultiBench: Multiscale Benchmarks for Multimodal Representation Learning