图卷积网络在VQA问题中的应用

知行编程网 2022-05-15 12:00 知行编程网 | 隐藏边栏 | 抢沙发 | 3 0

文章评分 0 次，平均分 0.0 ：

来自 | 知乎作者丨肥橘猫与肥柴犬

来源丨https://zhuanlan.zhihu.com/p/63207928

本文仅作学术交流，如有侵权，请联系删文

VQA的研究者们似乎越来越重视对图片中物体关系（visual object relation）的建模。
如果把图中的物体视作一个个节点（node），则这些物体之间的关系可以用连接节点的边（edge）表示，也就是说，我们可以以图（graph）的形式表达visual object relation。
这种思想恰好与近年来兴起的图卷积网络（Graph Convolutional Network，GCN）不谋而合，所以，很自然地，大概从2018年开始，有人尝试了将GCN应用到VQA中。
微软不久前发布在arXiv上的论文Relation-aware Graph Attention Network for Visual Question Answering是这方面的最新工作，这里就主要以该论文为例一窥GCN在VQA问题中的应用。

一、Graph Attention Network

从文章标题可知，该文章用到的是Graph Attention Network（GAT），实际上等于“GCN + Attention”。
GCN的概念不难理解，其中convolutional的含义与CNN中的一致，可以理解为是令节点“聚集并融合”周围边的局部信息，示意图如下。

加上Attention即对这些边的信息重新分配加权权重。
文章对此的数学表达是：

其中，表示节点的邻居节点集合，表示各个邻居节点，是attention权重，是进行了卷积操作后的节点特征。

二、不同类型的relation

该论文很敏锐地意识到，物体的视觉关系可以分为三大类：

semantic relation：物体的语义关系，主要体现为某个动作，如(kid, eating, sandwich)。
spatial relation：物体的空间关系，体现两个物体的相对位置，如(kid, intersect, sandwich)。
implicit relation：以上两种关系被称为explicit relation，因为它们都是可以被明确命名的，但还有一些关系是我们无法说清楚的，却对模型正确回答问题有重要帮助，于是文章称之为implicit relation。

文章最大的动机是用不同的graph对这三种关系建模，然后综合起来。

三、Relation-aware Graph Attention Network

文章的模型架构如上，由于relation encoder主要用GAT实现，所以他们将该模型命名为Relation-aware Graph Attention Network，简称ReGAT。

该工作针对每一种关系都分别训练了一个relation encoder，然后在inference阶段将三个encoder进行综合，形成一个ensemble model，最终预测答案的概率为：

//是基于semantic/spatial/implicit relation的预测概率分布，和是两个trade-off超参数。

不同的relation encoder有不同的graph构造方式：

对于implicit relation，graph是全连接的形式，把物体两两之间的关系全部考虑进来，得到条边（是检测到的物体数量）。
对于semantic relation和spatial relation，对应的graph则是稀疏的。文章分别预训练了两个分类器用于识别物体间的semantic relation和spatial relation，仅当两个物体被分类器识别出了关系时，该关系才会以边的形式加入graph中。相关的示例有：

另外，不同的relation encoder在attention计算方法上也存在差异，详情见原论文。

四、实验结果

在已成为VQA问题标准测试集的VQA2.0数据集上，该论文的ReGAT模型表现非常出色，超过了当前所有文献报告的测试结果。

该论文还认为，他们的relation encoder可以作为插件组装到现有的其他VQA模型中，相关的实验表明这一插件的确有效果（Sem/Spa/Imp表示加入了semantic/spatial/implicit relation encoder）。

此外，他们还进行了ablation study，探讨了是否加入Attention，以及是否把question特征融合到graph节点中（Q-adaptive）对不同relation encoder的影响。

然而，我认为该文章的实验部分存在一些不足，没有探讨relation encoder的不同组合对模型的影响，比如，文章最后采用“Sem+Spa+Imp”的ensemble模型，但是否一定比简单的“Imp+Imp+Imp”ensemble模型更好呢？
由于缺少了这样的实验，作者没法很好地说明这三种relation encoder的实际作用是互补的，让人感觉也许其优异的性能表现主要是靠作用相似的三个relation encoder“堆”出来的。

五、其他类似的工作

除了这个工作以外，我这周也看到了另外几篇同样用GCN处理VQA问题的论文。

NIPS 2018的Learning Conditioned Graph Structures for Interpretable Visual Question Answering，主要利用物体间的相对空间位置（即spatial relation）进行卷积操作，但没有用分类器对该关系进行明确的分类。

arXiv上的Multi-modal Learning with Prior Visual Relation Reasoning，用到了分类器对物体间的关系进行分类，但在GCN中利用的是分类器的中间结果作为relation embedding，如下图所示。

NIPS 2018的Out of the Box: Reasoning with Graph Convolution Nets for Factual Visual Question Answering，针对的是基于外部知识库的FVQA数据集——每一个答案都需要依据一条外部的fact（以关系三元组的形式表示）。在这里，图片和问题都只用一个向量表示，graph根据检索到的fact关系对来建立。

<pre style="max-width: 100%;box-sizing: border-box !important;overflow-wrap: break-word !important;"><section style="margin-right: 8px;margin-left: 8px;max-width: 100%;letter-spacing: 0.544px;white-space: normal;color: rgb(0, 0, 0);font-family: -apple-system-font, system-ui, "Helvetica Neue", "PingFang SC", "Hiragino Sans GB", "Microsoft YaHei UI", "Microsoft YaHei", Arial, sans-serif;widows: 1;line-height: 1.75em;box-sizing: border-box !important;overflow-wrap: break-word !important;"><strong style="max-width: 100%;box-sizing: border-box !important;overflow-wrap: break-word !important;"><span style="max-width: 100%;letter-spacing: 0.5px;font-size: 14px;box-sizing: border-box !important;overflow-wrap: break-word !important;"><strong style="max-width: 100%;font-size: 16px;letter-spacing: 0.544px;box-sizing: border-box !important;overflow-wrap: break-word !important;"><span style="max-width: 100%;letter-spacing: 0.5px;box-sizing: border-box !important;overflow-wrap: break-word !important;">—</span></strong>完<strong style="max-width: 100%;font-size: 16px;letter-spacing: 0.544px;box-sizing: border-box !important;overflow-wrap: break-word !important;"><span style="max-width: 100%;letter-spacing: 0.5px;font-size: 14px;box-sizing: border-box !important;overflow-wrap: break-word !important;"><strong style="max-width: 100%;font-size: 16px;letter-spacing: 0.544px;box-sizing: border-box !important;overflow-wrap: break-word !important;"><span style="max-width: 100%;letter-spacing: 0.5px;box-sizing: border-box !important;overflow-wrap: break-word !important;">—</span></strong></span></strong></span></strong></section><section style="max-width: 100%;letter-spacing: 0.544px;white-space: normal;font-family: -apple-system-font, system-ui, "Helvetica Neue", "PingFang SC", "Hiragino Sans GB", "Microsoft YaHei UI", "Microsoft YaHei", Arial, sans-serif;widows: 1;box-sizing: border-box !important;overflow-wrap: break-word !important;"><section powered-by="xiumi.us" style="max-width: 100%;box-sizing: border-box !important;overflow-wrap: break-word !important;"><section style="margin-top: 15px;margin-bottom: 25px;max-width: 100%;opacity: 0.8;box-sizing: border-box !important;overflow-wrap: break-word !important;"><section style="max-width: 100%;box-sizing: border-box !important;overflow-wrap: break-word !important;"><section style="max-width: 100%;letter-spacing: 0.544px;box-sizing: border-box !important;overflow-wrap: break-word !important;"><section powered-by="xiumi.us" style="max-width: 100%;box-sizing: border-box !important;overflow-wrap: break-word !important;"><section style="margin-top: 15px;margin-bottom: 25px;max-width: 100%;opacity: 0.8;box-sizing: border-box !important;overflow-wrap: break-word !important;"><section style="max-width: 100%;box-sizing: border-box !important;overflow-wrap: break-word !important;"><section style="margin-right: 8px;margin-bottom: 15px;margin-left: 8px;padding-right: 0em;padding-left: 0em;max-width: 100%;color: rgb(127, 127, 127);font-size: 12px;font-family: sans-serif;line-height: 25.5938px;letter-spacing: 3px;box-sizing: border-box !important;overflow-wrap: break-word !important;"><span style="max-width: 100%;color: rgb(0, 0, 0);box-sizing: border-box !important;overflow-wrap: break-word !important;"><strong style="max-width: 100%;box-sizing: border-box !important;overflow-wrap: break-word !important;"><span style="max-width: 100%;font-size: 16px;font-family: 微软雅黑;caret-color: red;box-sizing: border-box !important;overflow-wrap: break-word !important;">为您推荐</span></strong></span></section><p style="margin-right: 8px;margin-bottom: 5px;margin-left: 8px;padding-right: 0em;padding-left: 0em;max-width: 100%;min-height: 1em;color: rgb(127, 127, 127);font-size: 12px;font-family: sans-serif;line-height: 1.75em;letter-spacing: 0px;box-sizing: border-box !important;overflow-wrap: break-word !important;"><span style="max-width: 100%;font-size: 14px;box-sizing: border-box !important;overflow-wrap: break-word !important;">人工智能领域最具影响力的十大女科学家</span><br style="max-width: 100%;box-sizing: border-box !important;overflow-wrap: break-word !important;"  /></p><section style="margin-bottom: 5px;max-width: 100%;box-sizing: border-box !important;overflow-wrap: break-word !important;"><span style="max-width: 100%;-webkit-tap-highlight-color: rgba(0, 0, 0, 0);cursor: pointer;font-size: 14px;box-sizing: border-box !important;overflow-wrap: break-word !important;">MIT最新深度学习入门课，安排起来！</span></section><p style="margin-right: 8px;margin-bottom: 5px;margin-left: 8px;padding-right: 0em;padding-left: 0em;max-width: 100%;min-height: 1em;font-family: sans-serif;line-height: 1.75em;letter-spacing: 0px;box-sizing: border-box !important;overflow-wrap: break-word !important;"><span style="font-size: 14px;">有了这个神器，轻松用 Python 写个 App</span></p><p style="margin-right: 8px;margin-bottom: 5px;margin-left: 8px;padding-right: 0em;padding-left: 0em;max-width: 100%;min-height: 1em;font-family: sans-serif;line-height: 1.75em;letter-spacing: 0px;box-sizing: border-box !important;overflow-wrap: break-word !important;"><span style="max-width: 100%;color: rgb(87, 107, 149);box-sizing: border-box !important;overflow-wrap: break-word !important;"><span style="max-width: 100%;font-size: 14px;box-sizing: border-box !important;overflow-wrap: break-word !important;">图深度学习入门难？这篇教程帮你理清楚了脉络</span></span><br style="max-width: 100%;box-sizing: border-box !important;overflow-wrap: break-word !important;"  /></p><p style="margin-right: 8px;margin-bottom: 5px;margin-left: 8px;padding-right: 0em;padding-left: 0em;max-width: 100%;min-height: 1em;color: rgb(127, 127, 127);font-size: 12px;font-family: sans-serif;line-height: 1.75em;letter-spacing: 0px;box-sizing: border-box !important;overflow-wrap: break-word !important;"><span style="max-width: 100%;font-size: 14px;box-sizing: border-box !important;overflow-wrap: break-word !important;">我为什么鼓励你读计算机领域的博士？</span><br style="max-width: 100%;box-sizing: border-box !important;overflow-wrap: break-word !important;"  /></p></section></section></section></section></section></section></section></section>

本篇文章来源于: 深度学习这件小事

本文为原创文章，版权归知行编程网所有，欢迎分享本文，转载请保留出处！

知行编程网关注：1 粉丝：1

这个人很懒，什么都没写