Weakly supervised scene graph parsing, which learns structured image representations without annotated correspondences between graph nodes and visual objects, has been prevalent in recent computer vision research. Existing methods mainly focus on designing task-specific loss functions, model architectures, or optimization algorithms. We argue that correspondences between objects and graph nodes are crucial for the weakly supervised scene graph parsing task and are worth learning explicitly. Thus we propose GroParser, a framework that improves weakly supervised scene graph parsing models by grounding visual objects. The proposed weakly supervised grounding method learns a metric among visual objects and scene graph nodes by incorporating information from both object features and relational features. Specifically, we apply multi-instance learning to learn the object category information and exploit a two-stream graph neural network to model the relational similarity metric. Extensive experiments on the scene graph parsing task verify the grounding found by our model can reinforce the performance of the existing weakly supervised scene graph parsing methods, including the current state-of-the-art. Further experiments on Visual Genome (VG) and Visual Relation Detection (VRD) datasets verify that our model brings an improvement on scene graph grounding task over existing approaches.
Supplementary notes can be added here, including code, math, and images.