CNN+BERT for Video Recognition

CNN+BERT for Video Recognition

Traditionally, for video-based Action Recognition task, 3D CNN are used to extract temporal information and Temporal Global Average (TGAP) layer to summarize this information. In this work, we replace the TGAP layer with the attention mechanism of BERT as it has been state-of-the-art for many sequence-based tasks. BERT’s bidirectional attention mechanism gets a better representation of Temporal information with respect to TGAP.

Industry : Technology