CNN+BERT for Video Recognition

Industry : Technology

Traditionally, for video-based Action Recognition task, 3D CNN are used to extract temporal information and Temporal Global Average (TGAP) layer to summarize this information. In this work, we replace the TGAP layer with the attention mechanism of BERT as it has been state-of-the-art for many sequen...