Zhiming Hu, Ning Ye, Iqbal Mohomed
We study the problem of natural language-based video retrieval, the task of finding relevant videos given natural language search queries. Most recent state-of-the-art (SOTA) approaches would embed the video and query separately and map the video and query embeddings into a joint latent space to calculate a similarity score between them. To learn a video representation, existing solutions generally use all the frames or sample a subset of frames from the video using uniform sampling. The former solution could be computationally prohibitive while the latter may inject noise from uninformative frames into the final video representation. To this end, we propose mmSampler, a learning-based sampler, to adaptively select salient frames to represent the videos for multimodal video retrieval. mmSampler can greatly reduce the computational overhead for video representation without affecting the retrieval performance. We learn a lightweight policy network to decide whether to further process or discard a frame. By adopting the Gumbel-Softmax trick, we train the sampler jointly with the video retrieval model end-to-end in an efficient manner. Experimental results on benchmark datasets such as ActivityNet, DiDeMo and MSRVTT demonstrate that mmSampler achieves improved retrieval performance while saving as much as 43% GFLOPs per video.