阅读 TALL: Temporal Activity Localization via Language Query

misalign to be penalized in the loss

regression for temporal stamps of event that corresponds to a sentence description.

I think this task can still be categorized into event detection/retrieval.

There's an embedding learning from videos by using 3D CNN and pooling.

There's an embedding learning from text by using word2vec or the ***graph technique.

The fusion is done FC layers with loss designed as mentioned above - try to align text with the event.

你可能感兴趣的:(CVPR2018)