Seeing in Flowing: Adapting CLIP for Action Recognition with Motion Prompts Learning
The Contrastive Language-Image Pre-training (CLIP) has recently shown remarkable generalization on "zero-shot" training and has applied to many downstream tasks. We explore the adaptation of CLIP to achieve a more efficient and generalized action recognition method. We propose that the key lies in explicitly modeling the motion cues flowing in video frames. To that end, we design a two-stream motion modeling block to capture motion and spatial information at the same time. And then, the obtained motion cues are utilized to drive a dynamic prompts learner to generate motion-aware prompts, which contain much semantic information concerning human actions. In addition, we propose a multimodal communication block to achieve a collaborative learning and further improve the performance. We conduct extensive experiments on HMDB-51, UCF-101, and Kinetics-400 datasets. Our method outperforms most existing state-of-the-art methods by a significant margin on "few-shot" and "zero-shot" training. We also achieve competitive performance on "closed-set" training with extremely few trainable parameters and additional computational costs.
摘要翻译:最近,对比性语言图像预训练(CLIP)在“零样本”训练方面表现出显著的泛化能力,并在许多下游任务中得到了应用。我们探索了将CLIP用于实现更高效且更广义的动作识别方法。我们认为关键在于明确地对流动在视频帧中的运动线索进行建模。为此,我们设计了一个双流动作建模模块,以同时捕捉运动和空间信息。然后,获得的运动线索被用来驱动一个动态提示学习器,生成包含与人类动作相关的语义信息丰富的运动感知提示。此外,我们提出了一个多模态通信模块,以实现协同学习,并进一步提高性能。我们在HMDB-51、UCF-101和Kinetics-400数据集上进行了大量实验。我们的方法在“少样本”和“零样本”训练方面明显优于大多数现有的最先进方法。在“封闭集”训练中,我们还以极少的可训练参数和额外计算成本取得了有竞争力的性能。
Prompted Contrast with Masked Motion Modeling: Towards Versatile 3D Action Representation Learning
Self-supervised learning has proved effective for skeleton-based human action understanding, which is an important yet challenging topic. Previous works mainly rely on contrastive learning or masked motion modeling paradigm to model the skeleton relations. However, the sequence-level and joint-level representation learning cannot be effectively and simultaneously handled by these methods. As a result, the learned representations fail to generalize to different downstream tasks. Moreover, combining these two paradigms in a naive manner leaves the synergy between them untapped and can lead to interference in training. To address these problems, we propose Prompted Contrast with Masked Motion Modeling, PCM, for versatile 3D action representation learning. Our method integrates the contrastive learning and masked prediction tasks in a mutually beneficial manner, which substantially boosts the generalization capacity for various downstream tasks. Specifically, masked prediction provides novel training views for contrastive learning, which in turn guides the masked prediction training with high-level semantic information. Moreover, we propose a dual-prompted multi-task pretraining strategy, which further improves model representations by reducing the interference caused by learning the two different pretext tasks. Extensive experiments on five downstream tasks under three large-scale datasets are conducted, demonstrating the superior generalization capacity of PCM compared to the state-of-the-art works. Our project is publicly available at: https://jhang2020.github.io/Projects/PCM3/PCM3.html .
摘要翻译:自我监督学习已经证明对基于骨架的人类动作理解非常有效,这是一个重要但具有挑战性的课题。先前的研究主要依赖于对比学习或遮蔽运动建模范式来对骨架关系进行建模。然而,这些方法不能有效且同时地处理序列级和关节级表示学习。因此,所学的表示不能推广到不同的下游任务。此外,在天真的方式下将这两种范式结合起来未充分利用它们之间的协同效应,并且可能导致训练干扰。为解决这些问题,我们提出了一种称为“Prompted Contrast with Masked Motion Modeling”(PCM)的多功能三维动作表示学习方法。我们的方法以相互有益的方式集成了对比学习和遮蔽预测任务,从而大大增强了各种下游任务的泛化能力。具体而言,遮蔽预测为对比学习提供了新的训练视角,反过来,对比学习引导了带有高层语义信息的遮蔽预测训练。此外,我们提出了一种双提示多任务预训练策略,通过减少学习两个不同预训练任务所引起的干扰,进一步改善了模型表示。在三个大规模数据集下进行了五个下游任务的大量实验,证明了PCM相对于最先进的方法具有卓越的泛化能力。我们的项目可在以下网址公开获取:https://jhang2020.github.io/Projects/PCM3/PCM3.html。
Synthesizing Long-Term Human Motions with Diffusion Models via Coherent Sampling
Text-to-motion generation has gained increasing attention, but most existing methods are limited to generating short-term motions that correspond to a single sentence describing a single action. However, when a text stream describes a sequence of continuous motions, the generated motions corresponding to each sentence may not be coherently linked. Existing long-term motion generation methods face two main issues. Firstly, they cannot directly generate coherent motions and require additional operations such as interpolation to process the generated actions. Secondly, they generate subsequent actions in an autoregressive manner without considering the influence of future actions on previous ones. To address these issues, we propose a novel approach that utilizes a past-conditioned diffusion model with two optional coherent sampling methods: Past Inpainting Sampling and Compositional Transition Sampling. Past Inpainting Sampling completes subsequent motions by treating previous motions as conditions, while Compositional Transition Sampling models the distribution of the transition as the composition of two adjacent motions guided by different text prompts. Our experimental results demonstrate that our proposed method is capable of generating compositional and coherent long-term 3D human motions controlled by a user-instructed long text stream. The code is available at \href{https://github.com/yangzhao1230/PCMDM}{https://github.com/yangzhao1230/PCMDM}.
摘要翻译:文本到动作生成引起了越来越多的关注,但大多数现有方法仅限于生成与描述单一动作的单个句子相对应的短期动作。然而,当一个文本流描述一系列连续的动作时,与每个句子相对应的生成动作可能不会连贯地链接在一起。现有的长期动作生成方法面临两个主要问题。首先,它们不能直接生成连贯的动作,需要额外的操作,比如插值来处理生成的动作。其次,它们以自回归的方式生成后续动作,而不考虑未来动作对先前动作的影响。为了解决这些问题,我们提出了一种新颖的方法,该方法利用过去条件扩散模型,并配合两种可选的连贯采样方法:过去修复采样和组合过渡采样。过去修复采样通过将先前动作视为条件来完成后续动作,而组合过渡采样将过渡的分布建模为由不同文本提示引导的两个相邻动作的组合。我们的实验结果表明,我们提出的方法能够生成由用户指示的长文本流控制的组合和连贯的长期三维人体动作。代码可在以下链接获取:\href{https://github.com/yangzhao1230/PCMDM}{https://github.com/yangzhao1230/PCMDM}。
AntGPT: Can Large Language Models Help Long-term Action Anticipation from Videos?
Can we better anticipate an actor's future actions (e.g. mix eggs) by knowing what commonly happens after his/her current action (e.g. crack eggs)? What if we also know the longer-term goal of the actor (e.g. making egg fried rice)? The long-term action anticipation (LTA) task aims to predict an actor's future behavior from video observations in the form of verb and noun sequences, and it is crucial for human-machine interaction. We propose to formulate the LTA task from two perspectives: a bottom-up approach that predicts the next actions autoregressively by modeling temporal dynamics; and a top-down approach that infers the goal of the actor and plans the needed procedure to accomplish the goal. We hypothesize that large language models (LLMs), which have been pretrained on procedure text data (e.g. recipes, how-tos), have the potential to help LTA from both perspectives. It can help provide the prior knowledge on the possible next actions, and infer the goal given the observed part of a procedure, respectively. To leverage the LLMs, we propose a two-stage framework, AntGPT. It first recognizes the actions already performed in the observed videos and then asks an LLM to predict the future actions via conditioned generation, or to infer the goal and plan the whole procedure by chain-of-thought prompting. Empirical results on the Ego4D LTA v1 and v2 benchmarks, EPIC-Kitchens-55, as well as EGTEA GAZE+ demonstrate the effectiveness of our proposed approach. AntGPT achieves state-of-the-art performance on all above benchmarks, and can successfully infer the goal and thus perform goal-conditioned "counterfactual" prediction via qualitative analysis. Code and model will be released at https://brown-palm.github.io/AntGPT
摘要翻译:如果我们知道一个演员当前的动作(例如打蛋)之后通常会发生什么(例如砸蛋),我们是否可以更好地预测他/她未来的动作?如果我们还知道演员的长期目标(例如做蛋炒饭),会怎么样?长期动作预测(LTA)任务旨在通过视频观察来预测演员未来的行为,以动词和名词序列的形式呈现,对于人机交互至关重要。我们提议从两个角度来构建LTA任务:自下而上的方法通过建模时间动态来自回归地预测下一步的动作;自上而下的方法推断演员的目标并规划完成目标所需的步骤。我们假设大型语言模型(LLMs),这些模型已在过程文本数据(如食谱,操作指南)上进行了预训练,有潜力从这两个角度帮助LTA任务。它可以为可能的下一步动作提供先前的知识,并在观察到过程的一部分后推断目标。为了利用LLMs,我们提出了一个两阶段的框架,称为AntGPT。它首先识别出观察视频中已经执行的动作,然后通过有条件的生成要求LLMs预测未来的动作,或者通过“思维链”提示来推断目标并规划整个过程。在Ego4D LTA v1和v2基准、EPIC-Kitchens-55以及EGTEA GAZE+上的实验结果显示了我们提出的方法的有效性。AntGPT在所有上述基准测试中均取得了最先进的性能,并且通过定性分析成功地推断出目标,从而进行目标条件下的“反事实”预测。代码和模型将在 https://brown-palm.github.io/AntGPT 上发布。
Language-based Action Concept Spaces Improve Video Self-Supervised Learning
Recent contrastive language image pre-training has led to learning highly transferable and robust image representations. However, adapting these models to video domains with minimal supervision remains an open problem. We explore a simple step in that direction, using language tied self-supervised learning to adapt an image CLIP model to the video domain. A backbone modified for temporal modeling is trained under self-distillation settings with train objectives operating in an action concept space. Feature vectors of various action concepts extracted from a language encoder using relevant textual prompts construct this space. We introduce two train objectives, concept distillation and concept alignment, that retain generality of original representations while enforcing relations between actions and their attributes. Our approach improves zero-shot and linear probing performance on three action recognition benchmarks.
摘要翻译:最近的对比语言图像预训练已经实现了学习高度可转移和稳健的图像表示。然而,将这些模型用于具有最少监督的视频领域仍然是一个悬而未决的问题。我们在这个方向上探索了一个简单的步骤,使用语言关联的自监督学习来将图像CLIP模型调整到视频领域。一个经过时间建模的修改后的主干在自蒸馏设置下进行训练,训练目标在动作概念空间中操作。从语言编码器中使用相关文本提示提取的各种动作概念的特征向量构成了这个空间。我们引入了两个训练目标,概念蒸馏和概念对齐,保留了原始表示的普遍性,同时强制在动作和它们的属性之间建立关系。我们的方法提高了三个动作识别基准的零射击和线性探测性能。
ActionPrompt: Action-Guided 3D Human Pose Estimation With Text and Pose Prompting
Recent 2D-to-3D human pose estimation (HPE) utilizes temporal consistency across sequences to alleviate the depth ambiguity problem but ignore the action related prior knowledge hidden in the pose sequence. In this paper, we propose a plug-and-play module named Action Prompt Module (APM) that effectively mines different kinds of action clues for 3D HPE. The highlight is that, the mining scheme of APM can be widely adapted to different frameworks and bring consistent benefits. Specifically, we first present a novel Action-related Text Prompt module (ATP) that directly embeds action labels and transfers the rich language information in the label to the pose sequence. Besides, we further introduce Action-specific Pose Prompt module (APP) to mine the position-aware pose pattern of each action, and exploit the correlation between the mined patterns and input pose sequence for further pose refinement. Experiments show that APM can improve the performance of most video-based 2D-to-3D HPE frameworks by a large margin.
摘要翻译:近期的2D到3D人体姿态估计(HPE)利用序列间的时间一致性来减轻深度模糊问题,但忽略了姿态序列中隐藏的与动作相关的先验知识。在本文中,我们提出了一个名为动作提示模块(APM)的即插即用模块,用于有效地挖掘不同类型的动作线索用于3D HPE。重点在于,APM的挖掘方案可以广泛适用于不同的框架,并带来一致的益处。具体而言,我们首先提出了一种新颖的动作相关文本提示模块(ATP),直接嵌入动作标签,并将标签中的丰富语言信息传递到姿态序列中。此外,我们进一步引入了动作特定姿态提示模块(APP),以挖掘每个动作的位置感知姿态模式,并利用挖掘出的模式与输入姿态序列之间的相关性进一步进行姿态细化。实验证明,APM可以大幅提高大多数基于视频的2D到3D HPE框架的性能。
DisasterResponseGPT: Large Language Models for Accelerated Plan of Action Development in Disaster Response Scenarios
The development of plans of action in disaster response scenarios is a time-consuming process. Large Language Models (LLMs) offer a powerful solution to expedite this process through in-context learning. This study presents DisasterResponseGPT, an algorithm that leverages LLMs to generate valid plans of action quickly by incorporating disaster response and planning guidelines in the initial prompt. In DisasterResponseGPT, users input the scenario description and receive a plan of action as output. The proposed method generates multiple plans within seconds, which can be further refined following the user's feedback. Preliminary results indicate that the plans of action developed by DisasterResponseGPT are comparable to human-generated ones while offering greater ease of modification in real-time. This approach has the potential to revolutionize disaster response operations by enabling rapid updates and adjustments during the plan's execution.
摘要翻译:在灾难响应场景中制定行动计划是一个耗时的过程。大型语言模型(LLMs)通过上下文学习提供了一种强大的解决方案,可以加快这个过程。本研究介绍了DisasterResponseGPT,这是一种利用LLMs的算法,通过将灾难响应和规划指南融入初始提示中,快速生成有效的行动计划。在DisasterResponseGPT中,用户输入场景描述,然后输出行动计划。所提出的方法可以在几秒钟内生成多个计划,根据用户的反馈进一步进行改进。初步结果表明,DisasterResponseGPT生成的行动计划与人工生成的计划相当,同时在实时修改方面更加便捷。这种方法有可能通过在计划执行过程中实现快速更新和调整,彻底改变灾难响应操作。
STEPS: A Benchmark for Order Reasoning in Sequential Tasks
Various human activities can be abstracted into a sequence of actions in natural text, i.e. cooking, repairing, manufacturing, etc. Such action sequences heavily depend on the executing order, while disorder in action sequences leads to failure of further task execution by robots or AI agents. Therefore, to verify the order reasoning capability of current neural models in sequential tasks, we propose a challenging benchmark , named STEPS. STEPS involves two subtask settings, focusing on determining the rationality of given next step in recipes and selecting the reasonable step from the multi-choice question, respectively. We describe the data construction and task formulations, and benchmark most of significant Large Language Models (LLMs). The experimental results demonstrate 1) The commonsense reasoning of action orders in sequential tasks are challenging to resolve via zero-shot prompting or few-shot in-context learning for LLMs; 2) Prompting method still significantly lags behind tuning-based method on STEPS.
摘要翻译:各种人类活动可以被抽象为自然语言中的一系列行动,例如烹饪、维修、制造等。这样的行动序列在很大程度上依赖于执行顺序,而行动序列的无序会导致机器人或人工智能代理在执行后续任务时失败。因此,为了验证当前神经模型在顺序任务中的顺序推理能力,我们提出了一个具有挑战性的基准,名为STEPS。STEPS涉及两个子任务设置,分别侧重于确定给定菜谱中下一步的合理性和从多选题中选择合理的步骤。我们描述了数据构建和任务公式,并对大多数重要的大型语言模型(LLMs)进行了基准测试。实验结果表明:1) 在顺序任务中,通过零-shot提示或少-shot上下文学习解决行动顺序的常识推理是具有挑战性的;2) 在STEPS上,提示方法在性能上仍然明显落后于基于调优的方法。
Prompt Learning for Action Recognition
We present a new general learning approach for action recognition, Prompt Learning for Action Recognition (PLAR), which leverages the strengths of prompt learning to guide the learning process. Our approach is designed to predict the action label by helping the models focus on the descriptions or instructions associated with actions in the input videos. Our formulation uses various prompts, including optical flow, large vision models, and learnable prompts to improve the recognition performance. Moreover, we propose a learnable prompt method that learns to dynamically generate prompts from a pool of prompt experts under different inputs. By sharing the same objective, our proposed PLAR can optimize prompts that guide the model's predictions while explicitly learning input-invariant (prompt experts pool) and input-specific (data-dependent) prompt knowledge. We evaluate our approach on datasets consisting of both ground camera videos and aerial videos, and scenes with single-agent and multi-agent actions. In practice, we observe a 3.17-10.2% accuracy improvement on the aerial multi-agent dataset, Okutamam and 0.8-2.6% improvement on the ground camera single-agent dataset, Something Something V2. We plan to release our code on the WWW.
摘要翻译:我们提出了一种新的通用学习方法,用于行动识别,即Prompt Learning for Action Recognition (PLAR),它充分利用了提示学习的优势来引导学习过程。我们的方法旨在通过帮助模型关注输入视频中与行动相关的描述或指令,来预测行动标签。我们的公式使用了各种提示,包括光流、大型视觉模型和可学习的提示,以提高识别性能。此外,我们提出了一种可学习的提示方法,该方法学习从不同输入下的提示专家池中动态生成提示。通过共享相同的目标,我们提出的PLAR可以优化引导模型预测的提示,同时明确学习输入无关(提示专家池)和输入特定(数据相关)的提示知识。我们在由地面摄像头视频和空中视频组成的数据集以及包含单一代理和多代理行动的场景中评估了我们的方法。在实践中,我们观察到在空中多代理数据集Okutamam上的准确率提高了3.17-10.2%,在地面摄像头单一代理数据集Something Something V2上提高了0.8-2.6%。我们计划在WWW上发布我们的代码。
Interaction-Aware Prompting for Zero-Shot Spatio-Temporal Action Detection
The goal of spatial-temporal action detection is to determine the time and place where each person's action occurs in a video and classify the corresponding action category. Most of the existing methods adopt fully-supervised learning, which requires a large amount of training data, making it very difficult to achieve zero-shot learning. In this paper, we propose to utilize a pre-trained visual-language model to extract the representative image and text features, and model the relationship between these features through different interaction modules to obtain the interaction feature. In addition, we use this feature to prompt each label to obtain more appropriate text features. Finally, we calculate the similarity between the interaction feature and the text feature for each label to determine the action category. Our experiments on J-HMDB and UCF101-24 datasets demonstrate that the proposed interaction module and prompting make the visual-language features better aligned, thus achieving excellent accuracy for zero-shot spatio-temporal action detection. The code will be released upon acceptance.
摘要翻译:时空动作检测的目标是确定每个人的动作在视频中发生的时间和位置,并对应地对动作类别进行分类。大多数现有方法采用全监督学习,这需要大量的训练数据,因此很难实现零样本学习。在本文中,我们提出了利用预训练的视觉语言模型来提取代表性的图像和文本特征,并通过不同的交互模块对这些特征之间的关系进行建模,从而获得交互特征。此外,我们使用这个特征来提示每个标签,以获得更合适的文本特征。最后,我们计算交互特征和每个标签的文本特征之间的相似性,以确定动作类别。我们在J-HMDB和UCF101-24数据集上的实验表明,所提出的交互模块和提示使得视觉语言特征更好地对齐,从而实现了零样本时空动作检测的优异准确率。代码将在被接受后发布。
Vita-CLIP: Video and text adaptive CLIP via Multimodal Prompting(2023 CVPR)
Adopting contrastive image-text pretrained models like CLIP towards video classification has gained attention due to its cost-effectiveness and competitive performance. However, recent works in this area face a trade-off. Finetuning the pretrained model to achieve strong supervised performance results in low zero-shot generalization. Similarly, freezing the backbone to retain zero-shot capability causes significant drop in supervised accuracy. Because of this, recent works in literature typically train separate models for supervised and zero-shot action recognition. In this work, we propose a multimodal prompt learning scheme that works to balance the supervised and zero-shot performance under a single unified training. Our prompting approach on the vision side caters for three aspects: 1) Global video-level prompts to model the data distribution; 2) Local frame-level prompts to provide per-frame discriminative conditioning; and 3) a summary prompt to extract a condensed video representation. Additionally, we define a prompting scheme on the text side to augment the textual context. Through this prompting scheme, we can achieve state-of-the-art zero-shot performance on Kinetics-600, HMDB51 and UCF101 while remaining competitive in the supervised setting. By keeping the pretrained backbone frozen, we optimize a much lower number of parameters and retain the existing general representation which helps achieve the strong zero-shot performance. Our codes and models will be publicly released.
摘要翻译:采用像CLIP这样的对比图像-文本预训练模型用于视频分类已经引起了广泛关注,因为它具有成本效益和竞争性能。然而,该领域的最近研究面临着一个权衡问题。微调预训练模型以实现强有监督性能会导致零样本泛化能力较差。类似地,冻结骨干网络以保留零样本能力会显著降低有监督的准确性。因此,文献中的最近研究通常会针对有监督和零样本动作识别分别训练不同的模型。在本研究中,我们提出了一种多模态提示学习方案,旨在在单一统一训练下平衡有监督和零样本性能。我们在视觉方面的提示方法涵盖了三个方面:1)全局视频级提示,用于建模数据分布;2)本地帧级提示,用于提供逐帧的区分性条件;以及3)摘要提示,用于提取精简的视频表示。此外,我们在文本方面定义了提示方案,以增强文本上下文。通过这个提示方案,我们可以在Kinetics-600、HMDB51和UCF101数据集上实现最先进的零样本性能,同时在有监督设置中保持竞争性能。通过保持预训练的骨干网络冻结,我们优化了更少数量的参数,并保留了现有的通用表示,从而有助于实现强大的零样本性能。我们的代码和模型将会公开发布。
Prompt-Guided Zero-Shot Anomaly Action Recognition using Pretrained Deep Skeleton Features(2023CVPR)
This study investigates unsupervised anomaly action recognition, which identifies video-level abnormal-human-behavior events in an unsupervised manner without abnormal samples, and simultaneously addresses three limitations in the conventional skeleton-based approaches: target domain-dependent DNN training, robustness against skeleton errors, and a lack of normal samples. We present a unified, user prompt-guided zero-shot learning framework using a target domain-independent skeleton feature extractor, which is pretrained on a large-scale action recognition dataset. Particularly, during the training phase using normal samples, the method models the distribution of skeleton features of the normal actions while freezing the weights of the DNNs and estimates the anomaly score using this distribution in the inference phase. Additionally, to increase robustness against skeleton errors, we introduce a DNN architecture inspired by a point cloud deep learning paradigm, which sparsely propagates the features between joints. Furthermore, to prevent the unobserved normal actions from being misidentified as abnormal actions, we incorporate a similarity score between the user prompt embeddings and skeleton features aligned in the common space into the anomaly score, which indirectly supplements normal actions. On two publicly available datasets, we conduct experiments to test the effectiveness of the proposed method with respect to abovementioned limitations.
摘要翻译:本研究探讨了无监督异常行为识别,即在无需异常样本的情况下以无监督方式识别视频级别的异常人类行为事件,并同时解决了传统基于骨架的方法中的三个限制:目标域相关的深度神经网络(DNN)训练、对骨架错误的鲁棒性,以及缺乏正常样本。我们提出了一个统一的用户提示引导的零样本学习框架,使用一个在大规模动作识别数据集上预训练的目标域无关的骨架特征提取器。特别是在使用正常样本进行训练阶段,该方法在保持DNN权重不变的同时,对正常动作的骨架特征分布进行建模,并在推断阶段使用该分布估计异常分数。此外,为了增加对骨架错误的鲁棒性,我们引入了受点云深度学习范式启发的DNN架构,该架构在关节之间稀疏传播特征。此外,为了防止未观察到的正常动作被误认为是异常动作,我们将用户提示嵌入与在共同空间中对齐的骨架特征之间的相似性分数并入异常分数中,间接地补充了正常动作。在两个公开数据集上,我们进行了实验,以测试所提方法在上述限制方面的有效性。
Multi-modal Prompting for Low-Shot Temporal Action Localization
In this paper, we consider the problem of temporal action localization under low-shot (zero-shot & few-shot) scenario, with the goal of detecting and classifying the action instances from arbitrary categories within some untrimmed videos, even not seen at training time. We adopt a Transformer-based two-stage action localization architecture with class-agnostic action proposal, followed by open-vocabulary classification. We make the following contributions. First, to compensate image-text foundation models with temporal motions, we improve category-agnostic action proposal by explicitly aligning embeddings of optical flows, RGB and texts, which has largely been ignored in existing low-shot methods. Second, to improve open-vocabulary action classification, we construct classifiers with strong discriminative power, i.e., avoid lexical ambiguities. To be specific, we propose to prompt the pre-trained CLIP text encoder either with detailed action descriptions (acquired from large-scale language models), or visually-conditioned instance-specific prompt vectors. Third, we conduct thorough experiments and ablation studies on THUMOS14 and ActivityNet1.3, demonstrating the superior performance of our proposed model, outperforming existing state-of-the-art approaches by one significant margin.
摘要翻译:在本文中,我们考虑了低样本(零样本和少样本)情景下的时间动作定位问题,目标是在一些未剪辑的视频中检测和分类来自任意类别的动作实例,甚至在训练时未见过。我们采用了基于Transformer的两阶段动作定位架构,其中包括无类别偏见的动作提议阶段,随后是开放式词汇的分类阶段。我们做出了以下贡献。首先,为了弥补图像文本基础模型与时间运动之间的关系,我们通过明确地对齐光流、RGB和文本的嵌入来改进无类别偏见的动作提议,这在现有的低样本方法中很大程度上被忽视。其次,为了提高开放式词汇动作分类的准确性,我们构建了具有强大区分能力的分类器,即避免了词汇的模糊性。具体来说,我们建议使用预训练的CLIP文本编码器,要么使用详细的动作描述(从大规模语言模型获取),要么使用视觉条件的实例特定提示向量。第三,我们在THUMOS14和ActivityNet1.3数据集上进行了彻底的实验和消融研究,展示了我们提出的模型的优越性能,明显优于现有的最先进方法。
Action-GPT: Leveraging Large-scale Language Models for Improved and Generalized Action Generation
摘要翻译:We introduce Action-GPT, a plug-and-play framework for incorporating Large Language Models (LLMs) into text-based action generation models. Action phrases in current motion capture datasets contain minimal and to-the-point information. By carefully crafting prompts for LLMs, we generate richer and fine-grained descriptions of the action. We show that utilizing these detailed descriptions instead of the original action phrases leads to better alignment of text and motion spaces. We introduce a generic approach compatible with stochastic (e.g. VAE-based) and deterministic (e.g. MotionCLIP) text-to-motion models. In addition, the approach enables multiple text descriptions to be utilized. Our experiments show (i) noticeable qualitative and quantitative improvement in the quality of synthesized motions, (ii) benefits of utilizing multiple LLM-generated descriptions, (iii) suitability of the prompt function, and (iv) zero-shot generation capabilities of the proposed approach. Project page: this https URL
我们引入了Action-GPT,这是一个可插入式框架,用于将大型语言模型(LLMs)整合到基于文本的动作生成模型中。当前的动作捕捉数据集中的动作短语包含了最少且简洁的信息。通过精心设计LLMs的提示,我们可以生成更丰富、更细致的动作描述。我们证明,与原始的动作短语相比,利用这些详细描述可以更好地对齐文本和运动空间。我们提出了一种通用方法,适用于随机的(例如基于VAE的)和确定性的(例如MotionCLIP)文本到动作模型。此外,这种方法还可以利用多个文本描述。我们的实验结果表明:(i)合成运动质量在定性和定量方面均有显著改进,(ii)利用多个LLM生成的描述的好处,(iii)提示函数的适用性,以及(iv)所提方法的零样本生成能力。项目页面:[此处是项目链接]。
Multi-Modal Few-Shot Temporal Action Detection
Abstract: Conventional temporal action detection (TAD) methods rely on supervised learning from many labeled training videos, rendering them unscalable to new classes. Recent approaches to solving this problem include few-shot (FS) and zero-shot (ZS) TAD. The former can adapt a pretrained vision model to a new task represented by as few as a single video per class, whilst the latter synthesizes some semantic description given a new class (e.g, generating the classifier using a pretrained vision-language (ViL) model). In this work, we further introduce a hybrid problem setup, multi-modality few-shot(MMFS) TAD, that integrates the respective advantages of FS-TAD and ZS-TAD by accounting for both few-shot support videos (i.e, visual modality) and new class names (i.e, textual modality) in a single formula. To tackle this MMFS-TAD problem, we introduce a novel {\bf\em MUlti-modality PromPt mETa-learning} (MUPPET) method. Our key idea is to construct multi-modal prompts by mapping few-shot support videos to the textual token space of a pretrained ViL model (e.g, CLIP) using a meta-learned adapter-equipped visual semantics tokenizer; This facilitates a joint use of the two input modalities for learning richer representation. To address the large intra-class variation challenge, we further design a query feature regulation scheme. Extensive experiments on ActivityNetv1.3 and THUMOS14 demonstrate that our MUPPET outperforms state-of-the-art FS-TAD, ZS-TAD and alternative methods under a variety of MMFS-TAD settings, often by a large margin.
摘要翻译:传统的时间动作检测(TAD)方法依赖于从许多标记过的训练视频中进行监督学习,从而使它们无法扩展到新的类别。解决这个问题的最近方法包括少样本(FS)和零样本(ZS)时间动作检测。前者可以将预训练的视觉模型适应到一个新任务中,其中每类只有一个视频,而后者可以在给定一个新类别的语义描述时合成一些信息(例如,使用预训练的视觉-语言(ViL)模型生成分类器)。在这项工作中,我们进一步引入了一个混合问题设置,即多模态少样本(MMFS)时间动作检测,它通过在一个公式中考虑了少样本支持视频(即视觉模态)和新类别名称(即文本模态)的各自优势,从而集成了FS-TAD和ZS-TAD的优点。为了解决这个MMFS-TAD问题,我们引入了一种新颖的“多模态提示元学习”(MUPPET)方法。我们的关键思想是通过使用元学习适配器装备的视觉语义分词器,将少样本支持视频映射到预训练的ViL模型(例如CLIP)的文本标记空间中,从而构建多模态提示;这有助于联合使用这两个输入模态来学习更丰富的表示。为了解决大的类内变化挑战,我们进一步设计了一个查询特征调节方案。在ActivityNetv1.3和THUMOS14上的广泛实验表明,我们的MUPPET在各种MMFS-TAD设置下都优于最先进的FS-TAD、ZS-TAD和替代方法,通常具有显著的优势。
Knowledge Prompting for Few-shot Action Recognition
Few-shot action recognition in videos is challenging for its lack of supervision and difficulty in generalizing to unseen actions. To address this task, we propose a simple yet effective method, called knowledge prompting, which leverages commonsense knowledge of actions from external resources to prompt a powerful pre-trained vision-language model for few-shot classification. We first collect large-scale language descriptions of actions, defined as text proposals, to build an action knowledge base. The collection of text proposals is done by filling in handcraft sentence templates with external action-related corpus or by extracting action-related phrases from captions of Web instruction videos.Then we feed these text proposals into the pre-trained vision-language model along with video frames to generate matching scores of the proposals to each frame, and the scores can be treated as action semantics with strong generalization. Finally, we design a lightweight temporal modeling network to capture the temporal evolution of action semantics for classification.Extensive experiments on six benchmark datasets demonstrate that our method generally achieves the state-of-the-art performance while reducing the training overhead to 0.001 of existing methods.
摘要翻译:在视频中进行少样本动作识别面临着缺乏监督和难以泛化到未见动作的挑战。为了解决这个任务,我们提出了一种简单但有效的方法,称为知识提示(knowledge prompting),它利用外部资源中关于动作的常识知识,为预训练的视觉-语言模型提供提示,以进行少样本分类。我们首先收集了大规模的关于动作的语言描述,定义为文本提案,以构建动作知识库。文本提案的收集是通过用外部与动作相关的语料填充手工制作的句子模板,或者从Web指导视频的字幕中提取与动作相关的短语来完成的。然后,我们将这些文本提案与视频帧一起输入到预训练的视觉-语言模型中,生成每个帧与提案的匹配分数,这些分数可以被视为具有强泛化能力的动作语义。最后,我们设计了一个轻量级的时间建模网络,以捕捉动作语义的时间演变,用于分类。在六个基准数据集上的广泛实验表明,我们的方法通常达到了最先进的性能,同时将训练开销减少到现有方法的0.001。
Semantic2Graph: Graph-based Multi-modal Feature Fusion for Action Segmentation in Videos
Video action segmentation and recognition tasks have been widely applied in many fields. Most previous studies employ large-scale, high computational visual models to understand videos comprehensively. However, few studies directly employ the graph model to reason about the video. The graph model provides the benefits of fewer parameters, low computational cost, a large receptive field, and flexible neighborhood message aggregation. In this paper, we present a graph-based method named Semantic2Graph, to turn the video action segmentation and recognition problem into node classification of graphs. To preserve fine-grained relations in videos, we construct the graph structure of videos at the frame-level and design three types of edges: temporal, semantic, and self-loop. We combine visual, structural, and semantic features as node attributes. Semantic edges are used to model long-term spatio-temporal relations, while the semantic features are the embedding of the label-text based on the textual prompt. A Graph Neural Networks (GNNs) model is used to learn multi-modal feature fusion. Experimental results show that Semantic2Graph achieves improvement on GTEA and 50Salads, compared to the state-of-the-art results. Multiple ablation experiments further confirm the effectiveness of semantic features in improving model performance, and semantic edges enable Semantic2Graph to capture long-term dependencies at a low cost.
摘要翻译:视频动作分割和识别任务已广泛应用于许多领域。大多数先前的研究采用大规模、高计算复杂度的视觉模型来全面理解视频。然而,很少有研究直接采用图模型来推理视频。图模型具有参数较少、计算成本低、大的感受野和灵活的邻域信息聚合等优势。在本文中,我们提出了一种基于图的方法,称为Semantic2Graph,将视频动作分割和识别问题转化为图的节点分类问题。为了保留视频中的细粒度关系,我们在帧级别上构建视频的图结构,并设计了三种类型的边缘:时间、语义和自环。我们将视觉、结构和语义特征结合作为节点属性。语义边缘用于建模长时空关系,而语义特征是基于文本提示的标签文本嵌入。图神经网络(GNNs)模型用于学习多模态特征融合。实验结果表明,与现有技术相比,Semantic2Graph在GTEA和50Salads上取得了改进。多次消融实验进一步证实了语义特征在提高模型性能方面的有效性,而语义边使Semantic2Graph能够以较低的成本捕捉长期依赖性。
MotionDiffuse: Text-Driven Human Motion Generation with Diffusion Model
Human motion modeling is important for many modern graphics applications, which typically require professional skills. In order to remove the skill barriers for laymen, recent motion generation methods can directly generate human motions conditioned on natural languages. However, it remains challenging to achieve diverse and fine-grained motion generation with various text inputs. To address this problem, we propose MotionDiffuse, the first diffusion model-based text-driven motion generation framework, which demonstrates several desired properties over existing methods. 1) Probabilistic Mapping. Instead of a deterministic language-motion mapping, MotionDiffuse generates motions through a series of denoising steps in which variations are injected. 2) Realistic Synthesis. MotionDiffuse excels at modeling complicated data distribution and generating vivid motion sequences. 3) Multi-Level Manipulation. MotionDiffuse responds to fine-grained instructions on body parts, and arbitrary-length motion synthesis with time-varied text prompts. Our experiments show MotionDiffuse outperforms existing SoTA methods by convincing margins on text-driven motion generation and action-conditioned motion generation. A qualitative analysis further demonstrates MotionDiffuse's controllability for comprehensive motion generation. Homepage: https://mingyuan-zhang.github.io/projects/MotionDiffuse.html
摘要翻译:人体运动建模对于许多现代图形应用至关重要,通常需要专业技能。为了消除普通人的技能障碍,最近的运动生成方法可以直接在自然语言的条件下生成人体动作。然而,实现多样性和细粒度的、适应各种文本输入的运动生成仍然具有挑战性。为了解决这个问题,我们提出了MotionDiffuse,这是第一个基于扩散模型的文本驱动运动生成框架,相对于现有方法,它展示了几个期望的特性。1)概率映射。MotionDiffuse通过一系列去噪步骤生成运动,注入了变化,而不是确定性的语言-运动映射。2)逼真合成。MotionDiffuse在建模复杂的数据分布和生成生动的运动序列方面表现出色。3)多层次操作。MotionDiffuse可以响应于关于身体部位的细粒度指令,并使用随时间变化的文本提示进行任意长度的运动合成。我们的实验结果显示,MotionDiffuse在文本驱动运动生成和动作条件运动生成方面的性能超过了现有的SoTA方法。进一步的定性分析进一步证明了MotionDiffuse在综合运动生成方面的可控性。主页:https://mingyuan-zhang.github.io/projects/MotionDiffuse.html
Zero-Shot Temporal Action Detection via Vision-Language Prompting(2022 ECCV)
Existing temporal action detection (TAD) methods rely on large training data including segment-level annotations, limited to recognizing previously seen classes alone during inference. Collecting and annotating a large training set for each class of interest is costly and hence unscalable. Zero-shot TAD (ZS-TAD) resolves this obstacle by enabling a pre-trained model to recognize any unseen action classes. Meanwhile, ZS-TAD is also much more challenging with significantly less investigation. Inspired by the success of zero-shot image classification aided by vision-language (ViL) models such as CLIP, we aim to tackle the more complex TAD task. An intuitive method is to integrate an off-the-shelf proposal detector with CLIP style classification. However, due to the sequential localization (e.g., proposal generation) and classification design, it is prone to localization error propagation. To overcome this problem, in this paper we propose a novel zero-Shot Temporal Action detection model via Vision-LanguagE prompting (STALE). Such a novel design effectively eliminates the dependence between localization and classification by breaking the route for error propagation in-between. We further introduce an interaction mechanism between classification and localization for improved optimization. Extensive experiments on standard ZS-TAD video benchmarks show that our STALE significantly outperforms stateof-the-art alternatives. Besides, our model also yields superior results on supervised TAD over recent strong competitors.
摘要翻译:现有的时序动作检测(TAD)方法依赖于包括分段级注释在内的大量训练数据,在推断过程中仅限于识别先前见过的类别。为每个感兴趣的类别收集和注释大量训练集成本高昂,因此不可扩展。零样本TAD(ZS-TAD)通过使预训练模型能够识别任何未见过的动作类别来解决这个障碍。与此同时,ZS-TAD也更具挑战性,研究较少。受到使用像CLIP这样的视觉-语言(ViL)模型进行零样本图像分类的成功启发,我们旨在解决更复杂的TAD任务。一种直观的方法是将现成的提议检测器与CLIP风格的分类集成在一起。然而,由于顺序定位(例如,提议生成)和分类设计,它容易出现定位错误的传播。为了解决这个问题,在本文中,我们提出了一种新颖的基于Vision-LanguagE提示的零样本时序动作检测模型(STALE)。这种新颖的设计通过在定位和分类之间打破错误传播的路径,有效地消除了定位和分类之间的依赖关系。我们进一步引入了分类和定位之间的交互机制,以改进优化。在标准的ZS-TAD视频基准上进行的广泛实验表明,我们的STALE明显优于最先进的替代方法。此外,我们的模型在监督TAD上也优于最近的强竞争对手。
Prompting Visual-Language Models for Efficient Video Understanding(2022 ECCV)
Image-based visual-language (I-VL) pre-training has shown great success for learning joint visual-textual representations from large-scale web data, revealing remarkable ability for “zero-shot” generalisation. This paper presents a simple but strong baseline to efficiently adapt the pre-trained I-VL model for video understanding tasks, with minimal training. Specifically, we propose to optimise a few random vectors, termed as “continuous prompt vectors”, that convert video-related tasks into the same format as the pre-training objectives. In addition, to bridge the gap between static images and videos, temporal information is encoded with lightweight Transformers stacking on top of frame-wise visual features. Experimentally, we conduct extensive ablation studies to analyse the critical components. On ten public benchmarks of action recognition, action localisation, and text-video retrieval, across closed-set, few-shot, and zero-shot scenarios, we achieve competitive or state-of-the-art performance to existing methods, despite optimising significantly fewer parameters. Due to space limitation, we refer the readers to the arXiv version at https://arxiv.org/abs/2112.04478.
摘要翻译:基于图像的视觉-语言(I-VL)预训练在从大规模网络数据中学习联合视觉-文本表示方面取得了巨大成功,展现出了显著的“零样本”泛化能力。本文提出了一个简单但强大的基线,以最小的训练成本高效地适应预训练的I-VL模型用于视频理解任务。具体来说,我们提出优化一些随机向量,称为“连续提示向量”,将与视频相关的任务转化为与预训练目标相同的格式。此外,为了弥补静态图像和视频之间的差距,我们使用轻量级Transformer对逐帧的视觉特征进行堆叠,以编码时间信息。实验上,我们进行了大量消融研究,以分析关键组件。在行动识别、行动定位和文本-视频检索的十个公共基准上,跨封闭集、少样本和零样本场景,尽管优化参数数量显著减少,我们仍然实现了与现有方法竞争力或最先进性能。由于篇幅限制,我们将读者引导至arXiv版本,网址为https://arxiv.org/abs/2112.04478。
ActionCLIP: A New Paradigm for Video Action Recognition
The canonical approach to video action recognition dictates a neural model to do a classic and standard 1-of-N majority vote task. They are trained to predict a fixed set of predefined categories, limiting their transferable ability on new datasets with unseen concepts. In this paper, we provide a new perspective on action recognition by attaching importance to the semantic information of label texts rather than simply mapping them into numbers. Specifically, we model this task as a video-text matching problem within a multimodal learning framework, which strengthens the video representation with more semantic language supervision and enables our model to do zero-shot action recognition without any further labeled data or parameters requirements. Moreover, to handle the deficiency of label texts and make use of tremendous web data, we propose a new paradigm based on this multimodal learning framework for action recognition, which we dub "pre-train, prompt and fine-tune". This paradigm first learns powerful representations from pre-training on a large amount of web image-text or video-text data. Then it makes the action recognition task to act more like pre-training problems via prompt engineering. Finally, it end-to-end fine-tunes on target datasets to obtain strong performance. We give an instantiation of the new paradigm, ActionCLIP, which not only has superior and flexible zero-shot/few-shot transfer ability but also reaches a top performance on general action recognition task, achieving 83.8% top-1 accuracy on Kinetics-400 with a ViT-B/16 as the backbone. Code is available at this https URL
摘要翻译:传统的视频动作识别方法通常要求神经模型执行经典且标准的1-of-N多数投票任务。它们被训练用于预测一组固定的预定义类别,限制了它们在新的数据集上对未见概念的可传递能力。在本文中,我们提供了一个新的视角来进行动作识别,将重点放在标签文本的语义信息上,而不仅仅是将它们映射成数字。具体而言,我们将这个任务建模为一个视频-文本匹配问题,位于一个多模态学习框架内,通过更多的语义语言监督来增强视频表示,使我们的模型能够在没有进一步标注数据或参数要求的情况下进行零样本动作识别。此外,为了处理标签文本的不足并利用大量网络数据,我们提出了一个基于这种多模态学习框架的新范式,称之为“预训练、提示和微调”。这个范式首先从大量的网络图像-文本或视频-文本数据中进行预训练,学习强大的表示。然后通过提示工程,使动作识别任务更像是预训练问题。最后,在目标数据集上进行端到端的微调,获得强大的性能。我们给出了这一新范式的实例,称为ActionCLIP,它不仅具有优越且灵活的零样本/少样本传递能力,还在常规动作识别任务上达到了顶级性能,在Kinetics-400数据集上使用ViT-B/16作为骨干网络,达到了83.8%的top-1准确率。代码可在此https URL找到。