从”WIKIHOW”里选出了500个event(共95,321 videos, 4490 concepts).
然后用这些concepts和events建了一个CONCEPT LIBRARY, 就是个树状的.
We treat each segment as an instance and model it in a multiple instance learning framework(MIL.多示例学习), where each video is a “bag”. The instance-event similarity(the importance of a segment to an event can be obtained by matching its detected concepts against the evidential description of that event.) is quantized into different levels of relatedness. Intuitively, the most (ir)relevant instances should have higher (dis)similarities. Therefore, we propose to learn the instance labels by jointly optimize the instance classifier and its related level.
说是attributes at image level是不够的,这篇文章提出了用attributes at video level(叫做video attributes): the semantic labels of other external videos collected by researchers. Note that these external videos are different from complex event videos. The external videos contain simple contents of people, objects, scenes and actions which are basic elements of complex events.
As the external videos are used by treating their semantic labels as video attributes, we call these videos attribute videos.
we propose to use video attributes as additional information to assist complex event detection. Specifically, our framework learns the attribute classifier and event detector simultaneously. The observation of a particular event affects the attribute classifier, and in return, attributes characterize the event. This kind of mutual influence is explored by a correlation vector, which helps incorporate extra informative cues into the event detector.
The goal is to learn a robust target classifier by using the loosely labeled single-view data from the heterogeneous source domains[web videos (e.g., from YouTube) and web images (e.g., from Google/Bing image search)] and the unlabeled multi-view data from the target domain.
Observing that some source domains are more relevant to the target domain, in Section 3, we propose a new method called Multi-domain Adaptation with Heterogeneous Sources (MDA-HS) to effectively cope with heterogeneous sources. Specifically, we seek the optimal weights for different source domains with different types of features and also infer the labels of unlabeled target domain data based on all types of features. For each source domain, we propose to learn an adapted classifier based on the pre-learnt source classifier with data distribution mismatch, for which we minimize the distance between the two classifiers in terms of their weight vectors. We introduce a new regularizer by summing the weighted distances from all the source domains and combine all the weighted adapted classifiers as a new target classifier. We also propose a new ρ-SVM based objective function by using the new regularizer and target classifier for domain adaptation. We develop an iterative optimization method by using the cutting plane method and solving a group-based multiple kernel learning (MKL) problem.
Given a video clip of a specific event, e.g., the wedding of Prince William and Kate Middleton, the goal is to retrieve other videos representing the same event from a dataset of over 100k videos.
Our approach encodes the frame descriptors of a video to jointly represent their appearance and temporal order. It exploits the properties of circulant matrices to compare the videos in the frequency domain.
Given a query video, our framework provides not only a high level event label (e.g. a wedding ceremony), but also video segments which are important positive evidence and their textual descriptions (e.g. people hugging).
segmenting videos into short clips and applying pre-trained primitive action classifiers to each clip. The presence of an event is determined by: (1) Global video representation, generated by pooling visual features over the entire video, (2) the presence of several pieces of different positive evidence which are consistent over time.
Our model consists of a global template, a set of local evidence templates and a temporal transition constraint of evidence set. Given a video, we find the sequence of video segments which achieve best overall score in matching the evidence templates and meeting the temporal constraints. An event label is assigned based on the global feature of a video, as well as features from the selected pieces of evidence.
大概是先把视频分段成clips,然后用N个action classfier对每个clip进行打分->每个clip都有N个得分(每个clip都对应N个action).->对这些actions的顺序啊特征啊之类的建模
Due to the complex attribute of an event, it is comparatively hard to find positive exemplars which exactly match the definition of the event. However, it is easier to find videos that match the definition partially, which is referred as related exemplars in this paper.
For examples:
1. “Man performs an oil change on a motor-cycle” is marked as a related exemplar to the event “Chang-ing a vehicle tire”.
2. A video depicted as “A dog lies in the grass” is considered as a related exemplar to the event “Grooming an animal”.
对于related video, 不能把它们当成positive也不能当成negative.这篇文章提出to adaptively learn the relevance level of each related video and leverage the related videos of high relevance to infer a robust detector. 不用二分类的标签(+1,-1),而采用ordinal labels to differentiate relevance levels of related videos.
Specifically, if we use total R (R ≥ 3) ordinal labels to denote the R relevance levels, we assign 1 as negative label, and R as positive label. The numbers between 1 and R correspond to related videos. A larger ordinal label indicates a higher relevance level. The ordinal labels close to 1 are the labels with low relevance to the event, and the ordinal labels close to R are the labels with high relevance.
To progress beyond the state of the art, we propose a cross-feature reasoning approach to generate a set of candidate labels for all related videos and then adaptively select an optimal ordinal label for each of them. After assigning one candidate label to each video, we enumerate possible combinations of all the related videos. We then learn an optimal weight to each label combination. In conjunction with a kernel matrix, each label combination can be used to train a model for event detection. Given multiple label combinations, we have multiple models. Then we formulate the label weighting problem in a multiple kernel learning fashion to obtain a unified event detector, where maximum margin criterion is applied to learn (R − 1) discriminative boundaries between each pair of consecutive ordinal labels. To make the results more robust, we propose to recursively update the label combinations. Once we get the unified event detector, we use it to predict the labels of related videos and update the label combinations, which are then used for another round of learning. The procedure is repeated until convergence and the final unified detector is used for event detection.
In our approach, a video is decomposed into a sequence of overlapping fixed-length temporal clips, on which low-level feature detectors are applied. Each clip is then represented as a histogram (bag-of-visual-words) which is used as a clip level feature and tested against a set of pre-trained action concept detectors. Real-valued confidence scores, pertaining to the presence of each concept are recorded for each clip, converting the video into a vector time series.
video --> fixed-length clips
clip --> feature --(action concepts detectors)--> a vector time series(confidence scores for each concept detectors)
our goal is to break a visual sequence into segments of varied lengths and label them with events of interest or a null event.
The sequence model is built upon visual words (sub-events), not on annotated events, thus it does not require ground truth.
we first represent a video by a sequence of visual words learnt from our data in an unsupervised way with k-means clustering. We then apply the Sequence Memoizer (SM) [21] to explore temporal dependencies among the visual words in the sequence.
SM-based sequence model is empowered with the ability to predict the occurrence of a subsequent visual word in a sequence conditioned on all its previous contexts observed. We finally integrate the sequence model and event classification into a framework that performs segmentation and classification of events jointly in a video.
Mul-instance Learning: video consists of instances.
Our key assumption is that the positive videos usually have a large portion of positive instances, while the negative videos have few positive instances.
主要是在VGG16上.
提了frame feature之后,encoding成video feature
feature:IDT+FV
framework(?):extract feature with multiple time skips
cnn model: similar to “imagenet classification with deep convolutional neural networks(12’)
details: pretrained on imagenet14,finetuned on MED
稍微修改了一下cnn model的结构,就可以train videos了.
surveillance videos.
hirarchical model
to be continued
We propose a novel kernel for modeling cardinality relations, counting instance labels in a bag – for example the number of people in a scene who are performing an action.
Each video is a bag comprised of individual frames. The goal is to label a video according to whether a high-level event of interest is occurring in the video or not.
In this work, we formulate a new weakly supervised domain generalization approach for visual recognition by using loosely labeled web images/videos as training data.
1. coping with noise in the labels of training web images/videos in the source domain;
We formulate a multi-instance learning (MIL) problem by selecting a subset of training samples from each training bag and simultaneously learning the optimal classifiers based on the selected samples.
2. enhancing generalization capability of learnt classifiers to any unseen target domain.
We assume the training web images/videos may come from multiple hidden domains with different data distributions. we aim to learn one classifier for each class and each latent domain. As each classifier is learnt from the training samples with a distinctive data distribution, each integrated classifier, which is obtained by combining multiple classifiers from each class, is expected to be robust to the variation of data distributions, and thus can be well generalized to predict test data from any unseen target domain.
[trecvid 0ex: only textual descritiption, without training videos.]
Our system is built on the observation that an event is a composition of multiple mid-level concepts. These concepts are shared among events and can be collected from other sources (not necessarily related to the event search task). We then train a skip-gram language model to automatically identify the most relevant concepts to a particular event of interest.
not all concept classifiers are equally reliable, especially when they
are trained from other source domains. a relevant concept can be of limited use or even misuse if its classifier is highly unreliable. Therefore, when combining concept scores, we propose to take their relevance, predictive power, and reliability all into account.
Since concepts are shared among many different classes (events) and each concept classifier can be trained independently on datasets from other sources, semantic event search can be achieved by combining the relevant concept classification scores,
learning a relevance score between the event description and the pre-trained concept (attribute) classifiers.
output:a relevance vector w∈[0,1]m, , where wk measures a priori relevance of the k-th concept and the event of interest.
We further prune and refine these weights(relevance scores) for the following reasons:
1). Some concepts, although relevant to the event of interest, may not be very discriminative (low predictive power). 2). Some concepts may not be very reliable, possibly because they are trained on different domains.
how to:
we use the (unlabeled) MED 2014 Research dataset 2 to crudely refine the concepts as follows: We first compute a similarity score between the concept names and the text description of each video in the research dataset, which acts as a concept label, i.e. the likelihood of each video to contain a particular concept. Then we run concept classifiers on each video in the research dataset, and use the aforementioned concept labels to compute the average precisions. Concepts with low precision or low predictive power (such as concept people) are then dropped.
Suppose for event e we have selected m concepts(Different events may use different concepts), each with a weight wi∈[0,1],i=1,...,m . Then, for any test video v , the i -th concept classifier generates a confidence score si(v)∈[−1,1] . Since different concept classifiers result in different confidence scores, we need a principled way to combine them, preferably also taking their relevance w into account.
Previous research was pursued in both unsupervised and supervised settings. Unsupervised models were typically used for event retrieval where the goal is to retrieve all the related videos in the database in some sense similar to the query video provided by a user. On the other hand, supervised learning has been used in event recognition or detection in similar ways as in action recognition and general video classification. In this latter case, a classifier is learned from annotated training videos to detect and recognize the event categories of the test videos, e.g., the multimedia event detection task of the TRECVID. In practical applications, it is often important to qualify the event category prediction by providing an explanation for it. In particular the system needs to localize the key pieces of evidence that lead to the recognition decision. This is some times referred to as event recounting.
In ER3, (i) we introduce a feature alignment step which can significantly suppress the redundant information and generate a more comprehensive and compact video representation called video imprint. In addition, the video imprint also preserves the local spatial layout among video frames. (ii) Based on the video imprint, we further employ a reasoning network, a modified version of the neural memory network, which can simultaneously recognize the event category and locate the key pieces of evidence for the event category.
A bank of base classifiers follow, each of which are trained to produce a likelihood score based on a subset of the features. Their outputs for a particular event are then fused by our method, and the resulting fused likelihood is used to rank the clips in the archive relative to the operator’s interest.
提各种feature然后训练了很多分类器.在test时,每个分类器都有一个得分,要fuse这些得分.
训练了一个多分类器(一般是二分类,one-versus-all)?
SMMED is a maximum margin classifier learned using partial segments of training events. Unlike existing approaches, SMMED can sequentially select the most likely subset of classes while automatically enforcing a larger margin for the unlikely classes. As a result, SMMED can reliably discard many classes using only partially observed events.
Fig. 1. Given a test event (sequence of a subject playing the violin in the top of the figure), SMMED sequentially evaluates partial events at {10%, 20%, · · · , 100%} . When SMMED is confident that the event is not from a given class, it automatically discards this class from further consideration. The blue bars illustrate that class #2(IceDancing), #4(BlowDryHair),#5(Blending), #3(Shaving) are sequentially discarded. Finally, the test event is identified as class #1(playing the violin): the remaining class (the longest blue bar), after 80% of the event has been evaluated.
Figure 1. Illustration of our approach. (Top) A video from Wedding ceremony event is separated into a sequence of clips, each of which corresponds to an activity concept like kissing and dancing. (Bottom) Each dimension in our representation corresponds to an activity concept transition. A positive value indicates the transition is more likely to happen than parameterized by a Hidden Markov Model.
Our paper tries to address this problem by exploiting activity concept transitions in video events (ACTIVE). A video is treated as a sequence of short clips, all of which are observations corresponding to latent activity concept variables in a Hidden Markov Model (HMM). We propose to apply Fisher Kernel techniques so that the concept transitions over time can be encoded into a compact and fixed length feature vector very efficiently.
中心思想是activity concept transitions(?).
1. to encode activity concept transitions with Fisher kernel techniques.
2. we use Hidden Markov Model (HMM) as the underlying generative model. In this model, a video event is a sequence of activity concepts(concepts应该就是states). A new concept is generated with certain probabilities based on the previous concept. An observation is a low level feature vector from a sub-clip and generated based on the concepts.
训练了K个concept detector.
对每个视频分成T段, 每段让K个detecotrs打分,最后就得到了一个TxK的矩阵.
1. We use HMM to model a video event with activity concept transitions over time.
2. The idea of Fisher kernel is first proposed in [7], the goal is to get the sufficient statistics of a generative model, and use them as kernel functions in discriminative classifiers such as SVM. (???) 是用FV得到video representation,然后训练了个SVM吗?
TO BE CONTINUED