**
**
Visual signals from the face-to-face data show that several features can serve as indicators of depression, anxiety, and PTSD (Scherer et al., 2013b; Scherer et al., 2014). Specifically, these forms of psychological distress are predicted by a more downward angle of the gaze, less intense smiles and shorter average durations of smile, as well as longer self-touches and fidget on average longer with both hands (e.g. rubbing, stroking) and legs (e.g. tapping, shaking). Moreover, the predictive ability of these indicators is moderated by gender (Stratou et al., 2013). A crossover interaction was observed between gender and distress level on emotional displays such as frowning,contempt,anddisgust. For example, men who scored positively for depression tend to display more frowning than men who did not, whereas women who scored positively for depression tend to display less frowning than those who did not. Other features such as variability of facial expressions show a main effect of gender – women tend to be more expressive than men, while still other observations, such as head-rotation variation, were entirely gender independent.
**
**
Severalnon-verbalbehaviors were annotated (Waxer, 1974; Hall et al., 1995): gaze directionality (up, down, left, right, towards interviewer), listening smiles (smiles while not speaking), selfadaptors (self-touches in the hand, body, and head), fidgetingbehaviors,andfoot-tappingorshakingbehaviors. Each behavior was annotated in a separate tier in ELAN. Four student annotators participated in the annotation; each tier was assigned to apair of annotators, who first went through a training phase until the inter-rater agreement (Krippendorff’salpha)exceeded0.7. Following training, each video was annotated by a single annotator; to monitor reliability, every 10–15 videos each pair was assigned the same video and inter-rater agreement was re-checked. Annotators were informed that their reliability was measured but did not know which videos were used for cross-checking (Wildman et al., 1975; Harris and Lahey, 1982).
In addition, automatic annotation of non-verbal features was carried out using a multimodal sensor fusion framework called MultiSense, with a multithreading architecture that enables different face- and body-tracking technologies to run in parallel and in realtime. Output from MultiSense was used to estimate the head orientation, the eye-gaze direction, smile level, and smile duration. Further, we automatically analyzed voice characteristics including speakers’ prosody(e.g.fundamentalfrequencyorvoiceintensity) and voice quality characteristics, on a breathy to tense dimension (Scherer et al., 2013a).
**
**