MMA-82 Benchmark · Submitted to IEEE TMM 2026

A New Multi-Domain Benchmark for Micro-Action Recognition and Detection

Yanbin Hao* Pengyu Liu* Xing Wei Xun Yang Dan Guo Meng Wang
Hefei University of Technology · University of Science and Technology of China * Equal contribution
82 micro-action categories
79,574 annotated instances
75.87h total duration
454 subjects
4 source domains
2 benchmark tasks

Example video from the MMA-82 dataset, where micro-actions are short, subtle, and whole-body-level.

Abstract

Realistic Micro-Action Understanding

Micro-actions are short-duration, low-amplitude subtle body movements at the whole-body level that can reveal latent intentions, involuntary reactions, and fine-grained affective changes. Our previous MA-52 benchmark has provided an important foundation for micro-action recognition, but it remains limited in scale, scene diversity, task coverage, and evaluation protocols.

To advance micro-action analysis toward more realistic and comprehensive settings, we introduce MMA-82, a large-scale multi-domain extension of MA-52. MMA-82 expands the label space from 52 to 82 fine-grained micro-action categories and covers four distinct domains, including laboratory interviews, street interviews, psychiatric patient interviews, and emotion-rich television videos, resulting in 77,856 annotated instances from 454 subjects.

Built upon MMA-82, we establish two core tasks: Micro-Action Recognition and Multi-label Micro-Action Detection. For recognition, we further define in-domain and cross-domain protocols, including few-shot and zero-shot settings, to evaluate model robustness, transferability, and generalization.

Extensive experiments show that current methods still struggle with realistic micro-action understanding, especially under domain shift, long-tailed category distributions, and complex temporal localization. Beyond benchmarking, we investigate the relationship between micro-actions and emotion, showing that micro-actions are strongly associated with emotional states and provide complementary cues to facial micro-expressions for improved emotion recognition.

These results demonstrate that MMA-82 serves as a comprehensive and challenging benchmark for realistic micro-action analysis and a valuable resource for human-centered AI. MMA-82 is available at https://github.com/LpyNow/MMA-82 .

Overview

MMA-82 Extends Micro-Action Analysis Beyond the Lab

Overview comparing MMA-82 with existing micro-gesture and micro-action datasets Representative micro-action samples from four MMA-82 sources

MMA-82 improves scale, category richness, scene diversity, task coverage, and complexity compared with prior micro-gesture/action datasets.

Dataset

Four Domains, One Fine-Grained Taxonomy

01

Laboratory Interview Videos

These videos were collected through a specialized face-to-face psychological interview protocol based on participants’ SCL-90 test results.

02

Psychiatric Patient Interview Videos

These videos is collected from publicly available psychiatric patient interview videos on YouTube.

03

Street Interview Videos

These videos is mainly collected from unscripted street interview videos on YouTube, which are highly unstructured and recorded in open, uncontrolled environments.

04

Emotion-rich Television Videos

These videos is derived from the CAER dataset, one of the most widely used benchmarks for emotion recognition.

Benchmark Scale

MMA-82 combines recognition and detection annotations across four real-world domains.

79,574 annotated instances in all MMA-82
39,816 MMA-82-Rec clips · 28.94h · 454 subjects
39,758 MMA-82-Det instances · 46.93h · 434 subjects
82 action-level categories in 7 body-level groups
MMA-82 taxonomy with seven body-level groups and 82 action-level categories

The label space is organized into seven body-level groups: Body, Head, Upper Limb, Lower Limb, Body-Hand, Head-Hand, and Leg-Hand.

MMA-82 recognition dataset statistics
MMA-82-Rec statistics. The recognition split contains 39,816 trimmed clips from the four MMA-82 sources and shows a pronounced long-tailed category distribution.
MMA-82 detection dataset statistics
MMA-82-Det statistics. The detection split contains 39,758 action instances in 11,180 untrimmed videos, with most instances concentrated in Head and Upper Limb actions.

Tasks & Baselines

Recognition and Detection Under Realistic Shifts

Micro-Action Recognition

Given a trimmed video clip, models predict the target micro-action at body and action levels. MMA-82-Rec supports in-domain evaluation and cross-domain zero-shot / few-shot protocols.

39,816clips
28.94hduration
2.62savg. clip length
  • GC-TSM reaches 60.43 Top-1 Acc on all-domain in-domain testing.
  • PoseC3D drops to 14.13 Top-1 Acc in zero-shot transfer to emotion videos.
  • Few-shot adaptation helps, but gains saturate quickly as K increases.

Multi-label Micro-Action Detection

Given an untrimmed video, models localize and classify every micro-action instance, including co-occurring or rapidly successive subtle movements.

11,180videos
39,758instances
3.56instances/video
  • AdaTAD with VideoMAE-L obtains the best reported overall AVG mAP of 23.41.
  • Wrong-label and confusion errors dominate the false-positive profile.
  • Temporal localization becomes harder with dense and overlapping actions.
Representative MMA-82 recognition examples with pose annotations

Micro-action Recognition examples span all seven body-level groups and multiple source domains.

Representative MMA-82 detection examples with temporal annotations

Micro-action Detection examples include multiple overlapping micro-action segments in untrimmed videos.

Full Baseline Results

All values below are generated from the LaTeX source tables in tabs/*.tex.

Micro-Action Recognition: In-Domain

Sub-Dataset Method Action Level Body Level
Top-1 AccTop-5 AccMCAMacro F1Micro F1 Top-1 AccTop-5 AccMCAMacro F1Micro F1
ValTestValTestValTestValTestValTest ValTestValTestValTestValTestValTest
MMA-82-Rec (All)Skeleton54.4356.6280.4480.4536.2636.7737.3539.3954.4356.6279.0481.9399.1899.1169.6271.7171.1274.6279.0481.93
MMA-82-Rec (All)RGB57.5960.4383.0186.1437.3938.9835.6439.5657.5960.4379.1883.1798.7998.9567.2469.4867.7771.9879.1883.17
Laboratory InterviewsSkeleton57.8164.8484.2887.8139.8244.8841.7747.0357.8164.8481.7486.7399.5999.8075.1979.7876.1681.8781.7486.73
Laboratory InterviewsRGB62.0168.1587.3192.8343.0048.0139.6946.4862.0168.1583.6186.7599.1399.4573.5975.9673.8577.6683.6186.75
Psychiatric InterviewsSkeleton50.6441.6375.4668.7927.3315.7424.9317.6550.6441.6376.9777.7398.7199.0157.2947.4652.5451.3376.9777.73
Psychiatric InterviewsRGB52.0046.7475.8972.1324.7714.1017.3313.6452.0046.7474.2581.9998.0099.0852.6444.2645.3445.3674.2581.99
Street InterviewsSkeleton44.8240.7768.1870.7119.5519.4921.2020.7044.8240.7769.9571.3597.6098.8445.4443.3949.0949.5269.9571.35
Street InterviewsRGB42.8039.8769.7073.5513.7912.7512.0912.5642.8039.8763.2668.9097.8597.2937.5337.0236.2538.2363.2668.90
Emotion VideosSkeleton29.557.2960.3217.9319.253.7219.013.1129.557.2958.3035.8797.5788.4538.5019.2639.7417.8958.3035.87
Emotion VideosRGB35.6325.5367.6152.8921.8312.2022.1911.0535.6325.5357.0955.9398.3893.0131.0433.2029.6931.0157.0955.93

Micro-Action Recognition: Cross-Domain

SourceTaskAction LevelBody Level
Top-1 AccTop-5 AccMCAMacro F1Micro F1Top-1 AccTop-5 AccMCAMacro F1Micro F1
Psychiatric InterviewsZero-Shot27.3050.2814.257.9627.3068.5896.7449.4042.4368.58
Psychiatric Interviews1-Shot30.2758.6222.8213.3130.2872.4893.2255.5044.4772.48
Psychiatric Interviews5-Shot30.1258.6022.7213.2930.1272.6093.2855.4344.4172.60
Psychiatric Interviews10-Shot30.1358.5922.7113.2430.1272.6493.3055.4244.4872.64
Street InterviewsZero-Shot20.6544.9010.656.9120.6553.1692.6538.6835.1253.16
Street Interviews1-Shot21.3845.6415.6012.4021.3854.9584.4638.9237.2754.95
Street Interviews5-Shot21.5846.1016.1713.2821.5864.9195.6140.3317.0064.91
Street Interviews10-Shot22.5245.9217.7614.2322.5265.6497.2239.3738.9165.64
Emotion VideosZero-Shot14.1332.986.634.1814.1343.9290.4325.1823.7743.92
Emotion Videos1-Shot17.2639.7714.0011.6217.2641.4379.8427.1125.1341.43
Emotion Videos5-Shot17.4841.7515.2012.1017.4942.3590.8026.9324.5642.35
Emotion Videos10-Shot17.7042.5315.4312.2617.7042.0392.3026.9524.3942.03

Micro-Action Detection

BackboneAction-LevelBody-LevelAVG
@0.2@0.5@0.7Avg@0.2@0.5@0.7Avg
VideoMAE-S20.8812.725.5612.0948.1828.7813.9125.4418.77
VideoMAE-B22.6214.676.3213.5950.9530.4612.2329.1321.36
VideoMAE-L22.7415.607.6814.9855.4833.0114.0631.8323.41
VideoMAE-H26.5317.567.5516.0854.7133.6414.1730.0523.07

Emotion Recognition

TaskMethodTop-1 AccF1
Micro-Expression OnlyDeepFace22.8617.54
Micro-Action OnlyTSM32.3831.86
BothDeepFace + TSM32.8632.36

Top-5 Micro-Actions and Emotion

No.ExperimentAccDeltaF1
1Base Results0.27100.277
2No Top-5 MAs only0.186-0.0860.147
3Top-5 MAs only0.379+0.1070.350

Emotion Analysis

Micro-Actions Provide Affective Cues

Sankey visualization of top micro-actions associated with emotions

Decision-tree analysis reveals emotion-specific micro-action patterns, while related emotions share overlapping body cues.

What the paper finds

  • Sad and melancholy both correlate with bowing head and turning head.
  • Sad shows more explicit negative bodily actions, while melancholy is more inward and subtle.
  • Micro-actions alone outperform facial micro-expression cues on the emotion subset in the reported setup.
  • Combining micro-actions with micro-expressions improves over the facial baseline.
Micro-expression only22.86 Top-1 · 17.54 F1
Micro-action only32.38 Top-1 · 31.86 F1
Both32.86 Top-1 · 32.36 F1
Emotion-rich television clips annotated with micro-actions

Emotion-rich television examples illustrate how annotated micro-actions connect subtle movements with affective states.

Paper

Read the Full Paper

BibTeX

@misc{hao2026newmultidomainbenchmarkmicroaction,
  title={A New Multi-Domain Benchmark for Micro-Action Recognition and Detection},
  author={Hao, Yanbin and Liu, Pengyu and Wei, Xing and Yang, Xun and Guo, Dan and Wang, Meng},
  year={2026},
  eprint={2606.14096},
  archivePrefix={arXiv},
  primaryClass={cs.CV},
  url={https://arxiv.org/abs/2606.14096}
}