TimeSformer (original) (raw)

PyTorch

Overview

The TimeSformer model was proposed in TimeSformer: Is Space-Time Attention All You Need for Video Understanding? by Facebook Research. This work is a milestone in action-recognition field being the first video transformer. It inspired many transformer based video understanding and classification papers.

The abstract from the paper is the following:

We present a convolution-free approach to video classification built exclusively on self-attention over space and time. Our method, named “TimeSformer,” adapts the standard Transformer architecture to video by enabling spatiotemporal feature learning directly from a sequence of frame-level patches. Our experimental study compares different self-attention schemes and suggests that “divided attention,” where temporal attention and spatial attention are separately applied within each block, leads to the best video classification accuracy among the design choices considered. Despite the radically new design, TimeSformer achieves state-of-the-art results on several action recognition benchmarks, including the best reported accuracy on Kinetics-400 and Kinetics-600. Finally, compared to 3D convolutional networks, our model is faster to train, it can achieve dramatically higher test efficiency (at a small drop in accuracy), and it can also be applied to much longer video clips (over one minute long). Code and models are available at: this https URL.

This model was contributed by fcakyon. The original code can be found here.

Usage tips

There are many pretrained variants. Select your pretrained model based on the dataset it is trained on. Moreover, the number of input frames per clip changes based on the model size so you should consider this parameter while selecting your pretrained model.

Resources

TimesformerConfig

class transformers.TimesformerConfig

< source >

( image_size = 224 patch_size = 16 num_channels = 3 num_frames = 8 hidden_size = 768 num_hidden_layers = 12 num_attention_heads = 12 intermediate_size = 3072 hidden_act = 'gelu' hidden_dropout_prob = 0.0 attention_probs_dropout_prob = 0.0 initializer_range = 0.02 layer_norm_eps = 1e-06 qkv_bias = True attention_type = 'divided_space_time' drop_path_rate = 0 **kwargs )

Parameters

This is the configuration class to store the configuration of a TimesformerModel. It is used to instantiate a TimeSformer model according to the specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of the TimeSformerfacebook/timesformer-base-finetuned-k600architecture.

Configuration objects inherit from PretrainedConfig and can be used to control the model outputs. Read the documentation from PretrainedConfig for more information.

Example:

from transformers import TimesformerConfig, TimesformerModel

configuration = TimesformerConfig()

model = TimesformerModel(configuration)

configuration = model.config

TimesformerModel

class transformers.TimesformerModel

< source >

( config )

Parameters

The bare Timesformer Model outputting raw hidden-states without any specific head on top.

This model inherits from PreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)

This model is also a PyTorch torch.nn.Module subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.

forward

< source >

( pixel_values: FloatTensor output_attentions: typing.Optional[bool] = None output_hidden_states: typing.Optional[bool] = None return_dict: typing.Optional[bool] = None ) → transformers.modeling_outputs.BaseModelOutput or tuple(torch.FloatTensor)

Parameters

A transformers.modeling_outputs.BaseModelOutput or a tuple oftorch.FloatTensor (if return_dict=False is passed or when config.return_dict=False) comprising various elements depending on the configuration (TimesformerConfig) and inputs.

The TimesformerModel forward method, overrides the __call__ special method.

Although the recipe for forward pass needs to be defined within this function, one should call the Moduleinstance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.

Examples:

import av import numpy as np

from transformers import AutoImageProcessor, TimesformerModel from huggingface_hub import hf_hub_download

np.random.seed(0)

def read_video_pyav(container, indices): ... ''' ... Decode the video with PyAV decoder. ... Args: ... container (av.container.input.InputContainer): PyAV container. ... indices (List[int]): List of frame indices to decode. ... Returns: ... result (np.ndarray): np array of decoded frames of shape (num_frames, height, width, 3). ... ''' ... frames = [] ... container.seek(0) ... start_index = indices[0] ... end_index = indices[-1] ... for i, frame in enumerate(container.decode(video=0)): ... if i > end_index: ... break ... if i >= start_index and i in indices: ... frames.append(frame) ... return np.stack([x.to_ndarray(format="rgb24") for x in frames])

def sample_frame_indices(clip_len, frame_sample_rate, seg_len): ... ''' ... Sample a given number of frame indices from the video. ... Args: ... clip_len (int): Total number of frames to sample. ... frame_sample_rate (int): Sample every n-th frame. ... seg_len (int): Maximum allowed index of sample's last frame. ... Returns: ... indices (List[int]): List of sampled frame indices ... ''' ... converted_len = int(clip_len * frame_sample_rate) ... end_idx = np.random.randint(converted_len, seg_len) ... start_idx = end_idx - converted_len ... indices = np.linspace(start_idx, end_idx, num=clip_len) ... indices = np.clip(indices, start_idx, end_idx - 1).astype(np.int64) ... return indices

file_path = hf_hub_download( ... repo_id="nielsr/video-demo", filename="eating_spaghetti.mp4", repo_type="dataset" ... ) container = av.open(file_path)

indices = sample_frame_indices(clip_len=8, frame_sample_rate=4, seg_len=container.streams.video[0].frames) video = read_video_pyav(container, indices)

image_processor = AutoImageProcessor.from_pretrained("MCG-NJU/videomae-base") model = TimesformerModel.from_pretrained("facebook/timesformer-base-finetuned-k400")

inputs = image_processor(list(video), return_tensors="pt")

outputs = model(**inputs) last_hidden_states = outputs.last_hidden_state list(last_hidden_states.shape) [1, 1569, 768]

TimesformerForVideoClassification

class transformers.TimesformerForVideoClassification

< source >

( config )

Parameters

TimeSformer Model transformer with a video classification head on top (a linear layer on top of the final hidden state of the [CLS] token) e.g. for ImageNet.

This model inherits from PreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)

This model is also a PyTorch torch.nn.Module subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.

forward

< source >

( pixel_values: typing.Optional[torch.Tensor] = None labels: typing.Optional[torch.Tensor] = None output_attentions: typing.Optional[bool] = None output_hidden_states: typing.Optional[bool] = None return_dict: typing.Optional[bool] = None ) → transformers.modeling_outputs.ImageClassifierOutput or tuple(torch.FloatTensor)

Parameters

A transformers.modeling_outputs.ImageClassifierOutput or a tuple oftorch.FloatTensor (if return_dict=False is passed or when config.return_dict=False) comprising various elements depending on the configuration (TimesformerConfig) and inputs.

The TimesformerForVideoClassification forward method, overrides the __call__ special method.

Although the recipe for forward pass needs to be defined within this function, one should call the Moduleinstance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.

Examples:

import av import torch import numpy as np

from transformers import AutoImageProcessor, TimesformerForVideoClassification from huggingface_hub import hf_hub_download

np.random.seed(0)

def read_video_pyav(container, indices): ... ''' ... Decode the video with PyAV decoder. ... Args: ... container (av.container.input.InputContainer): PyAV container. ... indices (List[int]): List of frame indices to decode. ... Returns: ... result (np.ndarray): np array of decoded frames of shape (num_frames, height, width, 3). ... ''' ... frames = [] ... container.seek(0) ... start_index = indices[0] ... end_index = indices[-1] ... for i, frame in enumerate(container.decode(video=0)): ... if i > end_index: ... break ... if i >= start_index and i in indices: ... frames.append(frame) ... return np.stack([x.to_ndarray(format="rgb24") for x in frames])

def sample_frame_indices(clip_len, frame_sample_rate, seg_len): ... ''' ... Sample a given number of frame indices from the video. ... Args: ... clip_len (int): Total number of frames to sample. ... frame_sample_rate (int): Sample every n-th frame. ... seg_len (int): Maximum allowed index of sample's last frame. ... Returns: ... indices (List[int]): List of sampled frame indices ... ''' ... converted_len = int(clip_len * frame_sample_rate) ... end_idx = np.random.randint(converted_len, seg_len) ... start_idx = end_idx - converted_len ... indices = np.linspace(start_idx, end_idx, num=clip_len) ... indices = np.clip(indices, start_idx, end_idx - 1).astype(np.int64) ... return indices

file_path = hf_hub_download( ... repo_id="nielsr/video-demo", filename="eating_spaghetti.mp4", repo_type="dataset" ... ) container = av.open(file_path)

indices = sample_frame_indices(clip_len=8, frame_sample_rate=1, seg_len=container.streams.video[0].frames) video = read_video_pyav(container, indices)

image_processor = AutoImageProcessor.from_pretrained("MCG-NJU/videomae-base-finetuned-kinetics") model = TimesformerForVideoClassification.from_pretrained("facebook/timesformer-base-finetuned-k400")

inputs = image_processor(list(video), return_tensors="pt")

with torch.no_grad(): ... outputs = model(**inputs) ... logits = outputs.logits

predicted_label = logits.argmax(-1).item() print(model.config.id2label[predicted_label]) eating spaghetti

< > Update on GitHub