Vicky Kalogeiton - Academia.edu (original) (raw)
Papers by Vicky Kalogeiton
arXiv (Cornell University), Nov 6, 2023
arXiv (Cornell University), Jan 7, 2024
arXiv (Cornell University), Feb 11, 2023
Cybernetics and Systems, Mar 11, 2019
arXiv (Cornell University), Sep 19, 2022
arXiv (Cornell University), Apr 26, 2016
Neuromethods, Feb 24, 2012
Transformers were initially introduced for natural language processing (NLP) tasks, but fast they... more Transformers were initially introduced for natural language processing (NLP) tasks, but fast they were adopted by most deep learning fields, including computer vision. They measure the relationships between pairs of input tokens (words in the case of text strings, parts of images for visual transformers), termed attention. The cost is exponential with the number of tokens. For image classification, the most common transformer architecture uses only the transformer encoder in order to transform the various input tokens. However, there are also numerous other applications in which the decoder part of the traditional transformer architecture is also used. Here, we first introduce the attention mechanism (Subheading 1) and then the basic transformer block including the vision transformer (Subheading 2). Next, we discuss some improvements of visual transformers to account for small datasets or less computation (Subheading 3). Finally, we introduce visual transformers applied to tasks other than image classification, such as detection, segmentation, generation, and training without labels (Subheading 4) and other domains, such as video or multimodality using text or audio data (Subheading 5).
HAL (Le Centre pour la Communication Scientifique Directe), Oct 11, 2021
Automatically analysing stylistic features in movies is a challenging task, as it requires an in-... more Automatically analysing stylistic features in movies is a challenging task, as it requires an in-depth knowledge of cinematography. In the literature, only a handful of methods explore stylistic feature extraction, and they typically focus on limited low-level image and shot features (colour histograms, average shot lengths or shot types, amount of camera motion). These, however, only capture a subset of the stylistic features which help to characterise a movie (e.g. black and white vs. coloured, or film editing). To this end, in this work, we systematically explore seven high-level features for movie style analysis: character segmentation, pose estimation, depth maps, focus maps, frame layering, camera motion type and camera pose. Our findings show that low-level features remain insufficient for movie style analysis, while high-level features seem promising.
arXiv (Cornell University), Mar 31, 2023
Adapting a segmentation model from a labeled source domain to a target domain, where a single unl... more Adapting a segmentation model from a labeled source domain to a target domain, where a single unlabeled datum is available, is one of the most challenging problems in domain adaptation and is otherwise known as one-shot unsupervised domain adaptation (OSUDA). Most of the prior works have addressed the problem by relying on style transfer techniques, where the source images are stylized to have the appearance of the target domain. Departing from the common notion of transferring only the target "texture" information, we leverage text-to-image diffusion models (e.g., Stable Diffusion) to generate a synthetic target dataset with photo-realistic images that not only faithfully depict the style of the target domain, but are also characterized by novel scenes in diverse contexts. The text interface in our method Data AugmenTation with diffUsion Models (DA-TUM) endows us with the possibility of guiding the generation of images towards desired semantic concepts while respecting the original spatial context of a single training image, which is not possible in existing OSUDA methods. Extensive experiments on standard benchmarks show that our DATUM surpasses the state-of-the-art OSUDA methods by up to +7.1%
arXiv (Cornell University), Mar 21, 2023
Transformers were initially introduced for natural language processing (NLP) tasks, but fast they... more Transformers were initially introduced for natural language processing (NLP) tasks, but fast they were adopted by most deep learning fields, including computer vision. They measure the relationships between pairs of input tokens (words in the case of text strings, parts of images for visual Transformers), termed attention. The cost is exponential with the number of tokens. For image classification, the most common Transformer Architecture uses only the Transformer Encoder in order to transform the various input tokens. However, there are also numerous other applications in which the decoder part of the traditional Transformer Architecture is also used. Here, we first introduce the Attention mechanism (Section 1), and then the Basic Transformer Block including the Vision Transformer (Section 2). Next, we discuss some improvements of visual Transformers to account for small datasets or less computation (Section 3). Finally, we introduce Visual Transformers applied to tasks other than image classification, such as detection, segmentation, generation and training without labels (Section 4) and other domains, such as video or multimodality using text or audio data (Section 5).
British Machine Vision Conference, 2020
In this work, we introduce the Constrained first nearest neighbour Clustering (C1C) method for vi... more In this work, we introduce the Constrained first nearest neighbour Clustering (C1C) method for video face clustering. Using the premise that the first nearest neighbour (1NN) of an instance is sufficient to discover large chains and groupings, C1C builds upon the hierarchical clustering method FINCH by imposing must-link and cannot-link constraints acquired in a self-supervised manner. We show that adding these constraints leads to performance improvements with low computational cost. C1C is easily scalable and does not require any training. Additionally, we introduce a new Friends dataset for evaluating the performance of face clustering algorithms. Given that most video datasets for face clustering are saturated or emphasize only the main characters, the Friends dataset is larger, contains identities for several main and secondary characters, and tackles more challenging cases as it labels also the 'back of the head'. We evaluate C1C on the Big Bang Theory, Buffy, and Sherlock datasets for video face clustering, and show that it achieves the new state of the art whilst setting the baseline on Friends.
Studies in computational intelligence, 2015
Evacuation is an imminent movement of people away from sources of danger. Evacuation in highly st... more Evacuation is an imminent movement of people away from sources of danger. Evacuation in highly structured environments, e.g. building, requires advance planning and large-scale control. Finding a shortest path towards exit is a key for the prompt successful evacuation. Slime mould Physarum polycephalum is proven to be an efficient path solver: the living slime mould calculates optimal paths towards sources of attractants yet maximizes distances from repellents. The search strategy implemented by the slime mould is straightforward yet efficient. The slime mould develops may active traveling zones, or pseudopodia, which propagates along different, alternative, routes the pseudopodia close to the target loci became dominating and the pseudopodia propagating along less optimal routes decease. We adopt the slime mould’s strategy in a Cellular-Automaton (CA) model of a crowd evacuation. CA are massive-parallel computation tool capable for mimicking the Physarum’s behaviour. The model accounts for Physarum foraging process, the food diffusion, the organism’s growth, the creation of tubes for each organism, the selection of optimum path for each human and imitation movement of all humans at each time step towards near exit. To test the efficiency and robustness of the proposed CA model, several simulation scenarios were proposed proving that the model succeeds to reproduce sufficiently the Physarum’s inspiring behaviour.
arXiv (Cornell University), May 4, 2017
Current state-of-the-art approaches for spatio-temporal action localization rely on detections at... more Current state-of-the-art approaches for spatio-temporal action localization rely on detections at the frame level that are then linked or tracked across time. In this paper, we leverage the temporal continuity of videos instead of operating at the frame level. We propose the ACtion Tubelet detector (ACT-detector) that takes as input a sequence of frames and outputs tubelets, i.e., sequences of bounding boxes with associated scores. The same way state-of-theart object detectors rely on anchor boxes, our ACT-detector is based on anchor cuboids. We build upon the SSD framework [18]. Convolutional features are extracted for each frame, while scores and regressions are based on the temporal stacking of these features, thus exploiting information from a sequence. Our experimental results show that leveraging sequences of frames significantly improves detection performance over using individual frames. The gain of our tubelet detector can be explained by both more accurate scores and more precise localization. Our ACT-detector outperforms the state-of-the-art methods for frame-mAP and video-mAP on the J-HMDB [12] and UCF-101 [30] datasets, in particular at high overlap thresholds.
arXiv (Cornell University), Nov 6, 2023
arXiv (Cornell University), Jan 7, 2024
arXiv (Cornell University), Feb 11, 2023
Cybernetics and Systems, Mar 11, 2019
arXiv (Cornell University), Sep 19, 2022
arXiv (Cornell University), Apr 26, 2016
Neuromethods, Feb 24, 2012
Transformers were initially introduced for natural language processing (NLP) tasks, but fast they... more Transformers were initially introduced for natural language processing (NLP) tasks, but fast they were adopted by most deep learning fields, including computer vision. They measure the relationships between pairs of input tokens (words in the case of text strings, parts of images for visual transformers), termed attention. The cost is exponential with the number of tokens. For image classification, the most common transformer architecture uses only the transformer encoder in order to transform the various input tokens. However, there are also numerous other applications in which the decoder part of the traditional transformer architecture is also used. Here, we first introduce the attention mechanism (Subheading 1) and then the basic transformer block including the vision transformer (Subheading 2). Next, we discuss some improvements of visual transformers to account for small datasets or less computation (Subheading 3). Finally, we introduce visual transformers applied to tasks other than image classification, such as detection, segmentation, generation, and training without labels (Subheading 4) and other domains, such as video or multimodality using text or audio data (Subheading 5).
HAL (Le Centre pour la Communication Scientifique Directe), Oct 11, 2021
Automatically analysing stylistic features in movies is a challenging task, as it requires an in-... more Automatically analysing stylistic features in movies is a challenging task, as it requires an in-depth knowledge of cinematography. In the literature, only a handful of methods explore stylistic feature extraction, and they typically focus on limited low-level image and shot features (colour histograms, average shot lengths or shot types, amount of camera motion). These, however, only capture a subset of the stylistic features which help to characterise a movie (e.g. black and white vs. coloured, or film editing). To this end, in this work, we systematically explore seven high-level features for movie style analysis: character segmentation, pose estimation, depth maps, focus maps, frame layering, camera motion type and camera pose. Our findings show that low-level features remain insufficient for movie style analysis, while high-level features seem promising.
arXiv (Cornell University), Mar 31, 2023
Adapting a segmentation model from a labeled source domain to a target domain, where a single unl... more Adapting a segmentation model from a labeled source domain to a target domain, where a single unlabeled datum is available, is one of the most challenging problems in domain adaptation and is otherwise known as one-shot unsupervised domain adaptation (OSUDA). Most of the prior works have addressed the problem by relying on style transfer techniques, where the source images are stylized to have the appearance of the target domain. Departing from the common notion of transferring only the target "texture" information, we leverage text-to-image diffusion models (e.g., Stable Diffusion) to generate a synthetic target dataset with photo-realistic images that not only faithfully depict the style of the target domain, but are also characterized by novel scenes in diverse contexts. The text interface in our method Data AugmenTation with diffUsion Models (DA-TUM) endows us with the possibility of guiding the generation of images towards desired semantic concepts while respecting the original spatial context of a single training image, which is not possible in existing OSUDA methods. Extensive experiments on standard benchmarks show that our DATUM surpasses the state-of-the-art OSUDA methods by up to +7.1%
arXiv (Cornell University), Mar 21, 2023
Transformers were initially introduced for natural language processing (NLP) tasks, but fast they... more Transformers were initially introduced for natural language processing (NLP) tasks, but fast they were adopted by most deep learning fields, including computer vision. They measure the relationships between pairs of input tokens (words in the case of text strings, parts of images for visual Transformers), termed attention. The cost is exponential with the number of tokens. For image classification, the most common Transformer Architecture uses only the Transformer Encoder in order to transform the various input tokens. However, there are also numerous other applications in which the decoder part of the traditional Transformer Architecture is also used. Here, we first introduce the Attention mechanism (Section 1), and then the Basic Transformer Block including the Vision Transformer (Section 2). Next, we discuss some improvements of visual Transformers to account for small datasets or less computation (Section 3). Finally, we introduce Visual Transformers applied to tasks other than image classification, such as detection, segmentation, generation and training without labels (Section 4) and other domains, such as video or multimodality using text or audio data (Section 5).
British Machine Vision Conference, 2020
In this work, we introduce the Constrained first nearest neighbour Clustering (C1C) method for vi... more In this work, we introduce the Constrained first nearest neighbour Clustering (C1C) method for video face clustering. Using the premise that the first nearest neighbour (1NN) of an instance is sufficient to discover large chains and groupings, C1C builds upon the hierarchical clustering method FINCH by imposing must-link and cannot-link constraints acquired in a self-supervised manner. We show that adding these constraints leads to performance improvements with low computational cost. C1C is easily scalable and does not require any training. Additionally, we introduce a new Friends dataset for evaluating the performance of face clustering algorithms. Given that most video datasets for face clustering are saturated or emphasize only the main characters, the Friends dataset is larger, contains identities for several main and secondary characters, and tackles more challenging cases as it labels also the 'back of the head'. We evaluate C1C on the Big Bang Theory, Buffy, and Sherlock datasets for video face clustering, and show that it achieves the new state of the art whilst setting the baseline on Friends.
Studies in computational intelligence, 2015
Evacuation is an imminent movement of people away from sources of danger. Evacuation in highly st... more Evacuation is an imminent movement of people away from sources of danger. Evacuation in highly structured environments, e.g. building, requires advance planning and large-scale control. Finding a shortest path towards exit is a key for the prompt successful evacuation. Slime mould Physarum polycephalum is proven to be an efficient path solver: the living slime mould calculates optimal paths towards sources of attractants yet maximizes distances from repellents. The search strategy implemented by the slime mould is straightforward yet efficient. The slime mould develops may active traveling zones, or pseudopodia, which propagates along different, alternative, routes the pseudopodia close to the target loci became dominating and the pseudopodia propagating along less optimal routes decease. We adopt the slime mould’s strategy in a Cellular-Automaton (CA) model of a crowd evacuation. CA are massive-parallel computation tool capable for mimicking the Physarum’s behaviour. The model accounts for Physarum foraging process, the food diffusion, the organism’s growth, the creation of tubes for each organism, the selection of optimum path for each human and imitation movement of all humans at each time step towards near exit. To test the efficiency and robustness of the proposed CA model, several simulation scenarios were proposed proving that the model succeeds to reproduce sufficiently the Physarum’s inspiring behaviour.
arXiv (Cornell University), May 4, 2017
Current state-of-the-art approaches for spatio-temporal action localization rely on detections at... more Current state-of-the-art approaches for spatio-temporal action localization rely on detections at the frame level that are then linked or tracked across time. In this paper, we leverage the temporal continuity of videos instead of operating at the frame level. We propose the ACtion Tubelet detector (ACT-detector) that takes as input a sequence of frames and outputs tubelets, i.e., sequences of bounding boxes with associated scores. The same way state-of-theart object detectors rely on anchor boxes, our ACT-detector is based on anchor cuboids. We build upon the SSD framework [18]. Convolutional features are extracted for each frame, while scores and regressions are based on the temporal stacking of these features, thus exploiting information from a sequence. Our experimental results show that leveraging sequences of frames significantly improves detection performance over using individual frames. The gain of our tubelet detector can be explained by both more accurate scores and more precise localization. Our ACT-detector outperforms the state-of-the-art methods for frame-mAP and video-mAP on the J-HMDB [12] and UCF-101 [30] datasets, in particular at high overlap thresholds.