Rajat Koner | Ludwig-Maximilians-Universität München (original) (raw)

Papers by Rajat Koner

Research paper thumbnail of Is it all a cluster game? -- Exploring Out-of-Distribution Detection based on Clustering in the Embedding Space

It is essential for safety-critical applications of deep neural networks to determine when new in... more It is essential for safety-critical applications of deep neural networks to determine when new inputs are significantly different from the training distribution. In this paper, we explore this out-of-distribution (OOD) detection problem for image classification using clusters of semantically similar embeddings of the training data and exploit the differences in distance relationships to these clusters between in-and out-ofdistribution data. We study the structure and separation of clusters in the embedding space and find that the supervised contrastive learning leads to well separated clusters while its self-supervised counterpart fails to do so. In our extensive analysis of different training methods, clustering strategies, distance metrics and thresholding approaches, we observe that there is no clear winner. The optimal approach depends on the model architecture and selected datasets for in-and out-of-distribution. While we could reproduce the outstanding results for contrastive training on CIFAR-10 as in-distribution data, we find standard cross-entropy paired with cosine similarity outperforms all contrastive training methods when training on CIFAR-100 instead. Cross-entropy provides competitive results as compared to expensive contrastive training methods.

Research paper thumbnail of Box Supervised Video Segmentation Proposal Network

ArXiv, 2022

Video Object Segmentation (VOS) has been targeted by various fully-supervised and self-supervised... more Video Object Segmentation (VOS) has been targeted by various fully-supervised and self-supervised approaches. While fully-supervised methods demonstrate excellent results, self-supervised ones, which do not use pixel-level ground truth, attract much attention. However, selfsupervised approaches pose a significant performance gap. Box-level annotations provide a balanced compromise between labeling effort and result quality for image segmentation but have not been exploited for the video domain. In this work, we propose a box-supervised video object segmentation proposal network, which takes advantage of intrinsic video properties. Our method incorporates object motion in the following way: first, motion is computed using a bidirectional temporal difference and a novel bounding box-guided motion compensation. Second, we introduce a novel motion-aware affinity loss that encourages the network to predict positive pixel pairs if they share similar motion and color. The proposed method o...

Research paper thumbnail of Relationformer: A Unified Framework for Image-to-Graph Generation

ArXiv, 2022

A comprehensive representation of an image requires understanding objects and their mutual relati... more A comprehensive representation of an image requires understanding objects and their mutual relationship, especially in image-to-graph generation, e.g., road network extraction, blood-vessel network extraction, or scene graph generation. Traditionally, image-to-graph generation is addressed with a two-stage approach consisting of object detection followed by a separate relation prediction, which prevents simultaneous object-relation interaction. This work proposes a unified onestage transformer-based framework, namely Relationformer that jointly predicts objects and their relations. We leverage direct set-based object prediction and incorporate the interaction among the objects to learn an object-relation representation jointly. In addition to existing [obj]-tokens, we propose a novel learnable token, namely [rln]-token. Together with [obj]-tokens, [rln]-token exploits local and global semantic reasoning in an image through a series of mutual associations. In combination with the pai...

Research paper thumbnail of Relation Transformer Network

The extraction of a scene graph with objects as nodes and mutual relationships as edges is the ba... more The extraction of a scene graph with objects as nodes and mutual relationships as edges is the basis for a deep understanding of image content. Despite recent advances, such as message passing and joint classification, the detection of visual relationships remains a challenging task due to sub-optimal exploration of the mutual interaction among the visual objects. In this work, we propose a novel transformer formulation for scene graph generation and relation prediction. We leverage the encoder-decoder architecture of the transformer for rich feature embedding of nodes and edges. Specifically, we model the node-to-node interaction with the self-attention of the transformer encoder and the edge-to-node interaction with the cross-attention of the transformer decoder. Further, we introduce a novel positional embedding suitable to handle edges in the decoder. Finally, our relation prediction module classifies the directed relation from the learned node and edge embedding. We name this a...

Research paper thumbnail of Improving Visual Relation Detection using Depth Maps

State-of-the-art visual relation detection methods mostly rely on object information extracted fr... more State-of-the-art visual relation detection methods mostly rely on object information extracted from RGB images such as predicted class probabilities, 2D bounding boxes and feature maps. Depth maps can additionally provide valuable information on object relations, e.g. helping to detect not only spatial relations, such as standing behind, but also non-spatial relations, such as holding. In this work, we study the effect of using different object information with a focus on depth maps. To enable this study, we release a new synthetic dataset of depth maps, VG-Depth, as an extension to Visual Genome (VG). We also note that given the highly imbalanced distribution of relations in VG, typical evaluation metrics for visual relation detection cannot reveal improvements of under-represented relations. To address this problem, we propose using an additional metric, calling it Macro Recall@K, and demonstrate its remarkable performance on VG. Finally, our experiments confirm that by effective ...

Research paper thumbnail of OODformer: Out-Of-Distribution Detection Transformer

ArXiv, 2021

A serious problem in image classification is that a trained model might perform well for input da... more A serious problem in image classification is that a trained model might perform well for input data that originates from the same distribution as the data available for model training, but performs much worse for out-of-distribution (OOD) samples. In real-world safety-critical applications, in particular, it is important to be aware if a new data point is OOD. To date, OOD detection is typically addressed using either confidence scores, autoencoder based reconstruction, or by contrastive learning. However, global image context has not yet been explored to discriminate the non-local objectness between in-distribution and OOD samples. This paper proposes a first-of-its-kind OOD detection architecture named OODformer that leverage the contextualization capabilities of the transformer. Incorporating the transformer as the principle feature extractor allows us to exploit the object concepts and their discriminate attributes along with their co-occurrence via visual attention. Using the c...

Research paper thumbnail of Relation Transformer Network

ArXiv, 2020

The identification of objects in an image, together with their mutual relationships, can lead to ... more The identification of objects in an image, together with their mutual relationships, can lead to a deep understanding of image content. Despite all the recent advances in deep learning, in particular, the detection and labeling of visual object relationships remain a challenging task. In this work, we present the Relation Transformer Network, which is a customized transformer-based architecture that models complex object to object and edge to object interactions, by taking into account global context. Our hierarchical multi-head attention-based approach efficiently models and predicts dependencies between objects and their contextual relationships. In comparison to another state of the art approaches, we achieve an absolute mean 3.72% improvement in performance on the Visual Genome dataset.

Research paper thumbnail of Improving Visual Relation Detection using Depth Maps

2020 25th International Conference on Pattern Recognition (ICPR), 2021

Visual relation detection methods rely on object information extracted from RGB images such as 2D... more Visual relation detection methods rely on object information extracted from RGB images such as 2D bounding boxes, feature maps, and predicted class probabilities. We argue that depth maps can additionally provide valuable information on object relations, e.g. helping to detect not only spatial relations, such as standing behind, but also non-spatial relations, such as holding. In this work, we study the effect of using different object features with a focus on depth maps. To enable this study, we release a new synthetic dataset of depth maps, VG-Depth, as an extension to Visual Genome (VG). We also note that given the highly imbalanced distribution of relations in VG, typical evaluation metrics for visual relation detection cannot reveal improvements of under-represented relations. To address this problem, we propose using an additional metric, calling it Macro Recall@K, and demonstrate its remarkable performance on VG. Finally, our experiments confirm that by effective utilization ...

Research paper thumbnail of Multi-Pedestrian Tracking with GM-PHD Filters in an embedded Heterogeneous Parallel Processing Platform with Sensor Fusion

Real-Time Pedestrian detection and tracking methods are the state-of-the-art techniques in presen... more Real-Time Pedestrian detection and tracking methods are the state-of-the-art techniques in present driver assistance systems. However, pedestrian detection and tracking methods that exploit the parallel processing capabilities of heterogeneous high performance computing devices such as FPGAs (or GPUs) with sensor fusion(camera and Lidar), a technology that potentially will replace ECUs in a coming generation of cars, are a rare subject of interest. In this research a pedestrian detection and tracking algorithm is developed and implemented, especially designed to incorporate one or many, and even heterogeneous, hardware accelerators in the first phase. In the second phase it will incorporate Lider and use the data fusion technique to gain better accuracy and precision in real-time. Pedestrian detection is done using Histogram of Oriented Gradients (HOG) for human detection. Parallel implementation of HOG people detection had given a very good real-time performance and robustness. For...

Research paper thumbnail of Random Finite Set Based Bayesian Filtering with OpenCL in a Heterogeneous Platform

While most filtering approaches based on random finite sets have focused on improving performance... more While most filtering approaches based on random finite sets have focused on improving performance, in this paper, we argue that computation times are very important in order to enable real-time applications such as pedestrian detection. Towards this goal, this paper investigates the use of OpenCL to accelerate the computation of random finite set-based Bayesian filtering in a heterogeneous system. In detail, we developed an efficient and fully-functional pedestrian-tracking system implementation, which can run under real-time constraints, meanwhile offering decent tracking accuracy. An extensive evaluation analysis was carried out to ensure the fulfillment of sufficient accuracy requirements. This was followed by extensive profiling analysis to spot the potential bottlenecks in terms of execution performance, which were then targeted to come up with an OpenCL accelerated application. Video-throughput improvements from roughly 15 fps to 100 fps (6×) were observed on average while pro...

Research paper thumbnail of Scene Graph Reasoning for Visual Question Answering

ArXiv, 2020

Visual question answering is concerned with answering free-form questions about an image. Since i... more Visual question answering is concerned with answering free-form questions about an image. Since it requires a deep linguistic understanding of the question and the ability to associate it with various objects that are present in the image, it is an ambitious task and requires techniques from both computer vision and natural language processing. We propose a novel method that approaches the task by performing context-driven, sequential reasoning based on the objects and their semantic and spatial relationships present in the scene. As a first step, we derive a scene graph which describes the objects in the image, as well as their attributes and their mutual relationships. A reinforcement agent then learns to autonomously navigate over the extracted scene graph to generate paths, which are then the basis for deriving answers. We conduct a first experimental study on the challenging GQA dataset with manually curated scene graphs, where our method almost reaches the level of human perfo...

Research paper thumbnail of Scenes and Surroundings: Scene Graph Generation using Relation Transformer

The identification of objects in an image, together with their mutual relationships as a scene gr... more The identification of objects in an image, together with their mutual relationships as a scene graph, can lead to a deep understanding of image content. Despite the recent advancement in deep learning, the detection and labeling of visual object relationships remain a challenging task. In this work, a novel local-context aware relation transformer architecture has been proposed which also exploits complex global object to object and object to edge interactions. Our hierarchical multi-head attention based approach efficiently captures dependencies between objects and predicts contextual relationships. In comparison to state-of-the-art approaches, we have achieved an overall mean 4.85% improvement and new benchmark across all the scene graph generation tasks on the Visual Genome dataset.

Research paper thumbnail of Graphhopper: Multi-hop Scene Graph Reasoning for Visual Question Answering

The Semantic Web – ISWC 2021

Visual Question Answering (VQA) is concerned with answering free-form questions about an image. S... more Visual Question Answering (VQA) is concerned with answering free-form questions about an image. Since it requires a deep semantic and linguistic understanding of the question and the ability to associate it with various objects that are present in the image, it is an ambitious task and requires multi-modal reasoning from both computer vision and natural language processing. We propose Graphhopper, a novel method that approaches the task by integrating knowledge graph reasoning, computer vision, and natural language processing techniques. Concretely, our method is based on performing context-driven, sequential reasoning based on the scene entities and their semantic and spatial relationships. As a first step, we derive a scene graph that describes the objects in the image, as well as their attributes and their mutual relationships. Subsequently, a reinforcement learning agent is trained to autonomously navigate in a multi-hop manner over the extracted scene graph to generate reasonin...

Research paper thumbnail of Is it all a cluster game? -- Exploring Out-of-Distribution Detection based on Clustering in the Embedding Space

It is essential for safety-critical applications of deep neural networks to determine when new in... more It is essential for safety-critical applications of deep neural networks to determine when new inputs are significantly different from the training distribution. In this paper, we explore this out-of-distribution (OOD) detection problem for image classification using clusters of semantically similar embeddings of the training data and exploit the differences in distance relationships to these clusters between in-and out-ofdistribution data. We study the structure and separation of clusters in the embedding space and find that the supervised contrastive learning leads to well separated clusters while its self-supervised counterpart fails to do so. In our extensive analysis of different training methods, clustering strategies, distance metrics and thresholding approaches, we observe that there is no clear winner. The optimal approach depends on the model architecture and selected datasets for in-and out-of-distribution. While we could reproduce the outstanding results for contrastive training on CIFAR-10 as in-distribution data, we find standard cross-entropy paired with cosine similarity outperforms all contrastive training methods when training on CIFAR-100 instead. Cross-entropy provides competitive results as compared to expensive contrastive training methods.

Research paper thumbnail of Box Supervised Video Segmentation Proposal Network

ArXiv, 2022

Video Object Segmentation (VOS) has been targeted by various fully-supervised and self-supervised... more Video Object Segmentation (VOS) has been targeted by various fully-supervised and self-supervised approaches. While fully-supervised methods demonstrate excellent results, self-supervised ones, which do not use pixel-level ground truth, attract much attention. However, selfsupervised approaches pose a significant performance gap. Box-level annotations provide a balanced compromise between labeling effort and result quality for image segmentation but have not been exploited for the video domain. In this work, we propose a box-supervised video object segmentation proposal network, which takes advantage of intrinsic video properties. Our method incorporates object motion in the following way: first, motion is computed using a bidirectional temporal difference and a novel bounding box-guided motion compensation. Second, we introduce a novel motion-aware affinity loss that encourages the network to predict positive pixel pairs if they share similar motion and color. The proposed method o...

Research paper thumbnail of Relationformer: A Unified Framework for Image-to-Graph Generation

ArXiv, 2022

A comprehensive representation of an image requires understanding objects and their mutual relati... more A comprehensive representation of an image requires understanding objects and their mutual relationship, especially in image-to-graph generation, e.g., road network extraction, blood-vessel network extraction, or scene graph generation. Traditionally, image-to-graph generation is addressed with a two-stage approach consisting of object detection followed by a separate relation prediction, which prevents simultaneous object-relation interaction. This work proposes a unified onestage transformer-based framework, namely Relationformer that jointly predicts objects and their relations. We leverage direct set-based object prediction and incorporate the interaction among the objects to learn an object-relation representation jointly. In addition to existing [obj]-tokens, we propose a novel learnable token, namely [rln]-token. Together with [obj]-tokens, [rln]-token exploits local and global semantic reasoning in an image through a series of mutual associations. In combination with the pai...

Research paper thumbnail of Relation Transformer Network

The extraction of a scene graph with objects as nodes and mutual relationships as edges is the ba... more The extraction of a scene graph with objects as nodes and mutual relationships as edges is the basis for a deep understanding of image content. Despite recent advances, such as message passing and joint classification, the detection of visual relationships remains a challenging task due to sub-optimal exploration of the mutual interaction among the visual objects. In this work, we propose a novel transformer formulation for scene graph generation and relation prediction. We leverage the encoder-decoder architecture of the transformer for rich feature embedding of nodes and edges. Specifically, we model the node-to-node interaction with the self-attention of the transformer encoder and the edge-to-node interaction with the cross-attention of the transformer decoder. Further, we introduce a novel positional embedding suitable to handle edges in the decoder. Finally, our relation prediction module classifies the directed relation from the learned node and edge embedding. We name this a...

Research paper thumbnail of Improving Visual Relation Detection using Depth Maps

State-of-the-art visual relation detection methods mostly rely on object information extracted fr... more State-of-the-art visual relation detection methods mostly rely on object information extracted from RGB images such as predicted class probabilities, 2D bounding boxes and feature maps. Depth maps can additionally provide valuable information on object relations, e.g. helping to detect not only spatial relations, such as standing behind, but also non-spatial relations, such as holding. In this work, we study the effect of using different object information with a focus on depth maps. To enable this study, we release a new synthetic dataset of depth maps, VG-Depth, as an extension to Visual Genome (VG). We also note that given the highly imbalanced distribution of relations in VG, typical evaluation metrics for visual relation detection cannot reveal improvements of under-represented relations. To address this problem, we propose using an additional metric, calling it Macro Recall@K, and demonstrate its remarkable performance on VG. Finally, our experiments confirm that by effective ...

Research paper thumbnail of OODformer: Out-Of-Distribution Detection Transformer

ArXiv, 2021

A serious problem in image classification is that a trained model might perform well for input da... more A serious problem in image classification is that a trained model might perform well for input data that originates from the same distribution as the data available for model training, but performs much worse for out-of-distribution (OOD) samples. In real-world safety-critical applications, in particular, it is important to be aware if a new data point is OOD. To date, OOD detection is typically addressed using either confidence scores, autoencoder based reconstruction, or by contrastive learning. However, global image context has not yet been explored to discriminate the non-local objectness between in-distribution and OOD samples. This paper proposes a first-of-its-kind OOD detection architecture named OODformer that leverage the contextualization capabilities of the transformer. Incorporating the transformer as the principle feature extractor allows us to exploit the object concepts and their discriminate attributes along with their co-occurrence via visual attention. Using the c...

Research paper thumbnail of Relation Transformer Network

ArXiv, 2020

The identification of objects in an image, together with their mutual relationships, can lead to ... more The identification of objects in an image, together with their mutual relationships, can lead to a deep understanding of image content. Despite all the recent advances in deep learning, in particular, the detection and labeling of visual object relationships remain a challenging task. In this work, we present the Relation Transformer Network, which is a customized transformer-based architecture that models complex object to object and edge to object interactions, by taking into account global context. Our hierarchical multi-head attention-based approach efficiently models and predicts dependencies between objects and their contextual relationships. In comparison to another state of the art approaches, we achieve an absolute mean 3.72% improvement in performance on the Visual Genome dataset.

Research paper thumbnail of Improving Visual Relation Detection using Depth Maps

2020 25th International Conference on Pattern Recognition (ICPR), 2021

Visual relation detection methods rely on object information extracted from RGB images such as 2D... more Visual relation detection methods rely on object information extracted from RGB images such as 2D bounding boxes, feature maps, and predicted class probabilities. We argue that depth maps can additionally provide valuable information on object relations, e.g. helping to detect not only spatial relations, such as standing behind, but also non-spatial relations, such as holding. In this work, we study the effect of using different object features with a focus on depth maps. To enable this study, we release a new synthetic dataset of depth maps, VG-Depth, as an extension to Visual Genome (VG). We also note that given the highly imbalanced distribution of relations in VG, typical evaluation metrics for visual relation detection cannot reveal improvements of under-represented relations. To address this problem, we propose using an additional metric, calling it Macro Recall@K, and demonstrate its remarkable performance on VG. Finally, our experiments confirm that by effective utilization ...

Research paper thumbnail of Multi-Pedestrian Tracking with GM-PHD Filters in an embedded Heterogeneous Parallel Processing Platform with Sensor Fusion

Real-Time Pedestrian detection and tracking methods are the state-of-the-art techniques in presen... more Real-Time Pedestrian detection and tracking methods are the state-of-the-art techniques in present driver assistance systems. However, pedestrian detection and tracking methods that exploit the parallel processing capabilities of heterogeneous high performance computing devices such as FPGAs (or GPUs) with sensor fusion(camera and Lidar), a technology that potentially will replace ECUs in a coming generation of cars, are a rare subject of interest. In this research a pedestrian detection and tracking algorithm is developed and implemented, especially designed to incorporate one or many, and even heterogeneous, hardware accelerators in the first phase. In the second phase it will incorporate Lider and use the data fusion technique to gain better accuracy and precision in real-time. Pedestrian detection is done using Histogram of Oriented Gradients (HOG) for human detection. Parallel implementation of HOG people detection had given a very good real-time performance and robustness. For...

Research paper thumbnail of Random Finite Set Based Bayesian Filtering with OpenCL in a Heterogeneous Platform

While most filtering approaches based on random finite sets have focused on improving performance... more While most filtering approaches based on random finite sets have focused on improving performance, in this paper, we argue that computation times are very important in order to enable real-time applications such as pedestrian detection. Towards this goal, this paper investigates the use of OpenCL to accelerate the computation of random finite set-based Bayesian filtering in a heterogeneous system. In detail, we developed an efficient and fully-functional pedestrian-tracking system implementation, which can run under real-time constraints, meanwhile offering decent tracking accuracy. An extensive evaluation analysis was carried out to ensure the fulfillment of sufficient accuracy requirements. This was followed by extensive profiling analysis to spot the potential bottlenecks in terms of execution performance, which were then targeted to come up with an OpenCL accelerated application. Video-throughput improvements from roughly 15 fps to 100 fps (6×) were observed on average while pro...

Research paper thumbnail of Scene Graph Reasoning for Visual Question Answering

ArXiv, 2020

Visual question answering is concerned with answering free-form questions about an image. Since i... more Visual question answering is concerned with answering free-form questions about an image. Since it requires a deep linguistic understanding of the question and the ability to associate it with various objects that are present in the image, it is an ambitious task and requires techniques from both computer vision and natural language processing. We propose a novel method that approaches the task by performing context-driven, sequential reasoning based on the objects and their semantic and spatial relationships present in the scene. As a first step, we derive a scene graph which describes the objects in the image, as well as their attributes and their mutual relationships. A reinforcement agent then learns to autonomously navigate over the extracted scene graph to generate paths, which are then the basis for deriving answers. We conduct a first experimental study on the challenging GQA dataset with manually curated scene graphs, where our method almost reaches the level of human perfo...

Research paper thumbnail of Scenes and Surroundings: Scene Graph Generation using Relation Transformer

The identification of objects in an image, together with their mutual relationships as a scene gr... more The identification of objects in an image, together with their mutual relationships as a scene graph, can lead to a deep understanding of image content. Despite the recent advancement in deep learning, the detection and labeling of visual object relationships remain a challenging task. In this work, a novel local-context aware relation transformer architecture has been proposed which also exploits complex global object to object and object to edge interactions. Our hierarchical multi-head attention based approach efficiently captures dependencies between objects and predicts contextual relationships. In comparison to state-of-the-art approaches, we have achieved an overall mean 4.85% improvement and new benchmark across all the scene graph generation tasks on the Visual Genome dataset.

Research paper thumbnail of Graphhopper: Multi-hop Scene Graph Reasoning for Visual Question Answering

The Semantic Web – ISWC 2021

Visual Question Answering (VQA) is concerned with answering free-form questions about an image. S... more Visual Question Answering (VQA) is concerned with answering free-form questions about an image. Since it requires a deep semantic and linguistic understanding of the question and the ability to associate it with various objects that are present in the image, it is an ambitious task and requires multi-modal reasoning from both computer vision and natural language processing. We propose Graphhopper, a novel method that approaches the task by integrating knowledge graph reasoning, computer vision, and natural language processing techniques. Concretely, our method is based on performing context-driven, sequential reasoning based on the scene entities and their semantic and spatial relationships. As a first step, we derive a scene graph that describes the objects in the image, as well as their attributes and their mutual relationships. Subsequently, a reinforcement learning agent is trained to autonomously navigate in a multi-hop manner over the extracted scene graph to generate reasonin...