Context in object detection: a systematic literature review (original) (raw)

1 Introduction

Object detection is a fundamental computer vision task to identify and locate objects within images or videos. It is a foundation in other computer vision tasks, such as object tracking, scene understanding, and image captioning. Object detectors can be classified into two distinct categories: traditional object detectors, and deep learning object detectors, which have emerged since 2012. The traditional methods have limits in terms of robustness and speed when dealing with large datasets. The introduction of Convolutional Neural Networks (CNNs) by AlexNet (Krizhevsky et al. 2012) in 2012, sparked a profound revolution in the field of object detection. The timeline of some of the most important object detectors is depicted in Fig. 1, illustrating the historical development of these methods over time.

Fig. 1

Milestones of object detection, SIFT (Lowe 1999), Cascades (Viola and Jones 2001), BoW (Sivic and Zisserman 2003), HOG (Dalal and Triggs 2005), SURF (Bay et al. 2006), DPM (Felzenszwalb et al. 2008), AlexNet (Krizhevsky et al. 2012), OverFeat (Sermanet et al. [2013](/article/10.1007/s10462-025-11186-x#ref-CR208 "Sermanet P, Eigen D, Zhang X, Mathieu M, Fergus R, LeCun Y (2013) Overfeat: Integrated recognition, localization and detection using convolutional networks. arXiv preprint arXiv:1312.6229

            ")), RCNN (Girshick et al. [2014](/article/10.1007/s10462-025-11186-x#ref-CR78 "Girshick R, Donahue J, Darrell T, Malik J (2014) Rich feature hierarchies for accurate object detection and semantic segmentation. Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 580–587)")), SPPNet (Kaiming et al. [2014](/article/10.1007/s10462-025-11186-x#ref-CR117 "Kaiming H, Xiangyu Z, Shaoqing R, Sun J (2014) Spatial pyramid pooling in deep convolutional networks for visual recognition. European conference on computer vision")), Fast RCNN (Girshick [2015](/article/10.1007/s10462-025-11186-x#ref-CR77 "Girshick R (2015) Fast r-cnn. Proceedings of the IEEE international conference on computer vision (pp. 1440–1448)")), Faster RCNN (Ren et al. [2015](/article/10.1007/s10462-025-11186-x#ref-CR203 "Ren S, He K, Girshick R, Sun J (2015) Faster r-cnn: Towards real-time object detection with region proposal networks. Adv Neural Inf Process Syst, 28")), YOLOv1 (Redmon et al. [2016](/article/10.1007/s10462-025-11186-x#ref-CR199 "Redmon J, Divvala S, Girshick R, Farhadi A (2016) You only look once: Unified, real-time object detection. Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 779–788)")), SSD (Liu et al. [2016](/article/10.1007/s10462-025-11186-x#ref-CR156 "Liu W, Anguelov D, Erhan D, Szegedy C, Reed S, Fu C-Y, Berg AC (2016) Ssd: Single shot multibox detector. Computer vision–eccv 2016: 14th European conference, Amsterdam, The Netherlands, October 11–14, 2016, proceedings, part i 14 (pp. 21–37)")), YOLOv2 (Redmon and Farhadi [2017](/article/10.1007/s10462-025-11186-x#ref-CR200 "Redmon J, Farhadi A (2017) Yolo9000: better, faster, stronger. Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 7263–7271)")), YOLOv3 (Redmon and Farhadi [2018](/article/10.1007/s10462-025-11186-x#ref-CR201 "Redmon J, Farhadi A (2018) Yolov3: An incremental improvement. arXiv preprint 
              arXiv:1804.02767
              
            ")), Mask R-CNN (He et al. [2017](/article/10.1007/s10462-025-11186-x#ref-CR93 "He K, Gkioxari G, Dollár P, Girshick R (2017) Mask r-cnn. Proceedings of the IEEE international conference on computer vision (pp. 2961–2969)")), FPN (Lin, Dollár, et al. [2017](/article/10.1007/s10462-025-11186-x#ref-CR149 "Lin T-Y, Dollár P, Girshick R, He K, Hariharan B, Belongie S (2017) Feature pyramid networks for object detection. Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 2117–2125)")), RetinaNet (Lin, Goyal, et al. [2017](/article/10.1007/s10462-025-11186-x#ref-CR150 "Lin T-Y, Goyal P, Girshick R, He K, Dollár P (2017) Focal loss for dense object detection. Proceedings of the IEEE international conference on computer vision (pp. 2980–2988)")), SqueezeDet (Wu et al. [2017](/article/10.1007/s10462-025-11186-x#ref-CR249 "Wu B, Iandola F, Jin PH, Keutzer K (2017) Squeezedet: Unified, small, low power fully convolutional neural networks for real-time object detection for autonomous driving. Proceedings of the IEEE conference on computer vision and pattern recognition workshops (pp. 129–137)")), Transformer (Vaswani et al. [2017](/article/10.1007/s10462-025-11186-x#ref-CR226 "Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, PolosUKhin I (2017) Attention is all you need. Adv Neural Inf Process Syst, 30")), RefineDet (Zhang et al. [2018](/article/10.1007/s10462-025-11186-x#ref-CR278 "Zhang S, Wen L, Bian X, Lei Z, Li SZ (2018) Single-shot refinement neural network for object detection. Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 4203–4212)")), CornerNet (Law and Deng [2018](/article/10.1007/s10462-025-11186-x#ref-CR132 "Law H, Deng J (2018) Cornernet: Detecting objects as paired keypoints. Proceedings of the European conference on computer vision (eccv) (pp. 734–750)")), CenterNet (Duan et al. [2019](/article/10.1007/s10462-025-11186-x#ref-CR55 "Duan K, Bai S, Xie L, Qi H, Huang Q, Tian Q (2019) Centernet: Keypoint triplets for object detection. Proceedings of the IEEE/cvf international conference on computer vision (pp. 6569–6578)")), FCOS (Tian et al. [2019](/article/10.1007/s10462-025-11186-x#ref-CR222 "Tian Z, Shen C, Chen H, He T (2019) Fcos: Fully convolutional one-stage object detection. Proceedings of the IEEE/cvf international conference on computer vision (pp. 9627–9636)")), EfficientDet (Tan et al. [2020](/article/10.1007/s10462-025-11186-x#ref-CR221 "Tan M, Pang R, Le QV (2020) Efficientdet: Scalable and efficient object detection. Proceedings of the IEEE/cvf conference on computer vision and pattern recognition (pp. 10781–10790)")), DETR (Carion et al. [2020](/article/10.1007/s10462-025-11186-x#ref-CR28 "Carion N, Massa F, Synnaeve G, Usunier N, Kirillov A, Zagoruyko S (2020) End-to-end object detection with transformers. European conference on computer vision (pp. 213–229)")), ViT (Dosovitskiy et al. [2020](/article/10.1007/s10462-025-11186-x#ref-CR54 "Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X, Unterthiner T others (2020) An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint 
              arXiv:2010.11929
              
            ")), YOLOv4 (Bochkovskiy et al. [2020](/article/10.1007/s10462-025-11186-x#ref-CR23 "Bochkovskiy A, Wang C-Y, Liao H-YM (2020) Yolov4: optimal speed and accuracy of object detection. arXiv preprint 
              arXiv:2004.10934
              
            ")), YOLOv5 (Jocher [2020](/article/10.1007/s10462-025-11186-x#ref-CR114 "Jocher G (2020) Ultralytics yolov5. Retrieved from 
              https://github.com/ultralytics/yolov5
              
            ")),YOLOR (Wang et al. [2021](/article/10.1007/s10462-025-11186-x#ref-CR242 "Wang C, Yeh I, Liao H (2021) You only learn one representation: Unified network for multiple tasks. arXiv preprint 
              arXiv:2105.04206
              
            ")), G-RCNN (Wang and Hu [2021](/article/10.1007/s10462-025-11186-x#ref-CR236 "Wang J, Hu X (2021) Convolutional neural networks with gated recurrent connections. TPAMI")), Swin Transformer (Liu et al. [2021](/article/10.1007/s10462-025-11186-x#ref-CR157 "Liu Z, Lin Y, Cao Y, Hu H, Wei Y, Zhang Z, Guo B (2021) Swin transformer: hierarchical vision transformer using shifted windows. Proceedings of the IEEE/cvf international conference on computer vision (pp. 10012–10022)")), Deformable DETR (Zhu et al. [2020](/article/10.1007/s10462-025-11186-x#ref-CR291 "Zhu X, Su W, Lu L, Li B, Wang X, Dai J (2020) Deformable detr: Deformable transformers for end-to-end object detection. arXiv preprint 
              arXiv:2010.04159
              
            ")), YOLOv6 (Li et al. [2022](/article/10.1007/s10462-025-11186-x#ref-CR146 "Li C, Li L, Jiang H, Weng K, Geng Y, Li L others (2022) Yolov6: A single-stage object detection framework for industrial applications. arXiv preprint 
              arXiv:2209.02976
              
            ")), YOLOv7 (Wang et al. [2023](/article/10.1007/s10462-025-11186-x#ref-CR229 "Wang C-Y, Bochkovskiy A, Liao H-YM (2023) Yolov7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. Proceedings of the IEEE/cvf conference on computer vision and pattern recognition (pp. 7464–7475)")), TSST (Lee [2022](/article/10.1007/s10462-025-11186-x#ref-CR134 "Lee SY (2022) Task specific attention is one more thing you need for object detection")), Sparse DETR (Roh et al. [2021](/article/10.1007/s10462-025-11186-x#ref-CR204 "Roh B, Shin J, Shin W, Kim S (2021) Sparse detr: Efficient end-to-end object detection with learnable sparsity. arXiv preprint 
              arXiv:2111.14330
              
            ")), YOLOv8 (Jocher et al. [2023](/article/10.1007/s10462-025-11186-x#ref-CR115 "Jocher G, Chaurasia A, Qiu J (2023) Yolo by ultralytics. Ultralytics")), YOLO NAS (Aharon et al. [2021](/article/10.1007/s10462-025-11186-x#ref-CR3 "Aharon S, Louis-Dupont Ofri Masad, Yurkova K, Lotem Fridman Lkdci, Eran-Deci (2021) Super-gradients. GitHub. 
              https://zenodo.org/record/7789328
              
            ")), YOLOv9 (Wang and Liao [2024](/article/10.1007/s10462-025-11186-x#ref-CR230 "Wang C-Y, Liao H-YM (2024) Yolov9: Learning what you want to learn using programmable gradient information")), YOLOv10 (Ao Wang [2024](/article/10.1007/s10462-025-11186-x#ref-CR5 "Ao Wang HC (2024) Yolov10: Real-time end-to-end object detection. arXiv preprint 
              arXiv:2405.14458
              
            ")), YOLOv11 (Jocher and Qiu [2024](/article/10.1007/s10462-025-11186-x#ref-CR116 "Jocher G, Qiu J (2024) Ultralytics yolo11. Retrieved from 
              https://github.com/ultralytics/ultralytics
              
            "))

Despite significant progress in object detection, finding all objects in visual scenes remains a challenging topic for object detectors due to a multitude of factors. Some of the factors are as follows:

Inter-class similarity and intra-class variations: Inter-class similarity is high when objects from two different classes are extremely similar to one another; intra-class variations are high when the exterior perspectives of something, such as a school, can vary so drastically across various images (Venkataramanan et al. 2021). Both have the potential to impair the networks’ ability to comprehend the scenarios accurately.
Adverse environmental or imaging conditions: Fluctuations in images caused by occlusion, blur, weather and lighting conditions, small objects, deformation, and variations in object orientation are additional obstacles for detecting objects.
Objects out of context: Real entities have a tendency to appear in a spatial arrangement that facilitates their identification and localization. However, this can complicate object detection when objects are presented out of the correct context.

Several of these challenges are depicted in Fig. 2.

Fig. 2

Object detection challenges, a, b inter-class similarity, intra-class variation, c–e imaging condition, f object out of context

To address the aforementioned challenges, using context is one of the effective approaches that significantly enhances the accuracy and robustness of object detectors. Object detection is more accurate when contextual information is taken into account (Shrivastava and Gupta 2016; Gong et al. 2019). Context refers to any information that can be used in accurate semantic understanding of a scene and recognition of its element (Zolghadr and Furht 2016). Contextual information encompasses a wide range of information, including environmental information, lighting conditions, objects’ position and orientation, relationships between objects, time, location, and other visual or non-visual information that can provide additional information for object detectors. Non-context-based approaches perform object detection without considering contextual information in scenes. They mostly use low-level visual features such as color, texture, and shape to detect and locate objects. However, such appearance features often do not provide sufficient information to identify objects when environmental conditions change. In complex and challenging scenes, context-based approaches can provide more accurate results when dealing with adverse conditions. Context enables object detectors to comprehend their visual inputs by supplying information about their environments. A wide range of domains, from autonomous vehicles (Ibañez-Guzman et al. 2012), surveillance systems (Nazir et al. 2022), robotics (Dimitropoulos and Hatzilygeroudis 2022), medical image processing (Girum et al. 2021), benefit from context in terms of having better performance in detecting objects.

Based on Fig. 29, due to the increasing number of publications exploring context in object detection, it is important to determine how to make use of context to enhance performance and address the mentioned challenges. We argue that a comprehensive analysis of context in various types of object detection is desirable and can bring significant insights to the research community. To this end, we have conducted a comprehensive and systematic investigation of recent context-based object detection methods, including a detailed analysis and comparison. Compared to the existing surveys ((Galleguillos and Belongie 2010), (Marques et al. 2011), (Liu et al. 2020), (Gong et al. 2019)), our literature review is systematic and covers the recent developments in different categories of context-based object detection, including general object detection (GOD), video object detection (VOD), small object detection (SOD), zero-shot object detection (ZSD), one-shot object detection (OSOD), few-shot object detection (FSOD), and camouflaged object detection (COD). Galleguillos and Belongie (Galleguillos and Belongie 2010) focused on the role of context in object categorization. Marques et al. (Marques et al. 2011) discussed the importance and different types of context. In one section of another survey, Liu et al. (Liu et al. 2020) mentioned several papers that employed local and global context to detect objects (without specifying the type of context utilized). Recently, Wanga and Zhua (Gong et al. 2019), published a survey about applications of context in computer vision. Unlike these previous surveys that narrowly focus on certain aspects of context, such as local or global considerations, or limit their examination to general object detection, our analysis encompasses context from various perspectives across seven categories of object detection. Moreover, one of the goals of this study is to explore the use of context in recognizing objects in unfavorable scenarios, such as small object detection, zero-shot object detection, one-shot object detection, and few-shot object detection that have not yet been explored, and no literature review has been published on these topics.

After thoroughly reviewing the available literature and identifying the gaps therein, we address the following research questions in this literature review.

RQ1. Which context types have been predominantly used in different categories of object detection?
RQ2. What approaches are applicable for integrating context in object detection?
RQ3. Why are certain backbone networks and architectures most commonly used in recent context-based object detectors?
RQ4. What are the best performing context-based methods on the most widely used datasets, including COCO and PASCAL VOC? What about for one-stage and two-stage object detectors?
RQ5. To what extent can context improve object detection in scenarios where the number of training samples is very limited, such as in few-shot object detection, or when objects are indistinguishable from the background, as in camouflage object detection?

To answer these research questions and provide a comprehensive view of context, we reviewed more than 240 papers and conducted a thorough literature review. The main contributions of this work can be summarized as follows:

A comprehensive review of context from different perspectives, including context in human vision and computer vision, pairwise and higher-order relations, context levels: prior knowledge - global and local, contextual interactions in global and local considerations, and different types of context such as spatial, scale, temporal, spectral, thermal, etc.
An analysis of the recent context-based object detection approaches in seven categories, including general object detection, video object detection, small object detection, zero-shot object detection, one-shot object detection, few-shot object detection, and camouflaged object detection. All approaches are investigated based on context type, context level, backbone and architecture, mechanism and module, dataset, and mean average precision (mAP), and other evaluation metrics.
Highlighting the problems addressed in object detection that can be addressed by the utilization of context.
Identifying research gaps for future studies.

The structure of this literature review is organized as follows. In Sect. 2, general concept and various definitions of context are investigated. The importance of context in computer vision and human vision is described in Sect. 2.1. Section 2.2 covers different levels of context, as well as contextual interactions. Higher-order and pairwise relations are subsequently examined in Sect. 2.3. Section 2.4 gives a comprehensive analysis of different types of contextual information. Section 2.4 covers research method, including bibliographical databases 3.1, selection of studies 3.2, inclusion and exclusion criteria 3.3, data extraction and validity control 3.4, classification of the papers 3.5, context in categories of object detection 3.6, and data extraction and synthesis 3.7. In Sect. 4, employed datasets 4.1, papers in seven categories, including general object detection 4.2, small object detection 4.3, video object detection 4.4, zero-shot, one-shot, few-shot object detection 4.5, and camouflaged object detection 4.6, are reviewed, and key points, as well as the best models, are discussed. Finally, Sect. 5 provides a conclusion by addressing research questions and identifying research gaps.

2 Context

Different definitions of context have been proposed in different papers. Brown et al. (1997) describe context as the locations and identities of objects in different scenes, the time of day, season, temperature, etc. Galleguillos and Belongie (2010) define context in computer vision as available information beyond the local visual features of an image or video, which can help in resolving ambiguities and enhancing the accuracy of recognition and detection tasks. Felzenszwalb and Huttenlocher (2005) define context as a set of objects, scenes, and events that are relevant to a specific task that can be used to guide visual attention, reasoning, and decision-making. In general, context or contextual information can be any visual or non-visual information, including appearance information, such as color and texture; location information consisting of a kitchen or a library; time-related details, like the precise hour of a day or a month; semantic information, including the relationship between objects or different regions of a visual scene, and any other information that aids in better understanding the environment (Wang and Zhu 2023). Based on Fig. 6, context in computer vision refers to the information inside and surrounding an object or region of interest, providing essential cues for understanding its identity, location, and function. In order to have a deeper comprehension of context, we have thoroughly investigated and categorized it from different aspects, as shown in Fig. 3. This categorization is not merely our opinion but is derived from the extensive analysis of existing research.

Fig. 3

Context from different aspects

2.1 The role of context in human vision and computer vision

The notion of context in visual situations has been explored for many years, but it is unclear who originally suggested it. However, the origins of some of the earliest research on visual perception can go back to Gestalt psychology in the early 20th century, which theorized that the human’s brain groups all environmental information into meaningful structures and patterns based upon the context, such as proximity, similarity, closure, continuation, figure, and ground (Wertheimer 2017). According to Gestalt psychology, humans, animals, and other organisms understand complete patterns or combinations, not just isolated objects, regions, or components. This striking ability helps humans to comprehend visual scenes quickly and intuitively. This process is called human vision. The human visual system is extraordinarily adept at detecting, categorizing, and naming objects embedded in natural scenes (Munneke et al. 2013). Complex visual information, such as objects, layouts, scenes, actions, etc., can be seamlessly incorporated by our brains into coherent perceptions of the world. Despite the constant changes in environments, obstacles, blur, changes in weather and light, etc., humans are still able to understand the environments by using contextual information. Through context-based understanding, humans are able to recognize objects faster and more accurately, as well as make better decisions, even if they are incomplete or ambiguous. For instance, if humans see a partially obscured object in a scene, surrounding context, such as other objects, the background, and the relationship between objects can help their brains to infer what the object might be. Based on Fig. 4, identifying a keyboard in isolation is challenging, especially when the quality of the image is low. In an office with a monitor and a mouse on the desk, it is probable that the item positioned in front of the monitor is a keyboard.

Fig. 4

Differences between an object in isolation and the same object in context

There has been a noticeable increase in evidence suggesting that object recognition does not happen in isolation (Oliva and Torralba 2007; Marques et al. 2011). It is influenced by the presence of other objects as well as by the overall context of the scene. Real-world objects co-occur with other objects and particular environments, which allows visual systems to extract rich contextual clues (Perko and Leonardis 2010). Computer vision tries to imitate human vision to use contextual information for a better understanding of environments. The idea of utilizing context to help computers recognize patterns, objects, and scenes goes back to the 1970 s and 1980 s. Marr and Nishihara (Marr and Nishihara 1978) investigated the use of contextual information to segment images (Marr and Nishihara 1978), and Ballard and Brown (Dana and Christopher 1982) used context in object recognition. Later, (Oliva and Torralba 2007) suggested that visual systems can exploit contextual associations between objects and environments to guide attention and eyes to regions of interest in natural scenes. Overall, the studies suggest incorporating contextual information can lead to more effective computer vision systems. For example, an object can have various applications in different scenes: a tree can exist in a jungle, serve as a symbol in a book, or be used as an artificial decoration in a room. Therefore, it is imperative for a machine to identify the relationships between objects, their co-occurrence, and consider additional information to differentiate objects and accurately comprehend the entire scene. Furthermore, context can decrease the search space in computer vision tasks (Jain and Sinha 2010). When a machine tries to detect objects in an image or video, it has to look through several options or “search space” to figure out what it is seeing. For each object, a machine needs to search over all possible locations and scales and this process can take a lot of time and computing power. Given knowledge of the appearance and potential location of an object, the machine can reduce the search space, ignore heaps of objects that are unlikely to be what it is looking for, and concentrate on the regions where the object is expected to be located. This makes the process faster and more accurate. Additionally, where local features, such as edges or corners, are not sufficient for detecting objects due to different conditions, such as occlusion or rotation, contextual information can improve the efficiency and accuracy of systems. As shown in Fig. 5, it is impossible to detect the wheel of a car based solely on local features. However, by considering other parts of the car and the overall scene, it becomes possible to identify the wheel. In general, by using contextual information, computer vision systems can better identify objects and reduce the chances of false positives or false negatives.

Fig. 5

Utilizing context to better comprehend an object’s components

2.2 Context levels

Context can be classified into three main levels: prior knowledge, local context, and global context (Galleguillos and Belongie 2010).

The prior knowledge pertains to the information that can be obtained prior to observing environments. It provides insights into environmental variables, such as time, weather, location, and action, that can be utilized to forecast the occurrence of specific events or the visibility of particular objects. For example, if we know that the location of an image is a city street, the likelihood of encountering a lion is very low.

Local context refers to the information about the object itself and available information in the neighborhood of a pixel or an area. For example, in Fig. 6, for detecting the monitor, various elements can serve as local context, including nearby objects like the keyboard and cup, the overall appearance of the monitor, its different parts, and the surrounding pixels. The local context within bounding boxes around the objects can aid in distinguishing objects that may have significant visual and structural similarities. (Gidaris and Komodakis 2015) and (Zagoruyko et al. [2016](/article/10.1007/s10462-025-11186-x#ref-CR268 "Zagoruyko S, Lerer A, Lin T-Y, Pinheiro PO, Gross S, Chintala S, Dollár P (2016) A multipath network for object detection. arXiv preprint arXiv:1604.02135

            ")) are examples of using local context in object detection.

Fig. 6

Global and local context

When objects are very small in size, relying on local context alone may not be an efficient method, as detecting one object by considering itself and its surrounding pixels poses challenges. In this situation, global context can provide more valuable information. Global context refers to the information that is available over a wider area, such as the entire image or a larger region within the image. For example, to detect a car within an image, global context can be used to leverage the information in the entire image, such as the presence of other cars, buildings, or roads, which can help to provide additional cues for identifying the object. Moreover, global context can be utilized to provide prior knowledge about the typical spatial arrangement of objects within a scene, which can help to guide the interpretation of visual information and reduce the ambiguity and uncertainty of the data. Gist descriptor (Oliva and Torralba 2001) is one of the global image representations that captures the essential global spatial information about the entire scene rather than local information. Figure 7 shows the importance of global context in detecting objects. It should be noted that using global context alone to detect objects may cause confusion because objects can be detected differently in various environments (e.g., a dog in a park or a bedroom). Since each method has its own advantages and disadvantages, combining local context and global context, as demonstrated in (Li et al. 2016), is a practical method for enhancing object detection.

Fig. 7

Local context is sometimes insufficient for object recognition, as it necessitates the interpretation of the full image. On the left side, the vehicle, the person, and the road are nearly undetectable individually; however, when viewed collectively, they create a logical visual narrative (Derek 2006). On the right side, there are some examples of the importance of global context in object detection (objects, shapes, colors, and textures are similar). Capturing object characteristics may pose a challenge for feature extractors when merely considering local context

2.2.1 Contextual interactions in local context

Three types of contextual interactions exist in the local context, as shown in Fig. 8: pixel, region, and object.

Fig. 8

Local contextual interactions. a Pixel interactions capture information such as grass pixels around the cow’s boundary. b Region interactions are represented by relations between the face and the upper region of the body. c Object relationships capture interactions between the person and the horse

(1)
Pixel Interactions: Interactions at the pixel level are based on the assumption that nearby pixels have identical labels. Pixel-level analysis involves analyzing individual pixels in an image or video. Further application of the data obtained from pixel-level interactions is the delineation of object boundaries, which enables automated object segmentation and enhanced object localization. This is often done using techniques such as edge detection, corner detection, and color histograms. The advantage of pixel-level analysis is that it can provide very fine-grained information about an image, such as the location of edges or the distribution of colors. However, pixel-level analysis can be very computationally expensive. It is computationally more complex and time-consuming to obtain interactions at the pixel level, as the network is required to analyze many combinations of small windows from the image in order to arrive at a consistent result.
(2)
Region Interactions: Region interaction can be classified into two categories: interaction between different parts of one object, and interactions between image patches or segments. Interactions between different parts of one object can be used to obtain a complete picture of an object. This is often done using techniques such as segmentation. In interactions between image patches, image partitioning methods are usually used for dividing one image into several patches or segments (Carbonetto et al. 2004). The advantage of region or part-level analysis is that it can provide more detailed information about an image than pixel-level analysis, while still being computationally feasible.
(3)
Object Interactions: Object interactions are the most intuitive type of contextual interactions for humans, and have been widely analyzed in cognitive sciences (Bar and Ullman 1996; Biederman et al. 1982). Object-level analysis involves identifying and analyzing individual objects within an image or video. In object interactions, since the number of regions that need to be processed by the network is equal to the number of objects, extracting information is computationally less complex and time-consuming compared to pixel and region interactions.

2.2.2 Contextual interactions in global context

For incorporating global context a top-down processing is essential for detecting objects. In global context, contextual interactions are analyzed between different objects and the environment. For example, in Fig. 9, the network searches for images with similar scene configurations, as objects typically maintain fixed spatial relationships to each other within a global context. In fact, the background information is used to guide the detection process. Object-scene interactions investigate the possibility of an object in a certain environment to reduce the amount of error.

Fig. 9

Contextual interactions in global context (Russell et al. 2007)

2.3 Pairwise and higher-order relations

Pairwise relations are an essential aspect of spatial context analysis, as they provide valuable information for understanding the spatial relationships between objects. Pairwise relations describe the spatial relationship between two objects or regions of interest (ROIs) within an image. These relationships can be measured using geometric properties such as distance, angle, orientation, overlap, or adjacency between pairs of objects in an image. For example, the distance between two objects can provide information about their proximity and suggest that they belong to the same group or category. The orientation between two objects can provide information about their relative position and suggest that they belong to a specific arrangement or layout. Pairwise orders such as “above”, “below”, “inside”, “around”, “left”, “right” “next to”, “behind” or “overlapping” can help identify objects. For example, the building is “on the left side” of the car. The model presented by Gkioxari et al. (2018) underscores the significance of pairwise relations in a human-centric context by using interactions between humans and objects to enhance object detection. Their model leverages a person’s appearance and pose as cues to predict both the location and type of object they are likely interacting with. For instance, if a person is performing an action like “cutting,” the model infers that a nearby knife or cutting tool is likely involved, improving detection accuracy by analyzing the spatial relationship between the person and object. This approach highlights how human-object interactions serve as contextual anchors for more precise detection in complex scenes. Higher-order relations refer to the relationships between several objects or ROIs within an image, as shown in Fig. 10. Despite being more complex than pairwise relations, they can provide a more comprehensive understanding of spatial relationships. Pairwise relations and higher-order relations can be modeled using different approaches such as graph-based approaches(Georgousis et al. 2021), tensor-based approaches(Panagakis et al. 2021), and Markov random fields (MRFs)(Blake et al. 2011).

Fig. 10

Pairwise and high-order relations

2.4 Context types

Depending on the application, different types of context are employed in computer vision. Biederman et al. (1982) suggested five categories of relations between objects and their surroundings, including probability (objects tend to be found in some scenes but not others), position (given an object is probable in a scene, it often is found in one position but not others), interposition (object interrupts their background), support (objects tend to rest on surfaces), and familiar size (objects have a limited set of size relations with other objects). Galleguillos and Belongie (Belongie 2007) divided contextual information into 3 main categories: probability (semantic), position (spatial), and size (scale).

Fig. 11

The ontology of context types

Furthermore, types and significance of context have been discussed in the literature (Oliva and Torralba 2007; Marques et al. 2011), and recently Wang et al. have categorized context into spatial, temporal, and other types (Wang and Zhu 2023). Despite all these efforts, there is still no comprehensive literature on context types in computer vision. In this section, we provide a comprehensive view of the different types of context that can be used in object detection, illustrated in Fig. 11. Five types of contextual information that are mostly used in computer vision are semantic (probability), spatial (position), temporal (time), scale (size), and spectral. Context types that have received less attention in articles are categorized under the ’Other’ classification. Each will be elaborated upon in the subsequent sections.

2.4.1 Semantic context (probability)

As shown in Fig. 12, objects are more likely to be seen in some scenes than in others based on semantic context. In fact, the existence of one object increases the probability of the existence of another object appearing in a specific scene. For example, a computer vision system is able to identify a car in an image, but by comprehending the semantic context, it may also infer that the car is most likely on a road, thereby recognizing the scene as a street. Another example is that one sofa in the middle of a jungle does not make sense, but it can be in a living room with a high percentage. A multitude of studies, including (Rabinovich et al. 2007; Katti et al. 2019; Leroy et al. 2020; Ladicky et al. 2010), and (Mensink et al. 2014), have examined and implemented the semantic relationship.

Fig. 12

Inconsistent and consistent semantic relationships between objects (Katti et al. 2019)

2.4.2 Spatial context (position)

Spatial context refers to the physical relationships and arrangements of objects, such as their relative position, orientation, and distance within a specific scene. As a result of the spatial context of a scene, an object is more likely to be found in certain positions than others. In fact, it tries to find reasonable physical properties and relationships between objects and how they interact with each other to understand the structure of the scene. For example, as shown in Fig. 13, a car cannot exist in the sky because it is supposed to be on a road, or if the sky appears above the sand and water, the likelihood of a beach image is strengthened (Singhal et al. 2003).

Fig. 13

Spatial context. Left: car on the road (correct spatial context). Right: car is flying in the sky (incorrect spatial context)

As depicted in Fig. 14, spatial context can be divided into three main categories: co-occurrence, 2D relations, and 3D relations.

Fig. 14

The classification of spatial context

The term ’co-occurrence’ is used in both semantic context and spatial context. In the spatial context, co-occurence means focusing on the physical distribution of objects rather than their semantic meaning. Relationships between objects in a visual scene can be introduced simply by co-occurrence (Galleguillos et al. 2008; Rabinovich et al. 2007; Zheng et al. 2011). Spatial context implicitly encodes the physical co-occurrence of objects in an environment (Belongie 2007). In Fig. 4, distinguishing the keyboard without the monitor is difficult because the image is so blurred, but since these two objects are usually seen together, the presence of one object increases the probability of the other object. Co-occurrence analysis identifies patterns or relationships between objects that often occur together in one image. These patterns can be quantified using statistical methods such as co-occurrence matrices, which count the number of times that objects occur together in an image. For example, knives, plates, and a refrigerator are frequently seen in the kitchen, but the bed, closet, and mirror are seen in the bedroom. Co-occurrence matrices can be used as pairwise features in machine learning models like CRFs (Condition Random Fields) or HMMs (Hidden Markov Models) to capture relationships between adjacent observations. For example, Rabinovich (Rabinovich et al. 2007) invented interaction potentials for CRFs to calculate contextual information between different objects. The use of co-occurrence patterns for object detection has been explored by several researchers over the years (Galleguillos and Belongie 2010; Galleguillos et al. 2008; Mensink et al. 2014; Shrivastava and Gupta 2016; Rabinovich et al. 2007).

The 2D spatial relations are direction relations, distance relations, and topological relations (Marques et al. 2011). In direction relations, objects are oriented in relation to one another. Cardinal directions (E,N,S,W), (NE, NW, SW, SE), and relative vertical positions like “above”, “below”, horizontal positions like “left”, and “right” are samples of direction relations. For example, a book is below a table. Distance relationships are based on the Euclidean distance between two spatial features, as shown in Fig. 15, that measure the distance between objects. Terms like “absolute distance”, “close”, “far”, or “equidistant” are used for distance relations. For example, the car is close to the building. The topological relations describe an object’s relationship with its neighbors. In fact, topological relationships describe concepts of adjacency, containment, and intersection between two spatial features (Bogorny et al. 2009). Touch, overlap, contain, inside, and encloses are examples of topological relations. For instance, the car touches the road. Some scholarly works such as (Heitz and Koller 2008; Zheng et al. 2011; Choi et al. 2011) have used this type of spatial context for object detection. 3D spatial relations are another type of spatial context (Chen et al. 2022). In 3D spatial relations, objects are analyzed in three-dimensional space, considering not only their positions but also their depth or distance from the observer. This involves understanding the spatial layout, relative positions, and orientations of objects in a 3D scene. For example, determining if one object is in front of or behind another, or if they are located at different heights or depths within the scene. (Bao et al. 2011; Sun et al. 2010; Pan and Kanade 2013; Hoiem et al. 2005) are some papers that have employed 3D spatial relations in object detection.

Fig. 15

Distance, direction/order, and topological 2D spatial relations between objects

It is feasible to create combinations of all these spatial arrangements. For instance, in things and stuff model (TAS) (Heitz and Koller 2008), Geremy Heitz et al. combined all categories of relations (including eight directional relations, two different distances, and a topological relation) in order to generate a large number of candidate relationships. Some methods combine spatial context and semantic context to enhance the accuracy of a network. Figure 16 depicts the integration of semantic content and spatial content inside a network.

Fig. 16

Integration of semantic and spatial context. Initially, the input is segmented, resulting in the labeling of each segment. Furthermore, semantic context is employed to rectify some labels by relying on object co-occurrence. Ultimately, spatial context is used to offer more clarification by considering the relative locations of objects (Rabinovich et al. 2007)

2.4.3 Scale context (size)

Scale context, originally defined by Biderman (Biederman et al. 1982) as “familiar size,” refers to the relationship between objects based on their relative sizes within a scene. In computer vision, scale context helps detectors manage the variability of object sizes, enabling more accurate detection at different scales. By understanding the expected sizes of objects within a scene, a detector can reduce the need for exhaustive multi-scale searches. Objects within a scene often follow a limited set of size relations. For instance, a chair typically appears smaller than a person and not larger. Semantic context plays a role in both spatial and scale contexts. For example, in a scene depicting an office, the presence of a desk, keyboard, and monitor is dictated by semantic context (office setting). Semantic context implies spatial context, as the keyboard and monitor are expected to be placed close to each other, and it also implies scale context, as the desk is expected to be larger than both. Both spatial and scale contexts benefit from semantic context, as the semantic coherence of a scene determines the likelihood of certain objects and their relationships appearing together. These interconnected relationships enhance the overall understanding and detection of objects within a scene. However, scale relationships can also vary depending on the relative depth of objects within the scene. For example, as shown in Fig. 17, the apparent scale relationship between a tree and an elephant is influenced by their similar depth in the image. If the tree were further back, it would naturally appear smaller, highlighting that depth affects perceived scale. Therefore, scale context in object detection should consider not only object sizes but also their positions in depth, as this affects how they are perceived and how bounding boxes are generated. This relationship between scale context and depth could be incorporated by using depth-aware methods, which adjust object size predictions based on their distance from the camera. Models like Feature Pyramid Networks Lin, Dollár, et al. (2017) and multi-scale convolutional networks (Zhao et al. 2017) provide examples of techniques that capture scale context; however, further advancements could involve depth-sensitive context to enhance robustness in varied scenes.

Fig. 17

Scale context. Left: large chair compared to a person. Right: normal scaled elements

2.4.4 Spectral context

Spectral context refers to the relationships and patterns that exist between different colors or spectral bands in an image. This type of context is often used in remote sensing image analysis (Shaw and Burke 2003) and food safety control (Feng and Sun 2012). In spectral image analysis, an image is often composed of multiple spectral bands, each representing a different wavelength or color. In remote sensing and also other applications, there are two main bands, Near-infrared (NIR) bands and color or visual bands (VIS) such as red, green, and blue. NIR and color bands are often used to capture information about the reflectance properties of the scene or Earth’s surface. The main differences between NIR and the color bands are their wavelength and the information they capture. NIR is not visible to the human eye because of having a longer wavelength than visible light, but it can provide valuable information about environments, while the color bands capture information about the visible features in the scene. Spectral context refers to the fact that the colors in an image are not independent, but are instead related to one another in some way. Certain colors might be more strongly related to each other than to other colors in the image. For instance, in a satellite image of vegetation as shown in Fig. 18, the reflectance values in the near-infrared band may be highly correlated with the reflectance values in the red band, since vegetation reflects strongly in the near-infrared and absorbs strongly in the red. This relationship between the near-infrared and red bands is an example of spectral context. Generally, spectral context pertains to the utilization of distinct “colors” of light in order to gain further insights into distant observations. By considering the spectral context of an image, a machine can gain insights into the underlying properties of the scene, such as the presence of objects, vegetation, water, or other materials.

Fig. 18

Multiple spectral wavebands spread across a vast region. The graphs show the spectrum variation in the reflectance of soil, water, and vegetation. The spectral information can be used to create a visual representation of the scene at different wavelengths (Shaw and Burke 2003)

2.4.5 Spatial-spectral context

Spatial-spectral context refers to the idea that the spectral properties of an object in an image are not only determined by its intrinsic reflectance properties but also by its spatial relationships to other objects in the scene. In remote sensing, spatial-spectral context can be used to improve the classification of objects in an image by considering the spatial relationships between objects as well as their spectral properties. A system that considers both spectral and spatial context can discern objects inside an image, as well as different properties like vegetation, roads, and buildings. Figure 19 is an example of a two-stream spectral-spatial feature aggregation approach titled S2ADet (He et al. [2023](/article/10.1007/s10462-025-11186-x#ref-CR90 "He X, Tang C, Liu X, Zhang W, Sun K, Xu J (2023) Object detection in hyperspectral image via unified spectral-spatial feature aggregation. IEEE Trans Geosci Remote Sens 61:1–13. https://doi.org/10.1109/TGRS.2023.3307288

            ")) that leverages complementary spatial and semantic information to learn better semantic features of objects. Moreover, spatial-spectral context is also used in image classification with impressive results. (Fauvel et al. [2012](/article/10.1007/s10462-025-11186-x#ref-CR67 "Fauvel M, Chanussot J, Benediktsson JA (2012) A spatial-spectral kernel-based approach for the classification of remote-sensing images. Pattern Recogn 45(1):381–392")) and (Li et al. [2021](/article/10.1007/s10462-025-11186-x#ref-CR141 "Li Z, Cui X, Wang L, Zhang H, Zhu X, Zhang Y (2021) Spectral and spatial global context attention for hyperspectral image classification. Remote Sens 13(4):771")) are examples of using spatial-spectral context in image classification.

Fig. 19

Combining spatial and spectral context for detecting objects (He et al. [2023](/article/10.1007/s10462-025-11186-x#ref-CR90 "He X, Tang C, Liu X, Zhang W, Sun K, Xu J (2023) Object detection in hyperspectral image via unified spectral-spatial feature aggregation. IEEE Trans Geosci Remote Sens 61:1–13. https://doi.org/10.1109/TGRS.2023.3307288

"))

2.4.6 Temporal context (time)

Time can be defined as a measure in which events can be ordered from the past through the present into the future, and also, as the measure of the durations of events and the intervals between them (Lim et al. 2019). Using information from previous time frames in order to enhance understanding of the present frame is defined as the concept of “temporal context”. Temporal context is mostly used for dynamic data like videos. Schilder et al., (Oliva and Torralba 2007) proposed three types of temporal expression: explicit references (e.g., December 4), indexical references (e.g., yesterday), and vague references (e.g., about two years ago). Recently, (Wang and Zhu 2023) presented two forms of temporal context, including short-term temporal context (relations between frames in a short video) and long-term temporal context (relations between frames in a long video). In this review, we follow the proposed forms by (Wang and Zhu 2023). Through the investigation of disparities and resemblances across neighboring frames, short-term temporal context can be employed to identify and monitor objects, as well as activities. When recognizing an item necessitates evaluating a large number of frames over a lengthy period of time, short-term temporal context cannot give very valuable information; thus, long-term temporal context plays an important role (Beery et al. 2020). Long-term temporal context pertains to the interconnections that exist among frames within an extended video sequence. Long-term temporal context can be used to detect and track objects over longer periods of time, as well as recognize more complex actions and events, by examining the overall structure and patterns of motion in a video clip. In video object detection, temporal context is undeniably significant. Objects may frequently change appearance or motion between frames in a video sequence, making it challenging to detect and track them precisely using spatial information alone. By integrating temporal context into object detectors, researchers can leverage the interdependencies across frames in a video sequence to enhance the precision and robustness of object detection. Figure 20 is one example of leveraging long-term temporal context for detecting one wildebeest in a wild, which needs to analyze many frames over a long time (one month).

Fig. 20

Using previous and next frames of one video for detecting a wildebeest in a blurry frame (Beery et al. 2020)

2.4.7 Spatial-temporal context

Spatio-temporal or spatial-temporal context refers to the combination of spatial and temporal information. It involves analyzing not only the spatial relationships between objects in a scene but also their temporal relationships over time. By incorporating both spatial and temporal information, spatio-temporal context can provide a richer understanding of dynamic and complex scenes. This context can be used to capture both short-term and long-term temporal dependencies, such as the motion of objects or the evolution of a scene. For example, in action recognition, spatio-temporal context can be utilized to capture the dynamic motion patterns of a particular activity, such as walking or running, or in video analysis, spatio-temporal context is used to detect and track objects over time. Berg et al. (2014) introduced a spatio-temporal prior for improving the accuracy of bird species classification. In their approach, location and time are discretized into spatio-temporal cubes, and a kernel density estimate is used to determine the distribution of each species on an individual basis. In another paper (Mac Aodha et al. 2019), one spatio-temporal model for integrating all spatial context, long term temporal context (months to years), and semantic context together was proposed. Hao Luo et al. (Luo et al. [2019](/article/10.1007/s10462-025-11186-x#ref-CR165 "Luo H, Huang L, Shen H, Li Y, Huang C, Wang X (2019) Object detection in video with spatial-temporal context aggregation. arXiv preprint arXiv:1907.04988

            ")), as shown in Fig. [21](/article/10.1007/s10462-025-11186-x#Fig21), used spatio-temporal context to enhance the network for detecting objects in difficult conditions such as video defocus, motion blur, occlusion, etc.

Fig. 21

Distinctions between pixel/instance level feature correspondence in the top and middle rows, and the spatio-temporal context in the bottom row. The former is susceptible to issues such as the appearance of new objects, and occlusion, whereas the latter depicts the interdependence between intra-frame and inter-frame proposals (Luo et al. [2019](/article/10.1007/s10462-025-11186-x#ref-CR165 "Luo H, Huang L, Shen H, Li Y, Huang C, Wang X (2019) Object detection in video with spatial-temporal context aggregation. arXiv preprint arXiv:1907.04988

"))

2.4.8 Thermal context

Thermal context in computer vision refers to the use of thermal imaging data to improve the performance of computer vision algorithms. Thermal images are visual displays of measured emitted, reflected, and transmitted thermal radiation by objects within an environment (Berg 2016). For example, in surveillance systems, thermal imaging can be used to detect people in low light or complete darkness, where visible light cameras would not be effective Fig. 22 shows one example of using thermal context to detect people in darkness. To incorporate thermal context into computer vision algorithms, researchers often use specialized thermal cameras to capture thermal images of the environment (Krišto et al. 2020). They also use fusion techniques to combine thermal and visible light images to gain a more complete understanding of the environment (Zhong et al. 2017).

Fig. 22

Comparison between thermal network and RGB network for finding objects (Banuls et al. 2020)

2.4.9 Photogrammetric context

Photogrammetry is the science of extracting 3D information about objects and structures from photographs. It involves taking photographs of an object or scene from different angles and then using specialized software to analyze the images and create a 3D model of the object or scene. Photogrammetric context refers to as intrinsic and extrinsic camera parameters of image capturing. Intrinsic camera parameters refer to the internal properties of the camera, such as focal length, radiometric response, lens distortion, affinity, and shear (Lin et al. 2022). These parameters are used to calculate the relationship between the 3D world and the 2D image captured by the camera, which is important for accurate object detection and recognition. For example, knowing the focal length of the camera lens can help in estimating the size and distance of objects in the scene. Extrinsic camera parameters refer to the position, camera height, and orientation of the camera in the space (Hoiem et al. 2008). For instance, in the case of aerial photography, the height of the camera above the ground can be used to estimate the size and location of objects in the scene. By combining the intrinsic and extrinsic parameters of the camera, it is possible to create a more accurate and complete model of the environment. For example, knowing the camera height and lens distortion can help in accurately detecting objects in a scene, even if they are partially occluded or have irregular shapes. Figure 23 is one example of using photogrammetric context to enhance the performance of the object detector.

Fig. 23

The top image shows the effect of camera height, horizon position on detecting objects in the environment. In the bottom image, the blue lines represent the estimated horizon. The green boxes indicate true cars, while the yellow boxes represent false positives for pedestrians. With the integration of photogrammetric context in the full model detection, the number of false detections has decreased, while the number of true detections has increased (Hoiem et al. 2008)

2.4.10 Geographic context

Geographic context refers to the use of location-based information to improve the performance of computer vision algorithms. Geographic context takes into account the fact that images and videos are often captured in specific geographic locations, which can provide valuable contextual information about the scene. As shown in Fig. 24, it can specify the actual location of the image like GPS, or it can indicate a more generic land type such as desert, ocean, urban, or agricultural areas. For example, knowing that an image was captured in a forest area could increase the probability of detecting wildlife rather than vehicles. (Ardeshir et al. 2014), (Groenen et al. 2023), (Wang et al. 2015), and (Wang et al. 2017) are examples of using geospatial context for detecting objects.

Fig. 24

By leveraging semantic context, geographic context, and keywords linked to the scene, it is possible to forecast the presence of objects in an image (Divvala et al. 2009)

2.4.11 Audio context

Audio context can provide information about environments that visual data alone cannot capture. For example, in surveillance systems, sound context can be used to detect gunshots or breaking glass, which can indicate potential security threats. In autonomous driving systems, sound context can be used to detect and recognize different types of vehicles or to detect and avoid obstacles that are not visible in the visual data. The integration of audio context into multimodal models, using techniques such as multimodal fusion (Gao et al. 2020) and attention mechanisms (Lieskovská et al. 2021) enables a comprehensive comprehension of the scene by capturing the relationships between visual and audio data. Figure 25 provides an example of leveraging audio context to detect objects alongside visual feature maps.

Fig. 25

Utilizing a combination of noise-contrastive and clustering-based self-supervised learning to create self-detections (boxes and labels) and then utilizing those as targets to train a detector (Afouras et al. 2022)

In addition to the presence of objects, audio also determines distances and even directions. For example, by hearing the sound of a bird, in addition to identifying the bird as an object, the approximate distance and direction of the sound can also be identified. Figure 26 is an example of leveraging sound context with visual features to create an audio-visual event localization framework in unconstrained videos.

Fig. 26

Audio-visual event localization framework with audio-guided visual attention and multimodal fusion (Tian et al. 2018)

2.4.12 Text context

Text context can provide important information about a scene or object. Text context can be used to enhance the performance of computer vision tasks such as object recognition, scene understanding, and image retrieval (Mishra et al. 2013). Figure 27, for example, is an image-text retrieval system, and text context can be used to search for images with specific labels or descriptions. In a scene understanding system, text context can be used to identify specific objects within the scene or to infer relationships between objects based on their labels or descriptions. To incorporate text context into computer vision algorithms, natural language processing (NLP) techniques are effective approaches to analyze and extract textual information from various sources, such as image captions, object labels, or product descriptions. Machine learning algorithms can then be trained on this data to recognize patterns and identify specific objects or scenes based on their textual context.

Fig. 27

Visual context learning based on textual knowledge for image-text retrieval (Qin et al. 2022)

2.4.13 Illumination context

Illumination context refers to the study of how lighting conditions affect the appearance of objects in a scene and how this information can be used to improve computer vision algorithms. Illumination context involves information such as sun direction (Lalonde et al. 2008), sky color, shadow contrast, and covered by clouds. Illumination context is important because the way light interacts with objects in a scene can dramatically affect their appearance. By analyzing the illumination context of a scene, computer vision algorithms can adjust for variations in lighting conditions and make more accurate predictions about the appearance and behavior of objects in that scene.

2.4.14 Weather context

In outdoor applications, there is no escape from “bad” weather. Ultimately, computer vision systems need to leverage mechanisms that enable them to function in the presence of haze, fog, rain, hail, and snow (Narasimhan and Nayar 2002). The bad weather, however, turns out to have a positive side since it could serve as a powerful means for coding and conveying scene structure (Narasimhan and Nayar 2002). Weather context would describe meteorological conditions such as temperature, wind speed, or direction, and weather conditions such as rain, snow, mist, and different seasons.

3 Research method

A review protocol is developed to guide the conduct of the literature survey. Research questions (RQs) as mentioned in Sect. 1, and a set of criteria determine the topic and the objective of this literature review.

3.1 Identification of bibliographical databases

To conduct this literature review, three major and well-known bibliographical databases with good coverage of computer science were selected: IEEE Xplore, Web of Science (WoS), and Scopus. According to (Stapic et al. 2012), it is important to determine the starting and ending dates of a literature review; thus, in order to focus on the state-of-the-art methods, the period from 2018 to 2023 is selected as the date range.

3.2 Searching and selection of primary studies

A Boolean search criterion was utilized to search the databases. “(Title ((Context) AND (Object detection)) OR Abstract ((Context) AND (object detection)))”. Duplicate papers in different databases have been removed. The results are shown in Table 1.

Table 1 Papers retrieved from databases (Date Range: 17 April 2023 - 1 October 2024)

Full size table

3.3 Inclusion and exclusion criteria

According to (Kitchenham 2004), inclusion and exclusion criteria should be based on research questions. The inclusion and exclusion criteria are shown in Table 2. It should be noted that categories such as salient, RGB-D, remote sensing, 3D, moving, and adversarial object detection were excluded due to their specialized nature, lower recent research volume, and narrower practical applications. This allows for a more focused and relevant survey, ensuring thorough and detailed analysis of the most active and impactful areas in object detection.

Table 2 Inclusion and exclusion criteria used in the systematic literature review

Full size table

3.4 Data extraction and validity control

An overview of the data extraction strategy is shown in Fig. 28. A total of 265 papers were retrieved from three academic databases. Papers that are duplicated in different databases are considered in only one of the sources. Out of 265 papers, 2 papers were rejected for being either a survey or a review. Additionally, 2 papers were excluded because they were not written in English, and 126 papers were excluded for emphasizing context in salient object detection, RGB-D object detection, remote sensing object detection, 3D object detection, moving object detection, and adversarial object detection that are not covered in this literature review. Then, due to the absence of a final mAP in 16 papers and the use of alternative evaluation metrics in 2 papers, they were eliminated. Finally, 117 papers qualify for this systematic literature review. All these 89 papers were reviewed and classified into different application areas: context in general object detection, context in small object detection, context in video object detection, context in few-shot object detection, context in one-shot object detection, context in zero-shot object detection, and context in camouflaged object detection. In all categories, method or model, level and type of context, backbone and architecture, mechanism or module for exploiting and integrating contextual information, dataset, and mAP as the evaluation metrics were investigated.

Fig. 28

Data extraction strategy

3.5 Classification of the reviewed papers over time

Context has gotten greater attention in recent years than in the past. According to Fig. 29, the number of publications that have employed contextual information to improve object detection is increasing. The Fig. 29 demonstrates that the utilization of contextual information in computer vision is an active area, drawing the attention of more researchers and practitioners. Please note that the statistics for 2023 only include papers published during January to April.

Fig. 29

Distribution based on years of the publications

3.6 Context in categories of object detection

As shown in Fig. 30, in this literature review, we analyze seven distinct categories of object detection. The rationale behind the selection of these seven categories is as follows. Firstly, both video object detection and general object detection are extremely prevalent and currently garnering significant attention. Secondly, detecting small objects continues to pose a substantial obstacle in the field of object detection. Thirdly, no comprehensive research has been conducted on object detection using context in zero-shot object detection, one-shot object detection, few-shot object detection, and camouflaged object detection. Most of the papers under review have focused on using context in general object detection. The limited number of papers in zero-shot, one-shot, few-shot, and camouflaged object detection indicates that these areas have not yet fully benefited from contextual information and still have considerable room for growth.

Fig. 30

The number of context-based papers in different categories of object detection

3.7 Data extraction and synthesis

In the following steps, we extracted relevant information regarding RQs from the papers. For each paper, we extracted the following information: (1) Detector name (2) Employed context types (3) Employed context levels (4) Backbone and architectures (5) Mechanisms or modules to integrate contextual information (6) Evaluation metrics: We use mean Average Precision (mAP) to evaluate and compare the effectiveness of methods across studies, focusing on mAP averaged over Intersection over Union (IoU) thresholds from 50% to 95% to provide a balanced and comprehensive comparison. For methods evaluated on the PASCAL VOC (2007 and 2012) datasets, we used the mAP metric based on the 50% IoU threshold (mAP@50) with 11-point interpolation, consistent with the VOC benchmark. For COCO, we adopted the COCO-standard mAP at IoU thresholds from 50% to 95% in 5% increments (mAP@50-95) with 101-point interpolation, and included COCO-specific metrics like AP for small (APs), medium (APm), and large (APl) object scales. For other datasets, we used mAP as reported in the referenced papers, to allow for consistent evaluation across different methods. Additionally, Mean Absolute Error (MAE), F-measure, S-measure, and E-measure were included for camouflaged object detection, as these metrics are commonly used in this domain. For Zero-shot, One-shot, and Few-shot object detection, apart from mAP, the comparison focuses on the unique learning approaches within each approach, as they play critical roles in evaluating models. (7) Dataset used for evaluation.

4 Analysis and discussion

4.1 Datasets

In object detection research, several datasets have become benchmarks due to their diversity, complexity, and real-world relevance. Among the most prominent are MS COCO (Lin et al. 2014) and PASCAL VOC2007 (Everingham et al. n.d.) and 2012 (Everingham and Winn 2012), widely used across various object detection tasks. MS COCO contains over 200k annotated images and 80 object categories, presenting challenges such as detecting small, medium, and large objects, managing cluttered scenes, and understanding complex contextual interactions. Similarly, PASCAL VOC2007 and PASCAL VOC2012 are essential benchmarks with 20 object classes, where models must contend with occlusions, variations in object viewpoints, and the presence of small instances.

In specific domains, datasets such as Cityscapes (Cordts et al. 2016) are critical for urban scene understanding, challenging models with crowded environments, diverse lighting conditions, and occlusions. BDD100K (Yu et al. 2020) is a large-scale dataset designed for autonomous driving that introduces challenges in detecting objects across diverse conditions, including nighttime driving, rain, and heavy traffic, requiring models to handle both small and large objects in dynamic street scenes. DOTA (Xia et al. 2018) addresses object detection in aerial imagery, presenting challenges like rotated objects and significant scale variations, particularly for detecting airplanes, vehicles, and ships. In face detection, WIDER FACE (Yang et al. 2016) is a well-known dataset that presents extreme variations in occlusion, pose, and lighting conditions, making it highly challenging for models to generalize across real-world crowded scenarios. For wildlife detection, Caltech Camera Traps (Beery et al. 2018) introduces the challenge of detecting animals in natural environments, where factors like camouflage and low-contrast conditions complicate detection tasks. S-UODAC2020 (Chen et al. 2023), designed for underwater object detection, tests models’ adaptability to low visibility, distortion, and dynamic marine environments.

While these datasets are widely used for benchmarking, each has certain limitations and biases. For instance, MS COCO, with its rich variety, predominantly features objects common in western urban environments, which can bias models toward specific cultural contexts and affect generalizability in diverse settings. Similarly, PASCAL VOC datasets, while influential, lack the breadth of modern datasets, with fewer object classes and limited data diversity. The Cityscapes and BDD100K datasets focus on urban driving scenarios, which may bias models toward detecting road-related objects, potentially limiting performance in rural or off-road contexts. DOTA, in aerial imagery, presents unique challenges like scale variance but also emphasizes certain object types, such as vehicles and buildings, potentially reducing the robustness of models in other aerial domains. WIDER FACE and Caltech Camera Traps datasets, though comprehensive within their domains, are biased by their respective collection methods: WIDER FACE images come from news and media, possibly skewing model performance in casual or non-professional images, while Caltech Camera Traps may bias models toward recognizing specific animal species or environments. Lastly, underwater datasets like S-UODAC2020 are limited by the constraints of underwater imaging technology, which can reduce detection accuracy in diverse marine environments. Addressing these biases is crucial for developing robust object detection models that generalize well across varied real-world conditions. More comprehensive details about different datasets, along with their categories and attributes, can be found in Table 3, which organizes them by task, data type, number of classes, and dataset size.

Table 3 Overview of datasets utilized by context-based object detection approaches

Full size table

4.2 General object detection (GOD)

General object detection entails the identification and localization of various objects within images or videos, belonging to different categories. This task is essential for different applications, such as autonomous vehicles, surveillance, image retrieval, augmented reality, etc. The papers have used different approaches to integrate context into networks. We have categorized them into seven approaches: (1) graph-based approaches 4.2.1, (2) hierarchical approaches 4.2.2, (3) context data augmentation 4.2.3, (4) multi-scale approaches 4.2.4, (5) RPN-based approaches 4.2.5, (6) attention-based approaches 4.2.6, and (7) other approaches 4.2.7. All approaches are shown in Fig. 31. Approaches highlighted in red are exclusive to two-stage models, those highlighted in blue have been designed merely for one-stage models, and green-highlighted approaches are weakly supervised object detectors. Extracted information and mAPs of the papers have been compared in three different tables: (1) papers that utilized the COCO dataset in Table 5, (2) papers that utilized the Pascal dataset in Table 6, and (3) papers that employed other datasets in Table 7.

Fig. 31

Employed approaches for integrating context into general object detection

4.2.1 Graph-based approaches

Graph-based methods enhance object detection by encoding relationships between objects as a graph structure. In this setup, nodes represent objects, and edges denote relationships, providing a structured way to model spatial and semantic context. This approach allows for better understanding of the scene, as the network can leverage the co-occurrence and spatial positioning of objects, which often provide vital clues for detection. Figure 32 illustrates this, showing how object relationships are captured as graph edges, facilitating scene understanding. For example, a person is next to a skateboard based on spatial context, and a helmet and skateboard commonly appear together based on semantic context.

Fig. 32

Visual relation detection and its scene graph representation

Figure 33 demonstrates how context can be integrated into a graph-based object detection framework. The object detector first identifies bounding boxes for detected objects. Then, Visual Region Network builds a graph with objects as nodes and relationships as edges, capturing spatial connections like ’above’ or ’next to’. Spatial Network encodes spatial relationships, enhancing the model’s understanding of object positioning. Semantic Network processes category relationships, identifying associations like ’person rides horse’. By combining these contextual layers, the model refines object detections, as seen in relationships like ’horse above grass’.

Fig. 33

Integrating spatial and semantic context into a graph-based model (Zhang et al. 2021)

In this section, we explore five graph-based approaches, each leveraging context differently to enhance object detection.

(1)
Distilling Knowledge Graph (Yang et al. 2023) seamlessly integrates spatial and semantic context into object detection using a knowledge graph and knowledge distillation (KD). By leveraging a teacher-student model, the method constructs both geometry and semantic graphs through a transformer layer, capturing object interactions at both local and global levels. The edges in the graph encode relationships, which are incorporated into the attention matrix to enable graph-level attention, enhancing contextual learning. Among graph-based approaches in this paper tested on COCO dataset, this method achieves the highest AP, APs, APm, and APl.
(2)
Knowledge-guided Reasonable Object Detection (KROD) (Ji et al. 2022) focuses on global context and category relationships. It introduces Global Category Knowledge Mining (GKM) that integrates multi-label image-level classification results to provide global category knowledge for the detector. Additionally, the raw detection outputs were inputted by the Category Relationship Knowledge Mining (CRM) module into the object category co-occurrence-based knowledge graph to further refine the initial results.
(3)
The Structure Inference Network (SIN) (Liu et al. 2018) enhances object detection by integrating both scene-level and instance-level contextual information within a graphical model framework. In SIN, objects are represented as nodes in a graph, and their relationships form the edges, allowing the model to capture object interactions and scene context. Incorporated into a standard detection framework like Faster R-CNN, SIN utilizes this structured context to iteratively update each object’s state, refining predictions based on both local appearance and surrounding context, thus improving accuracy in complex scenes.
(4)
Adaptive context-aware object detection (AdaCon) (Neseem and Reda 2021) uses the spatial co-occurrence probabilities of object categories to create an adaptive network. A branch controller selects which sections of the network to execute during runtime based on the spatial context of the input frame. AdaCon is the first detector to present an adaptive algorithm for one-stage object detectors.
(5)
JLWSOD (Lai et al. 2024) a method for weakly supervised object detection, integrates two types of contextual information: instance-wise correlation and semantic-wise correlation. The framework comprises three key modules: the Instance-Wise Detection Branch, which enhances object localization by capturing correlations among spatially adjacent instances; the Semantic-Wise Prediction Branch, which addresses semantic ambiguity by modeling relationships between co-occurring object categories; and the Interactive Graph Contrastive Learning (iGCL) module, which facilitates the joint optimization of both contextual information types. This interactive learning mechanism allows for effective propagation of image-level supervisory signals to instance-level predictions.

Each of the reviewed models demonstrates distinct approaches to incorporating context in object detection, revealing both strengths and limitations in various technical aspects. The Distilling Knowledge Graph approach is computationally efficient, benefiting from a knowledge graph that integrates spatial and semantic context, but relies heavily on a teacher-student framework, which can complicate training setups. By contrast, Knowledge-guided Reasonable Object Detection (KROD) provides robust contextual reasoning through modules like Global Knowledge Mining and Category Relationship Mining, effectively suppressing false positives without extensive pre-processing. However, its dependency on prior co-occurrence knowledge graphs limits adaptability to uncommon object relations. Structure Inference Network (SIN) excels by employing a dual-context approach, leveraging scene-level and instance-level relationships to enhance both classification and localization accuracy. This method, however, struggles with rare or unexpected contexts due to its reliance on generalized scene assumptions and is highly dependent on well-structured graphs, which are complex to optimize. On the resource-constrained side, AdaCon stands out with its adaptive efficiency, selectively executing branches based on spatial context, which conserves energy and reduces latency-though at the cost of potentially lower precision on rare object combinations. Finally, JLWSOD offers a novel solution for weakly supervised settings by optimizing instance and semantic correlations simultaneously, enhancing detection performance across crowded scenes. Yet, it faces challenges with small objects and is computationally intensive due to its interactive contrastive learning mechanism. Together, these models illustrate the progressive advances in integrating context, highlighting an ongoing trade-off between computational efficiency, adaptability to novel contexts, and detection accuracy across various deployment environments.

4.2.2 Hierarchical approaches

Hierarchical approaches involve organizing visual information in a layered or multi-level manner to facilitate efficient processing and analysis, allowing models to consider context at multiple granularities. In object detection, hierarchical frameworks aim to emulate human perception by first capturing broader, scene-level context and then refining this information through intermediate layers down to precise, localized details. This layered processing strategy enhances models’ ability to handle complex scenes by systematically incorporating context from global to local levels, which is particularly useful in scenarios with dense or overlapping objects. Through this process, hierarchical models achieve a balance between high-level contextual awareness and detailed object recognition, leading to improved accuracy and robustness in object detection tasks. Figure 34 illustrates this hierarchical process, showing how the focus narrows progressively from the entire scene to specific regions of interest. Each level in the hierarchy represents a step that integrates contextual cues at varying scales, enabling accurate identification of objects within a complex scene.

Fig. 34

Image hierarchy of three levels with non-overlapped quarters

Here, we explore four hierarchical approaches, including Hierarchical Context Embedding (HCE), Hierarchical Context Embedding module, Context-Aware Hierarchical Feature Attention Network (CHFANet), and Fine hierarchical object detection.

(1)
Hierarchical Context Embedding (HCE) (Chen et al. 2020) benefits from hierarchical context by embedding contextual cues at both image-level and instance-level resolutions. Through the image-level categorical embedding module, it integrates whole-image context to support object-level classification, especially for objects dependent on contextual surroundings. The feature fusion and confidence fusion strategies further exploit this hierarchical embedding, combining global and instance-level contextual cues in the final classification stage. This layered context utilization improves the network’s ability to discern objects within cluttered backgrounds and identify contextually linked objects, enhancing robustness in varied detection tasks. The HCE exhibits superior performance on region-based detectors tested on the COCO dataset.
(2)
Hierarchical Context Embedding module (Qiu et al. 2020) leverages context by recalibrating noisy segmentation features based on hierarchical attention maps that span different spatial distances. Local context features focus on individual parts of objects (e.g., a person’s head or a car’s wheel), while non-local context encompasses broader scene-level information (e.g., a group of objects or background elements). By embedding this contextual information into the detection features, the model gains an enhanced ability to discern and differentiate objects in intricate settings, ensuring that both local details and global scene characteristics contribute to accurate detection outcomes.
(3)
Context-Aware Hierarchical Feature Attention Network (CHFANet) (Xu et al. 2020) benefits from context through its CFE and HFF modules. The CFE module collects context information at different scales using dilated convolutions, creating a context-aware feature map that captures both local and broad contextual cues. This is essential for accurately detecting objects across various scales in an image. The HFF module further strengthens the context utilization by combining high-level semantic information (useful for classification) with low-level spatial information (useful for localization), followed by channel-wise attention to selectively highlight significant features. Through these mechanisms, CHFANet is able to leverage both spatial and semantic context to improve detection accuracy and distinguish between objects more effectively. CHFANet outperforms (Cao et al. 2020) when evaluated on PASCAL 2007 dataset.
(4)
Fine hierarchical object detection method (Cao et al. 2020) leverages context primarily through the hierarchical search strategy and the ReNet layer. The top-down search mechanism mimics human perception, analyzing each region in context by progressively zooming in on areas most likely to contain the object. This strategy not only reduces redundant processing but also enables a contextual understanding of each subregion in relation to the entire image. Additionally, the ReNet layer enhances the agent’s ability to interpret spatial context over long distances, aiding in accurate decision-making. By incorporating both local and global contextual cues, the model improves its ability to locate objects that may be partially obscured or surrounded by complex backgrounds.

A technical comparison of these methods reveals distinct strengths and limitations in handling contextual information across different detection scenarios. HCE excels in context-dependent detection, particularly for objects that lack unique visual cues. However, its reliance on global context embedding may lead to misclassification in complex backgrounds, and its multi-layer fusion adds computational load, which can affect efficiency in real-time applications. (Qiu et al. 2020) module takes a more granular approach, integrating local and non-local context with multiscale embeddings, making it more effective in handling cluttered environments. Yet, its dependency on accurate segmentation and the high complexity of attention mechanisms make it challenging for applications with low-quality input data or resource constraints. CHFANet improves multi-scale detection by combining spatial and semantic information through channel-wise attention, which enhances feature selectivity for objects at varied scales. This approach, however, introduces complexity that could slow processing in real-time contexts, and the use of dilated convolutions can introduce artifacts, potentially affecting accuracy in high-resolution images. Lastly, Context-Based Fine Hierarchical Detection offers efficiency by narrowing down search regions through a top-down hierarchical strategy, which is computationally efficient for locating single objects. Yet, its sequential decision-making process can be limiting when multiple objects need detection, and its dependency on a fixed hierarchy reduces flexibility in variable-scale detection tasks. Overall, hierarchical methods show how multi-level context integration can improve object detection, though their complexity and computational demands may limit applicability in real-time or large-scale scenarios.

4.2.3 Context data augmentation

While most methods focus on directly incorporating context within model architectures, context data augmentation offers an alternative approach by enriching training data with contextually relevant objects. As illustrated in Fig. 35, context data augmentation can be integrated into the training pipeline to enhance model robustness. This process involves selecting contextually appropriate objects, adjusting their appearance, and blending them into background scenes in a way that aligns with the visual context. By augmenting images with realistic object placements that match scene characteristics, this approach improves the model’s ability to recognize and localize objects across diverse environments, ultimately enhancing detection accuracy and generalization.

Fig. 35

Context data augmentation, identifying suitable empty spaces for object insertion based on scene context (Dvornik et al. 2019)

This category reviews four papers that use context-based data augmentation to enhance object detection by strategically placing synthetic objects within scenes.

(1)
Context Augmentation Faster R-CNN (CA-Faster R-CNN) (Leng and Liu 2022) enhances region proposals in two-stage object detectors by initially creating a coarse set and subsequently improving uncertain proposals using appearance and geometry (spatial) information. Furthermore, it uses pair-wise relationships between region proposals to augment global feature information for better recognition outcomes. This method of context augmentation enhances object detection in cluttered scenes, where direct visual cues are insufficient. CA-Faster R-CNN outperforms Context Augmentation (Dvornik et al. 2018) when evaluated on the VOC 2012 dataset.
(2)
Context Augmentation (Dvornik et al. 2018) utilizes contextual information by training a CNN to predict likely placements for new objects in augmented images based on surrounding visual cues. This context-aware placement prevents unnatural object positioning, making the augmented data more realistic. The model evaluates the likelihood of an object’s presence in a particular bounding box based on the surrounding area, allowing it to integrate objects naturally within scenes. This method particularly benefits detection tasks with ambiguous contexts by ensuring object positioning aligns with common scene configurations, thereby enhancing training efficacy and model accuracy.
(3)
Another context-based data augmentation method (Zhang et al. 2021) focuses on copy-paste augmentation, which involves pasting foreground objects onto background images. This method benefits from context by leveraging Context Region Proposal Module (CRPM) to identify regions within a background image where objects can be realistically placed. Class-Location-Size Aware Module (CLSAM) then refines these placements by ensuring that each object’s class, location, and size are contextually appropriate for the scene, aligning with common object interactions (e.g., people seated on chairs, bottles on tables). By refining placement based on context, the model creates more photorealistic training data, enhancing the model’s ability to generalize to real-world conditions.
(4)
Copy-Paste data augmentation (Li et al. 2022) utilizes context by guiding object placement with both local (instance-level) and global (scene-level) transformations. The local context adapts each instance mask to match real-world scaling and color variations based on its perceived distance from the camera and surrounding conditions. Global context, derived from a multi-task model, directs where objects are placed to maintain logical coherence with traffic environments, such as placing traffic cones only on the ground along lanes and avoiding occlusions with other objects. Together, these contextual insights create realistic data augmentations that strengthen the model’s detection accuracy, especially in complex and rare object scenarios.

The methods in context-based data augmentation reveal intriguing trade-offs that highlight potential areas for optimization and improved applicability. For example, while CA-Faster R-CNN’s iterative refinement boosts object detection by enhancing region proposal quality, it relies heavily on pairwise relationships, which can limit its generalizability to more complex scenes with ambiguous spatial arrangements. In contrast, methods like (Dvornik et al. 2018) prioritize realism by context-driven object placement, which provides significant accuracy gains in low-data scenarios. However, its reliance on blending techniques, despite their benefits, can sometimes introduce artifacts, reducing consistency in the training data. (Zhang et al. 2021), on the other hand, achieves photorealistic object placement by refining contextual fit through specific class, location, and size considerations, making it highly effective for controlled environments but resource-intensive. (Li et al. 2022) approach is uniquely tailored to traffic contexts, incorporating both local and global transformations that reduce false positives by aligning objects with traffic cues; however, its dependence on domain-specific context restricts its versatility. Therefore, while each method pushes the boundaries of contextual realism in augmentation, future advancements could benefit from hybrid approaches that combine these methods’ strengths-such as refining object placement with pairwise and context-aware relationships while remaining computationally efficient and adaptable across varied domains.

4.2.4 Multi-scale approaches

In object detection, multi-scale approaches address the challenge of identifying objects at different scales within complex scenes by leveraging techniques like Feature Pyramid Networks (FPN) (Lin, Dollár, et al. 2017). FPN, a widely used architecture in this domain, enhances detection by generating feature maps at multiple scales, where each layer captures a progressively larger or smaller receptive field. As shown in Fig. 36, it involves a bottom-up pathway, which extracts low-level features from input images, and a top-down pathway, which refines these features by combining spatial details and high-level semantics from deeper layers. By merging feature information across levels, FPN helps retain both fine details for small objects and contextual information for larger ones, thereby allowing models to detect objects of various sizes effectively.

Fig. 36

Feature pyramid network

Sixteen of the papers reviewed are classified within this category, as outlined below:

(1)
Dilated and Deformable Feature Pyramid Network (DDFPN) (Wu et al. 2021) enhances object detection by using context to tailor receptive fields spatially (via Dilated and Deformable Convolution (DDC)) and semantically (via Cross Feature Correlation (CFC) and Co-occurrence Inference (CI) modules), thus accommodating a range of object scales and orientations in complex scenes. It creates more adaptable receptive fields compared to the standard FPN. This multi-level approach enables DDFPN to maintain semantic consistency across object boundaries, supporting the identification of smaller or obscured objects by leveraging cues from their surrounding context.
(2)
Context and level-aware FPN (CL-FPN) (Yang et al. 2023) enriches context by capturing spatial and semantic information at different levels. The dilated convolution in Context Enhancement Module (CEM) collects multi-scale context, making the model more adaptable to objects of varying sizes and improving overall detection reliability. Attention-Guided Feature Refinement Module (AFRM) further enhances this by focusing on spatial relationships, which boosts the model’s ability to discern context within high-resolution, fine-grained details. Together, CEM and AFRM provide a dual approach to contextual enhancement, strengthening object detection through a comprehensive integration of spatial and semantic cues.
(3)
Global Context Aware (GCA) RCNN (Zhang et al. 2021) architecture tackles the problem of losing some contextual information in the process of resizing the object proposal in two-stage object detectors. It improves the spatial relations between the foreground and background by incorporating global context information. The GCA framework incorporates a context-aware mechanism that utilizes a global feature pyramid and attention techniques for the purpose of extracting and refining features, respectively.
(4)
Global context encoding (GCE) module (Peng et al. 2022) employs a dual-path approach to fuse top-down, image-level classifications with regional features, making use of high-level contextual cues to improve detection accuracy. This integration enriches the model’s understanding of the full scene, allowing it to leverage semantic relationships within the image to identify objects more accurately, especially in complex or cluttered scenes.
(5)
CEBNet (Chen et al. 2019) directly enhances low-level feature layers to gather expanded contextual cues, particularly for small objects, through the use of Expansion Receptive Field Block (ERFB). This block captures multi-scale context within a single, efficient structure, making it especially useful for dense and cluttered scenes. Additionally, Feature Attention Block (FAB) ensures feature consistency across scales, enhancing object detection accuracy in diverse contexts by re-weighting features based on channel attention, thus fine-tuning the balance between detailed localization and broad contextual awareness.
(6)
CP-SSD (Jiang et al. 2019) captures local and multi-scale contextual information by using dilated convolutions with different rates, allowing for robust context recognition across scales. This method is particularly valuable for scenes with objects that are either small, overlapping, or close to background colors. Additionally, the Semantic Activation Module complements this by learning the interdependence between channels in a self-supervised manner, thus focusing on semantically rich features and strengthening object-background differentiation.
(7)
Region-Dependent Scale-Proposal (RDSP) network (Akita and UKita 2023) enhances object detection, especially for small objects, by estimating optimal scale factors based on contextual information derived from the scene structure. The integration of scene context via positional embedding and scene structure embedding allows for a more robust detection process, particularly in challenging conditions with significant object size variations and distance from the camera.
(8)
YOLOC’s MCTX module (Oreski 2023) adds a multi-task component to the YOLO architecture, enabling it to classify both objects and environmental context in one pass. By evaluating images across multiple scales, YOLOC can interpret spatial and global context cues, such as weather and time of day, directly influencing detection tasks and enhancing accuracy in scenes where these factors significantly alter object appearance and behavior. This explicit modeling of context allows YOLOC to handle complex traffic environments, making it especially valuable in applications like autonomous driving where situational awareness is critical.
(9)
CB-FPN (Liu and Cheng 2023) emphasizes capturing spatial and scale-based contextual information. Context Enhancement Module with CSPNet (CEM-CSP) captures diverse receptive fields to retain rich context, effectively bridging scale differences in multi-layer feature fusion. Bidirectional Efficient Feature Pyramid Network (BE-FPN) further enhances context by facilitating bidirectional fusion, ensuring efficient feature propagation across all feature layers, which helps improve object detection by maintaining spatial relationships and context cues across scales. This setup benefits particularly in handling occlusions or objects at multiple scales.
(10)
FA-FPN-MCI (Bhalla et al. 2024) leverages multiscale context by using a Twin-Branch Global Context Module (TBGCM) to fuse information across scales and capture global and local features. Style Normalization and Restitution (SNR) module supports domain generalization, aligning feature distributions for consistent performance across varied underwater conditions, while Receptive Field Blocks (RFBs) and deformable convolutions enhance the model’s adaptability to varying object scales and complex, cluttered backgrounds in underwater imagery. This approach provides a holistic representation, essential for robust object detection in challenging underwater environments such as color distortion, light attenuation, and complex backgrounds.
(11)
Multi-Scale Context-Aware Feature Pyramid Network (MCFPN) (Wang et al. 2022) improves the baseline performance of existing mainstream detectors as an alternative to FPN. This detector has three blocks: The Dilated Residual Block (DRB), Cross-scale Context Aggregation Block (CCAB), and Adaptive Context Aggregation Block (ACAB). DRB mitigates context loss by incorporating context at the topmost level and layering residual blocks with varying dilation rates to produce a more accurate representation. By fusing context information from adjacent levels in an efficient manner, CCAB enables interactive fusion to suppress noise and improve features. By calculating channel and spatial weights, ACAB bridges semantic gaps and utilizes Spatial-guided Aggregation Block (SAB) and Channel-guided Aggregation Block (CAB) to construct a balanced global context. MCFPN adaptively incorporates spatial and channel-specific context features, enhancing object detection accuracy across a range of visual tasks and object scales.
(12)
Few Relevant Neighbors (FNM) (Barnea and Ben-Shahar 2019) focuses on local spatial context between objects (higher-order relations) by tackling the challenge of learning when objects collide, particularly during context-detector conflicts. A belief propagation mechanism has been utilized to integrate spatial relations between objects. This mechanism is used to calculate context-based probabilities for objects, dynamically selecting the most informative collection of context variables for each location. Furthermore, for utilizing scale context, it uses scale-invariant representations, which reduce the requirement for varied instances at multiple sizes and simplify training. FNM not only has the greatest AP among approaches in multi-scale category but also outperforms all other general object detection methods on the COCO dataset.
(13)
MSF (Wang et al. 2018) leverages multi-scale fusion to enhance context awareness, synthesizing information across multiple resolutions and spatial regions to improve object detection accuracy. By combining context from surrounding regions and incorporating it directly into object detection layers, MSF effectively addresses scenarios with small or partially occluded objects, enhancing both precision and recall in cluttered or complex scenes.
(14)
Efficient Selective Context Network (ESCNet) (Nie et al. 2020) utilizes multi-scale context by enhancing feature pyramids and implementing selective attention to refine critical details. The ECM enriches shallow layers with multi-scale information, which is essential for accurate small object detection. The TAM’s selective attention further leverages context by filtering and amplifying relevant features, enhancing the network’s ability to focus on critical parts of the image while minimizing noise. Through these components, ESCNet successfully integrates spatial, channel, and global contexts, resulting in improved localization and classification, particularly in scenes with complex backgrounds and small, detailed objects.
(15)
Pyramid context learning (PCL) (Ding et al. 2020) employs a structured multi-level context extraction process where aggregation operator collects features at various spatial scales, and distribution operator adaptively weights these features based on their contextual importance. This ensures that both global and local context is harnessed, allowing the model to detect complex object arrangements by providing context-aware features across scales. Channel context learning further contributes by refining feature maps through capturing correlations among channels, enhancing the model’s ability to focus on critical aspects of the object, which improves accuracy in diverse and cluttered environments.
(16)
The UGC-YOLO (Yang et al. 2023) enhances underwater object detection by integrating global context information with the YOLOv3 architecture. It employs deformable convolution to adaptively capture features of various aquatic organisms and differentiates between overlapping objects or those that blend into the background, while also utilizing a pyramid pooling module to aggregate semantic information at different scales.

To provide a clearer evaluation, the following comparison highlights the key strengths and limitations of these multi-scale approaches, focusing on their adaptability, efficiency, and task-specific applications.

Adaptability to context variability: Dilated and Deformable FPN (DDFPN) and CL-FPN excel in capturing diverse spatial and semantic features with dilated convolutions and attention modules, benefiting complex scenes with varying object sizes. However, they come with increased memory and computational requirements, which limit their real-time usability. By comparison, CEBNet and CP-SSD offer a lightweight solution with enhanced low-level features, though they may not be as effective for very small objects or in densely packed scenes, where more complex methods like MCFPN and Global Context Aware (GCA) RCNN perform better.

Computational efficiency: GCE and ESCNet integrate context effectively without heavy computational demands, ideal for applications needing fast inference. GCE’s efficiency drops at high IoU thresholds, while ESCNet still faces background confusion in cluttered scenes. CP-SSD and MSF balance speed and context retention, with CP-SSD’s complexity making it robust for object differentiation and MSF’s simpler fusion benefiting smaller objects while limiting its reach for larger objects.

Task-specific strengths: UGC-YOLO and FA-FPN-MCI, designed for underwater and domain-generalized detection, capture fine details and handle specific environmental challenges. Similarly, CEBNet and YOLOC cater to urban scenes, with CEBNet enhancing small object detection in complex backgrounds and YOLOC modeling context for traffic-specific scenarios, though large-scale improvements are minimal.

In summary, while models like DDFPN and CL-FPN stand out for their contextual richness, their efficiency may hinder real-time application. Streamlined architectures such as FNM and CP-SSD offer better real-time performance but may lack nuanced context handling needed for certain challenging scenarios. Multi-scale approaches continue to demonstrate clear advancements in integrating context across scales, underscoring a progression toward models that can balance efficiency with increasingly complex contextual needs.

4.2.5 RPN-based approaches

Region Proposal Network (RPN) (Ren et al. 2016) approaches suggest regions likely containing objects, allowing for more precise object detection in later stages. Figure 37 illustrates a typical RPN pipeline, where the network generates several region proposals based on feature maps extracted from CNN layers. These proposals are then refined through processes like RoI pooling and classification, ultimately identifying the regions most likely to contain objects. RPN-based context approaches have incorporated contextual information to RPN to enhance the accuracy and relevance of these proposals.

Fig. 37

Using RPN to propose regions containing objects

In this section, we review four RPN-based context approaches.

(1)
Cascade region proposal (Zhong et al. 2020) benefits context by modeling both local and global context. This helps in refining region proposals based on surrounding features, leading to more accurate object detection. The model’s global context branch also leverages entire image features, which supports object classification by considering broader contextual cues beyond just the object’s immediate area.
(2)
Learning based context refinement (Chen et al. 2018) enriches region proposals by iteratively integrating visual, spatial, and semantic context from neighboring regions. This comprehensive use of context improves both proposal localization and classification by ensuring that each object’s context is thoroughly examined, making it effective for scenes with dense object interactions. The adaptive weighting strategy prioritizes more relevant context, reducing the impact of irrelevant surroundings, thereby enhancing the model’s ability to discriminate between objects and their backgrounds. This approach has achieved superior results compared to other PRN-based models in testing on the COCO dataset.
(3)
SC-Faster RCNN (Xiao et al. 2020) model leverages contextual information by embedding a feature extraction module for context after the conv5_3 layer. This module, combined with skip pooling, helps the network better understand the spatial and semantic context around objects. As a result, it can differentiate between similar-looking objects and the background, especially in cluttered or complex scenes, making it effective for detecting partially hidden objects and recognizing objects based on surrounding cues. SC-Faster R-CNN surpasses proposal-based object detection (Kaya and Alatan 2018) in detecting small objects.
(4)
The model (Kaya and Alatan 2018) leverages context by adding a dedicated context feature extractor stage, which operates in parallel to the object feature extraction layers in Faster R-CNN. By pooling features from a surrounding “context ring” around the object and merging them with object features via the wrap-around operation, the model captures spatially consistent context information. This spatially aware combination allows the model to discern object boundaries more effectively, especially in cases where context is a key factor in recognizing the extent or class of an object.

To provide a more analytical perspective, we compare these approaches to highlight their respective advantages, limitations, and contextual integration strategies. The cascade RPN model emphasizes global context modeling to improve proposal quality without significant computational overhead, making it efficient but limited in handling small objects. Conversely, the learning-based context refinement approach excels in crowded scenes by iteratively refining spatial and semantic context, enhancing boundary precision. However, it introduces additional computational costs and relies heavily on initial proposal quality, making it less effective in sparse contexts. The SC-Faster R-CNN builds upon these advancements by incorporating skip pooling and guided anchors to improve small and occluded object detection, although it faces challenges with camouflaged or heavily deformed objects. Meanwhile, (Kaya and Alatan 2018) approach offers enhanced boundary determination by preserving spatial relationships between object and context features, proving effective for context-rich classes but adding model complexity and showing limitations in heavily occluded settings. Overall, while each model improves contextual feature extraction, they vary in computational efficiency and suitability for specific object detection challenges, illustrating a progressive refinement of contextual integration tailored to distinct scene complexities and object types.

4.2.6 Attention-based approaches

Attention-based approaches leverage mechanisms that focus on relevant features while suppressing irrelevant background information, improving detection performance by emphasizing the most critical context. Figure 38 illustrates in (a) “Without Attention,” the model assigns equal importance to surrounding regions, leading to a misclassification where a chair is incorrectly identified as a sofa. In (b) “With Attention,” the model applies attention weights, prioritizing relevant contextual features, which enables it to correctly identify the object as a chair. This highlights how attention mechanisms improve object detection by helping models concentrate on the most informative parts of the scene, thus enhancing accuracy.

Fig. 38

Impact of attention mechanisms on object detection accuracy (Lan et al. 2022)

Attention mechanisms have been used in some approaches in the form of different modules to integrate contextual information. Fourteen papers are reviewed in this category.

(1)
Pure regression object detection (Fan et al. 2022) utilizes context by embedding a lightweight Contextual Attention Block (CAB) that enhances the representation of regression points through global contextual information. The CAB allows the model to capture extensive context around objects, addressing the limitations of traditional bounding box-based representations by considering both the spatial extent of objects and their surrounding context. This integration of context information leads to improved localization and classification of objects, particularly in challenging environments where context plays a key role in distinguishing similar objects.
(2)
Sparse attention block (SA) (Chen et al. 2022) selectively focuses on high-response areas, such as object edges, to capture meaningful spatial context in long-range dependencies. By aggregating information from only the most relevant positions, SA minimizes the computational and memory burden associated with dense pixel interactions. This focused attention reduces background noise and redundancy, ensuring that the included context supports accurate object detection without overwhelming resources, especially in scenarios where distinguishing objects from their surroundings requires effective long-range dependency modeling.
(3)
Global Context (GC) module (Lee et al. 2021) leverages self-attention to enhance object detection by blending global and local context in the feature maps. By computing relationships among all spatial elements, it allows each element to incorporate information from distant regions, aiding in the identification of ambiguous objects based on their spatial and semantic relationships. This tailored integration of self-attention supports object distinction in complex scenes.
(4)
CSA-Net (Liang et al. 2022) is superior to all other attention-based techniques on the COCO dataset. CSA-Net leverages context through a self-attention mechanism in the ResNet-SA (He et al. 2016), which emphasizes object regions while downplaying background noise. The RFFE module further enhances context utilization by integrating global and local features across multiple receptive fields, providing richer information for detecting both small and large objects. The SFFP network complements this by enabling multi-scale feature fusion, ensuring that features from different layers contribute to accurate object recognition. Together, these components allow CSA-Net to utilize spatial context effectively, particularly in scenes with varied object sizes and complex backgrounds.
(5)
Global Contextual Dependency Network (GCDN) (Li et al. 2022) improves two-stage object detectors by enhancing global contextual information. It combines global and local information to strengthen Region of Interest (RoI) feature representation. The Context Representation Module (CRM) provides multi-scale context, capturing relationships across scales, while the Context Dependency Modules (CDMs) use attention to refine these representations. This integration of spatial and global context improves object classification, especially for objects in cluttered backgrounds or with minimal visual features.
(6)
In Cross-context Attention-guided Network (CCAGNet) (Miao et al. 2022), to emphasize object-synergy regions and suppress non-object-synergy regions, CCAGNet integrates three attention mechanisms: Cross-context Attention Mechanism (CCAM), Semantic Fusion Attention Mechanism (SFAM), and Receptive Field Attention Mechanism (RFAM). The CCAM assists the model in focusing on significant parts of an image by highlighting relevant regions and dismissing less important ones. The SFAM optimizes upsampling by emphasizing valuable information and minimizing noise in feature representation. The RFAM increases the model’s awareness of the context of an image, helping it to better comprehend the relations between distinct features. This network reduces false positives in cluttered or ambiguous environments.
(7)
Context-based Feature Fusion Network (CFFN) (Xia et al. 2020)overcomes performance restrictions caused by a lack of global context and the presence of background noise in low-level features. Context Extraction Module (CEM) and Context Refinement Module (CRM), are suggested to augment the network’s capacity to acquire abstract global context information and boost the discriminative capability of low-level feature representations through the utilization of an attention mechanism.
(8)
AMCM+YOLOv4 (Ma et al. 2022) enhances the understanding of spatial relationships and dependencies within occluded and multi-scale objects. In this framework, YOLOv4 is enhanced by using attention-guided multi-scale context information module (AMCM) to improve the perception of object context information. The approach begins by performing feature extraction via the CSPDarkNet53 network. Subsequently, it employs the AMCM technique to enhance feature discrimination by applying attention weighting.
(9)
Lightened Context Extraction Module (LCEM) (Jiaxuan et al. 2022) employs superimposed dilation convolutions inside the Feature Pyramid Network (FPN) to provide efficient feature fusion across various scales. In addition, this module focuses on the attention mechanism in the Attention-guided Context Feature Pyramid Network (ACFP), specifically stressing the enhancement of feature map integration via dilation convolutions. It is well-suited for real-time applications that require context-rich object detection.
(10)
CIE-JHR’s combination of attention and convolution layers (Zhao et al. 2022) enhances context extraction, addressing challenges in identifying small objects by incorporating global–local spatial relationships. This enriched feature representation supports applications where accurate object detection amidst clutter or obscured backgrounds is critical, as in the detection of transmission line components.
(11)
Improved YOLOv8 (Saha et al. 2024) leverages contextual cues from its dataset, applying attention mechanisms and the C2f module to adapt detection to localized license plate characteristics. These mechanisms help manage the model’s focus, enabling it to distinguish among different vehicle types and environmental variations specific to Bangladesh, such as varying license plate colors and formats in diverse lighting conditions. This integration strengthens the model’s adaptability and precision in Automatic License Plate Recognition (ALPR) systems.
(12)
YOLOv8-CGRNet (Niu et al. 2023) is a lightweight object detection network optimized for mobile devices. By integrating YOLOv8 with Context GuidedNet (CGNet) and Res2Net, it enhances feature learning while maintaining low computational complexity. The model effectively captures local features and spatial dependencies, improving contextual understanding. Additionally, it employs a pyramid network and a dynamic focusing mechanism to handle low-quality examples. This is valuable for scenarios with complex backgrounds or object variances, where accurate context integration can significantly improve detection robustness.
(13)
The presented approach in (Ma and Wang 2023), is a multi-scale model that focuses on fusing high-resolution spatial features with deep semantic features. To extract global context, a cross-pooling captures long-range dependencies across the image, which enhances the model’s ability to detect objects within complex backgrounds or scenes. Local context can be extracted via a cascaded deformable context module that integrates spatial information from surrounding regions, assisting in detecting objects amidst clutter and varied scales.
(14)
The Fine-Grained Dual Level Attention Mechanism joint Spatial Context Information Fusion (FGDLAM and SCIF) (Deng et al. 2024) enhances object detection by refining traditional attention mechanisms, subdividing the feature space into multiple subspaces for precise channel weight extraction while employing a dual-weight strategy to capture relationships both within channels and across spatial regions. Additionally, it integrates a context information extraction module to leverage local and global contextual information, significantly improving object recognition and localization.

A more analytical comparison of these attention-based approaches highlights clear trade-offs between accuracy, computational efficiency, and context utilization. Lightweight models like Pure Regression Object Detection and Sparse Attention Block (SA) are efficient and suitable for real-time applications, focusing on key spatial areas to reduce noise. However, their selective use of spatial context risks losing detail in cluttered scenes, where more comprehensive context may be necessary. Global Context (GC) and Global Contextual Dependency Network (GCDN) improve small and occluded object detection by incorporating extensive global context, though at the cost of increased memory and processing demands, limiting their real-time usability. Balanced models like CSA-Net and CFFN blend local and global context, offering robust multi-scale detection. However, their fusion mechanisms increase complexity, making them less ideal for fast applications. Meanwhile, FGDLAM&SCIF enhances spatial-channel correlations through fine-grained attention, capturing nuanced context across scales, though it adds computational overhead and potential isolation issues that require added processing. In summary, while each approach enhances context-driven detection, they differ in balancing accuracy, speed, and adaptability. These trade-offs reflect a progression in the field, with each method advancing tailored solutions for specific object detection challenges.

4.2.7 Other approaches

In the “Other Approaches” category, these models each enhance object detection by leveraging unique forms of context, each with distinct strengths and trade-offs.

(1)
Context Learning Network (CLN) (Leng et al. 2018) captures the pairwise relations between objects and the global context. The network is divided into two subnetworks: a Multi-Layer Perceptron (MLP) and a Convolutional Neural Network (ConvNet). The MLP is first used to capture pairwise relations. The ConvNet then gathers and concatenates pairwise relations to learn more about the global context.
(2)
One-stage Diverse Receptive Field Network (DRFNet) (Xie et al. 2020) utilizes multi-branch diverse receptive field modules (DRF modules) and a parallel framework to collect contextual information at various sizes.
(3)
EGCI-Net (Guo et al. 2020) integrates enhanced global context information through global activation blocks into its backbone. This integration aims to decrease the reliance on local information and increase the global context to tackles the constraints identified in DSOD (Shen et al. 2017). Moreover, a pyramid feature pool module produces multi-scale global context features, guiding the detection process.
(4)
In Recursive Context Routing mechanism (ReCoR) (Chen et al. 2021), spatial modeling captures spatial relationships over longer distances via a recursive structure, and channel-wise modeling, which encodes relationships between features for a better understanding of context. The technique dynamically models contexts by integrating spatial relations with channel-wise representations of local items.
(5)
MCENet (Wang and Ma 2022) employs rectangle pooling kernels (RPU) to extract and use image level long-range relationships. To capture multi-scale contextual information, dilated convolutions with varying dilation rates are used in addition to image level context. Finally, RPU and dilated convolutions (DC) are merged into a context enhancement module (CEM), which can be used to increase detection accuracy in different models.
(6)
Boundary Aware Network (BAN) (Kim et al. 2018) emphasizes the importance of boundary context, which is a subset of spatial context. BAN defines three types of boundary contexts (side, vertex, in/out-boundary) and employs 10 sub-networks to represent the relationship between these contexts. The detection head of BAN is an ensemble of these sub-networks, selectively focusing on different contributions based on the detection sub-problem to improve object detection accuracy.
(7)
CompositionalNet (Wang et al. 2020) comprises dissecting image representation into context and object components during training, which is accomplished using context segmentation. This network efficiently manages the impact of context, boosting robustness in identifying heavily occluded objects.
(8)
Layout Transfer Network (LTN) (Wang et al. 2019) uses a retrieve-and-transform technique to forecast likely objects’ positions and sizes. This technique integrates both bottom-up and top-down visual processing into Faster RCNN for combined reasoning of object detection and scene layout estimation.
(9)
Edge-aware Context-aggregation Network (ECNet) (Xiao et al. 2021) detects transparent and reflective objects such as glass products. ED module extracts boundary features from images, serving as a global attention mechanism. DFF module extracts context discontinuity from input depth maps that eases textual feature extraction by fusing into the RGB backbone at each level. Finally, MFE module utilizes multi-receptive-field features to expose the discontinuity in texture.
(10)
Of-The-Shelf (Bardool et al. 2019) uses an of-the-shelf Mask R-CNN to generate feature maps and detect objects. In the next step, a Fully Convolutional Network (FCN) (Shelhamer et al. 2017) is employed as a contextual learner to comprehend semantic relationships between objects using contextual feature maps.
(11)
In POD-F and POD-Y methods (Ma et al. 2023), a learnable Gabor convolution layer and a Spatial Attention (SA) mechanism are presented for low-level features to collect edge and contour information while improving spatial relationships. For high-level features a Global Context Feature Extraction (GCFE) module is used to extract multi-scale global contextual information, and a Dual Scale Feature Aggregation (DSFA) module to fuse features from different scales.
(12)
A trainable spatial context features extractor (SCFE) (Wang et al. 2018) inspired by recurrent neural networks (RNN) was presented for fast object detection by augmenting convolutional neural networks (Wang et al. 2018). The SCFE, as opposed to conventional CNN-based approaches that prioritize local geometric and texture features, extracts spatial context information directly from the scene.
(13)
To address the lack of contextual information surrounding R-CNN object proposals, an object detection system (Chu and Cai 2018) was introduced that integrates local appearance and contextual information. It makes use of a fully connected CRF formed on object proposals, with contextual restrictions incorporated as edges. The system incorporates both local interactions between objects and global scene information, employing a logistic regression model to comprehend the scene.
(14)
In (Zeng et al. 2021), the approach addresses two challenges: local optima and object ambiguity. Multiple Instance Learning (MIL) with self-training is utilized by the proposed method in order to generate pseudo-labels and enhance object localization throughout training. In order to surpass local optima, a context awareness block is incorporated to focus on background and context. The approach employs a spatial pyramid pooling (SPP) network layer to boost generalization and detection performance while avoiding local optima.
(15)
Paper (Gu et al. 2022) addresses WSOD with image-level annotations, highlighting the frequent difficulty of localizing discriminative parts rather than the full object. To improve object localization accuracy, the proposed technique employs a Symmetry Context Module (SCM) and context proposal mining strategies. The SCM uses contextual information from precomputed region proposals to encourage the model to prioritize proposals that include the whole object. Furthermore, to gather context appearance in different spatial areas surrounding, two context proposal mining strategies, naive and Gaussian-based are implemented.
(16)
The Transformer-based Context Condensation (TCC) (Chen et al. 2023) improves multi-level feature fusion (MFF) in feature pyramids. It decomposes context information into locally concentrated and globally summarized representations, then uses a Transformer decoder to refine MFF results by exploring relationships between local features and condensed contexts. This method boosts detection accuracy while reducing computational costs.

Other approaches offer diverse strategies to integrate context for object detection, each with unique advantages and challenges. Models like CLN, EGCI-Net, and ReCoR improve detection by emphasizing global context, which enhances accuracy in complex scenes but can increase computational load and reduce adaptability in simpler applications. DRFNet and MCENet focus on multi-scale context extraction, excelling in size-variant object detection but often at the cost of processing speed due to their added structures. BAN and CompositionalNet prioritize boundary and compositional contexts, aiding accuracy for small or occluded objects, though they may miss finer contextual details in cluttered scenes. LTN and ECNet handle specific detection needs, like scene layout forecasting and transparent objects, but can struggle with unpredictable or low-contrast environments. POD-F, POD-Y, and TCC improve object clarity in complex scenes with attention and multi-scale context fusion, though these techniques add computational demands. Lastly, SCFE, CRF-based Object Detection, and Adaptive MIL enhance spatial-semantic coherence for small object detection, albeit with limitations in real-time applications. In sum, these methods advance object detection by addressing unique context challenges, balancing trade-offs between efficiency, accuracy, and adaptability across detection tasks.

4.2.8 Results on general object detection

Based on Fig. 31, approaches highlighted in red have been specifically designed to improve two-stage object detectors. The most effective two-stage approach tested on COCO dataset is GCA RCNN (Zhang et al. 2021), which mitigates the loss of contextual information during the resizing of object proposals by integrating a context-aware mechanism that extracts and refines features using a global feature pyramid and attention techniques, respectively. By preserving spatial relationships between the foreground and background, GCA RCNN enhances detection performance across varied object scales and positions. Its context-aware mechanism refines object features with high accuracy, making it effective in COCO’s complex scenes. Furthermore, YOLOv4-AMCM demonstrated superior performance on the PASCAL VOC07 dataset by incorporating an attention-based module (AMCM), which emphasizes feature discrimination by applying context-sensitive attention weighting. AMCM allows the model to focus on the most relevant spatial relationships, making YOLOv4-AMCM robust in scenarios with complex object overlap. By improving the perception of object context, YOLOv4-AMCM surpasses other models in the VOC07 dataset by enhancing detection accuracy in cluttered and occlusion-heavy environments. In Fig. 31, approaches highlighted in blue have been designed merely for one-stage models. CCAGNet (Miao et al. 2022), which integrates contextual information using three attention mechanisms, including CCAM, SFAM, and RFAM, is the top network in the one-stage category tested on Pascal 2007 dataset. It effectively focuses on relevant regions while ignoring less important areas. CCAM enhances focus on object-synergy areas, while SFAM reduces noise in feature representation by emphasizing valuable features during upsampling. RFAM further improves the model’s contextual awareness by understanding relationships between distinct features. Together, these attention mechanisms help CCAGNet reduce false positives, making it one of the best one-stage detectors for VOC07 by leveraging attention-guided context to avoid distractions in dense scenes. Moreover, this network surpasses other one-stage methods in detecting small objects. FGDLAM and SCIF is another one-stage model that surpasses other one-stage models tested on the COCO dataset in terms of mAP due to the inclusion of dual-level attention mechanism, which divides the feature space into subspaces for fine-grained weight extraction and captures relationships within and across spatial regions. This dual-weight strategy enables the model to capture both local and global context effectively, crucial in COCO’s diverse and high-density scenes (Table 4).

Table 4 Best context-based GOD methods

Full size table

Throughout the paper, bolded models in the tables denote the highest mAP achieved, emphasizing the best performance among all compared methods in the one specific dataset.

Based on Fig. 40, ResNet and VGG16 as backbones, Faster RCNN and RetinaNet architectures, and spatial, scale, and semantic context types are mostly used in general object detection. Faster R-CNN, Cascade Mask RCNN, SSD, and YOLOv4 architectures have been used in the mentioned top-performing networks with the highest mAP.

As shown in Tables 5 and 6, on the COCO dataset, FNM has the highest AP, FGDLAM and SCIF is ranked second, and the Feature Refinement is positioned as the third-best model. FNM excels in the COCO dataset due to its belief propagation mechanism, which manages spatial relations between objects in a dynamic and context-sensitive manner. This mechanism calculates context-based probabilities, dynamically selecting relevant context variables for each location. The use of scale-invariant representations is another key module that reduces the need for multi-scale instances, simplifying the training process while retaining accuracy across various object scales. These innovations allow FNM to surpass other methods by effectively leveraging both spatial and scale context in highly variable scenes. Feature Refinement surpasses other models on VOC07 and for medium-to-large objects (APm and APl) due to its cross-pooling and cascaded deformable context modules. The cross-pooling module is specifically designed to capture long-range dependencies, enabling the model to leverage global context efficiently. The deformable context module further refines local context by incorporating spatial details from surrounding regions, making it adept at identifying medium-to-large objects even when they are partially obscured or surrounded by clutter. These context-aware modules allow Feature Refinement to outperform others by precisely balancing global and local context in dense or visually noisy environments. GCE has the highest accuracy for recognizing small objects (APs), a challenging task due to their tendency to be overlooked in cluttered scenes. Its dual-path approach leverages top-down, image-level classifications combined with regional features, which enables the model to utilize high-level contextual cues to locate small objects accurately. This global context encoding enriches the model’s understanding of the full scene, allowing it to capture subtle semantic relationships and recognize small objects even when surrounded by larger, more prominent objects. Among all methods evaluated on the PASCAL VOC 2007 and 2012 datasets, Cascade Region Proposal demonstrates the highest level of accuracy for PASCAL 2012 because of its multi-stage refinement process that integrates both local and global context. The model’s global context branch extracts entire image features, which enables it to refine region proposals based on surrounding context, leading to more accurate localization. This broader view beyond the immediate object vicinity is particularly advantageous in VOC12, where understanding the scene context as a whole supports better classification. The progressive refinement steps in Cascade Region Proposal ensure precise object boundaries and improve detection accuracy, especially in scenes where objects might otherwise be misclassified due to similar features. Conversely, for PASCAL 2007, Feature Refinement emerges as the most optimal model. The performance of other methods on different datasets has been demonstrated in Table 7. The best performing methods for different tasks are shown in Table 4.

Table 5 General object detection (COCO dataset)

Full size table

Table 6 General object detection (PASCAL VOC 2007 and 2012)

Full size table

Table 7 General object detection (other datasets)

Full size table

In summary, the combination of context-aware modules across two-stage and one-stage detectors, especially through the use of attention and multi-scale approaches, consistently boosts object detection performance across different datasets. Approaches like GCA-RCNN and FGDLAM show how integrating global and local context improves accuracy across object sizes, while models like Feature Refinement excel in domain-specific tasks. This highlights the critical role context plays in advancing the field of object detection.

Table 8 comprehensively outlines the challenges addressed by general object detection approaches, encompassing issues such as scale variations, object occlusion, background complexity, and environmental factors. These challenges highlight how context-based models are capable of addressing obstacles that context-free approaches often struggle to overcome, showcasing the advantages of integrating contextual information for improved object detection performance.

Table 8 Addressed challenges in general object detection

Full size table

4.3 Small object detection (SOD)

Small object detection focuses specifically on detecting and localizing objects that are small in size. Small objects fill areas less than and equal to \(32\times 32\) pixels (Lin et al. 2014). As illustrated in Fig. 39, the presence of fewer pixels in small objects within an image hinders the network’s capacity to extract significant features from them (Tong et al. 2020). Contextual information provides additional details about objects and their surroundings, aiding the network in detecting small objects more effectively. In this section, context-based small object detection approaches are reviewed (Fig. 40).

Fig. 39

Small birds with 20 pixels are much more difficult to detect compared to those with 150 or 120 pixels

Fig. 40

Overview of datasets, context levels, context types, and architectures employed in general object detection approaches. The size of each section indicates the contribution of that section

(1)
FA-SSD (Lim et al. 2021) is the result of incorporating two modules: the Context by Feature Fusion module (F-SSD) and the Attention Mechanism (A-SSD). F-SSD utilizes the concatenation of multi-scale features from adjacent pixels to extract context information, thereby enabling a more comprehensive depiction of small objects. A-SSD contains an attention mechanism early in the layer, allowing for focused detection by eliminating extraneous background information.
(2)
FPN with CEM and FPM (CEFP2N) (Xiao, Guo, et al. 2023) is a new feature pyramid composite neural network structure. It has two main modules: Context Enhancement Module (CEM) and Feature Purification Module (FPM). CEM enhances context information using multi-scale dilated convolution features, whereas FPM uses feature purification procedures to remove conflicting information in multi-scale feature fusion.
(3)
Internal-External Network (IENet) (Leng et al. 2021) utilizes both appearance and context information. This network contains three modules: Bidirectional Feature Fusion Module (Bi-FFM), Context Reasoning Module (CRM), and Context Feature Augmentation Module (CFAM). Bi-FFM collects internal features of objects, proposal quality is enhanced via context reasoning in CRM, and classification is achieved by CFAM through the learning of pair-wise relations between region proposals produced by CRM.
(4)
Improved YOLOv5 (Zhang et al. 2022) integrates Coordinate Attention (CA) and a Context Feature Enhancement Module (CFEM) into the YOLOv5 network. CA incorporates positional information into channel attention, whereas CFEM extracts rich context information from multiple receptive fields.
(5)
CGA-YOLO (Hang et al. 2022) utilizes a Swin Transformer-based context information extraction module. In the feature fusion network, it employs a Global Attention Mechanism (GAM) comprising Channel Attention Module (CAM) and Spatial Attention Module (SAM) to enhance feature representation, especially for small object detection.
(6)
TYOLOv5 (Corsel et al. 2023) is a spatio-temporal model that utilizes temporal context extracted from video sequences to enhance the recognition of small moving objects while maintaining the accuracy of detecting stationary objects. The main contributions consist of integrating temporal data augmentation, introducing a spatio-temporal detector that operates on a single stream, and proposing a two-stream architecture that utilizes frame differencing to capture explicit motion information.
(7)
Contextual-YOLOV3 (Luo et al. 2019) incorporates a Contextual Relationship Matrix (CRM) into YOLOv3’s classification probability to enhance classification probability. Furthermore, a Context-Based Filtering Algorithm replaces the traditional non-maximum suppression algorithm for optimal window selection.
(8)
Context-Aware Block Net (CAB Net) (Cui et al. 2020) addresses losing spatial information and low accuracy due to gradual downsampling by implementing a Context-Aware Block (CAB) that incorporates pyramidal dilated convolutions while maintaining spatial relationships. By preserving both detailed and high-level semantic features, CAB Net enhances the accuracy of detection while preventing an increase in model complexity.
(9)
In (Fang and Shi 2018), a context information fusion technique is implemented that selectively combines classification information with region proposal windows. This method improves the efficiency of region proposal classification without adding errors in bounding box regression, unlike standard approaches that rely on both classification and bounding box information.
(10)
Semantic Context-Aware Network (SCAN) (Guan et al. 2018) uses pyramid pooling to merge larger feature maps with context information. Two modules are utilized to combine them: a Location Fusion Module (LFM) for fine-grained semantic features, and a Context Fusion Module (CFM) for context-aware features.
(11)
MCS-YOLO v4 (Ji et al. 2023) introduces a novel detection scale of 104x104 to gather more detailed information about small objects. The Expanded Field of Sensation Block (EFB) collects contextual information around small objects, hence improving feature richness.
(12)
Contextual Information Fusion (Chen et al. 2021) employs feature fusion and multi-scale output predictions to integrate contextual information into the network for enhancing resilience in detecting small objects.
(13)
Vanishing-Point-Guided Context-Aware Network (VCANet) (Chen et al. 2021) employs a vanishing point prediction block and a context-aware center detection block to collect and extract semantic information. This model has superior accuracy in recognizing small objects on roads compared to generic object detection methods.
(14)
Discriminative Learning and Graph-Cut (DLGC) framework (Xi et al. 2020) leverages semantic similarity among predicted objects’ candidates. The framework entails the creation of a pairwise constraint to depict semantic similarity, implementing discriminative learning to assess potential similarity, and deploying a graph-cut method to group candidates according to their similarity. Once a graph model has been constructed and candidates have been divided into different groups, a voting mechanism is utilized to ascertain the categorization of candidates within each group. The voting technique improves the accuracy of object detectors by utilizing semantic information and cohesive relationships among neighboring candidates.
(15)
CEASC (Du et al. 2023), ptimizes the detection head in drone images. It utilizes a context-enhanced group normalization (CE-GN) layer that incorporates global contextual features to improve accuracy while operating on sparsely sampled regions through adaptive sparse convolution. It also implements an adaptive multi-layer masking (AMM) strategy to dynamically adjust the mask ratio for foreground coverage, ensuring a balance between computational efficiency and detection performance.
(16)
Dynamic Local and Global Context Exploration (DCE) (Zhang et al. 2023) dynamically explores local and global context features. DCE includes Dynamic Surrounding Search (DSS), Semantic Object Relation Enhancement (SORE), and Global Feature Supplement (GFS) to enhance detection performance.
(17)
The Eagle-YOLO (Ma et al. 2024) integrates a Lightweight Kernel Attention (LKA) mechanism and contextual feature fusion to enhance detection accuracy in complex scenes. Building on the YOLOv5 framework, it employs a Backbone module (CSPDarknet53) for feature extraction, a Neck module that incorporates a Bidirectional Feature Pyramid Network (Bi-FPN) with LKA to focus on target regions, and a Head module that includes an additional detection layer for small targets. Additionally, the model introduces the Eagle-IoU loss function, which addresses gradient instability and convergence issues during training.
(18)
The paper (Cheng et al. 2023) introduces a multi-level feature fusion module to improve the extraction of detailed information, addressing the challenges of missed and false detections. A regional attention module is employed to focus on small object features while minimizing the influence of background noise. Additionally, the anchor boxes are refined to better accommodate small objects.
(19)
The paper (Kim et al. 2024) proposes a novel drone-view object detection framework that integrates environmental factors like weather, illumination, and visibility to enhance detection robustness against adverse conditions. Utilizing a multimodal large language model (MLLM), the authors generate a diverse weather content feature set, which describes the environmental conditions associated with drone-view images. The framework then adaptively selects the most relevant weather content features and combines them with learnable object queries to improve contextual understanding in the detection process.
(20)
The proposed approach in (Jing et al. 2024) presents leverages a composite backbone network, context information, and multi-scale learning to enhance detection accuracy. The approach includes a Composite Dilated Convolution and Attention Module (CDAM) to efficiently integrate context information while reducing noise interference. Additionally, a Feature Elimination Module (FEM) is introduced to suppress features of medium and large objects, allowing for better detection of small objects.
(21)
SG-YOLO (Deng et al. 2024) enhances underwater target detection by integrating a Feature Fusion Module (FFM) to balance semantic information and mitigate feature loss, a Global Context Decoupling (GCD) Head to filter out irrelevant background details while processing classification and regression tasks separately, and a Content-Aware Reassembly Module (CFRM) that adapts upsampling with position-specific kernels to improve localization accuracy for small targets.
(22)
The Bi-AFN++CA (Zhang and Chen 2024) integrates a Bi-directional Adaptive Fusion Network that facilitates simultaneous information flow in both top-down and bottom-up pathways, combined with a Context Extraction Module to capture rich contextual spatial-channel information. This dual approach mitigates the challenges of detecting small objects in cluttered environments by refining feature representation and incorporating contextual cues.
(23)
The paper (He et al. 2024) introduces three lightweight plug-and-play modules-CIE-Pool, SE-CBAM, and Adaptive Feature Processing (AFP)-to enhance the detection accuracy of small objects within YOLO-based algorithms. The CIE-Pool module enriches feature extraction by incorporating contextual information, while the SE-CBAM module enhances spatial attention for better localization of small objects. The AFP further optimizes feature representation by filtering noise from shallow layers.

4.3.1 Results on small object detection

A comprehensive analysis of small object detectors is shown in Table 9. Moreover, context levels, context types, and CNN architectures in small object detection approaches are presented in Fig. 41. As shown in Fig. 41, the combination of local and global context in this category has garnered more attention, resulting in improved networks for detecting small objects. Spatial context, scale context, and semantic context have also been used much more than other contextual information in small object detectors. Since a wide variety of datasets have been used in this category, it is not logical to compare the mAPs of their predictions. Noteworthy is the fact that various versions of the YOLO network have been used more frequently than other architectures. The combination of YOLO with contextual information has led to acceptable results in detecting small objects. In small object detection, the reviewed models show diverse methods to address challenges of feature clarity, contextual information, and computational efficiency. FA-SSD, CEFP2N, and IENet focus on enhancing feature representation by incorporating multi-scale and attention-based modules. FA-SSD uses feature fusion and attention for precise context capture, benefiting small objects but with an increase in computational demand. CEFP2N’s purification process effectively reduces noise, enhancing clarity, though the multi-step approach may impact speed. IENet leverages internal and external context through reasoning modules, which improves detection but can be complex to implement. Improved YOLOv5, CGA-YOLO, and Contextual-YOLOv3 each integrate context into YOLO architectures to retain small object details. Improved YOLOv5’s coordinate attention boosts positional accuracy, though it remains sensitive to noise in complex backgrounds. CGA-YOLO’s Swin Transformer enhances context extraction, offering robust small-object detection with high computational needs. Contextual-YOLOv3, meanwhile, prioritizes classification accuracy but may miss localization precision in dense scenes. Models like DCE, Eagle-YOLO, and VCANet excel by fusing local and global context with targeted modules, which refines object boundaries and reduces false positives. Eagle-YOLO’s lightweight attention mechanism balances accuracy with speed, but detection can be impacted under extreme conditions. VCANet, which employs a vanishing-point-based context block, excels in road scenarios but is less effective in varied settings. Lastly, MCS-YOLOv4, SG-YOLO, and the Bi-AFN++CA model focus on balancing semantic information and spatial details, reducing feature loss and enhancing localization in challenging environments. SG-YOLO’s adaptive reassembly mitigates noise effectively, while Bi-AFN++CA’s dual-pathway approach captures rich context, excelling in cluttered scenes though at the cost of increased model complexity.

Table 9 Small object detection

Full size table

Fig. 41

Overview of context levels, context types, and architectures employed in small object detection approaches. The size of each section indicates the contribution of that section

4.4 Video object detection (VOD)

Video object detection involves detecting objects using video data as compared to conventional object detection using static images (Zhu et al. 2020). In video object detection, objects may move and change their appearance across frames. As a result, objects should be detected and tracked across consecutive frames. Contextual information can aid video object detection by leveraging spatial and temporal cues, enhancing object recognition and tracking consistency across frames. In this section, context-based video object detection approaches are reviewed.

(1)
Context Faster R-CNN (Beery et al. 2020) was designed for static monitoring cameras that have low and irregular sample frequency. By utilizing temporal context extracted from unlabeled frames, the model improves the performance of object detection. This model employs an attention-based technique to index a Long Term Memory bank (Mlong) built on per-camera to collect long-term contextual information from previous unlabeled frames. Short Term Memory (Mshort) is also added to include short-term context from neighboring frames. This approach solves issues such as partially observed objects, low quality, and background distractors.
(2)
In contrast to other methods that combine features at once, Progressive Temporal-Spatial Enhanced Transformer (PTSEFormer) (Wang et al. 2022), employs a progressive strategy to optimize feature utilization by integrating both temporal and spatial information. This method utilizes a Temporal Feature Aggregation Module (TFAM) to handle temporal context and a Spatial Transition Awareness Module (STAM) to handle spatial context.
(3)
Flow and LSTM (Zhang and Kim 2019) is a causal recurrent flow-based method for real-time online detection. In contrast to conventional approaches that need a large number of preceding and succeeding frames to detect objects, making them impracticable for real-time online detection, the suggested method reads only the current frame and one prior frame from the memory buffer at each time step. This model employs short-term temporal context via optical flow-based feature warping from the previous frame and long-term temporal context via a temporal convolutional LSTM, and combines both long-term and short-term temporal context information.
(4)
Temporal Context Enhanced Network (TCENet) (He et al. 2020) mitigates appearance degradation in video frames by proposing Temporal Context Enhanced Aggregation (TCEA), which aggregates features from adjacent frames to model temporal context. This network comprises a DeformAlign module that ensures precise spatial alignment at the pixel level throughout time, as well as a Temporal Stride Predictor that intelligently chooses video frames for aggregation. TCENet handles severe appearance degradation such unusual poses and occlusions better than traditional spatial feature enhanced aggregation approaches.
(5)
Video Object Detection Using Object’s Motion Context and Spatio-Temporal Feature Aggregation (VOD-MT) (Kim et al. 2021) was designed for one-stage detectors. It takes advantage of both motion context and spatio-temporal features collected across multiple frames. This approach computes correlation maps between neighboring frames, and encodes them for motion context using LSTM. Subsequently, a gated attention network is utilized to collect spatial feature maps. This method can be useful to detect objects in videos with motion blur or defocusing.
(6)
Context and Structure Mining Network (CSMN) (Han et al. 2021) addresses issues in video frames such as occlusion, motion blur, and uncommon postures using a spatial-temporal Context Information Encoding module (stCIE) and a Structure-based Proposed Feature Aggregation module (SPFA). The stCIE encodes spatial-temporal contextual information into object features by evaluating each object pixel’s spatial and temporal relationship with surrounding pixels. On the other hand, the SPFA divides proposals into several patches, effectively dealing with problems such as misalignment of poses and occlusion.
(7)
Motion Context Network (MC-Net) (Jin et al. 2020) improves weakly supervised object detection in videos by utilizing motion context. In order to overcome the difficulties associated with accurately locating objects in the absence of box-level annotations and complex motion patterns, MC-Net implements a Motion Context Module (MCM) that utilizes neighborhood motion correlation to derive motion context features, which are then effectively combined with appearance information. Furthermore, a Temporal Aggregation Module (TAM) addresses problems associated with degraded object appearances by consolidating features over consecutive frames.
(8)
Non-local prior based spatiotemporal attention model (Lu et al. 2020) aims to utilize global spatio-temporal context and non-local dependencies in order to improve video object detection. The technique enhances accuracy in challenging video sequences, such as intricate backgrounds, motion blur, and partial occlusion, by employing non-local blocks and 3D convolutions into the YOLOv2 network. The method incorporates self-attention mechanisms through non-local blocks that capture long-range dependencies in both spatial and temporal forms, improving features by evaluating the similarity between any two points in the image. On the other hand, 3D convolutions gather temporal information by merging frames.
(9)
Adaptive omni-attention model (Yu et al. 2022) addresses challenges such as lighting variations, smaller objects, and motion blur in video frames. The model includes inter-frame and intra-frame attention modules, in addition to a feature fusion module. It uses inter-frame contextual information to enhance identification in low-quality frames and leverages intra-frame attention to reduce false positive detections in background regions. This strategy makes optimal use of attention in the temporal, spatial, and channel domains, enhancing detection accuracy while reducing training costs.
(10)
Temporal Aggregation with Context Focusing (TACF) (Han et al. 2023) framework enhances few-shot video object detection by integrating temporal information from adjacent frames with contextual details from support images. It comprises two main modules: the Context Focusing (CF) module, which computes similarity scores between support images and adjacent frame features to focus on relevant target object features, and the Temporal Aggregation (TA) module, which aggregates features from adjacent frames based on their similarity to the Region of Interest (ROI) features. This approach effectively addresses challenges like occlusion and motion blur.

4.4.1 Results on video object detection

A comprehensive analysis of video object detection approaches is shown in Table 10. Moreover, context levels, context types, and CNN architectures in video object detection approaches are presented in Fig. 42. Temporal consistency, scale variation, occlusions, and motion blur are some challenges that have been addressed by video object detection approaches. The ImageNet VID dataset is frequently employed to evaluate VOD models, and PTSEFormer built upon the DETR framework based on temporal-spatial context, outperforms other approaches on this dataset. Moreover, VOD-MT (Kim et al. 2021) and MC-Net (Jin et al. 2020) introduce a new concept called “motion context” in their approaches, which, in addition to spatial and temporal aspects, also addresses the motion characteristics of objects. This allows for the identification of objects in videos with defocusing or motion blur through motion cues. In terms of context type, context level, and CNN architecture, spatio-temporal context, local+global level, ResNet, and R-FCN are the most commonly used in video object detection approaches. For video object detection, contextual information plays a critical role in handling challenges like motion blur, occlusion, and appearance variations across frames. Context Faster R-CNN leverages long- and short-term memory mechanisms for static monitoring, utilizing temporal context from prior frames, which helps address partially observed objects and background noise. PTSEFormer adopts a progressive strategy for temporal and spatial context integration, offering refined contextual aggregation across frames. In contrast, Flow and LSTM improve real-time capabilities by utilizing only the current and previous frames, combining optical flow-based short-term context with LSTM-based long-term context for efficient online detection. TCENet addresses appearance degradation by aligning pixel-level features across time, which effectively counters occlusions and unusual poses, while VOD-MT focuses on spatio-temporal features and motion context, enhancing detection in videos with motion blur. CSMN introduces structured feature aggregation to improve detection accuracy amid occlusions and complex postures, combining spatial-temporal context with proposal patches. MC-Net and the Non-local prior model also emphasize motion context, with MC-Net specifically optimizing for weakly-supervised settings, and the Non-local prior model using non-local dependencies for global spatio-temporal attention. Meanwhile, the Adaptive omni-attention model addresses quality variations with intra- and inter-frame attention modules, enhancing accuracy in challenging video sequences, while TACF framework improves few-shot detection by integrating context from support images, effectively countering occlusion and motion blur through focused temporal aggregation. Each method’s integration of temporal and spatial context advances video object detection by enhancing feature richness, tracking consistency, and robustness to scene complexity.

Table 10 Video object detection

Full size table

Fig. 42

Overview of context levels, context types, and architectures employed in video object detection approaches. The size of each section indicates the contribution of that section in the reviewed articles

4.5 Zero-shot, one-shot, and few-shot object detection

Zero-shot object detection (ZSD) refers to the task of detecting objects from categories that have not been seen during the training phase (Bansal et al. 2018). The model is trained to detect objects from known categories, and then it can generalize to detect objects from unseen categories based on auxiliary information or semantic embeddings. Only one published work, Multi-label Context (MLC) framework (Wei and Ma 2022), comprises four main components: the Multi-Label Head, Contextual RoI Feature Generation, Background Dynamic Generator (BDG), and Zero-Shot Head. First, the Multi-Label Head extracts object-level concepts from holistic, image-level context, helping the model learn relationships across the entire image. Then, Contextual RoI Features are generated by combining instance-level and global information, providing a complementary representation to conventional RoI features. The BDG dynamically updates background word vectors, reducing confusion between background and unseen objects. Finally, the Zero-Shot Head uses both the fused context and BDG to locate and classify seen and unseen objects, utilizing knowledge from previously seen classes to support zero-shot detection. This integration of global and local context strengthens the framework’s ability to generalize to unseen classes.

In one-shot object detection (OSOD), a model is trained to detect objects in an image after being exposed to just one example of each object during the training phase (Osokin et al. 2020). This approach aims to enable the model to generalize and accurately detect previously unseen objects with minimal training instances. Attention mechanism mutual Global Context (mGC) block ( Jia et al. 2021) and Adaptive context and scale-aware feature aggregation module (ACS) with a feature alignment metric block (FAM) (Zhang et al. 2022) are context-based methods for one-shot object detection. Both have been tested on the COCO and PASCAL datasets, and the findings demonstrate that mGC performs superior in recognizing seen and unseen objects on both datasets. The mGC block ( Jia et al. 2021) focuses on important regions of an image by utilizing contextual information. This block improves the quality of features obtained from the image, specifically enhancing the region proposal network (RPN), which identifies the potential locations of objects. Once feature maps have been extracted and contextual information has been added to them using the mGC block, the process continues with the RPN, which produces enhanced Regions of Interest (ROIs). Ultimately, ROIs are categorized by a Metric-based Detector, which has the ability to handle both seen and unseen classes without requiring any further training. ACS+FAM (Zhang et al. 2022) safeguards crucial details during the OSOD process by integrating both global and local contextual information. The ACS module enables a strong knowledge of the object’s surroundings and size fluctuations by combining context enrichment with conditioned multi-scale interaction. Meanwhile, the FAM block tackles issues related to spatial misalignment by employing feature alignment with the assistance of a spatial transformer network (STN) (Jaderberg et al. 2015), enhancing the overall accuracy and resilience of the OSOD system. CSSI model (Yang et al. 2024) proposes a novel method for one-shot object detection without requiring fine-tuning. It introduces the ATP module, which utilizes a transformer encoder to enhance long-range spatial interactions across different scales by aggregating features in a size-aware manner. Additionally, the GCC module extracts semantic-consistent spatial correlations by analyzing a complete 4D correlation tensor, complemented by inter-channel interactions through the CCL branch.

Few-shot object detection (FSOD) aims to extract semantic knowledge from limited object instances of novel categories within a target domain (Li et al. [2023](/article/10.1007/s10462-025-11186-x#ref-CR160 "Li W, Wei H, Wu Y, Yang J, Ruan Y, Li Y, Tang Y (2023) Tide: Test time few shot object detection. arXiv preprint arXiv:2311.18358

            ")). This method can be used in scenarios where limited labeled data is available for training an object detection model. Context has received more attention in FSOD, and four approaches, including Semantic Relation Reasoning Few-Shot Detector (SRR-FSD) (Zhu et al. [2021](/article/10.1007/s10462-025-11186-x#ref-CR288 "Zhu C, Chen F, Ahmed U, Shen Z, Savvides, M. (2021). Semantic relation reasoning for shot-stable few-shot object detection. Proceedings of the IEEE/cvf conference on computer vision and pattern recognition (pp. 8782–8791)")), Context-Transformer (Yang et al. [2020](/article/10.1007/s10462-025-11186-x#ref-CR265 "Yang Z, Wang Y, Chen X, Liu J, Qiao Y (2020) Context-transformer: Tackling object confusion for few-shot detection. Proceedings of the aaai conference on artificial intelligence (Vol. 34, pp. 12653–12660)")), and Dense Relation Distillation with Context-aware Aggregation (DCNet) (Hu et al. [2021](/article/10.1007/s10462-025-11186-x#ref-CR101 "Hu H, Bai S, Li A, Cui J, Wang L (2021) Dense relation distillation with context-aware aggregation for few-shot object detection. Proceedings of the IEEE/cvf conference on computer vision and pattern recognition (pp. 10185–10194)")), and Instance Context Network (ICNet) (Ran et al. [2023](/article/10.1007/s10462-025-11186-x#ref-CR198 "Ran S, Duan D, Peng L, Hu F, Zhong W (2023) A few-shot object detection method based on instance context. 2023 China automation congress (cac) (pp. 9247–9252)")) are introduced. SRR-FSD (Zhu et al. [2021](/article/10.1007/s10462-025-11186-x#ref-CR288 "Zhu C, Chen F, Ahmed U, Shen Z, Savvides, M. (2021). Semantic relation reasoning for shot-stable few-shot object detection. Proceedings of the IEEE/cvf conference on computer vision and pattern recognition (pp. 8782–8791)")) combines semantic relations with visual information to improve the stability and robustness of FSOD across different shot variations. The detector utilizes semantic embeddings, which are obtained from textual data, to represent classes of objects. The model incorporates a relation reasoning module that considers the relations between different classes based on their semantic embeddings. The fundamental idea is that in situations where visual data is limited, the detection of new objects is facilitated by knowledge of the relationships between specific classes. Context-Transformer (Yang et al. [2020](/article/10.1007/s10462-025-11186-x#ref-CR265 "Yang Z, Wang Y, Chen X, Liu J, Qiao Y (2020) Context-transformer: Tackling object confusion for few-shot detection. Proceedings of the aaai conference on artificial intelligence (Vol. 34, pp. 12653–12660)")) surmounts the challenge of limited data diversity encountered in conventional transfer learning for FSOD by leveraging source-domain object knowledge and extracting context from the target domain’s limited training set. Context-Transformer consists of two submodules: affinity discovery and context aggregation. Affinity discovery generates contextual fields for a certain target image using default prior boxes. Then, it utilizes the relations between these boxes and contextual fields. Moreover, context aggregation uses these relationships as a guide and inserts critical contexts into each box. The approach allows Context-Transformer to generate context-aware representations for each preceding box, enabling the detector to differentiate few-shot confusion using significant contextual cues. DCNet (Hu et al. [2021](/article/10.1007/s10462-025-11186-x#ref-CR101 "Hu H, Bai S, Li A, Cui J, Wang L (2021) Dense relation distillation with context-aware aggregation for few-shot object detection. Proceedings of the IEEE/cvf conference on computer vision and pattern recognition (pp. 10185–10194)")) has two main modules: Dense Relation Distillation Module (DRD) and Context-aware Feature Aggregation Module (CFA). The DRD module utilizes support features through a pixel-wise matching technique. Furthermore, in order to address the issue of scale variation, the CFA module collects features from various resolutions adaptively during RoI pooling. Another approach, Instance Context Network (ICNet) (Ran et al. [2023](/article/10.1007/s10462-025-11186-x#ref-CR198 "Ran S, Duan D, Peng L, Hu F, Zhong W (2023) A few-shot object detection method based on instance context. 2023 China automation congress (cac) (pp. 9247–9252)")), integrates an instance-level context extraction module that utilizes a self-attention mechanism to improve feature representation. The method first extracts context information from the instances, which helps in enhancing the features of regions of interest. These enhanced features are then embedded into a transfer learning detection module, allowing for more effective differentiation between object classes.

4.5.1 Results on zero-shot, one-shot, and few-shot object detection

A comprehensive analysis of ZSD, OSOD, and FSOD approaches is shown in Table 11. The enhancement of networks in detecting objects in cases with not enough training data, known as corner cases (Heidecker et al. 2024), demonstrates the advantageous impact of context in such situations. The findings demonstrate that incorporating global semantic context has a positive role in detecting unseen objects in zero-shot object detection. In one-shot object detection, focusing on important regions of an image to detect both seen and unseen classes, and gathering a strong knowledge of the object’s surroundings and size fluctuations are benefits of leveraging context. In few-shot object detection, improving the stability and robustness across different shot variations and tackling limited data diversity and scale variation are examples are advantages gained from using context. Nevertheless, given the scarcity of research in this domain, there exists an opportunity to further use contextual information.

Table 11 Zero-shot, one-shot, and few-shot object detection

Full size table

4.6 Camouflaged object detection (COD)

Camouflaged object detection (COD) aims to identify objects, as shown in Fig. 43, that are seamlessly embedded in their surroundings, making them difficult to distinguish (Fan et al. 2020). The focus is on developing algorithms that can effectively identify such objects even in challenging backgrounds, where the object’s appearance may match or mimic the surrounding environment.

Fig. 43

Detecting the frog and crab camouflaged in their environments is very challenging

Based on inclusion and exclusion criteria, six papers were found on the application of context in camouflaged object detection. They have been tested on different datasets, including the CAMO, Chameleon, COD10K, and NC4K. Context-aware Cross-level Fusion Network (C2F-Net) (Sun et al. [2021](/article/10.1007/s10462-025-11186-x#ref-CR220 "Sun Y, Chen G, Zhou T, Zhang Y, Liu N (2021) Context-aware cross-level fusion network for camouflaged object detection. arXiv preprint arXiv:2105.12555

            ")) has three main components: Multi-Scale Channel Attention (MSCA), Attention-induced Cross-level Fusion Module (ACFM), and Dual-branch Global Context Module (DGCM). MSCA effectively captures information at multiple scales by taking into account both global and local contexts; ACFM incorporates features from multiple levels with a particular emphasis on high-level features by utilizing attention mechanisms guided by MSCA; and DGCM makes additional use of global context data contained within the fused features. These modules work together to improve the model to detect camouflaged objects by improving feature representations at various sizes via attention-guided fusion and global context. Boundary-guided Context-aware Network (BCNet) (Xiao, Chen, et al. [2023](/article/10.1007/s10462-025-11186-x#ref-CR256 "Xiao J, Chen T, Hu X, Zhang G, Wang S (2023) Boundary-guided context-aware network for camouflaged object detection. Neural Comput Appl 35(20):15075–15093")) deals with inaccurate contours in COD. BCNet incorporates a High-resolution Feature Enhancement Module (HFEM) to extract multi-scale data while keeping precise cues, surpassing the limitations of prior approaches. Furthermore, in order to enhance segmentation with more precise contours, a Boundary-guided Feature Interaction Module (BFIM) is specifically engineered to uncover complementary information between camouflaged objects and their boundaries. These modules contribute to BCNet’s performance in producing high-quality segmentation maps, demonstrating increases in accuracy as well as computational efficiency. Another approach titled CamoFocus (Khan et al. [2024](/article/10.1007/s10462-025-11186-x#ref-CR121 "Khan A, Khan M, Gueaieb W, El Saddik A, De Masi G, Karray F (2024) Camofocus: Enhancing camouflage object detection with split-feature focal modulation and context refinement. Proceedings of the IEEE/cvf winter conference on applications of computer vision (pp. 1434–1443)")) introduces two main modules: the Feature Split and Modulation (FSM) module, which separates and modulates foreground and background features using a supervisory mask, and the Context Refinement Module (CRM), which refines these features through cross-scale interactions. This approach improves semantic representation while maintaining lower computational complexity. Pixel-Centric Context Perception Network (PCPNet) (Song et al. [2023](/article/10.1007/s10462-025-11186-x#ref-CR216 "Song Z, Kang X, Wei X, Li S (2023) Pixel-centric context perception network for camouflaged object detection. IEEE Trans Neural Netw Learn Syst")) uses a CNN-based encoder with a Vital Component Generation (VCG) module to extract rich spatial and semantic features. It employs a parameter-free Pixel Importance Estimation (PIE) function to prioritize pixels in complex backgrounds, guiding the network during decoding. A Local Continuity Refinement Module (LCRM) further refines detection results for improved accuracy. In another paper (Wen et al. [2024](/article/10.1007/s10462-025-11186-x#ref-CR246 "Wen Y, Ke W, Sheng H (2024) Camouflaged object detection based on deep learning with attention-guided edge detection and multi-scale context fusion. Appl Sci 14(6):2494")), the method combines the Swin Transformer (Swin-T) and EfficientNet-B7 models to enhance efficiency in feature extraction and segmentation. It introduces three key modules: the Masked-Edge Attention Module for efficient edge detection using Fourier transform, the Joint Dense Skip Attention Module for aggregating multi-level feature information, and the Object Attention Module to minimize discrepancies between encoder and decoder outputs. This method focuses on extracting both shallow and deep semantic information to improve the detection of camouflaged objects. Another approach, Discriminative Context-Aware Network (DiCANet) (Ike et al. [2024](/article/10.1007/s10462-025-11186-x#ref-CR103 "Ike CS, Muhammad N, Bibi N, Alhazmi S, Eoghan F (2024) Discriminative context-aware network for camouflaged object detection. Front Artif Intell 7:1347898")) enhances camouflaged object detection through a two-stage approach. It features an Adaptive Restoration Block (ARB) that intelligently weights feature channels and pixels, prioritizing informative data while suppressing noise, utilizing channel and pixel attention mechanisms. Following this, a Cascaded Detection Module (CDB) refines object predictions by enlarging the receptive field, producing accurate saliency maps with clear boundaries.

4.6.1 Results on camouflaged object detection

The analysis of camouflaged object detection approaches is shown in Table 12. The results are sorted for each dataset based on MAE, where a lower MAE indicates better accuracy in locating and detecting camouflaged objects. For the other metrics-F-measure, S-measure, and E-measure-higher values are better, reflecting improved precision, structural similarity, and alignment with the ground truth. The results demonstrate that integrating context can address and improve difficulties in COD arising from diverse appearances of camouflaged objects, low boundary contrast, and inaccurate contours. Based on the evaluation metrics shown in Table 12, each model exhibits specific strengths. C2F-Net and BCNet generally achieve higher accuracy in detecting camouflaged objects with strong F-measure and S-measure scores, indicating robust performance in handling multi-scale and boundary-aware features. CamoFocus stands out for its computational efficiency while maintaining competitive accuracy, as seen in its balanced performance across datasets. PCPNet effectively prioritizes complex background pixels but has room for improvement in boundary precision. DiCANet excels in generating clear boundaries, reflected by its high E-measure scores, though its complexity may increase computational demands. Each approach highlights unique advancements in COD, addressing challenges such as multi-scale representation, boundary precision, and efficient feature refinement. In summary, CamoFocus performs best for the CAMO, COD10K, and NC4K datasets, while PCPNet has the best performance on the Chameleon dataset based on the sorted MAE values. Considering the scarcity of context-based camouflaged object detectors, it is reasonable to anticipate a greater number of scholarly articles exploring the use of context to improve object detection.

Table 12 Camouflaged object detection

Full size table

5 Conclusion, research gaps, and limitations

5.1 Conclusion

In this systematic literature review, we comprehensively surveyed various aspects of context and noteworthy context-based object detection approaches within seven distinct categories, including general object detection, small object detection, video object detection, zero-shot object detection, one-shot object detection, few-shot object detection, and camouflaged object detection. In Sect. 2, We started by defining context aspects, including context in computer vision and human vision, context levels, contextual interactions, higher-order and pairwise relations, and a comprehensive analysis of context types. Then, after collecting papers from three databases and applying inclusion and exclusion criteria based on Sect. 3, in Sect. 4, we conducted a systematic analysis of 117 object detection papers that utilized context in their architectures to improve their performance. All 117 papers were reviewed and compared based on context level, context type, backbone, architecture, methods and modules, dataset, mAP, and other evaluation metrics. Most papers have focused on the following four types of context: spatial, size, semantic, and temporal, either individually or in combination, such as the spatio-temporal context. Moreover, the combination of local and global context has garnered more attention than their individual usage. In order to integrate contextual information, papers have proposed distinct modules and mechanisms that can be added into their architectures. In all seven reviewed categories, context has played a significant role in improving models performance.

In general object detection 4.2, different approaches, including graph-based approaches 4.2.1, hierarchical approaches 4.2.2, multi-scale approaches 4.2.4, RPN-based approaches 4.2.5, attention-based approaches 4.2.6, and other approaches 4.2.7, have integrated context into their architecture through various modules and mechanisms during the network training process. Instead of adding context into context-free architectures, context data augmentation methods 4.2.3 augment available contextual information.

In small object detection 4.3, various versions of the YOLO network have been used more frequently than other architectures. The combination of YOLO with contextual information has led to acceptable results in detecting small objects.

In video object detection 4.4, the investigated methods have effectively utilized context to improve their performance by tackling the problems such as low quality, background distractors, occlusions, motion blur, and degraded object appearance.

In zero-shot, one-shot, and few-shot object detection 4.5, where the number of training samples is very limited, networks attempt to focus on regions with a higher likelihood of presence through the integration of context. They also aim to detect unseen objects that have not been seen during the training process using global context and background information.

The methods presented in camouflaged object detection 4.6, tackle the difficulties in COD stemming from the diverse appearances of camouflaged objects, low boundary contrast, and inaccurate contours by integrating context into their architectures.

We answer the research questions outlined in the introduction Sect. 1 as follows:

RQ1. Which context types have been predominantly used in different categories of object detection? Figs. 40, 41, and 42 show the distribution of context types and levels in general object detection, small object detection, and video object detection. Based on Fig. 11, more than 14 types of context can be utilized in object detection, but most articles have focused on size, spatial, temporal, and combinations of them. This preference can stem from the alignment of these context types with the unique challenges posed by specific detection categories; for instance, spatial and temporal context naturally aid in tracking movement in video detection or locating small objects within crowded scenes. Additionally, the frequent use of these contexts could reflect their well-established effectiveness in enhancing model performance. Furthermore, in all three categorizations, a combination of local and global context is more commonly utilized than each individually. Tables 11 and 12 also demonstrate a similar trend regarding the distribution of context in zero-shot, one-shot, few-shot, and camouflaged object detection. The limited exploration of other context types suggests potential for future research to investigate broader context types. This could offer new insights and solutions to complex scenarios where underutilized contexts, such as environmental or spectral data, might further enhance object detection accuracy and adaptability.
RQ2. What approaches are applicable for integrating context in object detection? In general object detection, to integrate context into algorithms, methods have utilized different modules and mechanisms in various approaches such as graph-based approaches, hierarchical approaches, context augmentation, multi-scale approaches, RPN-based approaches, attention-based approaches, etc. Although the majority of papers integrate context into context-free methods, some approaches have been proposed to facilitate object detection through context data augmentation. In other categories, including small object detection, video object detection, zero-shot, one-shot, few-shot object detection, and camouflaged object detection, various proposed methods have also utilized different modules to integrate context into architectures. Given that methods have been tested on different datasets, it cannot be concluded which approach is the best. All approaches have been thoroughly reviewed in detail in Sect. 4.
RQ3. Why are certain backbone networks and architectures most commonly used in recent context-based object detectors? Based on Figs. 40, 41, 42, and Tables 11 and 12, ResNet, VGG16, and Faster RCNN have been more commonly utilized in object detection approaches. This preference can be attributed to the proven effectiveness of these architectures in balancing computational efficiency with high performance, especially for complex contextual features. ResNet and VGG16, for instance, offer robust feature extraction capabilities, essential for capturing fine-grained context information. Faster R-CNN, a two-stage detector, is favored for its high accuracy and ability to integrate additional contextual modules, making it particularly suitable for scenarios requiring nuanced contextual understanding.
RQ4. What are the best performing context-based methods on the most widely used datasets, including COCO and PASCAL VOC? What about for one-stage and two-stage object detectors? In the domain of general object detection, the FNM (Barnea and Ben-Shahar 2019) demonstrates superiority on the COCO dataset, while the Feature Refinement (Ma and Wang 2023) excels on VOC07, and the Cascade Region Proposal (Zhong et al. 2020) stands out as the most effective for VOC12. On the COCO dataset, Feature Refinement (Ma and Wang 2023) proves to be the best method for medium and large objects, while GCE (Peng et al. 2022) stands out as the best for small objects. In the realm of one-stage object detection, CCAGNet (Miao et al. 2022) emerges as the top performer on PASCAL07, whereas GCA RCNN (Zhang et al. 2021) demonstrates superiority on the COCO dataset.
RQ5. To what extent can context improve object detection in scenarios where the number of training samples is very limited, such as in few-shot object detection, or when objects are indistinguishable from the background, as in camouflage object detection? The number of papers using context to enhance performance in zero-shot, one-shot, few-shot, and camouflaged object detection is much lower compared to other categories. However, the superior performance of the proposed approaches in reviewed papers over context-free methods shows that context can have a significant impact on training networks that are constrained by limited training data. Additionally, context can enhance the network’s capability in detecting objects that are indiscernible in camouflaged object detection.

5.2 Research gaps

Based on the findings from the papers, we have identified and categorized the following research gaps related to context-based object detection:

Many current object detection models demonstrate strong performance within specific domains; however, the generalization of context-based detectors across domains remains underexplored. For instance, approaches like SG-YOLO and VCANet have shown effective domain-specific adaptations for underwater object detection and small objects on roads but lack the ability to generalize across diverse domains. Future research could focus on evaluating and enhancing cross-domain robustness, potentially incorporating adaptive mechanisms like self-supervised learning or domain adaptation techniques to improve performance across varied contexts.
Most models rely on static context modules that lack adaptability to various contexts within a single scene or across tasks. Modular, flexible context integration-where context types like spatial, temporal, or multi-modal information can be dynamically added-could enhance model versatility and improve performance in complex, real-world environments, particularly in situations with shifting contextual needs.
Many models are optimized for specific object sizes, often excelling with either small or large objects but not both. Real-world scenes typically include a range of object sizes, which current models are not fully equipped to handle in a unified, efficient manner. Research into developing multi-scale, adaptive models that can dynamically detect objects across various scales in real-time could greatly benefit high-stakes applications like autonomous navigation.
Video-based object detection for few-shot or camouflaged scenarios could benefit from utilizing long-range temporal context across frames. Long-term dependencies can provide cues to improve detection in challenging conditions, like rapid movement or occlusion. Integrating robust temporal context mechanisms, such as those in Context Faster R-CNN and TACF, could enhance model accuracy for tracking and detection in video sequences.
Most studies focus on spatial, temporal, or semantic contexts, with limited exploration of other context types, such as environmental (weather or lighting) and behavioral cues (e.g., predicting actions based on object interactions). Further research into these novel contextual types, especially in applications like surveillance and autonomous vehicles, could offer new insights and improve model performance in complex scenarios.
Current approaches that uses scale context in object detection often overlook depth as a factor influencing object size. This assumption of fixed size relationships can lead to inaccuracies in layered scenes where objects appear at different scales depending on their distance from the camera. Future research could address this by incorporating depth-aware scale context, allowing models to adapt object size based on depth cues. Such an approach would enhance robustness, particularly in complex environments with varying object distances.
Although contextual information has enhanced the performance of zero-shot, one-shot, few-shot, and camouflaged object detectors, there is a scarcity of published studies on these topics. This lack could be remedied by devoting additional research efforts to these domains.
More multimodal networks for integrating different types of context, such as text and audio, could be implemented to improve object detection performance, especially in complex scenes or under challenging conditions. Recently, large vision-language models (VLMs), such as those developed for multimodal applications, have shown potential in leveraging textual and visual information to enhance contextual understanding. Exploring these VLMs for object detection tasks may offer new pathways to integrate richer contextual cues, potentially improving detection accuracy in scenarios where traditional methods struggle.
A notable gap in available papers is the limited consideration of uncertainties and statistical rigor in evaluating object detection models. While performance metrics like mAP and APs are frequently reported, few studies include crucial statistical insights, such as error bars, or confidence intervals, to assess the significance and reliability of their results. Addressing this gap by incorporating uncertainty quantification and statistical testing would provide a clearer understanding of model robustness and result validity, enhancing the reliability of findings in object detection research.

5.3 Limitations

While this review employed a boolean search criterion targeting the term “context” within titles, we recognize that this may not capture all relevant papers, especially those that address context implicitly. For instance, multi-modal approaches utilizing LiDAR, thermal, or text-based inputs, such as visual question answering systems, often integrate spatial, semantic, or other contextual information without explicitly mentioning “context.” Additionally, multi-task learning frameworks can provide implicit context by sharing feature representations across tasks. Examples of such approaches that may not have been fully captured include TIDE for semantic contextual adaptation Kerssies et al. (2022), MMDetection3D for LiDAR-based context integration Contributors (2020), and MTLNet for multi-task learning in complex environments He et al. (2019). Future reviews could expand search criteria to better capture these implicit uses of context in object detection research.
This review highlights the impact of context on improving object detection performance across various categories. However, a limitation is the lack of a direct comparison between context-based and context-free approaches in the studies analyzed. Most reviewed papers do not present baseline results for context-free models, making it difficult to quantify the precise contribution of context in isolation. Addressing this gap could provide a clearer understanding of context’s value and limitations. Future research would benefit from systematic studies that evaluate and report on the performance differences between context-inclusive and context-free approaches across object detection categories.
We have concentrated on seven categories of context-based object detection, including general object detection, video object detection, small object detection, camouflaged object detection, zero-shot, one-shot, and few-shot object detection. Other approaches, such as salient object detection, RGB-D object detection, and those listed in the exclusion section of Table 2, could be explored as future research directions, as they have not been extensively covered thus far and hold significant potential for further exploration.

References

Afouras T, Asano YM, Fagan F, Vedaldi A, Metze F (2022) Self-supervised object detection from audio-visual correspondence. In: Proceedings of the IEEE/cvf conference on computer vision and pattern recognition (pp.10575–10586)
Afrl U (2009) Wright-Patterson air force base (wpafb) dataset
Aharon S, Louis-Dupont Ofri Masad, Yurkova K, Lotem Fridman Lkdci, Eran-Deci (2021) Super-gradients. GitHub. https://zenodo.org/record/7789328
Akita K, UKita N (2023) Context-aware region-dependent scale proposals for scale-optimized object detection using super-resolution. IEEE Access 11:122141–122153
Article Google Scholar
Ao Wang HC (2024) Yolov10: Real-time end-to-end object detection. arXiv preprint arXiv:2405.14458
Ardeshir S, Zamir AR, Torroella A, Shah M (2014) Gis-assisted object detection and geospatial localization. In: Computer vision–eccv 2014: 13th European conference, Zurich, Switzerland, September 6-12, 2014, proceedings, part vi 13 (pp. 602–617)
Bansal A, Sikka K, Sharma G, Chellappa R, Divakaran A (2018) Zero-shot object detection. Proceedings of the European conference on computer vision (eccv) (pp. 384–400)
Banuls A, Mandow A, Vázquez-Martín R, Morales J, García-Cerezo A (2020) Object detection from thermal infrared and visible light cameras in search and rescue scenes. In: 2020 IEEE international symposium on safety, security, and rescue robotics (ssrr) (pp. 380–386)
Bao SY, Sun M, Savarese S (2011) Toward coherent object detection and scene layout understanding. Image Vis Comput 29(9):569–579
Article Google Scholar
Bar M, Ullman S (1996) Spatial context in recognition. Perception 25(3):343–352
Article Google Scholar
Bardool K, Tuytelaars T, Oramas J (2019) A systematic analysis of a context aware deep learning architecture for object detection. Bnaic/Benelearn, 2491
Barnea E, Ben-Shahar O (2019) Contextual object detection with a few relevant neighbors. Computer vision–accv 2018: 14th asian conference on computer vision, perth, australia, december 2–6, 2018, revised selected papers, part ii 14 (pp. 480–495)
Bay H, Tuytelaars T, Van Gool L (2006) Surf: Speeded up robust features. Computer vision–eccv 2006: 9th European conference on computer vision, graz, austria, may 7-13, 2006. proceedings, part i 9 (pp. 404–417)
Beery S, Van Horn G, Perona P (2018) Recognition in terra incognita. Proceedings of the European conference on computer vision (eccv) (pp. 456–473)
Beery S, Wu G, Rathod V, Votel R, Huang J (2020) Context r-cnn: Long term temporal context for per-camera object detection. Proceedings of the IEEE/cvf conference on computer vision and pattern recognition (pp. 13075–13085)
Behrendt K, Novak L (2017). A deep learning approach to traffic lights: Detection, tracking, and classification. Robotics and automation (icra), 2017 IEEE international conference on
Belongie, C.G.S. (2007). Context based object categorization: A critical survey. Citeseer
Berg A (2016) Detection and tracking in thermal infrared imagery (Unpublished doctoral dissertation). Linköping University Electronic Press
Berg T, Liu J, Woo Lee S, Alexander ML, Jacobs DW, Belhumeur PN (2014) Birdsnap: Large-scale fine-grained visual categorization of birds. Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 2011–2018)
Bhalla S, Kumar A, Kushwaha R (2024) Feature-adaptive fpn with multiscale context integration for underwater object detection. Earth Sci Inf 6:1–17
Google Scholar
Biederman I, Mezzanotte RJ, Rabinowitz JC (1982) Scene perception: detecting and judging objects undergoing relational violations. Cogn Psychol 14(2):143–177
Article Google Scholar
Blake A, Kohli P, Rother C (2011) Markov random fields for vision and image processing. MIT Press
Book Google Scholar
Bochkovskiy A, Wang C-Y, Liao H-YM (2020) Yolov4: optimal speed and accuracy of object detection. arXiv preprint arXiv:2004.10934
Bogorny V, Engel PM, Alavares LO (2009) Enhancing the process of knowledge discovery in geographic databases using geo-ontologies. Database technologies: Concepts, methodologies, tools, and applications (pp. 2405–2426). IGI Global
Brown PJ, Bovey JD, Chen X (1997) Context-aware applications: from the laboratory to the marketplace. IEEE Pers Commun 4(5):58–64
Article Google Scholar
Cao Y, Lu X, Zhu Y, Zhou X (2020) Context-based fine hierarchical object detection with deep reinforcement learning. 2020 7th international conference on information science and control engineering (icisce) (pp. 405–409)
Carbonetto P, De Freitas N, Barnard K (2004) A statistical model for general contextual object recognition. Computer vision-eccv 2004: 8th European conference on computer vision, prague, czech republic, may 11-14, 2004. proceedings, part i 8 (pp. 350–362)
Carion N, Massa F, Synnaeve G, Usunier N, Kirillov A, Zagoruyko S (2020) End-to-end object detection with transformers. European conference on computer vision (pp. 213–229)
Chen J, Chen X, Luo L, Wang G (2021) Contextual information fusion for small object detection. 2021 40th chinese control conference (ccc) (pp. 7971–7975)
Chen G, Chen K, Zhang L, Zhang L, Knoll A (2021) Vcanet: Vanishing-point-guided context-aware network for small road object detection. Autom Innov 4:400–412
Article Google Scholar
Cheng M, Ge H, Ma S, He W, An Y, Zhou T (2023) Small object detection based on context information and attention mechanism. 2023 5th international conference on natural language processing (icnlp) (pp. 7–11)
Chen S, Guhur P-L, Tapaswi M, Schmid C, Laptev I (2022) Language conditioned spatial relation reasoning for 3d object grounding. Adv Neural Inf Process Syst 35:20522–20535
Google Scholar
Chen Z, Huang S, Tao D (2018) Context refinement for object detection. Proceedings of the European conference on computer vision (eccv) (pp. 71–86)
Chen Z-M, Jin X, Zhao B, Wei X-S, Guo Y (2020) Hierarchical context embedding for region-based object detection. Computer vision–eccv 2020: 16th European conference, Glasgow, UK, August 23–28, 2020, proceedings, part xxi 16 (pp. 633–648)
Chen Y, Song P, Liu H, Dai L, Zhang X, Ding R, Li S (2023) Achieving domain generalization for underwater object detection by domain mixup and contrastive learning. Neurocomputing 528:20–34
Article Google Scholar
Chen C, Yu J, Ling Q (2022) Sparse attention block: aggregating contextual information for object detection. Pattern Recogn 124:108418
Article Google Scholar
Chen Z, Zhang J, Tao D (2021) Recursive context routing for object detection. Int J Comput Vis 129(1):142–160
Article Google Scholar
Chen Z, Zhang J, Xu Y, Tao D (2023) Transformer-based context condensation for boosting feature pyramids in object detection. Int J Comput Vis 131(10):2738–2756
Article Google Scholar
Chen Y, Zhao M, Tan X, Tang H, Sun D (2019) Accurate and efficient object detection with context enhancement block. 2019 IEEE international conference on multimedia and expo (icme) (pp. 1726–1731)
Choi MJ, Torralba A, Willsky AS (2011) A tree-based context model for object recognition. IEEE Trans Pattern Anal Mach Intell 34(2):240–252
Article Google Scholar
Chu W, Cai D (2018) Deep feature based contextual model for object detection. Neurocomputing 275:1035–1042
Article Google Scholar
Contributors M. (2020). MMDetection3D: OpenMMLab next-generation platform for general 3D object detection. https://github.com/open-mmlab/mmdetection3d
Cordts M, Omran M, Ramos S, Rehfeld T, Enzweiler M, Benenson R, Schiele B (2016) The cityscapes dataset for semantic urban scene understanding. Proc. of the IEEE conference on computer vision and pattern recognition (cvpr)
Corsel CW, van Lier M, Kampmeijer L, Boehrer N, Bakker EM (2023) Exploiting temporal context for tiny object detection. Proceedings of the IEEE/cvf winter conference on applications of computer vision (pp. 79–89)
Cui L, Lv P, Jiang X, Gao Z, Zhou B, Zhang L, Xu M (2020) Context-aware block net for small object detection. IEEE Trans Cybernet 52(4):2300–2313
Article Google Scholar
Dalal N, Triggs B (2005) Histograms of oriented gradients for human detection. 2005 IEEE computer society conference on computer vision and pattern recognition (cvpr’05) (Vol. 1, pp. 886–893)
Dana HB, Christopher MB (1982) Computer vision. Prentice, Hall, Inc.# Englewood Cliffs. New Jersey 7632:76–77
Deng L, Luo S, He C, Xiao H, Wu H (2024) Underwater small and occlusion object detection with feature fusion and global context decoupling head-based yolo. Multimedia Syst 30(4):208
Article Google Scholar
Deng H, Wang C, Li C, Hao Z (2024) Fine grained dual level attention mechanisms with spacial context information fusion for object detection. Pattern Anal Appl 27(3):75
Article Google Scholar
Derek H (2006) Putting objects in perspective. CVPR2006
Dimitropoulos K, Hatzilygeroudis I (2022) Context representation and reasoning in robotics-an overview. Advances in Artificial Intelligence-based Technologies: Selected Papers in Honour of Professor Nikolaos G. Bourbakis-Vol. 1:79–92
Ding P, Zhang J, Zhou H, Zou X, Wang M (2020) Pyramid context learning for object detection. J Supercomput 76:9374–9387
Article Google Scholar
Divvala SK, Hoiem D, Hays JH, Efros AA, Hebert M (2009) An empirical study of context in object detection. 2009 IEEE conference on computer vision and pattern recognition (pp. 1271–1278)
Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X, Unterthiner T others (2020) An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929
Duan K, Bai S, Xie L, Qi H, Huang Q, Tian Q (2019) Centernet: Keypoint triplets for object detection. Proceedings of the IEEE/cvf international conference on computer vision (pp. 6569–6578)
Du B, Huang Y, Chen J, Huang D (2023) Adaptive sparse convolutional networks with global context enhancement for faster object detection on drone images. Proceedings of the IEEE/cvf conference on computer vision and pattern recognition (pp. 13435–13444)
Du D, Qi Y, Yu H, Yang Y, Duan K, Li G, Tian Q (2018) The unmanned aerial vehicle benchmark: Object detection and tracking. Proceedings of the European conference on computer vision (eccv) (pp. 370–386)
Dvornik N, Mairal J, Schmid C (2018) Modeling visual context is key to augmenting object detection datasets. Proceedings of the European conference on computer vision (eccv) (pp. 364–380)
Dvornik N, Mairal J, Schmid C (2019) On the importance of visual context for data augmentation in scene understanding. IEEE Trans Pattern Anal Mach Intell 43(6):2014–2028
Article Google Scholar
Everingham M, Van Gool L, Williams CKI, Winn J, Zisserman A (n.d.) The PASCAL Visual Object Classes Challenge 2007 (VOC2007) Results. http://www.pascal-network.org/challenges/VOC/voc2007/workshop/index.html
Everingham, M., & Winn, J. (2012). The pascal visual object classes challenge 2012 (voc2012) development kit. Pattern Anal Stat Model Comput Learn, Tech Rep, 2007(1-45), 5
Fan D-P, Ji G-P, Sun G, Cheng M-M, Shen J, Shao L (2020) Camouflaged object detection. Proceedings of the IEEE/cvf conference on computer vision and pattern recognition (pp. 2777–2787)
Fan D-P, Ji G-P, Cheng M-M, Shao L (2021) Concealed object detection. IEEE Trans Pattern Anal Mach Intell 44(10):6024–6042
Article Google Scholar
Fan B, Shao M, Li Y, Li C (2022) Global contextual attention for pure regression object detection. Int J Mach Learn Cybern 13(8):2189–2197
Article Google Scholar
Fan Q, Tang C-K, Tai Y-W (2022) Few-shot video object detection. European conference on computer vision (pp. 76–98)
Fang P, Shi Y (2018) Small object detection using context information fusion in faster r-cnn. 2018 IEEE 4th international conference on computer and communications (iccc) (pp. 1537–1540)
Fauvel M, Chanussot J, Benediktsson JA (2012) A spatial-spectral kernel-based approach for the classification of remote-sensing images. Pattern Recogn 45(1):381–392
Article Google Scholar
Felzenszwalb PF, Huttenlocher DP (2005) Pictorial structures for object recognition. Int J Comput Vis 61:55–79
Article Google Scholar
Felzenszwalb P, McAllester D, Ramanan D (2008) A discriminatively trained, multiscale, deformable part model. 2008 IEEE conference on computer vision and pattern recognition (pp. 1–8)
Feng Y-Z, Sun D-W (2012) Application of hyperspectral imaging in food safety inspection and control: a review. Crit Rev Food Sci Nutr 52(11):1039–1058
Article Google Scholar
Galleguillos C, Belongie S (2010) Context based object categorization: a critical survey. Comput Vis Image Underst 114(6):712–722
Article Google Scholar
Galleguillos C, Rabinovich A, Belongie S (2008) Object categorization using co-occurrence, location and appearance. 2008 IEEE conference on computer vision and pattern recognition (pp. 1–8)
Gao J, Li P, Chen Z, Zhang J (2020) A survey on deep learning for multimodal data fusion. Neural Comput 32(5):829–864
Article MathSciNet Google Scholar
Geiger A, Lenz P, Urtasun R (2012) Are we ready for autonomous driving? the kitti vision benchmark suite. Conference on computer vision and pattern recognition (cvpr)
Georgousis S, Kenning MP, Xie X (2021) Graph deep learning: state of the art and challenges. IEEE Access 9:22106–22140
Article Google Scholar
Gidaris S, Komodakis N (2015) Object detection via a multi-region and semantic segmentation-aware cnn model. Proceedings of the IEEE international conference on computer vision (pp. 1134–1142)
Girshick R (2015) Fast r-cnn. Proceedings of the IEEE international conference on computer vision (pp. 1440–1448)
Girshick R, Donahue J, Darrell T, Malik J (2014) Rich feature hierarchies for accurate object detection and semantic segmentation. Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 580–587)
Girum KB, Créhange G, Lalande A (2021) Learning with context feedback loop for robust medical image segmentation. IEEE Trans Med Imaging 40(6):1542–1554
Article Google Scholar
Gkioxari G, Girshick R, Dollár P, He K (2018) Detecting and recognizing human-object interactions. Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 8359–8367)
Gong Y, Xiao Z, Tan X, Sui H, Xu C, Duan H, Li D (2019) Context-aware convolutional neural network for object detection in vhr remote sensing imagery. IEEE Trans Geosci Remote Sens 58(1):34–44
Article Google Scholar
Groenen I, Rudinac S, Worring M (2023) Panorams: automatic annotation for detecting objects in urban context. IEEE Trans Multimed 26:1281
Article Google Scholar
Gu X, Zhang Q, Lu Z (2022) Weakly supervised object detection with symmetry context. Symmetry 14(9):1832
Article Google Scholar
Guan L, Wu Y, Zhao J (2018) Scan: Semantic context aware network for accurate small object detection. Int J Comput Intell Syst 11(1):951–961
Article Google Scholar
Guo J, Yuan C, Zhao Z, Feng P, Luo Y, Wang T (2020) Object detector with enriched global context information. Multimed Tools Appl 79:29551–29571
Article Google Scholar
Gupta A, Dollar P, Girshick R (2019) Lvis: a dataset for large vocabulary instance segmentation. Proceedings of the IEEE/cvf conference on computer vision and pattern recognition (pp. 5356–5364)
Han W, Lei J, Wang F, Feng Z, Liang R (2023) Temporal aggregation with context focusing for few-shot video object detection. 2023 IEEE international conference on systems, man, and cybernetics (smc) (pp. 2196–2201)
Han L, Wang P, Yin Z, Wang F, Li H (2021) Context and structure mining network for video object detection. Int J Comput Vis 129(10):2927–2946
Article Google Scholar
Hang Z, Fan L, Ping K, Xiaofeng G, Mingyun H, Heng T (2022) Small object detection algorithm based on context information and attention mechanism. 2022 19th international computer conference on wavelet active media technology and information processing (iccwamtip) (pp. 1–6)
He F, Gao N, Li Q, Du S, Zhao X, Huang K (2020) Temporal context enhanced feature aggregation for video object detection. Proceedings of the aaai conference on artificial intelligence (Vol. 34, pp. 10941–10948)
He K, Gkioxari G, Dollár P, Girshick R (2017) Mask r-cnn. Proceedings of the IEEE international conference on computer vision (pp. 2961–2969)
He T, Shen C, Tian Z, Gong D, Sun C, Yan Y (2019) Knowledge adaptation for efficient semantic segmentation. Proceedings of the IEEE/cvf conference on computer vision and pattern recognition (pp. 578–587)
He X, Tang C, Liu X, Zhang W, Sun K, Xu J (2023) Object detection in hyperspectral image via unified spectral-spatial feature aggregation. IEEE Trans Geosci Remote Sens 61:1–13. https://doi.org/10.1109/TGRS.2023.3307288
Article Google Scholar
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 770–778)
He X, Zheng X, Hao X, Jin H, Zhou X, Shao L (2024) Improving small object detection via context-aware and feature-enhanced plug-and-play modules. J Real-Time Image Proc 21(2):44
Article Google Scholar
Heidecker F, Bieshaar M, Sick B (2024) Corner cases in machine learning processes. AI Perspect Adv 6(1):1
Article Google Scholar
Heitz G, Koller D (2008) Learning spatial context: Using stuff to find things. Computer vision–eccv 2008: 10th European conference on computer vision, Marseille, France, October 12-18, 2008, proceedings, part i 10 (pp. 30–43)
Hoiem D, Efros AA, Hebert M (2005) Geometric context from a single image. Tenth IEEE international conference on computer vision (iccv’05) (Vol. 1, pp. 654–661)
Hoiem D, Efros AA, Hebert M (2008) Putting objects in perspective. Int J Comput Vis 80:3–15
Article Google Scholar
Hossain SN, Hassan MZ, Masba MMA (2022) Automatic license plate recognition system for bangladeshi vehicles using deep neural network. Proceedings of the international conference on big data, iot, and machine learning: Bim 2021 (pp. 91–102)
Hu H, Bai S, Li A, Cui J, Wang L (2021) Dense relation distillation with context-aware aggregation for few-shot object detection. Proceedings of the IEEE/cvf conference on computer vision and pattern recognition (pp. 10185–10194)
Ibañez-Guzman J, Laugier C, Yoder J-D, Thrun S (2012) Autonomous driving: context and state-of-the-art. Springer
Google Scholar
Ike CS, Muhammad N, Bibi N, Alhazmi S, Eoghan F (2024) Discriminative context-aware network for camouflaged object detection. Front Artif Intell 7:1347898
Article Google Scholar
Jaderberg M, Simonyan K, Zisserman A, KavUKcuoglu K (2015) Spatial transformer networks. Advances in neural information processing systems 28. Annual conference on neural information processing systems (pp. 7–12)
Jain V, Learned-Miller E (2010) Fddb: A benchmark for face detection in unconstrained settings (Tech. Rep.). UMass Amherst technical report
Jain R, Sinha P (2010) Content without context is meaningless. Proceedings of the 18th acm international conference on multimedia (pp. 1259–1268)
Ji S-J, Ling Q-H, Han F (2023) An improved algorithm for small object detection based on yolo v4 and multi-scale contextual information. Comput Electr Eng 105:108490
Article Google Scholar
Ji H, Ye K, Wan Q, Shen L (2022) Reasonable object detection guided by knowledge of global context and category relationship. Expert Syst Appl 209:118285
Article Google Scholar
Jia S, Lu T, Zhang H (2021) One shot object detection with mutual global context. Proceedings of the 2021 4th international conference on artificial intelligence and pattern recognition (pp. 165–171)
Jiang Y, Peng T, Tan N (2019) Cp-ssd: Context information scene perception object detection based on ssd. Appl Sci 9(14):2785
Article Google Scholar
Jiaxuan H, Lei Y, Junping Y (2022) Lightened context extraction network for object detection. 2022 19th international computer conference on wavelet active media technology and information processing (iccwamtip) (pp. 1–6)
Jin R, Lin G, Wen C, Wang J (2020) Motion context network for weakly supervised object detection in videos. IEEE Signal Process Lett 27:1864–1868
Article Google Scholar
Jing X, Liu X, Liu B (2024) Composite backbone small object detection based on context and multi-scale information with attention mechanism. Mathematics 12(5):622
Article Google Scholar
Jocher G (2020) Ultralytics yolov5. Retrieved from https://github.com/ultralytics/yolov5
Jocher G, Chaurasia A, Qiu J (2023) Yolo by ultralytics. Ultralytics
Jocher G, Qiu J (2024) Ultralytics yolo11. Retrieved from https://github.com/ultralytics/ultralytics
Kaiming H, Xiangyu Z, Shaoqing R, Sun J (2014) Spatial pyramid pooling in deep convolutional networks for visual recognition. European conference on computer vision
Katti H, Peelen MV, Arun S (2019) Machine vision benefits from human contextual expectations. Sci Rep 9(1):2112
Article Google Scholar
Kaya EC, Alatan AA (2018) Improving proposal-based object detection using convolutional context features. 2018 25th IEEE international conference on image processing (icip) (pp. 1308–1312)
Kerssies T, Kılıçkaya M, Vanschoren J (2022) Evaluating continual test-time adaptation for contextual and semantic domain shifts. arXiv preprint arXiv:2208.08767
Khan A, Khan M, Gueaieb W, El Saddik A, De Masi G, Karray F (2024) Camofocus: Enhancing camouflage object detection with split-feature focal modulation and context refinement. Proceedings of the IEEE/cvf winter conference on applications of computer vision (pp. 1434–1443)
Kim Y, Kim T, Kang B-N, Kim J, Kim D (2018) Ban: focusing on boundary context for object detection. Asian conference on computer vision (pp. 555–570)
Kim J, Koh J, Lee B, Yang S, Choi JW (2021) Video object detection using object’s motion context and spatio-temporal feature aggregation. 2020 25th international conference on pattern recognition (icpr) (pp. 1604–1610)
Kim H, Lee D, Park S, Ro YM (2024) Weather-aware drone-view object detection via environmental context understanding. 2024 IEEE international conference on image processing (icip) (pp. 549–555)
Kitchenham B (2004) Procedures for performing systematic reviews. Keele, UK, Keele University 33(2004):1–26
Krišto M, Ivasic-Kos M, Pobar M (2020) Thermal object detection in difficult weather conditions using yolo. IEEE Access 8:125459–125476
Article Google Scholar
Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. Adv Neural Inf Process Syst, 25
Ladicky L, Russell C, Kohli P, Torr PH (2010) Graph cut based inference with co-occurrence statistics. European conference on computer vision (pp. 239–253)
Lai Q, Vong C-M, Shi S-Q, Chen CP (2024) Towards precise weakly supervised object detection via interactive contrastive learning of context information. IEEE Trans Emerging Top Comput Intell
Lalonde J-F, Narasimhan SG, Efros AA (2008) What does the sky tell us about the camera? Computer vision–eccv 2008: 10th European conference on computer vision, Marseille, France, October 12-18, 2008, proceedings, part iv 10 (pp. 354–367)
Lan Y, Duan Y, Liu C, Zhu C, Xiong Y, Huang H, Xu K (2022) Arm3d: Attention-based relation module for indoor 3d object detection. Comput Vis Media 8(3):395–414
Article Google Scholar
Law H, Deng J (2018) Cornernet: Detecting objects as paired keypoints. Proceedings of the European conference on computer vision (eccv) (pp. 734–750)
Le T-N, Nguyen TV, Nie Z, Tran M-T, Sugimoto A (2019) Anabranch network for camouflaged object segmentation. J Comput Vis Image Understand 184:45–56
Article Google Scholar
Lee SY (2022) Task specific attention is one more thing you need for object detection
Lee D, Kim J, Jung K (2021) Improving object detection quality by incorporating global contexts via self-attention. Electronics 10(1):90
Article Google Scholar
Leng J, Liu Y (2022) Context augmentation for object detection. Appl Intell 52(3):2621–2633
Article Google Scholar
Leng J, Liu Y, Zhang T, Quan P (2018) Context learning network for object detection. 2018 IEEE international conference on data mining workshops (icdmw) (pp. 667–673)
Leng J, Ren Y, Jiang W, Sun X, Wang Y (2021) Realize your surroundings: exploiting context information for small object detection. Neurocomputing 433:287–299
Article Google Scholar
Leroy A, Faure S, Spotorno S (2020) Reciprocal semantic predictions drive categorization of scene contexts and objects even when they are separate. Sci Rep 10(1):8447
Article Google Scholar
Li Z, Cui X, Wang L, Zhang H, Zhu X, Zhang Y (2021) Spectral and spatial global context attention for hyperspectral image classification. Remote Sens 13(4):771
Article Google Scholar
Li C, Li L, Jiang H, Weng K, Geng Y, Li L others (2022) Yolov6: A single-stage object detection framework for industrial applications. arXiv preprint arXiv:2209.02976
Li Y, Shao M, Fan B, Zhang W (2022) Multi-scale global context feature pyramid network for object detector. Signal, Image and Video Processing, pp 1–9
Li N, Song F, Zhang Y, Liang P, Cheng E (2022) Traffic context aware data augmentation for rare object detection in autonomous driving. 2022 international conference on robotics and automation (icra) (pp. 4548–4554)
Li J, Wei Y, Liang X, Dong J, Xu T, Feng J, Yan S (2016) Attentive contexts for object detection. IEEE Trans Multimedia 19(5):944–954
Article Google Scholar
Li W, Wei H, Wu Y, Yang J, Ruan Y, Li Y, Tang Y (2023) Tide: Test time few shot object detection. arXiv preprint arXiv:2311.18358
Li J, Zhang C, Yang B (2022) Global contextual dependency network for object detection. Future Internet 14(1):27
Article Google Scholar
Liang H, Zhou H, Zhang Q, Wu T (2022) Object detection algorithm based on context information and self-attention mechanism. Symmetry 14(5):904
Article Google Scholar
Lieskovská E, Jakubec M, Jarina R, Chmulík M (2021) A review on speech emotion recognition using deep learning and attention mechanism. Electronics 10(10):1163
Article Google Scholar
Lim J-S, Astrid M, Yoon H-J, Lee S-I (2021) Small object detection using context and attention. 2021 international conference on artificial intelligence in information and communication (icaiic) (pp. 181–186)
Lim C-G, Jeong Y-S, Choi H-J (2019) Survey of temporal information extraction. J Inf Process Syst 15(4):931–956
Google Scholar
Lin T-Y, Dollár P, Girshick R, He K, Hariharan B, Belongie S (2017) Feature pyramid networks for object detection. Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 2117–2125)
Lin T-Y, Goyal P, Girshick R, He K, Dollár P (2017) Focal loss for dense object detection. Proceedings of the IEEE international conference on computer vision (pp. 2980–2988)
Lin T-Y, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Zitnick CL (2014) Microsoft coco: Common objects in context. Computer vision–eccv 2014: 13th European conference, Zurich, Switzerland, September 6-12, 2014, proceedings, part v 13 (pp. 740–755)
Lin K-Y, Tseng Y-H, Chiang K-W (2022) Interpretation and transformation of intrinsic camera parameters used in photogrammetry and computer vision. Sensors 22(24):9602
Article Google Scholar
Liu Z, Cheng J (2023) Cb-fpn: object detection feature pyramid network based on context information and bidirectional efficient fusion. Pattern Anal Appl 26(3):1441–1452
Article Google Scholar
Liu W, Anguelov D, Erhan D, Szegedy C, Reed S, Fu C-Y, Berg AC (2016) Ssd: Single shot multibox detector. Computer vision–eccv 2016: 14th European conference, Amsterdam, The Netherlands, October 11–14, 2016, proceedings, part i 14 (pp. 21–37)
Liu Z, Lin Y, Cao Y, Hu H, Wei Y, Zhang Z, Guo B (2021) Swin transformer: hierarchical vision transformer using shifted windows. Proceedings of the IEEE/cvf international conference on computer vision (pp. 10012–10022)
Liu Z, Luo P, Wang X, Tang X (2015) Deep learning face attributes in the wild. Proceedings of the IEEE international conference on computer vision (pp. 3730–3738)
Liu L, Ouyang W, Wang X, Fieguth P, Chen J, Liu X, Pietikäinen M (2020) Deep learning for generic object detection: a survey. Int J Comput Vis 128:261–318
Article Google Scholar
Liu Y, Wang R, Shan S, Chen X (2018) Structure inference net: Object detection using scene-level context and instance-level relationships. Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 6985–6994)
Long Y, Gong Y, Xiao Z, Liu Q (2017) Accurate object localization in remote sensing images based on convolutional neural networks. IEEE Trans Geosci Remote Sens 55(5):2486–2498
Article Google Scholar
Lowe DG (1999) Object recognition from local scale-invariant features. Proceedings of the seventh IEEE international conference on computer vision (Vol. 2, pp. 1150–1157)
Lu W, Xu W, Wu Z, Xu Y, Wei Z (2020) Video object detection based on non-local prior of spatiotemporal context. 2020 Eighth international conference on advanced cloud and big data (cbd) (pp. 177–182)
Luo Z, Branchaud-Charron F, Lemaire C, Konrad J, Li S, Mishra A, Jodoin P-M (2018) Mio-tcd: A new benchmark dataset for vehicle classification and localization. IEEE Trans Image Process 27(10):5129–5141
Article MathSciNet Google Scholar
Luo H, Huang L, Shen H, Li Y, Huang C, Wang X (2019) Object detection in video with spatial-temporal context aggregation. arXiv preprint arXiv:1907.04988
Luo H-W, Zhang C-S, Pan F-C, Ju X-M (2019) Contextual-yolov3: Implement better small object detection based deep learning. 2019 International conference on machine learning, big data and business intelligence (mlbdbi) (pp. 134–141)
Lv Y, Zhang J, Dai Y, Li A, Liu B, Barnes N, Fan D-P (2021) Simultaneously localize, segment and rank the camouflaged objects. Proceedings of the IEEE/cvf conference on computer vision and pattern recognition (pp. 11591–11601)
Ma S, An W, Yang X, Hou Z (2022) An object detection algorithm with multi-scale context information based on yolov4. 2022 4th International conference on natural language processing (icnlp) (pp. 13–19)
Ma R, Fan W, Dong J, Hao Y, Shi W, Wu D (2024) Object detection algorithm of drone image combining attention mechanism and context. 2024 IEEE 4th international conference on electronic technology, communication and information (icetci) (pp. 209–213)
Ma Y, Wang Y (2023) Feature refinement with multi-level context for object detection. Mach Vis Appl 34(4):49
Article Google Scholar
Ma C, Zhuo L, Li J, Zhang Y, Zhang J (2023) Occluded prohibited object detection in x-ray images with global context-aware multi-scale feature aggregation. Neurocomputing 519:1–16
Article Google Scholar
Mac Aodha O, Cole E, Perona P (2019) Presence-only geographical priors for fine-grained image classification. Proceedings of the IEEE/cvf international conference on computer vision (pp. 9596–9606)
Marques O, Barenholtz E, Charvillat V (2011) Context modeling in computer vision: techniques, implications, and applications. Multimed Tools Appl 51:303–339
Article Google Scholar
Marr D, Nishihara HK (1978) Representation and recognition of the spatial organization of three-dimensional shapes. Proc R Soc London Ser B Biol Sci 200(1140):269–294
Google Scholar
Mei H, Yang X, Wang Y, Liu Y, He S, Zhang Q, Lau RW (2020) Don’t hit me! glass detection in real-world scenes. Proceedings of the IEEE/cvf conference on computer vision and pattern recognition (pp. 3687–3696)
Mensink T, Gavves E, Snoek CG (2014) Costa: Co-occurrence statistics for zero-shot classification. Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 2441–2448)
Miao S, Du S, Feng R, Zhang Y, Li H, Liu T, Fan W (2022) Balanced single-shot object detection using cross-context attention-guided network. Pattern Recogn 122:108258
Article Google Scholar
Miao C, Xie L, Wan F, Su C, Liu H, Jiao J, Ye Q (2019) Sixray: A large-scale security inspection x-ray benchmark for prohibited item discovery in overlapping images. Proceedings of the IEEE/cvf conference on computer vision and pattern recognition (pp. 2119–2128)
Mishra A, Alahari K, Jawahar C (2013) Image retrieval using textual cues. Proceedings of the IEEE international conference on computer vision (pp. 3040–3047)
Munneke J, Brentari V, Peelen MV (2013) The influence of scene context on object recognition is independent of attentional focus. Front Psychol 4:552
Article Google Scholar
Narasimhan SG, Nayar SK (2002) Vision and the atmosphere. Int J Comput Vis 48:233–254
Article Google Scholar
Nazir M, Haque HMU, Saleem K (2022) A semantic knowledge based context-aware formalism for smart border surveillance system. Mob Netw Appl 27(5):2036–2048
Article Google Scholar
Neseem M, Reda S (2021) Adacon: Adaptive context-aware object detection for resource-constrained embedded devices. 2021 IEEE/acm international conference on computer aided design (iccad) (pp. 1–9)
Nie J, Pang Y, Zhao S, Han J, Li X (2020) Efficient selective context network for accurate object detection. IEEE Trans Circ Syst Video Technol 31(9):3456–3468
Article Google Scholar
Niu Y, Cheng W, Shi C, Fan S (2023) Yolov8-cgrnet: a lightweight object detection network leveraging context guidance and deep residual learning. Electronics 13(1):43
Article Google Scholar
Oliva A, Torralba A (2001) Modeling the shape of the scene: a holistic representation of the spatial envelope. Int J Comput Vis 42:145–175
Article Google Scholar
Oliva A, Torralba A (2007) The role of context in object recognition. Trends Cogn Sci 11(12):520–527
Article Google Scholar
Oreski G (2023) Yolo* c-adding context improves yolo performance. Neurocomputing 555:126655
Article Google Scholar
Osokin A, Sumin D, Lomakin V (2020) Os2d: One-stage one-shot object detection by matching anchor features. Computer vision–eccv 2020: 16th European conference, Glasgow, UK, August 23–28, 2020, proceedings, part xv 16 (pp. 635–652)
Pan J, Kanade T (2013) Coherent object detection with 3d geometric context from a single image. Proceedings of the IEEE international conference on computer vision (pp. 2576–2583)
Panagakis Y, Kossaifi J, Chrysos GG, Oldfield J, Nicolaou MA, Anandkumar A, Zafeiriou S (2021) Tensor methods in computer vision and deep learning. Proc IEEE 109(5):863–890
Article Google Scholar
Peng J, Wang H, Yue S, Zhang Z (2022) Context-aware co-supervision for accurate object detection. Pattern Recogn 121:108199
Article Google Scholar
Perko R, Leonardis A (2010) A framework for visual-context-aware object detection in still images. Comput Vis Image Underst 114(6):700–711
Article Google Scholar
Qin Y, Gu X, Tan Z (2022) Visual context learning based on textual knowledge for image-text retrieval. Neural Netw 152:434–449
Article Google Scholar
Qiu H, Li H, Wu Q, Meng F, Xu L, Ngan KN, Shi H (2020) Hierarchical context features embedding for object detection. IEEE Trans Multimedia 22(12):3039–3050
Article Google Scholar
Qiu L, Xiong Z, Wang X, Liu K, Li Y, Chen G, Cui S (2022) Ethseg: An amodel instance segmentation network and a real-world dataset for x-ray waste inspection. Proceedings of the IEEE/cvf conference on computer vision and pattern recognition (pp. 2283–2292)
Rabinovich A, Vedaldi A, Galleguillos C, Wiewiora E, Belongie S (2007) Objects in context. 2007 IEEE 11th international conference on computer vision (pp. 1–8)
Ran S, Duan D, Peng L, Hu F, Zhong W (2023) A few-shot object detection method based on instance context. 2023 China automation congress (cac) (pp. 9247–9252)
Redmon J, Divvala S, Girshick R, Farhadi A (2016) You only look once: Unified, real-time object detection. Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 779–788)
Redmon J, Farhadi A (2017) Yolo9000: better, faster, stronger. Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 7263–7271)
Redmon J, Farhadi A (2018) Yolov3: An incremental improvement. arXiv preprint arXiv:1804.02767
Ren S, He K, Girshick R, Sun J (2015) Faster r-cnn: Towards real-time object detection with region proposal networks. Adv Neural Inf Process Syst, 28
Ren S, He K, Girshick R, Sun J (2016) Faster r-cnn: Towards real-time object detection with region proposal networks. IEEE Trans Pattern Anal Mach Intell 39(6):1137–1149
Article Google Scholar
Roh B, Shin J, Shin W, Kim S (2021) Sparse detr: Efficient end-to-end object detection with learnable sparsity. arXiv preprint arXiv:2111.14330
Russakovsky O, Deng J, Su H, Krause J, Satheesh S, Ma S et al (2015) Imagenet large scale visual recognition challenge. Int J Comput Vis 115:211–252
Article MathSciNet Google Scholar
Russell B, Torralba A, Liu C, Fergus R, Freeman W (2007) Object recognition by scene alignment. Adv Neural Inf Process Syst, 20
Saha U, Ahamed IU, Hossain MI (2024) Yolov8 for bangla license plate recognition: Advancing real-time object detection in localized contexts. In: 2024 7th International conference on informatics and computational sciences (icicos) (pp. 413–418)
Sermanet P, Eigen D, Zhang X, Mathieu M, Fergus R, LeCun Y (2013) Overfeat: Integrated recognition, localization and detection using convolutional networks. arXiv preprint arXiv:1312.6229
Shaw GA, Burke HK (2003) Spectral imaging for remote sensing. Lincoln Lab J 14(1):3–28
Google Scholar
Shelhamer E, Long J, Darrell T et al (2017) Fully convolutional networks for semantic segmentation. IEEE Trans Pattern Anal Mach Intell 39(4):640–651
Article Google Scholar
Shen Z, Liu Z, Li J, Jiang Y-G, Chen Y, Xue X (2017) Dsod: Learning deeply supervised object detectors from scratch. Proceedings of the IEEE international conference on computer vision (pp. 1919–1927)
Shrivastava A, Gupta A (2016) Contextual priming and feedback for faster r-cnn. Computer vision–eccv 2016: 14th European conference, Amsterdam, The Netherlands, October 11–14, 2016, proceedings, part i 14 (pp. 330–348)
Singhal A, Luo J, Zhu W (2003) Probabilistic spatial context models for scene content understanding. 2003 IEEE computer society conference on computer vision and pattern recognition, 2003. proceedings. (Vol. 1, pp. I–I)
Sivic, Zisserman (2003) Video google: A text retrieval approach to object matching in videos. Proceedings ninth IEEE international conference on computer vision (pp. 1470–1477)
Skurowski P, Abdulameer H, Błaszczyk J, Depta T, Kornacki A, Kozieł P (2018) Animal camouflage analysis: Chameleon database. Unpublished manuscript, 2(6), 7
Song Z, Kang X, Wei X, Li S (2023) Pixel-centric context perception network for camouflaged object detection. IEEE Trans Neural Netw Learn Syst
SRMIST (2023). Utdac2020 dataset [Open Source Dataset]. https://universe.roboflow.com/srmist-vx65l/utdac2020-kkoqh. Roboflow. Retrieved from https://universe.roboflow.com/srmist-vx65l/utdac2020-kkoqh (visited on 2024-10-21)
Stapic Z, López EG, Cabot AG, de Marcos Ortega L, Strahonja V (2012) Performing systematic literature review in software engineering. Central European conference on information and intelligent systems (p.441)
Sun M, Bao Y, Savarese S (2010) Object detection with geometrical context feedback loop. Bmvc (Vol. 1, p.2)
Sun Y, Chen G, Zhou T, Zhang Y, Liu N (2021) Context-aware cross-level fusion network for camouflaged object detection. arXiv preprint arXiv:2105.12555
Tan M, Pang R, Le QV (2020) Efficientdet: Scalable and efficient object detection. Proceedings of the IEEE/cvf conference on computer vision and pattern recognition (pp. 10781–10790)
Tian Z, Shen C, Chen H, He T (2019) Fcos: Fully convolutional one-stage object detection. Proceedings of the IEEE/cvf international conference on computer vision (pp. 9627–9636)
Tian Y, Shi J, Li B, Duan Z, Xu C (2018) Audio-visual event localization in unconstrained videos. Proceedings of the European conference on computer vision (eccv) (pp. 247–263)
Tong K, Wu Y, Zhou F (2020) Recent advances in small object detection based on deep learning: a review. Image Vis Comput 97:103910
Article Google Scholar
Unal ME, Kovashka A (2021) Context for object detection via lightweight global and mid-level representations. 2020 25th international conference on pattern recognition (icpr) (pp. 8423–8430)
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, PolosUKhin I (2017) Attention is all you need. Adv Neural Inf Process Syst, 30
Venkataramanan A, Laviale M, Figus C, Usseglio-Polatera P, Pradalier C (2021) Tackling inter-class similarity and intra-class variance for microscopic image-based classification. International conference on computer vision systems (pp. 93–103)
Viola P, Jones M (2001) Rapid object detection using a boosted cascade of simple features. Proceedings of the 2001 IEEE computer society conference on computer vision and pattern recognition. cvpr 2001 (Vol. 1, pp. I–I)
Wang S, Bai M, Mattyus G, Chu H, Luo W, Yang B, Urtasun R (2017) Torontocity: Seeing the world with a million eyes. 2017 IEEE international conference on computer vision (iccv) (p.3028-3036)
Wang C-Y, Bochkovskiy A, Liao H-YM (2023) Yolov7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. Proceedings of the IEEE/cvf conference on computer vision and pattern recognition (pp. 7464–7475)
Wang N, Cai A, Zhang S (2018) The study of rnn enhanced convolutional neural network for fast object detection based on the spatial context multi-fusion features. 2018 11th international symposium on computational intelligence and design (iscid) (Vol. 1, pp. 136–140)
Wang S, Fidler S, Urtasun R (2015) Holistic 3d scene understanding from a single geo-tagged image. Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 3964–3972)
Wang T, He X, Cai Y, Xiao G (2019) Learning a layout transfer network for context aware object detection. IEEE Trans Intell Transp Syst 21(10):4209–4224
Article Google Scholar
Wang J, Hu X (2021) Convolutional neural networks with gated recurrent connections. TPAMI
Wang B, Ji R, Zhang L, Wu Y (2022) Bridging multi-scale context-aware representation for object detection. IEEE Trans Circ Syst Video Technol
Wang C-Y, Liao H-YM (2024) Yolov9: Learning what you want to learn using programmable gradient information
Wang Y, Ma Y (2022) Multi-scale context enhancement network for object detection. 2022 IEEE 2nd international conference on software engineering and artificial intelligence (seai) (pp. 6–11)
Wang A, Sun Y, Kortylewski A, Yuille AL (2020) Robust object detection under occlusion with context-aware compositionalnets. Proceedings of the IEEE/cvf conference on computer vision and pattern recognition (pp. 12645–12654)
Wang H, Tang J, Liu X, Guan S, Xie R, Song L (2022) Ptseformer: Progressive temporal-spatial enhanced transformer towards video object detection. European conference on computer vision (pp. 732–747)
Wang H, Xu J, Li L, Tian Y, Xu D, Xu S (2018) Multi-scale fusion with context-aware network for object detection. 2018 24th international conference on pattern recognition (icpr) (pp. 2486–2491)
Wang C, Yeh I, Liao H (2021) You only learn one representation: Unified network for multiple tasks. arXiv preprint arXiv:2105.04206
Wang X, Zhu Z (2023) Context understanding in computer vision: a survey. Comput Vis Image Underst 229:103646
Article Google Scholar
Wei Y, Ma Y (2022) Zero-shot object detection with multi-label context. Seke (pp. 142–146)
Wei Y, Tao R, Wu Z, Ma Y, Zhang L, Liu X (2020) Occluded prohibited items detection: An x-ray security inspection benchmark and de-occlusion attention module. Proceedings of the 28th acm international conference on multimedia (pp. 138–146)
Wen L, Du D, Cai Z, Lei Z, Chang M-C, Qi H, Lyu S (2020) Ua-detrac: a new benchmark and protocol for multi-object detection and tracking. Comput Vis Image Underst 193:102907
Article Google Scholar
Wen Y, Ke W, Sheng H (2024) Camouflaged object detection based on deep learning with attention-guided edge detection and multi-scale context fusion. Appl Sci 14(6):2494
Article Google Scholar
Wertheimer M (2017) Investigations into the doctrine of shape. Gestalt Theory 39(1):79–89
Article Google Scholar
Wu B, Iandola F, Jin PH, Keutzer K (2017) Squeezedet: Unified, small, low power fully convolutional neural networks for real-time object detection for autonomous driving. Proceedings of the IEEE conference on computer vision and pattern recognition workshops (pp. 129–137)
Wu Y, Lim J, Yang M-H (2013) Online object tracking: A benchmark. Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 2411–2418)
Wu K, Zhang Y, Xie Z, Guo D, An X (2021) Ddfpn: Context enhanced network for object detection. Future Gener Comput Syst 124:133–141
Article Google Scholar
Xi Y, Zheng J, He X, Jia W, Li H, Xie Y, Li X (2020) Beyond context: exploring semantic similarity for small object detection in crowded scenes. Pattern Recogn Lett 137:53–60
Article Google Scholar
Xia G-S, Bai X, Ding J, Zhu Z, Belongie S, Luo J, Zhang L (2018) Dota: A large-scale dataset for object detection in aerial images. Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 3974–3983)
Xia Y, He Y, Hao X, Yin B (2020) Context-based feature fusion network for object detection. Proceedings of the 4th international conference on advances in image processing (pp. 15–20)
Xiang Y, Mottaghi R, Savarese S (2014) Beyond pascal: A benchmark for 3d object detection in the wild. IEEE winter conference on applications of computer vision (pp. 75–82)
Xiao J, Chen T, Hu X, Zhang G, Wang S (2023) Boundary-guided context-aware network for camouflaged object detection. Neural Comput Appl 35(20):15075–15093
Article Google Scholar
Xiao J, Guo H, Zhou J, Zhao T, Yu Q, Chen Y, Wang Z (2023) Tiny object detection with context enhancement and feature purification. Expert Syst Appl 211:118665
Article Google Scholar
Xiao Y, Wang X, Zhang P, Meng F, Shao F (2020) Object detection based on faster r-cnn algorithm with skip pooling and fusion of contextual information. Sensors 20(19):5490
Article Google Scholar
Xiao Z, Xie P, Wang G (2021) Ecnet: Edge-aware context-aggregation network for transparent and reflective object detection. 2021 IEEE international conference on artificial intelligence and computer applications (icaica) (pp. 77–81)
Xie S, Liu C, Gao J, Li X, Luo J, Fan B, Peng Y (2020) Diverse receptive field network with context aggregation for fast object detection. J Vis Commun Image Represent 70:102770
Article Google Scholar
Xu X, Luo X, Ma L (2020) Context-aware hierarchical feature attention network for multi-scale object detection. 2020 IEEE international conference on image processing (icip) (pp. 2011–2015)
Yang H, Cai S, Deng B, Ye J, Lin G, Zhang Y (2024) Context-aware and semantic-consistent spatial interactions for one-shot object detection without fine-tuning. IEEE Trans Circ Syst Video Technol
Yang Y, Chen L, Zhang J, Long L, Wang Z (2023) Ugc-yolo: underwater environment object detection based on yolo with a global context block. J Ocean Univ China 22(3):665–674
Article Google Scholar
Yang A, Lin S, Yeh C-H, Shu M, Yang Y, Chang X (2023) Context matters: distilling knowledge graph for enhanced object detection. IEEE Trans Multimed
Yang S, Luo P, Loy C-C, Tang X (2016) Wider face: A face detection benchmark. Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 5525–5533)
Yang Z, Wang Y, Chen X, Liu J, Qiao Y (2020) Context-transformer: Tackling object confusion for few-shot detection. Proceedings of the aaai conference on artificial intelligence (Vol. 34, pp. 12653–12660)
Yu F, Chen H, Wang X, Xian W, Chen Y, Liu F, Darrell T (2020) Bdd100k: A diverse driving dataset for heterogeneous multitask learning. Proceedings of the IEEE/cvf conference on computer vision and pattern recognition (pp. 2636–2645)
Yu T, Chen C, Zhou Y, Hu X (2022) Improving surveillance object detection with adaptive omni-attention over both inter-frame and intra-frame context. Proceedings of the asian conference on computer vision (pp. 2697–2712)
Zagoruyko S, Lerer A, Lin T-Y, Pinheiro PO, Gross S, Chintala S, Dollár P (2016) A multipath network for object detection. arXiv preprint arXiv:1604.02135
Zeng X, Li Z, Zhang W (2021) An adaptive learning-based weakly supervised object detection via context awareness. 2021 2nd international conference on big data & artificial intelligence & software engineering (icbase) (pp. 331–335)
Zhang H, Chen E (2024) Bi-afn++ ca: Bi-directional adaptive fusion network combining context augmentation for small object detection. Appl Intell 54(1):614–628
Article Google Scholar
Zhang W, Dong C, Zhang J, Shan H, Liu E (2022) Adaptive context-and scale-aware aggregation with feature alignment for one-shot object detection. Neurocomputing 514:216–230
Article Google Scholar
Zhang W, Fu C, Xie H, Zhu M, Tie M, Chen J (2021) Global context aware rcnn for object detection. Neural Comput Appl 33:11627–11639
Article Google Scholar
Zhang Z, Gong P, Sun H, Wu P, Yang X (2023) Dynamic local and global context exploration for small object detection. Icassp 2023-2023 IEEE international conference on acoustics, speech and signal processing (icassp) (pp. 1–5)
Zhang J, Han F, Chun Y, Liu K, Chen W (2021) Detecting objects from no-object regions: a context-based data augmentation for object detection. Int J Comput Intell Syst 14(1):1871–1879
Article Google Scholar
Zhang C, Kim J (2019) Modeling long-and short-term temporal context for video object detection. 2019 IEEE international conference on image processing (icip) (pp. 71–75)
Zhang T-Y, Li J, Chai J, Zhao Z-Q, Tian W-D (2022) Improved yolov5 network with attention and context for small object detection. International conference on intelligent computing (pp. 341–352)
Zhang L, Wang Y, Chen H, Li J, Zhang Z (2021) Visual relationship detection with region topology structure. Inf Sci 564:384–395
Article MathSciNet Google Scholar
Zhang S, Wen L, Bian X, Lei Z, Li SZ (2018) Single-shot refinement neural network for object detection. Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 4203–4212)
Zhang S, Wu G, Costeira JP, Moura JM (2017) Understanding traffic density from large-scale web camera data. Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 5898–5907)
Zhao L, Liu C, Qu H (2022) Transmission line object detection method based on contextual information enhancement and joint heterogeneous representation. Sensors 22(18):6855
Article Google Scholar
Zhao H, Shi J, Qi X, Wang X, Jia J (2017) Pyramid scene parsing network. Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 2881–2890)
Zheng W-S, Gong S, Xiang T (2011) Quantifying and transferring contextual information in object detection. IEEE Trans Pattern Anal Mach Intell 34(4):762–777
Article Google Scholar
Zhong Y, Han X, Zhang L (2018) Multi-class geospatial object detection based on a position-sensitive balancing framework for high spatial resolution remote sensing imagery. ISPRS J Photogramm Remote Sens 138:281–294
Article Google Scholar
Zhong Y, Jia T, Zhao J, Wang X, Jin S (2017) Spatial-spectral-emissivity land-cover classification fusing visible and thermal infrared hyperspectral imagery. Remote Sens 9(9):910
Article Google Scholar
Zhong Q, Li C, Zhang Y, Xie D, Yang S, Pu S (2020) Cascade region proposal and global context for deep object detection. Neurocomputing 395:170–177
Article Google Scholar
Zhu C, Chen F, Ahmed U, Shen Z, Savvides, M. (2021). Semantic relation reasoning for shot-stable few-shot object detection. Proceedings of the IEEE/cvf conference on computer vision and pattern recognition (pp. 8782–8791)
Zhu H, Chen X, Dai W, Fu K, Ye Q, Jiao J (2015) Orientation robust object detection in aerial images using deep convolutional neural network. 2015 IEEE international conference on image processing (icip) (pp. 3735–3739)
Zhu Z, Liang D, Zhang S, Huang X, Li B, Hu S (2016) Traffic-sign detection and classification in the wild. Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 2110–2118)
Zhu X, Su W, Lu L, Li B, Wang X, Dai J (2020) Deformable detr: Deformable transformers for end-to-end object detection. arXiv preprint arXiv:2010.04159
Zhu H, Wei H, Li B, Yuan X, Kehtarnavaz N (2020) A review of video object detection: datasets, metrics and methods. Appl Sci 10(21):7834
Article Google Scholar
Zhu P, Wen L, Du D, Bian X, Fan H, Hu Q, Ling H (2021) Detection and tracking meet drones challenge. IEEE Trans Pattern Anal Mach Intell 44(11):7380–7399
Article Google Scholar
Zolghadr E, Furht B (2016) Context-based scene understanding. Int J Multimed Data Eng Manag (IJMDEM) 7(1):22–40
Article Google Scholar

Download references