Object Detection in Video Surveillance Based on Multiscale Frame Representation and Block Processing by a Convolutional Neural Network

A method for detecting objects in high-resolution images is proposed that is based on representing an image as a set of its copies of decreasing scale, splitting it into blocks with overlap at each level of the image pyramid except for the top one, detecting objects in the blocks, and analyzing objects at the boundaries of adjacent blocks to merge them. The number of pyramid layers is determined by the size of the image and the input layer of the convolutional neural network (CNN). At all levels except for the top one, a block splitting is performed, and the use of overlap allows one to improve the correct classification of objects, which are divided into fragments and located in adjacent blocks. The decision to merge such fragments is made based on the analysis of the metric of intersection over union and membership in the same class. The proposed approach is evaluated for 4K and 8K images. To carry out experiments, a database is prepared with objects of two classes, person and vehicle, marked in such images. Networks of the You Only Look Once (YOLO) family of the third and fourth versions are used as CNNs. A quantitative assessment of the detection efficiency of objects is performed using the mAP metric for various combinations of parameters such as the degree of threshold confidence of the CNN and the percentage of intersection of blocks in the hierarchical representation of images. The results of the investigations are presented.


INTRODUCTION
The development of digital video cameras has led to an ever-increasing use of devices that allow one to form images and videos in 4K and larger formats. The large side for 4K images can range from 3600 to 4500 pixels, and, for 8K, from 7600 to 8500 pixels [5]. Moreover, the quality of displaying the features of objects significantly increases, including small-size ones, or objects located at a considerable distance from the camera during shooting.
To detect objects, one applies a comparison with a reference set of features or uses convolutional neural networks (CNNs). The first approach involves comparing the input image with the features of previously prepared images, such as the brightness level of each pixel and features formed according to certain rules, for example, histograms of oriented gradients, the Haar features, etc. [4]. Since the choice of an optimal set of informative features for accurate description of arbitrary images is not a completely solved problem, the approach to solving applied problems requires the development and testing of adapted algorithms to form object descriptors. Therefore, in recent years, thanks to the rapid development of computing technologies, a different approach has been increasingly used, which is based on the application of pretrained CNNs on large image databases [7]. The presence of convolution operations with different types of filters and a set of layers in their architecture makes it possible to form effective feature sets whose size decreases from the input to the output of the CNN to describe images of objects with a fuzzy structure and a number of variations within a class. However, at the first stage of using a CNN, the input image is scaled to the size of the input layer, which leads to a decrease in the information content of objects, and, in case of small-size objects, this makes impossible their detection in the image. Therefore, an urgent task is to develop methods and algorithms for detecting small-size objects in high-resolution images.
In this work, we propose a method for detecting objects in high-resolution images on the basis of pyramidal-block analysis and the use of CNNs for detecting objects in each block. This method does not impose restrictions on the CNN model and improves the detection accuracy compared to the baseline model. Examples of 4K and 8K resolution images show that the method provides an increase in the mAP for CNNs of the You Only Look Once (YOLO) family: YOLOv3 and YOLOv4.

EXISTING METHODS OF OBJECT DETECTION IN HIGH-RESOLUTION IMAGES
In [18], the authors proposed an approach for localizing objects in high-resolution images that is based on block processing with overlapping and the use of a CNN to distinguish areas of objects in each block. The main disadvantage of this method is the high probability of combining several different objects located on the boundaries of the processed blocks into a single area and missing an object due to its fragmentation into blocks.
In [22], the authors presented an algorithm that is fed with the original image scaled to a size of 227 × 227. Objects are detected on the reduced copy using the AlexNet CNN. To detect smaller size objects, the original image was divided into nonintersecting blocks of sizes ranging from 600 to 75 pixels and processed by the spatial correlation network (SC-Net). The overlap in this case varied from 25 to 50% relative to the size of the blocks. For this algorithm, the authors obtained a recall of 80% on the SUN2012 database [20]. It is known that the recall metric does not allow evaluating false positives, and the database used by the authors was almost free of images with 4K resolution.
In [15], an approach for human detection was considered, in which the input image was divided into large intersecting blocks, and, when people were detected, the image was divided into smaller blocks to refine the bounding boxes of the detected regions of interest. The regions were combined based on the vertical shape of a human body. This means that the algorithm was not adapted for other classes of objects. The accuracy of the algorithm was estimated using the PEVID-UHD database [8] and a closed database containing a slightly larger number of objects. For the first dataset, the average accuracy was 90.7% and, for the second, 75.4%. However, the use of block overlap in the range from 20 to 50 pixels is not sufficient in some cases; moreover, it is not adaptive to the size of the input image.
In [17], Unel et al. proposed an algorithm for detecting small-size objects of classes "vehicle" and "person" in images with a resolution of 2K and a size of 1920 × 1080, obtained from small unmanned aerial vehicles for further implementation on the graphic processors of mobile devices. In this case, a PeleeNet CNN was used for block processing the original image with mutual overlap of 25%. The test results obtained on the VisDrone2018 database show that the mAP metric reaches a value of 36.67% for this algorithm. However, the authors did not assess the effect of block overlapping on the accuracy of the algorithm. The number of missing objects of class person was significantly higher compared with the class vehicle. Hence, the performance of this algorithm on detecting smallsize objects is much worse even in images with resolution below 4K.
To solve the problem posed, we should distinguish the following modifications of CNN models among the well-known models differing in architecture, accuracy, and efficiency: region-based convolutional neural networks (R-CNNs) [2], residual networks (ResNet) [6], and YOLO [12]. The first version of the R-CNN model is based on preliminary selection of regions in the image, calculation of features using a CNN (for example, AlexNet), and the use of a classifier to identify objects. This model was developed in the Fast R-CNN [3] and Faster R-CNN [14] versions. Fast R-CNN architecture is aimed at reducing the number of regions in the image by applying a RoI (region of interest) polling layer. On this layer, maxpolling is applied to reduce the dimension, and all regions of potential objects have the same fixed dimension. In Faster R-CNN, two steps are executed. At the first of them, one applies a deep fully connected network (region proposal network) to determine regions in the image that may contain objects. At the second step, the Fast R-CNN detector is used, which searches for objects in the proposed regions. The Faster R-CNN model is not resistant to noisy images and requires significant computational costs.
The ResNet CNN architecture uses shortcut connection to minimize the performance degradation with an increase in the number of CNN layers, provided that the accuracy limit is reached in some layer of the network. The error rate is 3.57% in the top-5 metric. In [16], the Inception-ResNet model is presented as a development of the Inception model by introducing shortcut connections, which, if necessary, allow one to skip a layer and, accordingly, nullify its effect on the result of the detector operation. This makes it possible to change the network architecture so that a finite number of layers are determined for a particular problem during training, which, in turn, allows one to reduce the error rate to 3.1% in the top-5 metric. However, the presented models are characterized by high time costs. Therefore, to solve many problems, one uses truncated versions of architectures with a reduced number of layers (for example, Res-Net-34, ResNet-50, etc.), which, however, do not allow achieving the accuracy of the baseline model. 3 The YOLO family of CNNs refers to one-pass networks and makes it possible to localize and identify objects by a single model. Among the existing versions of these CNNs, according to the results of testing in the top-5 metric, YOLO of the third version is distinguished by high accuracy (93.8%) [13]; it uses, for feature extraction, an improved Darknet-53 architecture containing 53 layers and using 23 residual connections. The next version of this CNN is YOLOv4 [1]. The structure of YOLOv4 can be divided into three main blocks: the feature extraction block, which serves to reveal the characteristic features of objects in the input image; a block for collecting feature maps from different layers, which collects and transmits feature maps from different levels of the neural network to the detection and classification unit, which, in turn, generates output feature maps for different scales, thus allowing one to predict the coordinates of the bounding boxes and classify the contents of each cell in the input image. This structure of the YOLOv4 network is aimed at increasing the running speed and optimizing parallel computing. Accordingly, these modifications of the YOLO CNN are promising for processing highresolution images.

METHOD OF OBJECT DETECTION ON THE BASIS OF MULTISCALE REPRESENTATION AND BLOCK PROCESSING BY A CONVOLUTIONAL NEURAL NETWORK
The proposed method requires a pyramidal representation of the original image as a set of its copies of decreasing scale. As one moves up the pyramid, the scale (size and resolution) of the original image decreases. At each level of the pyramid, the images are divided into blocks in which detection is performed using a CNN. To ensure high efficiency of detecting small-sized objects, it is necessary to minimize their splitting into parts before processing, that is, it is important that the object is not fragmented in any of the blocks fed to the CNN input. Therefore, overlapping blocks are used. In this case, the higher the overlap value, the larger object can be detected without being divided into parts, which, however, requires additional computational costs. After processing by the CNN, one needs a procedure for combining the regions obtained at all levels.
Based on this method, the formalization of the object detection algorithm in high-resolution images can be described as follows.
Step 1. Initialization of the original data: the size of the input image, the type of CNN used, and the size of the input layer.
Step 2. Determination of the number of levels of the image pyramid, taking into account that the dimensions of the top level should be close to the dimensions of the input layer of the CNN, according to the formula where W and H are the width and height of the input image, is the size of the input layer of the CNN, and [*] is the nearest integer.
This approach eliminates the fragmentation of objects at the minimum scale and allows one to increase their correct classification as a whole.
Step 3. Decomposition of the image on layer p into overlapping blocks. The number of blocks is determined by the equality where is the number of pyramid layer, , and is the shift of a block (in pixels). When or are rounded downward, the last incomplete block is combined with the preceding block and is scaled to the size of the input layer of the CNN, while, when rounding upward, this block is filled with zeros.
The overlap of blocks can be calculated by the formula Step 4. Detection of candidate regions that may contain an object or a fragment of an object. To do this, each block is processed by the CNN, and the detected candidate region is described by the feature set where x 1 and y 1 are the coordinates of the upper left corner of the found region in the original image, x 2 and y 2 are the coordinates of the lower right corner of the found region in the original image, Cl is the class of the selected object, and is the CNN confidence score.
In the block processing of the detection results, it is possible that the same object or its fragments are detected with a shift of coordinates at different scales of the image pyramid or in adjacent cells. In addition, the features of a smaller fragment of the object may differ from the features of the original object, and even be closer to the descriptors of a different class. Therefore, the merging process additionally involves postprocessing. To this end, it is proposed to use a rule according to which all regions obtained in the previous step are analyzed and merged: where and are the differences between the dimensions of the candidate regions along axes and , respectively, and and are threshold levels.

METHOD OF INVESTIGATION
To carry out experiments, we developed a technique that implements the proposed method and implies further analysis of the detected objects preliminarily marked in the input image. The preselected objects are described by their coordinates in the image and the membership in the classes person or vehicle. The class vehicle is a composite class, because it includes such objects as bus, car, and truck. The preparation of annotated databases of 4K and 8K images is performed using the labelImg tool [9]. Figure 1 shows an example of object annotation for an 8K image.
The main stages of the method are demonstrated in Fig. 2 and include the following: formation of an image pyramid until the dimensions of the top level approach the dimensions of the input layer of the CNN; division into overlapping blocks at all levels of the image presentation; the application of a CNN to detect objects in each block; localization of detected objects in frame coordinates; post-processing of the results obtained in the previous step for analyzing objects at the boundaries of blocks and describing each object by coordinates and class type. If, for previously selected and detected objects, the amount of their overlap is > 0.5 and the objects belong to the same class, then these objects are considered correctly detected.
The quality of the object detection algorithm is assessed using the mean average precision (mAP) metric, which implies averaging the values of the known average precision (AP) metric over classes.
At the first stage, the number of true positive and false positive detections is calculated based on the Intersection over Union metric for the rectangles describing the found objects and annotated on the images of the database . Then the precision and recall are calculated: where is the number of omissions.
Based on the obtained values, the average precision AP is calculated for each class for eleven threshold lev-  Thus, for each of the eleven levels of r, we determine the interpolated value of precision, after averaging which the AP is formed. At the last stage, when calculating the mAP, the AP values obtained at the previous step are averaged for two classes of objects.
In order to more optimally use the capabilities of the CNN when detecting and identifying small-size objects, it is necessary to determine the threshold degree of confidence of the CNN and the size of the input layer, which ensure the highest performance. The results of experiments have confirmed that, to ensure high accuracy when detecting small objects in high-resolution images, it is necessary to minimize their scaling. For example, for an input layer size of 608 × 608, the mAP value is 0.21, while, for a size of 1024 × 1024, the highest value of mAP = 0.47 is achieved on a test set of images with 4K resolution, the threshold degree of confidence being T = 50%.

RESULTS OF EXPERIMENTS
The proposed method was implemented in the Python programming language using the PyTorch machine learning framework, with the help of which the main elements of the CNN architecture were synthesized, which made it possible to conduct experiments to assess the detection accuracy. Basic operations on images are implemented using the OpenCV computer vision library. To increase the detection speed, batch processing of image blocks was performed for each pyramid level using the CUDA tech- nology, which allows parallel processing of data by a graphic processor.

Detection of Objects in 4К Images
In total, we annotated 6049 objects of two classes, person and vehicle, with sizes from 24 × 14 to 308 × 763 in 4K images. Then, we evaluated the accuracy of the algorithm for different block overlap values and threshold levels of the CNN varying with a step of 5%. Note that, for the minimum scale of representation of the original image, the threshold value corresponds to the most effective one for the CNN; in this case, T = 50%. The results of experiments are presented in Table 1. The analysis of these results shows that, for any values of α and , the mAP of the algorithm exceeds the mAP of the CNN. The maximum gain is obtained for and , when mAP = 0.759. Figure 3 demonstrates the fragments of processed images with resolution 4К from the prepared database. Using the algorithm, we detected nine objects (Fig. 3a); the minimum size of one of these objects was 33 × 13, and the application of the YOLOv3 CNN allowed us to detect an object of size 149 × 54 (Fig. 3b).
The performance testing of the algorithm was also performed on other 4K images that were not included in the annotated database. Figure 4a shows an example of object detection in a 4K video frame, using the YOLOv3 CNN, the input layer size 1024 × 1024, and . In this case, 50 objects were detected, the minimum size of the objects being 41 × 20 pixels. Figure 4b shows the result of object detection by the algorithm with the parameters determined during the experiments: the size of the input layer of the CNN is 1024 × 1024, the number of pyramid levels is , , and . In this case, 80 objects were detected, the minimum size being 33 × 11. Figure 4 also shows that the maximum gain in detection is  achieved precisely for small-size objects, which are located in the upper and right parts of the image.
Analysis of the results of detection at different image scales (Figs. 5a-5c) and a combination of regions (Fig. 5d) for enlarged fragment 2 in Fig. 4b with three objects confirms that the use of pyramidal processing makes it possible to detect objects of different sizes at different levels and increases the accuracy of the algorithm as a whole. The YOLOv3 CNN did not detect a person in the selected fragment (Fig. 4a). Figure 6 shows an example of merging candidate regions at the fifth step of the algorithm for    Fig. 4b, which testifies to the efficiency of this stage.
After the fourth step of the algorithm, six candidate regions were found (Fig. 6a); the application of the merging procedure allowed us to correctly localize a person and improve the result for a car (Fig. 6b), and further post-processing made it possible to filter out the false positive localization of the front part of the car. The YOLOv3 CNN model did not detect a person in this region of the video frame (see Fig. 4a).

Detection of Objects in 8К Images
On the set of 8K images in [10,11,19], 4200 objects were marked, the smallest of them being of size 38 × 10. The YOLOv4 CNN was implemented on the PyTorch framework, and files with weighting coefficients from [21] were used. Table 2 presents the mAPs determined for various combinations of the parameters: the threshold confidence degree T of the CNN and the percentage of mutual intersection of blocks in the pyramidal image representation. Table 2 shows that, under the given conditions, the most optimal combination of parameters is T = 70% and α = 30%, for which the maximum value mAP max = 0.6087 is obtained, the size of the smallest detected object being 39 × 17. When using YOLOv4 with an input layer of 1024 × 1024 for = 70% and mAP = 0.295, the size of the smallest detected object T is 44 × 27. Thus, the approach using a database of 8K images allows one to detect objects of smaller size and to increase the detection efficiency by more than two times. Using the YOLOv4 CNN, we detected 50 objects in the image (Fig. 7b) with a minimum size of 319 × 124 pixels, and, using the considered technique, we detected 132 objects (Fig. 7c), the size of the minimum of them being 42 × 36 pixels. In addition, the advantage of the proposed technique for 8K images is that it can significantly increase the detection efficiency when many objects are located close together and overlap. In this case, the features of objects will be significantly different from those obtained as a result of training the CNN, and the reduction of the 8K format of the image to the size of the input layer further reduces the information content of such objects. The applied technique, using block and hierarchical processing, allows one to preserve the original features of each object and ensures their correct detection. Figure 8 demonstrates examples of detecting closely spaced people in an 8K image.
The use of the YOLOv4 CNN made it possible to detect one person with a size of 235 × 110 (Fig. 8a), and the approach proposed ensured the detection of 210 objects, the minimum size being 41 × 21 (Fig. 8b).  CONCLUSIONS We have proposed an algorithm for detecting objects in high-resolution (4K and 8K) images, which includes the following basic steps: forming an image pyramid until the dimensions of the top level approach the dimensions of the input layer of the CNN; division into overlapping blocks at all levels of the image representation; the use of a CNN to detect objects in each block; and post-processing of the results obtained in the previous step for the analysis of objects at the boundaries of blocks.
We used two architectures of the YOLO family of CNNs of the third and fourth versions to detect objects with an input layer size of 1024 × 1024 to ensure high efficiency of detecting small objects in images with resolution 4K and 8K. The proposed method has been implemented in Python using the PyTorch machine learning framework, the OpenCV computer vision library, and the CUDA technology. To carry out experiments to evaluate the effectiveness of the method, we have prepared a set of high-resolution images with annotated small and medium objects of two classes: person and vehicle. Experiments have shown that the mAP performance score can be more than doubled for the data used. Thus, the results obtained indicate that the approach under consideration is promising for solving applied problems of detecting small objects in high-resolution images. In some cases, when objects of the same class are located near each other, our algorithm may combine them by mistake. Therefore, our forthcoming work will be aimed at solving this problem.

COMPLIANCE WITH ETHICAL STANDARDS
This article is a completely original work of its authors; it has not been published before and will not be sent to other publications until the PRIA Editorial Board decides not to accept it for publication.