PS-Net: Progressive Selection Network for Salient Object Detection

Low-level features contain abundant details and high-level features have rich semantic information. Integrating multi-scale features in an appropriate way is significant for salient object detection. However, direct concatenation or addition taken by most methods ignores the distinctions of contribution among multi-scale features. Besides, most salient object detection models fail to dynamically adjust receptive fields to fit objects of various sizes. To tackle these problems, we propose a Progressive Selection Network (PS-Net). Specifically, PS-Net dynamically extracts high-level features and encourages high-level features to guide low-level features to suppress the background response of the original features. We proposed a salient model PS-Net that selects features progressively at multiply levels. First, we propose a Pyramid Feature Dynamic Extraction module to dynamically select appropriate receptive fields to extract high-level features by Feature Dynamic Extraction modules step by step. Besides, a Self-Interaction Attention module is designed to extract detailed information for low-level features. Finally, we design a Scale Aware Fusion module to fuse these multiple features for adequate exploitation of high-level features to refine low-level features gradually. Compared with 19 start-of-the-art methods on 6 public benchmark datasets, the proposed method achieves remarkable performance in both quantitative and qualitative evaluation. We performed a lot of ablation studies, and more discussions to demonstrate the effectiveness and superiority of our proposed method. In this paper, we propose a PS-Net for effective salient object detection. Extensive experiments on 6 datasets validate that the proposed model outperforms 19 state-of-the-art methods under different evaluation metrics.


Introduction
Salient Object Detection (SOD) aims to locate the most obvious regions in an image.As a preprocessing step, it has been widely applied in various computer vision tasks, such as object recognition [1], image editing [2], image retrieval [3], semantic segmentation [4,5] and visual tracking [6].Earlier SOD algorithms mainly used conventional methods to generate saliency maps [7], which often rely on heuristic priors (e.g, color [8] and texture [9]).However, these hand-crafted features are of great difficulty to capture the latent semantic information in images, thus they fail to yield satisfactory results for images with complex backgrounds.Recently, with the development of deep learning, SOD has made prominent progress.Due to the powerful capability to extract low-level information and high-level information simultaneously [10,11], CNNs have emerged as an important trend for SOD, especially in complicated cases.
Despite CNNs have achieved excellent performance in SOD, there are still many challenges.
(1) Many saliency studies have revealed that multi-scale features are essential for SOD [11][12][13].Specifically, low-level features contain abundant details but full of background noise (Fig. 1(b)).On the contrary, highlevel features have rich semantic information, which is helpful in locating the salient objects and suppressing the background noises (Fig. 1(c)).Therefore, it is critical to properly aggregate these features to generate satisfactory saliency maps.Existing approaches tackle the problem by integrating multiple features layer-bylayer [10,12,14] often by direct concatenation [14] or addition [15,16], which ignores the guidance relationship using semantic information to optimize details and the differences of their contributions.(2) Besides, there is no effective extraction and utilization of multi-scale context information in every single block.When extracting high-level features, saliency objects and their surroundings are necessary to generate the final saliency maps [17].Recently, some methods have been proposed to integrate multi-scale context information [15,18].However, the receptive fields fail to be dynamically adjusted to fit the objects of different sizes in their methods, resulting in poor sensitivity to the change of sizes of saliency objects.
Since the attention mechanism [19,20] has been widely and successfully used for improving model performance, many networks based on the attention mechanism have been widely proposed in SOD.Therefore, to deal with the problem, we proposed a salient model PS-Net that selects features progressively at multiply levels.PS-Net emphasizes the attention mechanism to effectively integrate selected low-level appearance features and high-level semantic features to generate saliency maps in a supervised way.First, in order to extract more abundant low-level detail features, we propose a Self-Interaction Attention module (SIA) for pixellevel fusion, which fuses the global information and local information of low-level features to ensure that the attention score of each pixel is calculated both globally and locally, especially boundary-focused.Besides, due to the sizes of the salient objects that vary greatly, we propose a Pyramid Feature Dynamic Extraction module (PFDE) for the effective utilization of multi-scale context information in every single block.Different from direct concatenation or addition, the PFDE module takes advantage of the attention mechanism, named Feature Dynamic Extraction module (FDE), to dynamically adjust the receptive field in every single block to adapt to distinct sizes of salient objects (Fig. 1(d)).Finally, considering the guidance relationship using semantic information to optimize details and the different contributions between high-level features and low-level features, we propose the Scale Aware Fusion module (SAF).A spatial attention mechanism is introduced to encourage high-level features to guide low-level features and fuse them by selflearning to suppress the background response of the original features (Fig. 1(e)).
To verify the performance of PS-Net, we indicate experiment results on 6 popular SOD datasets and visualize some saliency maps.We conduct a series of ablation experiments to evaluate the effect of each module.The experiment and visual results demonstrate that PS-Net can obtain better saliency maps.We would like to highlight our contributions as follows: (1) We introduce a Self-Interaction Attention module to extract more abundant detailed features, which ensure that the attention score of each pixel is calculated both globally and locally, especially boundary-focused.(2) We proposed a Pyramid Feature Dynamic Extraction module to dynamically adjust the receptive field in every single block to adapt to distinct sizes of salient objects.(3) Considering different contributions of high-level features and low-level features, we design the Scale Aware Fusion module for effective feature fusion.A spatial attention mechanism is introduced to suppress the background response of the original features.(4) Compared with 19 start-of-the-art methods on 6 public benchmark datasets, the proposed method achieves remarkable performance in both quantitative and qualitative evaluation.We performed a lot of ablation studies, and more discussions to demonstrate the effectiveness and superiority of our proposed method.

Related Works
In this section, we introduce related works from two aspects.Firstly, we review several representative SOD methods, and then we describe the applications of the attention mechanisms in various visual fields.

Salient Object Detection
Early-stage saliency methods are mainly based on handcrafted priors to estimate saliency objects, such as color contrast [8], local contrast [21], and background priors [9].In recent years, deep learning has emerged as a promising alternative for SOD, due mainly to the fact that CNN-based saliency models allow flexible feature utilization and equip powerful end-to-end capabilities.Zhao et al. [22] proposed to use a fully connected CNN to integrate global context information for saliency detection.Liu et al. [23] generate prediction maps by refining edges in low-level features.Wang et al. [24] propose a model that adds low-level detail features to predict images of different scales.Hu et al. [25] concatenate multi-layer features for saliency detection.Zhang et al. [26] build a directional message-passing model to better integrate multi-scale features.Qin et al. [27] proposed a boundary-aware model to predict the boundaries simultaneously.Wu et al. [14] proposed a cascade partial decoder that utilizes attention mechanisms to refine high-level features.There are also some methods that construct deep network architecture to optimize saliency maps [28,29].
The above researches demonstrate that the extraction of effective features plays a crucial role in generating a complete saliency map.Therefore, we propose a saliency model PS-Net that selects features progressively at multiple levels.It selectively integrates multi-scale information to generate low-level saliency feature maps guided by high-level semantic information.

Attention Mechanism
The essence of the attention mechanism is to locate obvious information and suppress useless information, which is mainly divided into spatial attention and channel attention.Attention mechanisms have been proven to be beneficial in visual tasks, such as image classification [30], image captioning [31], and visual question and answer [32].
Chen et al. [31] propose a SCA-CNN network that combines spatial and channel attention for image captioning.Li et al. [33] focus on the global context to guide target detection by using the attention mechanism.Liu et al. [34] construct a pixel-level contextual attention model to pay attention to the information context position of each pixel.Chen et al. [35] embed the reverse attention module in the topdown approach to predict saliency maps.Zhang et al. [36] build a progressive attention model which sequentially generates attention features for saliency detection through the channel and spatial attention mechanisms.
The above studies demonstrate that the attention mechanism is of great help in SOD.However, while integrating the convolutional features, most existing methods treat multiscale features without distinction.
On the contrary, PS-Net integrates global and pixel-level attention guidance, fusing the feature extraction capabilities of multi-scale information and the feature selection capabilities of the attention mechanism.

Method
In this section, we illuminate how each component made up and elucidate its effect on saliency detection.The overall architecture of the proposed method is illustrated in Fig. 2 .

Pyramid Feature Dynamic Extraction module
In the feature extraction module, convolution operations of different levels correspond to features extraction of different scales, which directly affects the representation capability of the model.As discussed in the introduction, low-level features contain more detailed information whilst high-level features contain affluent semantic information.Therefore, in order to better extract the semantic information in the highlevel features, we propose the Pyramid Feature Dynamic Extraction module (PFDE), inspired by Atrous Spatial Pyramid Pooling (ASPP) [37].
For each convolutional layer containing deep semantic information, combining multi-scale information can produce more robust feature representations.ASPP proposes to concatenate the feature maps generated by the dilated convolution with different rates so that the salient maps encode multi-scale information under different receptive fields without distinction, resulting in information redundancy and even performance degradation.Consequently, it is necessary to mine multi-scale information for more effective fusion.
In the proposed PFDE module as shown in Fig. 3, we use four parallel dilated convolutions with different dilation rates of 1, 3, 5 and 7 to capture information of different scales.After this, we design a Feature Dynamic Extraction module (FDE) to fuse differently scaled features.As shown in Fig. 4, the global and local attention mechanisms are introduced to dynamically select the appropriate scale features and fuse them by self-learning.Given two features f h×w×c 1 and f h×w×c 2 with different reception fields, h × w represents the spatial dimension and c denotes the number of channels.First, the FDE module applies element addition operation to merge f h×w×c 1 and f h×w×c 2 to extract the mixed feature f m .Then f m locates salient objects from different receptive fields through global and local attention mechanisms, respectively, which can dynamically adapt to various sizes of salient objects through self-learning.Specifically, f m passes through the global average pooling layer and the fully connected layer followed by a repeat function, respectively, to obtain the global attention map f g which has the same resolution as f m .On the other hand, f m goes through the average pooling layer and a convolution layer to get the local attention map f a .Besides, the common feature f c is combined with f g and f a by element addition operation, respectively.Finally, the fused feature map f f is obtained as a weighted sum as detailed below: (1) where GAP refers to the global average pooling layer, AvP donates the average pooling layer, FC is the full connected layer, denotes Relu function and represents the sigmoid operation.
We employ three cascaded Feature Dynamic Extraction (FDE) modules to get the final fusion feature of four branches. (3)

Self-Interaction Attention Module
As seen in Fig. 1, the saliency map of low-level features comprises a lot of details, some of which are beneficial for SOD but others are counterproductive.As manifested in Fig. 5 where several saliency images and their corresponding boundaries are shown, the issue of unclear boundaries of saliency objects still remains a challenge, even for the latest methods with excellent performance.In order to extract the detailed information thoroughly from the low-level features and explicitly learn salient object boundaries to better locate and sharpen salient objects, we propose the Self-Interaction Attention module.
In the SIA module, the score of each pixel is obtained by comparing with all other positions.Specifically, for the shallow feature f h×w×c w , it is necessary to highlight those channels which focus on foreground information and suppress other channels with background noise since each channel focuses on a different feature.Each channel can be regarded as a boundary detector, so we calculate the maximum value and the average value at the same time to obtain soft attention: where GMP refers to the global max-pooling layer, denotes softmax function.GMP only pays attention to the (6) f s = [ (GAP(f w )) + (GMP(f w ))] × f w most significant part and GAP treats all pixels equally which will inevitably introduce noise, so we train f s to make a soft choice.
In addition, in order to ensure that the attention score of each pixel is calculated both locally and globally, we add two items for global and local information extraction (Fig. 6).The global item is the same as the structure described above where the softmax function is combined with global average pooling of the spatial average matrix.For local item, we use average pooling to figure out the local information similarity where a 2 × 2 pooling layer is applied to obtain the attention score of each local pixel.
Considering that local information should be independent of each other, we use the sigmoid function when calculating local attention.

Scale Aware Fusion module
Due to multiple downsampling, high-lever features have a lot of semantic information, but they lose a lot of detailed information.At the same time, the low-level features retain rich details and background noise on account of the limitation of ( 7) the receptive field.In order to refine the details of semantic features and suppress the background noise of detail features, we propose the Scale Aware Fusion (SAF) module.
As shown in Fig. 7, taking into account the attention guidance relationship and their different contributions of multi-scale features, a spatial attention mechanism is introduced to dynamically select the appropriate scale features and fuse them.Specifically, this module first applies element addition operation to merge the semantic feature f h and the detailed feature f l to extract the common feature f t .Then f t passes through a series of convolution layers and obtains two feature maps f A and f B .Finally, the fused feature map f is obtained as a weighted sum: where conv is cascaded by convolution, batchnorm and relu, D represents the operation of channel splitting.
This attention fusion algorithm can effectively avoid the pollution caused by background noise.We cascade multiple SAF modules sequentially to make the semantic features and detailed features fully merged.Finally, the boundary of the high-level feature is sharpened and the background noise of the low-level feature is suppressed.

Loss
In SOD tasks, binary cross-entropy loss is usually used as the loss function to evaluate the gap between the generated saliency map and the ground truth.The binary cross-entropy (BCE) loss function is given as follows: where H, W refer to the height and width of the image, respectively, G ij denotes the ground truth of the pixel (i, j) and S ij represents the probability of belonging to salient regions.
However, BCE cannot smoothly focus the foreground fields and treat each pixel equally, which compoundings the imbalance issue of foreground and background caused by multi-scale.To deal with the problem, two conditions need to be met: (1) It is not sensitive to changes in object size; (2) It pays more attention to the foreground field.Therefore, we introduce the consistency enhancement loss (CEL) [38]: where TP, FP and FN represent true-positive, false-positive and false-negative, respectively.FP + FN refers to the dif- ference between the union and intersection of the predicted map and the ground truth, while FP + 2TP + FN represents the sum of the union and the intersection.

Datasets
We evaluate the proposed model on six public saliency detection benchmark datasets: ECSSD [39], DUT-OMRON [9], HKU-IS [40], PASCAL-S [41], DUTS [42] and SOD [43], which are human-labeled with pixel-wise ground truth for quantitative evaluations.DUTS is currently the largest SOD dataset, including 10553 training images (DUTS-TR) and 5019 test images (DUTS-TE).DUT-OMRON contains 5168 images of complex backgrounds and high content variety.ECSSD consists of 1,000 natural-looking pictures with complex content.HKU-IS is composed of 4447 challenging images with multiple disconnected salient objects.PAS-CAL-S includes 850 challenging pictures.SOD contains 300 images with complex backgrounds and multiple foreground objects.

Evaluation Criteria
To quantitatively evaluate the effectiveness of our proposed model, we adopt precision-recall (PR) curves, F-measure (Fm) score, Mean Absolute Error (MAE), and mean E-measure (Em) score as our performance measures.MAE: defined as the average pixel-wise absolute difference between the prediction map and the ground truth.where P refers to the predicted salient map and G denotes the ground truth.F-measure: a comprehensive evaluation criterion calculated by a weighted combination of precision and recall.E-measure: combining the local pixel value with the global mean to evaluate the similarity between the predicted map and the ground truth.Precision-Recall (PR) curve: under different thresholds, the precision and recall values can be obtained by using the predicted map and the ground truth, the thresholds are from 0 to 255.

Implementation Details
Following most existing state-of-the-art methods [24,26,34,36], we use DUTS-TR as our training dataset.We exclude those methods which use other datasets for training, such as RADF [25] and RAS [35] which apply MASA-10K [8] for training.During the training stage, we crop the image to a size of 224 × 224 .Besides, we exploit random cropping and random rotation operations for data enhancement to avoid over-fitting.The model applies the poly strategy, where the variable is set to 0.9.To ensure model convergence, our model was trained on NVIDIA GTX 1080 Ti GPU with a batshsize of 8. Besides, we adopted a two-step training strategy to train different components separately.Specifically, we deploy VGG-16 trained on ImageNet as our backbone and initialize other convolution layers at random.We first freeze the backbone network to train other layers for 50 epochs with a large initial learning rate, and then we train the whole network for 50 epochs with a small initial learning rate.

Quantitative Comparison
In order to fully compare our proposed model with the compared models, the experimental results under different metrics are listed in Table 1.It can be seen from the results that our method exhibits excellent performance, which validates the effectiveness of the proposed model.Besides, Fig. 10 shows the PR curve of the above algorithms on the 6 datasets.
The results reveal that our method is the most prominent in most cases, indicating that our model is highly competitive.

Qualitative Evaluation
To further illustrate the advantages of the proposed method, we provide some visual examples of different methods.Some representative examples are shown in Fig. 8.These examples reflect various scenarios, including large salient object (1st and 2nd row), small objects (3rd and 4th row), multiple salient objects (5th and 6th row), low contrast between salient object and image background (7th and 8th row).Compared with other methods, the saliency maps produced by our method are more complete and more accurate.Additionally, our method captures salient boundaries quite well due to its use of the Self-Interaction Attention module.As shown in Fig. 9, our method performs very well when dealing with salient objects with background disturbance due to its use of the Scale Aware Fusion Module, which takes into account the attention-guidance relationship and their different contributions.

Ablation Study
To illustrate the effectiveness of each module designed in the proposed model, we conduct the ablation study.The ablation experiments are applied on the ECSSD dataset, where  2, the proposed model containing all components (i.e, PFDE, SIA, and SAF) achieves the best performance, which demonstrates the necessity of each component for the proposed model to obtain the best saliency detection results.
To verify that the performance improvement of our proposed model is not caused by increasing the model complexity, we design a network based on the Baseline with similar complexity to the PS-Net by adding channels, which is called Base-C in Table 2.The experiment shows that our proposed PS-Net achieves notable improvement than the Base-C (36% in terms of MAE).
We adopt the model which only uses high-level features after up-sampling as the baseline model, then we add each module progressively.First, in order to verify the function of each part of the SIA module more accurately, we extract low-level features after the first part of the SIA module which is shown in Formula 6 and after the whole SIA module, respectively.Integrating high-level features and low-level features by addition, we improve the baseline from 0.071 to 0.066 and 0.064, respectively, in terms of MAE.Furthermore, we add the PFDE module where the FDE module is replaced by the addition operation, which is called PFDE-w in Table 2.The result shows that we get a decline of 15% in MAE compared with the basic model.On this basis, the MAE score is improved by 38% after adding FDE to the PFDE module.Finally, the combination of SAF achieves the best result.

Fig. 1
Fig. 1 Motivating examples for the proposed PS-Net.(a) Image.(b) Original low-level feature.(c) Original high-level feature.(d) Features extracted after the PFDE module.(e) Features extracted after

Fig. 5 Fig. 6
Fig. 5 Examples of boundaries of several salient objects.From left to right are original images, boundaries of Ground truth, boundaries of the proposed method, boundaries of GateNet, boundaries of PAGE

Fig. 7
Fig. 7 Detailed structure of Scale Aware Fusion Module

Fig. 8 Fig. 9
Fig. 8 Qualitative comparison of the proposed model with other state-of-the-art methods.Obviously, saliency maps produced by our model are clearer and more accurate than others and our results are more consistent with the ground truths

Table 1
Performance comparison with 19 state-of-the-art methods over 6 datasets.MAE (smaller is better), mean E-measure (E-m, larger is better) and F-measure (F-m, larger is better) are used to measure the model performance.The best three results are shown in red, blue, and green

Table 2
Ablation study for different modules on the ECSSD dataset In this paper, we propose a Progressive Selection Network (PS-Net) for effective salient object detection.Taking into account the characteristics of multi-scale features, we design the PFDE module to aggregate high-level features dynamically.For refining the saliency edge, we propose the SIA module to extract low-level features.Besides, considering the different contributions of high-level features and lowlevel features, we propose the SAF module which exploits high-level features to guide low-level features.Extensive experiments on 6 datasets validate that the proposed model outperforms 19 state-of-the-art methods under different evaluation metrics.
Fig. 10 Precision-recall curves on six common saliency datasetsConclusion