Unsupervised domain adaptation for VHR urban scene segmentation via prompted foundation model-based hybrid training joint-optimized network.

Lyu, Shuchang; Zhao, Qi; Sun, Yaxuan; Cheng, Guangliang; He, Yiwei; Wang, Guangbiao; Ren, Jinchang; Shi, Zhenwei

doi:10.1109/TGRS.2025.3564216

Unsupervised domain adaptation for VHR urban scene segmentation via prompted foundation model-based hybrid training joint-optimized network.

Lyu, Shuchang; Zhao, Qi; Sun, Yaxuan; Cheng, Guangliang; He, Yiwei; Wang, Guangbiao; Ren, Jinchang; Shi, Zhenwei

Authors

Shuchang Lyu

Qi Zhao

Yaxuan Sun

Guangliang Cheng

Yiwei He

Guangbiao Wang

Professor Jinchang Ren j.ren@rgu.ac.uk
Professor of Computing Science

Zhenwei Shi

Abstract

Unsupervised Domain Adaptation for Remote Sensing Semantic Segmentation (UDA-RSSeg) is to adapt a model trained on the source domain data to the target domain samples, thereby minimizing the need for annotated data across diverse remote sensing scenes. In urban planning and monitoring, the task of UDA-RSSeg on Very-High-Resolution (VHR) images has garnered significant research interest. While recent deep learning techniques have demonstrated huge success in tackling the UDA-RSSeg task for VHR urban scenes, a persistent challenge in addressing the domain shift issue remains. Specifically, there are two primary problems: (1) severe inconsistencies in feature representation across diverse domains, characterized by notably differing data distributions, and (2) the domain gap problem due to the representation bias of the source domain patterns when translating features to predictive logits. To solve these problems, we propose a prompted foundation model based hybrid training joint-optimized network (PFM-JONet) for UDA-RSSeg on VHR urban scene. Our approach integrates the notable "Segment Anything Model" (SAM) as prompted foundation model to leverage its robust generalized representation capabilities, thereby alleviating feature inconsistencies. Based on the feature extracted by SAM-Encoder, we introduce a mapping decoder designed to convert SAM-Encoder features into predictive logits. Additionally, a prompted segmentor is employed to generate class-agnostic maps, which guide the mapping decoder’s feature representations. To efficiently optimize the entire network in an end-to-end manner, we design a hybrid training scheme that integrates feature-level and logits-level adversarial training strategies alongside a self-training mechanism. This scheme enhances the model from diverse, compatible perspectives. To evaluate the performance of our proposed PFM-JONet, we conduct extensive experiments on urban scene benchmark datasets, including ISPRS (Potsdam/Vaihingen) and CITY-OSM (Paris/Chicago). On ISPRS dataset, PFM-JONet surpasses previous SOTA methods by 1.60% in mean IoU value across four adaptation tasks. For CITY-OSM's adaptation task, it outperforms SOTA by 4.84% in mean IoU value. These results demonstrate the effectiveness of our method. Furthermore, visualization and analysis reinforce the method's interpretability. The code of this paper is available at https://github.com/CV-ShuchangLyu/PFM-JONet.

Citation

LYU, S., ZHAO, Q., SUN, Y., CHENG, G., HE, Y., WANG, G., REN, J. and SHI, Z. 2025. Unsupervised domain adaptation for VHR urban scene segmentation via prompted foundation model based hybrid training joint-optimized network. IEEE transactions on geoscience and remote sensing [online], 63, article number 4409117. Available from: https://doi.org/10.1109/tgrs.2025.3564216

Journal Article Type	Article
Acceptance Date	Apr 22, 2025
Online Publication Date	Apr 24, 2025
Publication Date	Dec 31, 2025
Deposit Date	May 5, 2025
Publicly Available Date	May 5, 2025
Journal	IEEE transactions on geoscience and remote sensing
Print ISSN	0196-2892
Electronic ISSN	1558-0644
Publisher	Institute of Electrical and Electronics Engineers (IEEE)
Peer Reviewed	Peer Reviewed
Volume	63
Article Number	4409117
DOI	https://doi.org/10.1109/TGRS.2025.3564216
Keywords	Unsupervised domain adaptation; Semantic segmentation; Hybrid training; Prompted foundation model; Ver-high-resolution images; Urban scene
Public URL	https://rgu-repository.worktribe.com/output/2801786

Files

LYU 2025 Unsupervised domain adaptation (VOR) (4.6 Mb)
PDF

Publisher Licence URL
https://creativecommons.org/licenses/by-nc-nd/4.0/

Copyright Statement
© 2025 The Authors. This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License.
For more information, see https://creativecommons.org/licenses/by-nc-nd/4.0/

Hyperspectral image classification using a multi-scale CNN architecture with asymmetric convolutions from small to large kernels. (2025)
Journal Article

FusDreamer: label-efficient remote sensing world model for multimodal data classification. (2025)
Journal Article

Entropy guidance hierarchical rich-scale feature network for remote sensing image semantic segmentation of high resolution. (2025)
Journal Article

MDDNet: multilevel difference-enhanced denoise network for unsupervised change detection in SAR images. (2025)
Presentation / Conference Contribution

Binary quantization vision transformer for effective segmentation of red tide in multi-spectral remote sensing imagery. (2025)
Journal Article

Downloadable Citations

HTML

BIB

RTF

Authors

Abstract

Citation

Files

You might also like

Downloadable Citations