MF-BHNet A Hybrid Multimodal Fusion Network For Building Height Estimation Using Sentinel-1 and Sentinel-2 Imagery
MF-BHNet A Hybrid Multimodal Fusion Network For Building Height Estimation Using Sentinel-1 and Sentinel-2 Imagery
Abstract— Integrated Sentinel-1 synthetic aperture radar Index Terms— Building height, data synergy, deep learning,
(SAR) imagery and Sentinel-2 optical imagery have shown great remote sensing.
promise in mapping large-scale building height. Effectively fusing
the complementary features of SAR and optical imagery is
a key challenge in enhancing the building height estimation I. I NTRODUCTION
performance. However, SAR imagery and optical imagery have
significant heterogeneity, which makes obtaining accurate build-
ing height a challenging problem. In this article, we propose
a hybrid multimodal fusion network (MF-BHNet) for building
B UILDING height characterizes urban development in
vertical structure, which influences urban volume ratio,
material stock, and built-up environment [1], [2]. It is also
height estimation using Sentinel-1 SAR imagery and Sentinel-2
optical imagery. First, we design a hybrid multimodal encoder the key to evaluating the urban ecological environment and
to mine modal-specific feature and model intermodal correlation. development status and plays an important role in analyzing
In particular, an intramodal encoder (IME) is designed to urban landscape structure [3], evaluating urban human livabil-
reconstruct valuable intramodal information, and a transformer- ity [4], and estimating economic development patterns [5].
based cross-modal encoder (CME) is used to model intermodal
Especially in recent years, with the accelerated urbanization
correlation and capture contextual information. Then, a coarse-
fine progressive multimodal fusion method is proposed to fuse process in developing countries, the increase of regional
SAR feature and optical feature to improve the building height high-rise and high-density buildings has also caused some
estimation performance. We construct a building height dataset socio-environmental problems, such as increased energy con-
by introducing superior building footprints to validate our sumption, urban congestion, and residential segregation [6],
method. Experimental results demonstrate that our MF-BHNet
method outperforms the compared 11 state-of-the-art methods,
[7], [8]. Therefore, accurate mapping of building height over
which achieves the lowest root-mean-square error (RMSE) of large areas is essential for a comprehensive understanding of
3.6421 m. Besides, compared to the four publicly available urban development processes.
building height products, the mapping result of the proposed With the development of Earth observation technology,
method has significant advantages in terms of spatial detail and increasing attention has been paid to the extraction of building
accuracy.
height information using remote sensing data. In many cases,
Received 6 May 2024; revised 2 September 2024; accepted 7 October 2024. building height can be measured via ultrahigh-resolution stereo
Date of publication 10 October 2024; date of current version 24 October image pairs [9], [10], but collecting large-scale stereo images
2024. This work was supported in part by the National Key Research and is expensive. Furthermore, building height can be determined
Development Program of China under Grant 2023YFB3906102, in part by
Sichuan Science and Technology Program under Grant 2022YFN0031, in part using the building shadow length in optical remote sensing
by Hubei Key Research and Development Plan under Grant 2022BAA048, imagery. [11], [12]. However, shadow extraction is suscep-
in part by Stable Support for Scientific Research Projects in Key Laboratories tible to occlusion and atmosphere. Recently, the quantitative
under Grant WDZC20245250203, in part by the Fundamental Research
Fund Program of State Key Laboratory of Information Engineering in analysis of Sentinel satellite constellation has provided new
Surveying, Mapping and Remote Sensing (LIESMARS) under Grant 4201- perspectives on building height estimation, with the advantage
420100071, in part by the Open Topic of the Hunan Engineering Research of global coverage as well as the synergy of Sentienl-1 syn-
Center of 3-D Real Scene Construction and Application Technology under
Grant 3DRS2024Y4, and in part by the National Natural Science Foundation thetic aperture radar (SAR) and Sentinel-2 optical imagery [5],
of China under Grant 62401410. (Corresponding author: Zhenfeng Shao.) [13], [14]. Machine learning methods, such as the support
Siyuan Wang and Zhenfeng Shao are with the State Key Laboratory vector machine and random forest, are increasingly used
of Information Engineering in Surveying, Mapping and Remote Sensing,
Wuhan University, Wuhan 430079, China (e-mail: wsy1998@[Link]; for large-scale building height mapping with Sentinel-1/2
shaozhenfeng@[Link]). data [15], [16]. However, these methods are inadequate for
Bowen Cai is with the School of Remote Sensing and Information Engi- mining the deep nonlinear features of multisource data. It is
neering, Wuhan University, Wuhan 430079, China (e-mail: caibowen@whu.
[Link]). a challenge for performance largely depends on the quality of
Dongyang Hou is with the School of Geosciences and Info-Physics, Central the feature engineering.
South University, Changsha 410000, China (e-mail: houdongyang1986@ Deep learning, especially convolutional neural networks
[Link]).
Qing Ding is with the College of Geo-Exploration Science and Technology, (CNNs), which learn the representation of images layer
Jilin University, Changchun 130026, China (e-mail: dingqing@[Link]). by layer, has significant advantages in urban information
Jiaming Wang is with the School of Computer Science and Engineering, extraction and large-scale mapping [17], [18], [19], [20].
Wuhan Institute of Technology, Wuhan 430205, China (e-mail: wjmecho@
[Link]). Inspired by this, several studies have explored the use of
Digital Object Identifier 10.1109/TGRS.2024.3477588 deep learning to collaborate with Sentinel-1/2 data for building
1558-0644 © 2024 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See [Link] for more information.
Authorized licensed use limited to: Kongju Natonal University. Downloaded on August 05,2025 at [Link] UTC from IEEE Xplore. Restrictions apply.
4512419 IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, VOL. 62, 2024
height estimation [21], [22], [23]. Capturing effective SAR presented in Section IV. Section V describes our extensive
and optical fusion feature is the key to facilitating the build- experiments and analysis. Finally, Section VI summarizes the
ing height estimation performance. However, current feature conclusion.
cascading [23] or band cascading [22] is loosely feature
fusion, which not only results in information redundancy but II. R ELATED W ORK
also makes it difficult to eliminate the discrepancy between
In this section, we review work closely related to this article,
different modalities. Besides, valuable feature perception is
including building height estimation using optical remote
the basis for effective fusion, and the current methods are
sensing imagery and large-scale building height mapping using
insufficient to understand the critical information of a single
SAR imagery from the Sentinel-1/2 constellation. The former
modality.
mainly utilizes high-resolution optical images, which are esti-
To this end, this article proposes a hybrid multimodal
mated by monocular measurement or stereoscopic estimation.
fusion network (MF-BHNet) for building height estima-
The latter tends to favor national/planetary scale mapping
tion. First, a hybrid multimodal encoder is proposed for
studies via Sentinel-1 SAR imagery and Sentinel-2 optical
mining intramodal feature and modeling intermodal cor-
imagery.
relation. Specifically, we design an intramodal encoder
(IME) by introducing optimal convolutional units to capture
modality-specific critical information and reduce redundancy. A. Building Height Estimation Using Optical Remote
Furthermore, a transformer-based cross-modal encoder (CME) Sensing Imagery
is used to model intermodal correlation and perceptual contex- With the development of computer vision, deep learning
tual information. Then, a coarse-fine progressive multimodal is applied to building height estimation from optical remote
fusion method is proposed to fuse the complementary infor- sensing imagery. According to the estimation principle, we can
mation of SAR images and optical images to improve the divide them into monocular image estimation-based methods
precision of building height estimation. Thus, the semantic, and multiview/stereo image-based methods.
spatial detail, and height response information of the building 1) Monocular Image Estimation-Based Methods: The idea
are fused at different levels, which not only helps in estimating of using monocular images for building height estimation
the building height but also improves the quality of the comes from monocular depth estimation in computer vision,
building footprint. which aims to estimate the depth of objects in a scene from
The main contributions of this article can be summarized a single 2-D image. In fact, both involve recovering depth
as follows. information from 2-D image projections of a 3-D scene.
1) We propose a novel hybrid multimodal fusion network Inspired by this, several studies have explored the application
for estimating building height combined Sentinel-1 SAR of CNN to superhigh-resolution optical images [24], [25] and
imagery and Sentinel-2 optical imagery. With the global street view images [26]. For example, Liu et al. [25] con-
coverage of the Sentinel constellation, the method has structed a regression model with an encoder–decoder structure
the potential to produce large-scale building height to learn the mapping relationship from a single aerial image
maps. to the digital surface model (DSM) using the 0.15-m spatial
2) A hybrid multimodal encoder is designed to mine resolution optical images. Ureña-Pliego et al. [26] attempted to
modal-specific feature and model intermodal correlation. learn the building height information from a single streetscape
In this way, the bottleneck of multimodal fusion due to image based on the segmentation model. These methods do
the insufficient representation of unimodal information not require additional auxiliary data and learn the implicit
can be avoided. mapping relationship between 2-D and 3-D. However, it is
3) A coarse-fine progressive multimodal fusion method still a very challenging task due to the poor interpretability
is proposed to bridge the heterogeneity of SAR and of the mapping relationships and the limited generalization
optical imagery. By multiscale feature fusion (MFF) and ability of the model.
multimodal semantic fusion (MSF), the discrepancy of 2) Multiview/Stereo Image-Based Methods: It mainly
heterogeneous data is effectively eliminated, promoting exploits the inherent relationship between multiview/stereo
the exploitation of complementary information. images and building height to drive the optimization of deep
4) A building height dataset is constructed, and extensive learning models. For example, Cao and Huang [27] proposed a
comparative experiments are carried out with the other multispectral, multiview, and multitask deep network for build-
11 state-of-the-art methods. Experimental results indi- ing height estimation, introducing a 2.5-m spatial resolution
cate that our method has a mean error of 3.6421 m, ZY-3 multiview imagery to reduce the uncertainty of building
which is 27.6% better than the suboptimal method. height estimation. Using the stereo observation capability of
In addition, comparing the four publicly available build- China’s GaoFen-7 satellite, Chen et al. [28] first generate DSM
ing height products, our mapping has a significant data from rear-view and front-view images. Then, building
superiority in detail. roofs are combined with a semantic segmentation network to
The rest of this article is organized as follows. Section II generate building roofs, and finally, the height of each building
provides an overview of related work. Section III outlines the is estimated based on the DSM data and building outline
proposed methodological framework. Experimental setup is data. Relying on superhigh-resolution satellite imagery, these
Authorized licensed use limited to: Kongju Natonal University. Downloaded on August 05,2025 at [Link] UTC from IEEE Xplore. Restrictions apply.
WANG et al.: MF-BHNet: HYBRID MULTIMODAL FUSION NETWORK FOR BUILDING HEIGHT ESTIMATION 4512419
methods are able to map fine building height with exciting III. M ETHODOLOGY
results. However, it is difficult to acquire superhigh-resolution Given a paired Sentinel-1 SAR image x1 and Sentinel-2
multiview/stereo imagery, which is currently not applicable to optical image x2 , with the same spatial extent, where x1 ∈
large-scale mapping tasks. Therefore, the above studies have R 2×H ×W , the spatial resolution is H × W , and the number
only been validated in a few cities. of channels is 2 representing the two polarization bands VV
and VH. The optical image x2 ∈ R 4×H ×W , which includes
B. Large-Scale Building Height Mapping Using Sentinel-1 red, green, blue, and near-infrared (NIR) bands. Our goal is
SAR Imagery and Sentinel-2 Optical Imagery to predict the corresponding pixel-level building height map
of size H × W .
The Sentinel-1/2 satellites provide publicly available 10-m
spatial resolution imagery over global coverage. It achieves
an ideal balance between spatial coverage and resolution for A. Overview
large-scale mapping needs [29], [30]. Moreover, the backscat- The overall framework of the proposed MF-BHNet is shown
ter values of Sentinel-1 SAR data correlate with building in Fig. 1, which is a dual-branch multitask learning network.
height [31]. Inspired by this, Li et al. [5] constructed a MF-BHNet takes SAR imagery and optical imagery as inputs
building height indicator called VVH using the polarization to each branch and gets the height and footprint of the
bands VV and VH of Sentinel-1 SAR imagery. Yang and building through the two prediction branches. In the workflow,
Zhao [16] developed a building height dataset with a spatial we first use an IME to obtain the optimized modality-specific
resolution of 1 km for China in 2017, using a spatially critical features from SAR and optical images, respectively.
informed Gaussian process regression model. However, SAR Internally, a multimodal feature-semantic fusion module is
imagery lacks clear spatial details, which makes it difficult to designed to fuse SAR and optical features from coarse to
perform fine-scale building height mapping using SAR data by fine, combining MFF and MSF. Then, a transformer-based
itself. CME is proposed to model intermodal associations via global
To capture detailed building boundaries, several studies have context learning. Specifically, MF-BHNet consists of four
explored using complementary information from SAR and components: IME, CME, multimodal feature-semantic fusion
optical imagery to map the building height. For example, module, and decoder.
Frantz et al. [32] applied a support vector model to map the The IME is designed as a dual-branch CNN to extract
building height in Germany by integrating the shading features critical modality-specific details from SAR and optical data,
of Sentinel-2 optical imagery and the backscatter features of respectively. In each branch, we design a new residual unit to
Sentinel-1 SAR imagery. Li et al. [15] combined Sentinel-1 reduce information redundancy. We adopt unshared weights to
SAR data, Landsat optical imagery, and other ancillary data to learn domain-specific information for a single modality.
map 3-D building structures based on a random forest model. The CME is implemented as a transformer for globally
Considering that deep learning has powerful feature-fitting modeling intermodal correlation. It encodes unimodal feature
capabilities, some studies have attempted to use deep learning maps and semantic fusion modal feature maps into patches
to collaborate with Sentinel-1/2 data for joint mapping [21], to produce continuous inputs. By modeling the global context
[22], [23]. Yadav et al. [21] used two CNNs to capture the on the top of the CNN features, the problem of losing spatial
features of SAR and optical images separately, followed by details in the pure transformer is avoided.
cascading various levels of features to decode building height The multimodal feature-semantic fusion module facili-
maps. Li et al. [22] developed a multitask network for building tates the transfer and interaction of multimodal information at
height mapping at a 100-m resolution, utilizing all the bands both the feature and semantic levels. In particular, the MFF uti-
of SAR and optical images as combined inputs. Further, lizes a feature cascade to convey the low-level spatial structure
Cai et al. [23] proposed a building height estimation network information of the building. The MSF first refines the semantic
(BHE-Net) for 10-m resolution mapping, which cascades the features of a single modality and then performs channel
features of the two modalities at a high level, followed by shuffling of multimodal features to achieve semantic fusion.
refining the features using an attention mechanism. The decoder is designed to convert multiscale features into
In summary, the Sentinel data facilitate building height building height and building footprints. It follows the same
mapping over large areas. Band cascading or feature cascading twinned design with unshared weights as the local feature
is commonly used in Sentinel-1/2 imagery for collaborative encoder. The decoder up-samples global context-encoded fea-
building height mapping. However, SAR imagery and optical tures and combines them with high-resolution local feature
imagery have significant heterogeneity, and previous coarser maps. In this way, low-level spatial structural features are fully
methods do not fully exploit the multimodal information. exploited with high-level semantic information.
Moreover, the current methods are insufficient to extract In general, MF-BHNet attends to both intramodal criti-
critical information from a single modality, which is a major cal information extraction and intermodal hierarchical level
limitation to the effectiveness of fusion. Therefore, how to fusion. Further experiments show that our design can effec-
design a reasonable fusion strategy is a key issue to be tively facilitate the exploitation of SAR images and optical
solved. It is necessary to fully exploit the complementary images. The following describes the hybrid multimodal
information of multimodal data while taking into account the encoder, the multimodal feature-semantic fusion module, the
modality-specific information. decoder, and the loss function in detail.
Authorized licensed use limited to: Kongju Natonal University. Downloaded on August 05,2025 at [Link] UTC from IEEE Xplore. Restrictions apply.
4512419 IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, VOL. 62, 2024
Fig. 1. Overall framework of the MF-BHNet. The IME aims to capture single modal key features, and the CME uses transformer to model intermodal
correlation, and the decoder is utilized to generate predictions. Inside the network, multiscale feature fusion and MSF modules are learned together with the
main structure.
Authorized licensed use limited to: Kongju Natonal University. Downloaded on August 05,2025 at [Link] UTC from IEEE Xplore. Restrictions apply.
WANG et al.: MF-BHNet: HYBRID MULTIMODAL FUSION NETWORK FOR BUILDING HEIGHT ESTIMATION 4512419
To obtain multiple feature maps with four different C denotes the channel count, P denotes the patch size, and
resolutions, the output of the 7 × 7 convolution and various n = (H W /P 2 ) denotes the number of patches. We aggregate
residual blocks are retained. These feature maps are denoted these patches together to obtain a multimodal aggregation
2
as {F1o , F2o , F3o , F4o } and {F1s , F2s , F3s , F4s } from bottom to top. representation, denoted as { f pi ∈ R P ·C |i = 1, . . . , N }, where
In particular, F1s ∈ R (H/2)×(W/2)×D1 , F2s ∈ R (H/4)×(W/4)×D2 , N is the total number of all patches.
F3s ∈ R (H/8)×(W/8)×D3 , and F4s ∈ R (H/16)×(W/16)×D4 , where D1 , Afterward, the learnable positional embedding information
D2 , D3 , and D4 represent the number of channels, respectively. is assigned to each block as input. The vectorized patch f pi
Note that the SAR- and optical-derived feature maps have the is mapped to a potential D-dimensional embedding space
same shape. through a linear projection. To incorporate spatial information,
the patch embedding is augmented with position information,
2) Cross-Modal Encoder: Beyond intramodal learning,
which is defined as
we design a CME to model the intermodal correlation and
facilitate the interaction of multimodal information. The CME g0 = f p1 e; f p2 e; . . . ; f pN e + epos
(2)
is driven from a transformer [35], which can process the mul-
where e ∈ R (P ·C)×D denotes the block embedding projection,
2
timodal input sequences and generates the output sequences
by self-attention. By globally capturing features from multiple and epos ∈ R N ×D denotes the positional embedding.
modalities simultaneously, CME can learn remote dependen- Finally, we perform self-attention encoding. The transformer
cies of multimodal data. As shown in Fig. 1, CME requires consists of L self-attention blocks that capture contextual
three different feature maps provided by the IME, which are dependencies. Each self-attention block incorporates a multi-
the optical modal feature map, SAR modal feature map, and head self-attention (MSA) and a multilayer perceptron (MLP).
MSF feature map, defined as Fo , Fs , and Fm , respectively. The output of layer l can be expressed as
Initially, CME serializes these feature maps into differ-
2
ent fixed 2-D patches {oip ∈ R P ·C |i = 1, . . . , n}, {s ip ∈ gl′ = MSA(LN(gl−1 )) + gl−1 (3)
gl = MLP LN gl′ + gl′
2 2
R P ·C |i = 1, . . . , n}, and {m ip ∈ R P ·C |i = 1, . . . , n}, where (4)
Authorized licensed use limited to: Kongju Natonal University. Downloaded on August 05,2025 at [Link] UTC from IEEE Xplore. Restrictions apply.
4512419 IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, VOL. 62, 2024
Fig. 5. Framework of MSF, where encoder represents the IME. (a) MSF module. (b) SSM.
where LN(·) is the layer normalization, and gl is the coded 1) Multiscale Feature Fusion: MF-BHNet uses CNN as
image representation. In this way, the output of the last a local feature encoder, which represents image features at
self-attention block is regarded as a global representation with multiscale levels, and low-level feature maps contain rich
the shape (H W /P 2 ) × D. We reshape it to recover the spatial detail information [36], such as texture, edge, and height.
order, denoted as G ∈ R (H/P)×(W/P)×D . In general, low-level features are usually relatively coarse;
It should be noted that the transformer is powerful in global thus, access to a wider range of information is often more
context modeling, but it ignores spatial details at lower levels important. Therefore, we adopt a MFF strategy that uses
of the image. In contrast, we solve this problem by embedding feature cascades to increase the information of the fused
high-level feature maps from CNN into patches, which is able feature map with lower complexity.
to learn both local details and global contextual information. We first concatenate the feature maps of the SAR branch
With the cross-modal global sensing of CME, the intermodal and the optical branch and then use a 1 × 1 convolution to
correlation between SAR data and optical data is effectively reduce the channel dimensions to obtain a preliminary feature
captured. map. Afterward, batch normalization and ReLU activation are
performed to improve the representation. The four different
C. Multimodal Feature-Semantic Fusion
levels of features obtained by the local feature encoder are
Optical imagery provides spectral and textural information,
{F1o , F2o , F3o , F4o } and {F1s , F2s , F3s , F4s }. In particular, the ith
while SAR imagery contains valuable geometric information
level features from optical and SAR images are denoted as Fio
about ground objects, which are highly complementary. Joint
and Fis , i ∈ [1, 2, 3, 4], respectively. The feature fusion can
optical and SAR imagery can help detect the contours of build-
be built as
ings. More importantly, the combination of the two sensors
improves the ability of dual-polarized radar backscatter to esti- Fi′ = ReLU BatchNorm C1×1 Fio ⊙ Fis
(5)
mate the building height [32]. However, the current building
height estimation methods rarely consider fine-grained multi- where Fi′ denotes the fused features, ⊙ denotes the feature
modal fusion. To solve this problem, we propose a multimodal splicing along the channel, and C1×1 denotes 1 × 1 convolu-
progressive feature-semantic fusion method to improve the tion. We perform feature fusion on the four levels to obtain a
performance of SAR images and optical images in building multiscale fusion feature, denoted as {F1′ , F2′ , F3′ , F4′ }.
height estimation. In the following, we present MFF and MSF, It is noteworthy that in the previous description, we design
respectively. the SC-RU structure to reduce redundancy, so that MFF does
Authorized licensed use limited to: Kongju Natonal University. Downloaded on August 05,2025 at [Link] UTC from IEEE Xplore. Restrictions apply.
WANG et al.: MF-BHNet: HYBRID MULTIMODAL FUSION NETWORK FOR BUILDING HEIGHT ESTIMATION 4512419
not cause a large amount of redundancy. It is a simple yet By capturing the modality-specific critical feature through
effective fusion strategy. semantic mining, multimodal high-level information can be
s o
2) Multimodal Semantic Fusion: The aforementioned MFF fused more efficiently. We define X k1 and X k1 to represent
is a relatively coarse approach aimed at obtaining richer the channel-refined features of SAR imagery and optical
s o
multimodal information. However, to achieve a more detailed imagery, respectively. Similarly, X k2 and X k2 represent the
semantic description of buildings, it is necessary to perform spatial-refined features of SAR imagery and optical imagery,
a fine-grained fusion of information from different modali- respectively. The features of different modal are then fused
ties. To this end, we design an MSF module that facilitates together through channel fusion and spatial fusion. We cascade
s o
the extraction of intramodal discriminative information while the channel feature X k1 and X k1 to obtain the channel fusion
taking into account fine-grained cross-modal interactions. feature X k1 = (X k1 ⊙ X k1 ) ∈ R C/G×H ×W , where ⊙ represents
m s o
The MSF strategy is inspired by [37] and employs an the channelwise concatenation. Similarly, the spatial fusion
attention mechanism to refine the modal-specific feature. The m
feature X k2 = (X k2
s o
⊙ X k2 ) ∈ R C/G×H ×W is obtained.
diagram of MSF is shown in Fig. 5(a). First, the optical feature To combine the spatial and channel features, we cascade
and SAR feature are refined with a semantic mining module m
X k1 and X k2m
to obtain the sub-feature X m = (X k1 m m
⊙ X k2 )∈
2C/G×H ×W
(SMM) to highlight valuable spatial and channel features and R . Afterward, the sub-feature is aggregated and
suppress redundant information. Subsequently, the spatial and channel shuffle is performed to obtain the semantic fusion
channel fusion strategies are utilized to combine the spatial feature S ∈ R 2C×H ×W . In this way, cross-group information is
and channel features of different modalities, respectively. allowed to flow along the channel dimension, which enables
Finally, a channel shuffle operation is introduced to facilitate fully exchanging and fusing information from both modalities.
intermodal information interaction. Since deeper features contain more significant semantic infor-
Specifically, an attention mechanism-driven SMM is intro- mation, we perform MSF between the highest SAR feature
duced to capture the key modal-specific feature, as shown F4s and the highest optical feature F4o . The semantic fusion
in Fig. 5(b). Two independent SMMs are used to model feature S4 can be computed as
the spatial attention and channel attention of each modality,
S4 = MSF F4s , F4o .
respectively. The processing flow of each SMM branch is (9)
described as follows. In summary, the spatial detail and semantic information
To simplify the description, we define the input to the SMM provided by the multimodal data can be effectively exploited
as a feature map X with C channels and a 2-D shape of H × through MFF and MSF.
W . First, X is divided into G groups at the channel dimension,
i.e., X = {X 1 , . . . , X k , . . . , X G }, where X k ∈ R C/G×H ×W .
Then, the sub-feature X k of the grouping is separated into D. Decoder
two streams in the channel dimension, i.e., X k1 , X k2 ∈ In this article, we design a building height decoder and a
R C/2G×H ×W . One processing stream captures critical channel building footprint decoder, which are independently used for
information, while the other stream acquires spatially focused building height estimation and building footprint extraction,
regions. As for the channel attention stream, we first embed respectively. The initial idea is that building height information
global information using global average pooling to generate tends to be continuous, while the semantic category is a dis-
the channel statistic s ∈ R C/2G×1×1 , which is computed as crete property. With only one decoder, the building boundary
follows: may be negatively affected by the height estimation. Therefore,
H X W with a dual-decoder design that does not share weights, it is
1 X
s = Fgap (X k1 ) = X k1 (m, n). (6) possible to use the predicted footprint to refine the boundaries
H × W m=1 n=1 of the building height maps, yielding a more accurate 3-D
Then, the channel refined feature is obtained through a product.
simple gating mechanism Fc (·) with an s-shaped activation To obtain the pixel-by-pixel results, the top-layer feature
function. It can be formulated as of the encoder usually is sampled to full resolution, i.e., the
2-D shape is H × W . For example, FCN [38], PSANet [39],
′
X k1 = σ (Fc (s)) · X k1 = σ (W1 s + b1 ) · X k1 (7) and DeepLabV3 [40] all upsample the feature map of the
′
last layer of the encoder, which is a more common strat-
where X k1 denotes the channel refinement result, W1 ∈ egy in semantic segmentation models. Moreover, additional
C/2G×1×1
R and b1 ∈ R C/2G×1×1 are the parameters used to processing is added before upsampling to obtain fine node
scale and shift s to enhance the feature representation of X k1 , information. For example, DeepLabV3 uses spatial pyramid
and σ (·) is the Sigmoid function. To refine the spatial feature, pooling and convolutional kernels with various dilated rates to
we first obtain spatial statistics by applying GN to X k2 . Then, extend a more typical representation. Unfortunately, the lack of
the spatial refined output is computed as height response information at lower levels makes it difficult
′
X k2 = σ (W2 · GN(X k2 ) + b2 ) · X k2 (8) for the regression model to converge. Therefore, we decode
the fusion of low-level spatial detail feature with high-level
′
where X k2 denotes the spatial refined feature, and W2 ∈ semantic feature by skip connection similar to U-Net.
C/2G×1×1
R and b2 ∈ R C/2G×1×1 are used for scaling and In this way, accurate building height and footprints can be
shifting to enhance the feature representation of X k2 . obtained.
Authorized licensed use limited to: Kongju Natonal University. Downloaded on August 05,2025 at [Link] UTC from IEEE Xplore. Restrictions apply.
4512419 IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, VOL. 62, 2024
G ′ = Upsample(G) (10) where α and β are the hyperparameters that balance the
optimization objective of the network.
G ′′ = Conv2dReLU F3′ ⊙ G ′ .
(11)
The following three upsampling decoding blocks work sim- IV. E XPERIMENTAL S ETUP
ilarly, taking the result of the previous fusion stage and con- A. Datasets
necting it to the previous low-level features. By cascading four Following the dataset construction process in [23], a basic
upsample decoding blocks, the spatial resolution of the feature building height dataset covering 62 major cities in China is
map is gradually increased to an H × W consistent with the conducted. We collected Sentinel-1 SAR images, Sentinel-2
input image. Finally, we use a 1 × 1 convolution layer to adjust optical images, and corresponding reference building height
the number of channels to yield the final pixel-level mask. and building footprint data. The following is a description of
Note that for the building height decoding branch, we set the each data product.
category number to 1. For the building footprint decoding 1) Sentinel-1/2 Data: The Sentinel-1 satellite provides com-
branch, the category number is set to 2 (i.e., building and prehensive and free global SAR imagery in two polarization
background). bands, VV and VH, with a spatial resolution of 10 m.
To ensure the complete coverage of the sample area, we merge
E. Loss Function data from both orbital directions. We set a minimum backscat-
ter coefficient threshold of −30 dB to exclude speckle noise
We develop a multitask learning loss to optimize building from the observations. The Sentinel-2 is a high-resolution
height regression and building footprint segmentation tasks. multispectral imaging satellite consisting of Sentinel-2A and
The mean square error (MSE) loss is a widely used regression Sentinel-2B satellites, providing 13 bands. For this study,
loss function that calculates the mean of the squared differ- we selected images from band 2 (blue), band 3 (green), band 4
ences between the predicted and measured values. It has been (red), and band 8 (NIR) with a spatial resolution of 10 m. The
proved that the MSE loss has a favorable effect on building images underwent radiometric calibration and atmospheric
height estimation [23], [27], so it is used as a regression loss correction to produce the atmospheric bottom reflection data.
for MF-BHNet. The MSE loss can be defined as 2) Reference Building Height Data: The reference building
1X
n
2 height is derived from publicly available field survey data from
Lmse = h i − ĥ i (12) Autonavi Map ([Link] These data provide the
n i=1
range of buildings and the corresponding number of floors
in major Chinese cities and are released in vector form.
where h i denotes the reference building height, ĥ i denotes
Especially, the reliability of the data has been verified in the
the predicted building height, and n denotes the number of
literature [27], with the MSE of 1.19 m. In our experiments,
samples.
we transformed the number of floors into building height using
The cross-entropy loss is currently widely used for semantic
experience and building codes. The height of each floor in
segmentation tasks. However, the cross-entropy loss tends to
buildings with less than 33 floors is set to 3 m, while buildings
be overconfident and is susceptible to noisy samples. In large-
with more than 33 floors have a height of 5 m per floor. The
scale building mapping, noisy samples will inevitably exist.
data were transformed from vector to raster format with a
Therefore, we adopt label smoothing loss to alleviate the prob-
spatial resolution of 10 m.
lem of noisy data. Label smoothing loss uses the parameter α
3) Reference Building Footprint Data: Additional building
to smooth the true label, and the smoothed label is represented
footprint data assistance is required to obtain fine-grained
as
boundaries due to partially missing reference building height
yiLS = yi (1 − α) + α/n (13) data. Compared to the dataset used in [23], we extend it
to include more detailed building footprint data. The China
where yi denotes the true label, and yiLS denotes the smoothed Building Rooftop Area (CBRA) dataset, which is the first
true label. The corresponding label smoothing loss is expressed high-resolution (2.5 m) and multiyear (2016–2021) building
Authorized licensed use limited to: Kongju Natonal University. Downloaded on August 05,2025 at [Link] UTC from IEEE Xplore. Restrictions apply.
WANG et al.: MF-BHNet: HYBRID MULTIMODAL FUSION NETWORK FOR BUILDING HEIGHT ESTIMATION 4512419
roof area dataset in China [41], is selected as the reference segmentation loss is not in the same order of magnitude as
footprint data. The accuracy of CBRA has been validated the regression loss, this setting can better facilitate multitask
on 250 000 test samples in urban areas. Compared to the learning.
WSF2019 dataset used in the literature [23], CBRA has clearer
building contours that are not disturbed by roads. C. Evaluation Metric
Fig. 6 illustrates the dataset construction process. First,
In this section, evaluation metrics are defined to quantify
Sentinel-1 and Sentinel-2 images for the year 2020 are
the performance of the proposed method. Mean relative error
obtained from Google Earth Engine (GEE) [42]. Meanwhile,
(Rel), root-mean-square error (RMSE), and threshold accuracy
reference building height and footprint data were uploaded
(δγ ) are defined to evaluate the effectiveness of building height
to GEE and converted to 10-m resolution raster data. High-
estimation. They are defined as
quality samples were then retained based on the spatial
n
consistency assessment of the reference building height and 1 X h i − ĥ i
Rel = (17)
footprint data, following the methodology described in [23]. n hi
All images were segmented into 256 × 256 pixels tiles. Each r i=1
sample pair includes Sentinel-1 (VV and VH), Sentinel-2 (R, 1 Xn 2
RMSE = h i − ĥ i (18)
G, B, and NIR), reference building height data, and reference n i=1
building footprint data. In addition, the dataset was expanded h i ĥ i
δγ = max , < 1.25γ , γ ∈ {1, 2, 3} (19)
by including pure image pairs, such as forests and water ĥ i h i
systems, at a 5:1 ratio, producing a total of 3565 images.
where h i is the reference building height, ĥ i is the predicted
building height, and n is the number of samples.
B. Implementation Details RMSE is a widely used metric for evaluating regression
We build and train our network models based on the Pytorch problems. It provides an overall measure of the deviation
library. The experiments are conducted on an Ubuntu system between the predicted and actual values. Rel measures the
with an NVIDIA 3090 graphics card with 24 G memory. error between the predicted and measured values, while δγ
The AdamW optimizer is used with an initial learning rate measures the degree of proximity between the predicted and
of 0.001. The StepLR scheduler is used with a step size of reference building height. δγ is a measure of the ratio between
8 epochs and gamma of 0.95. The epoch and batch size are the predicted building height and the reference building height.
set to 400 and 32, respectively. The training and validation It should be noted that lower Rel and RMSE indicate lower
sets are divided into an 8:2 ratio. During the training process, error values and higher δγ indicates more precise results.
the hyperparameters α and β, which balance the optimization For building footprint segmentation, the mean intersection
objective of the model, are both set to 10. Since the building over union (mIoU) and mean F1-score (mF1) metrics are used
Authorized licensed use limited to: Kongju Natonal University. Downloaded on August 05,2025 at [Link] UTC from IEEE Xplore. Restrictions apply.
4512419 IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, VOL. 62, 2024
TABLE I
P ERFORMANCE OF 12 B UILDING H EIGHT E STIMATION M ETHODS . L OWER VALUES OF R EL AND RMSE I NDICATE B ETTER
P ERFORMANCE , W HILE H IGHER VALUES OF OTHER M ETRICS I NDICATE B ETTER P ERFORMANCE
to evaluate the performance. The specific formulas are defined and then cascades the two modality features. In MBHR-Net,
as the building height is generated via a single decoder, and the
TP building footprint cannot be directly captured.
Recall = (20) Moreover, nine state-of-the-art pixel-level prediction net-
TP + FN
TP works were selected for comparative experiments. The current
Precision = (21) deep learning-based building height estimation models are
TP + FP
Nc translated from the field of semantic segmentation. Specif-
1 X TP ically, FCN [38], U-Net [43], SegNet [44], LinkNet [45],
mIoU = (22)
Nc i=1 TP + FN + FP DeepLabV3+ [46], and SFNet [47] using CNN architecture
Precision × Recall were selected, including SegFormer [48] and Lawin [49],
F1 = 2 × (23) which use transformer framework, and TransUnet [50], which
Precision + Recall
Nc employs a hybrid CNN-Transformer architecture. We redesign
1 X
mF1 = F1 (24) the above nine networks with a dual-branch encoder–decoder
Nc i=1 structure to simultaneously produce building height and build-
where TP, FP, and FN denote the number of true posi- ing footprint. In the subsequent experiments, we denote them
tives, false positives, and false negatives, respectively, and as Net_HE.
Nc denotes the number of categories. For the building height
estimation task, mIoU indicates the “shape accuracy” of the V. R ESULTS AND D ISCUSSION
height predictions. A higher mIoU means that the building A. Performance Comparison
areas are located more accurately, which can lead to more
detailed height estimates. It should be noted that we do not 1) Comparisons With Other State-of-the-Art Methods: To
mask building heights with predicted building profiles to better evaluate the performance of MF-BHNet, we conducted com-
reflect the validity of building height estimates. parative experiments using the constructed dataset. Table I
shows the performance of the 12 methods for building height
estimation and building footprint segmentation. The best
D. Methods for Comparison results are highlighted in bold.
To verify the effectiveness of the proposed MF-BHNet, Furthermore, compared to BHE-Net and MBHR-Net, which
we compared it with 11 representative methods. Specifically, are specialized height estimation networks, the RMSE of
the BHE-Net [23] and MBHR-Net [21] are the CNN-based MF-BHNet is reduced by 1.752 and 3.579 m, respectively.
building height estimation methods, and both use Sentinel- The current methods rarely consider intramodal discrimi-
1/2 data. In particular, BHE-Net is a dual-branch U-Net native information or fine-grained multimodal information
framework that learns the features of different modalities interactions, which limits their performance. In contrast, our
separately through partially interacting encoders. It adopts method has significant advantages due to the designed mul-
two decoders to produce building height and building foot- timodal feature semantic fusion module. With progressive
print outputs, respectively. MBHR-Net follows a dual-encoder fusion from coarse to fine, modality-specific valuable knowl-
single-decoder structure. It uses two encoders to extract the edge and modality-complementary information are effectively
SAR image feature and optical image feature, respectively, mined.
Authorized licensed use limited to: Kongju Natonal University. Downloaded on August 05,2025 at [Link] UTC from IEEE Xplore. Restrictions apply.
WANG et al.: MF-BHNet: HYBRID MULTIMODAL FUSION NETWORK FOR BUILDING HEIGHT ESTIMATION 4512419
Fig. 7. Building height estimation in China: (a) Beijing, (b) Guangzhou, (c) Baoding, and (d) Shaoxing.
Authorized licensed use limited to: Kongju Natonal University. Downloaded on August 05,2025 at [Link] UTC from IEEE Xplore. Restrictions apply.
4512419 IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, VOL. 62, 2024
Fig. 8. Intercomparison of building height mapping with Wu et al. [54], Cai et al. [23], Huang et al. [55], and Ma et al. [56].
In addition, we present estimation maps for four distinct BHE-Net and U-Net_HE, which are better in the comparison
urban areas: Beijing, Guangzhou, Baoding, and Shaoxing in methods, have false alarms in the neighboring image pixels
China. These areas are chosen due to their varying economic of tall buildings. Fig. 7(c) and (d) demonstrates the challenge
levels and urban height distributions. For example, Beijing of distinguishing building boundaries in dense and low-rise
and Guangzhou are more developed and have more high-rise areas for the comparison method. Instead, our method provides
buildings, while Baoding and Shaoxing have smaller building better spatial detail of building height mapping.
floors. Fig. 7 shows the building height mapping results. By intramodal critical information mining and multimodal
It can be found that our method is optimal in all scenarios. fusion, our method utilizes building boundary information
As shown in Fig. 7(a) and (b), MF-BHNet provides more from optical images to refine the distribution of build-
accurate height and footprint of tall buildings. In contrast, ing heights. In summary, the proposed method effectively
Authorized licensed use limited to: Kongju Natonal University. Downloaded on August 05,2025 at [Link] UTC from IEEE Xplore. Restrictions apply.
WANG et al.: MF-BHNet: HYBRID MULTIMODAL FUSION NETWORK FOR BUILDING HEIGHT ESTIMATION 4512419
TABLE II
P ERFORMANCE OF M ETHODS W ITH D IFFERENT M ODULES
improves the accuracy of building height estimation and has multimodal fusion strategy to fully leverage the advantages of
significant advantages over the compared methods. the transformer.
2) Benefits of Transformer in MF-BHNet: Encoding–
decoding segmentation networks, such as U-Net, which are B. Ablation Study
built on the top of CNN, have become the standard for 1) Module Validity: To determine the impact of different
pixel-level prediction tasks. However, the inherent localiza- components in our MF-BHNet, ablation studies are conducted.
tion of convolutional operations limits their ability to model Specifically, we investigate the impact of four modules: mul-
explicit global contextual dependencies. Transformer architec- tiscale feature fusion, multimodal semantic fusion, intramodal
tures can effectively model global contextual dependencies encoder, and cross-modal encoder, which are denoted as MFF,
in images, compensating for the local information modeling MSF, IME, and CME. Table II presents the experimental
capability of convolutional networks. This is crucial for cap- results equipped with different module methods. The baseline
turing the semantic features of complex urban scenes [51]. represents the raw network ablated of all our improvements.
The results in Table I demonstrate the significant role of It can be seen that methods with a single module outper-
the transformer in our approach. Compared to the CNN-only form the baseline. For example, MFF captures richer modal
U-Net_HE method, the RMSE error of MF-BHNet is reduced information by fusing multilevel features and reduces RMSE
by 1.5 m. However, simply introducing a self-attention archi- by 21.2% compared to baseline. The MSF reduces the RMSE
tecture does not always lead to better regression results. For by 11.9% over baseline, indicating that semantic fusion can
example, the performance of SegFormer_HE and Lawin_HE, facilitate the prediction effect of SAR and optical data.
which use a purely self-attention design, is lower than that However, the single module improvement is limited because
of CNN models such as U-Net_HE and LinkNet_HE. This it only considers the effect of a few factors. When MFF is
could be because SegFormer_HE and Lawin_HE model the coupled with MSF, a surprising effect is produced. The RMSE
global context from the lowest level and lack the detailed of MFF + MSF is reduced by 28.8% compared to the base-
localization information from low-resolution features, leading line, while δ1 is improved by 48.4%. This indicates that the
to poor building height regression. To address this, MF- proposed feature fusion and semantic fusion can fully utilize
BHNet uses vision transformer (ViT) to encode the multimodal the complementary nature of multimodal data and effectively
feature maps from CNN in chunks, combining the local feature improve building height estimation. By combining all mod-
representation and global context dependencies. ules, MF-BHNet achieves the highest δ1 of 50.8%, indicating
Some studies have also explored combining CNN and that the predicted building height is close to the reference
self-attention for more robust feature learning. For instance, building height. In addition, MF-BHNet also achieves the best
the CNN-Transformer hybrid architecture of TransUnet_HE building segmentation results, which is an effective multitask
achieved a suboptimal RMSE of 5.0283 m. However, the learning framework. Experimental results demonstrate that our
metrics such as δ1 and δ2 for TransUnet_HE are still unsat- MF-BHNet can better integrate the advantages of each module
isfactory, indicating inaccurate building height predictions to improve the accuracy of building height estimation.
with many over- or underestimations. Optimizing the model’s 2) Effectiveness of SAR and Optical Data Fusion: It has
architecture alone is limited in accurately predicting build- been found that both optical and SAR imagery can be used
ing heights. In contrast, our method effectively mines and for building height estimation, but fusion provides better per-
fuses multimodal information, reducing the RMSE by 1.38 m formance. However, this conclusion is based on the machine
compared to TransUnet_HE. learning models, and the deep learning models have stronger
Overall, the transformer architecture in MF-BHNet can feature representation capabilities. Therefore, it is necessary
enhance the model’s understanding of the building structure to investigate the performance of deep learning-driven mul-
and improve height prediction accuracy. However, it also timodal collaboration. We investigate the performance of
needs to address the challenges of computational overhead band-level fusion, pixel-level fusion, and feature-level fusion
and local information modeling capability. The key direction methods. For clarity, the following comparison methods are
to improve MF-BHNet is through rational network design and defined as follows.
Authorized licensed use limited to: Kongju Natonal University. Downloaded on August 05,2025 at [Link] UTC from IEEE Xplore. Restrictions apply.
4512419 IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, VOL. 62, 2024
TABLE III
P ERFORMANCE C OMPARISON OF DATA F USION S TRATEGIES FOR B UILDING H EIGHT E STIMATION
Authorized licensed use limited to: Kongju Natonal University. Downloaded on August 05,2025 at [Link] UTC from IEEE Xplore. Restrictions apply.
WANG et al.: MF-BHNet: HYBRID MULTIMODAL FUSION NETWORK FOR BUILDING HEIGHT ESTIMATION 4512419
Fig. 9. Validation of building height of 55 cities in China. (a) RMSE distribution. (b) R-square distribution.
Fig. 10. Noisy samples. (a) Reference height data mismatch. (b) Reference
height data partially missing.
Authorized licensed use limited to: Kongju Natonal University. Downloaded on August 05,2025 at [Link] UTC from IEEE Xplore. Restrictions apply.
4512419 IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, VOL. 62, 2024
Fig. 12. Difference between predicted and ground truth. Four regions from (a) Wuhan, (b) Hong Kong, (c) Macau, and (d) Changsha.
BHE-Map are more blurred and there are some false pos- In contrast, Macau has the largest RMSE value of
itives between neighboring buildings. The 30-m product by 3.906, while Lhasa, Yantai, and Sanya have the low-
Huang et al. [55] and the 150-m GBH-2020 tend to cluster est R 2 values of less than 0.65, indicating a relatively
image pixel heights and cannot distinguish between buildings. weaker fitting effect of the model in these locations.
In summary, our method demonstrates superior mapping Macau, Sanya, and Yantai are situated in coastal areas
results compared to published building height products. As an with complex and variable topography. The influence of
end-to-end mapping method, it provides a novel and more effi- island/peninsular geography on building forms may affect
cient solution for rapidly mapping building height at national the estimation accuracy of the model. The buildings in
and even planetary scales. Lhasa are located in the mountainous highland area, with
2) Generalizability in Different Cities of China: We visu- a complex topographic environment. Therefore, it may be
alize the relationship between reference building heights and necessary to explore the integration of additional geographic
estimated building heights for 55 cities in China to reveal the data (e.g., topographic information) to enhance the model’s
generality of MF-BHNet to different geographical areas and ability to predict building heights in complex geographic
urban layouts. These cities have relatively complete reference environments.
data of high quality. Fig. 9 shows the RMSE and R-square
(R 2 ) for different cities. D. Limitation and Future Study
Overall, most of the cities exhibit high goodness-of-fit and 1) Analysis of Limited Reference Data: Similar to other
low prediction errors, indicating the generalizability of our published studies, the building height estimates in this work
method across different cities. Specifically, 15 cities have rely on reference data. However, the reference building height
RMSE values below 1 m, demonstrating very high prediction data and the reference footprint data came from different
accuracies of the model in these cities. These cities include sources in this study. Our observations indicate that the
megacities such as Beijing, Shanghai, Guangzhou, Shenzhen, reference footprint data have a high degree of completeness,
and other large cities such as Hangzhou, Tianjin, Wuhan, while the reference height data have significant deficiencies,
and Xi’an. This suggests that the MF-BHNet model performs as shown in Fig. 10. To explore the inconsistency in the
exceptionally well in high-density urban building areas. Fur- reference height data, we use the reference building footprint
thermore, the R 2 values of 20 cities exceed 0.9, meaning that data as the ground truth for the distribution of buildings.
the model is able to explain more than 90% of the building Fig. 11 presents the percentage of missing data in the refer-
height variability in these cities. ence building height data for different cities. The results show
Authorized licensed use limited to: Kongju Natonal University. Downloaded on August 05,2025 at [Link] UTC from IEEE Xplore. Restrictions apply.
WANG et al.: MF-BHNet: HYBRID MULTIMODAL FUSION NETWORK FOR BUILDING HEIGHT ESTIMATION 4512419
that most cities have inconsistency values between 0.25 and multimodal encoder to extract intramodal valuable informa-
0.35. Some cities, such as Eerduosi, Loyang, and Ningbo, tion and model intermodal correlation. In this way, better
have more than 35% of the reference height data missing, intermodal feature fusion is facilitated by perceiving modal
meaning that over one-third of the building height information information more comprehensively from local to global. Fur-
is unavailable for these cities. To avoid this inconsistency from thermore, a coarse-fine progressive multimodal fusion method
misleading the model, we applied a masking operation during is proposed to bridge the heterogeneity problem of SAR
model training to ignore the effect of the inconsistent image images and optical images. Through the coarse-to-fine multi-
pixels on the loss calculation. For future work, methods such scale feature fusion and MSF, the complementary information
as transfer learning or semisupervised learning can be explored provided by SAR and optical data can be effectively integrated
to supplement the training data for cities with severe missing and fused. Besides, a building height dataset is constructed
height data. by introducing new building footprint data. The experimental
2) Failure Cases: The aforementioned experiments demon- results show that our method outperforms the suboptimal
strate that our method achieves the most competitive perfor- method by 27.6% in RMSE, compared with 11 state-of-the-
mance. Inevitably, our method also produces false positives in art methods. Moreover, comparing the four publicly available
a few regions. Thus, we conduct a detailed error analysis to building height products, the mapping of the proposed method
identify where the model underperforms and potential areas for has significant advantages in spatial detail and accuracy.
improvement. Specifically, we select a few regions, where the Our method has only been tested in selected cities in China.
RMSE is greater than 5 m, and generate plots of the absolute In the future, our work will be extended to the national
error between the predicted results and the ground truth. scale to perform large-scale building height mapping. Further-
Fig. 12 shows the Sentinel-2 imagery, predictions, ground more, satellite remote sensing imagery is currently considered,
truth, and difference maps for the different regions. while crowdsourced geographic information data are useful for
The predicted results are generally in good agreement with building height estimation and validation. We will introduce
the true values in the overall trend. However, there are also crowdsourced data to compensate for the mapping limitations
issues of omissions, false positives, and underestimations. For of remotely sensed imagery.
example, in Fig. 12(b), the buildings on the small islands
around the land are omitted. This is due to the low spatial ACKNOWLEDGMENT
resolution of the Sentinel-2 imagery, which makes it difficult The authors would like to thank the anonymous reviewers
for small buildings on isolated islets to be clearly recognized. for their comments to improve this article. They would also
Meanwhile, there is a problem of underestimation of some like to thank Prof. Guang Zheng and Dr. Xiao Ma for
buildings, as shown in the lower left of Fig. 12(c). We found providing the reference data.
that the high and undulating terrain in this area may cause
the SAR sensor to not fully receive the backscattered signals R EFERENCES
from the buildings, which affects the strength of the backscat- [1] Y. Park, J.-M. Guldmann, and D. Liu, “Impacts of tree and building
tering. Furthermore, interference from extraneous background shades on the urban heat island: Combining remote sensing, 3D digital
city and spatial regression approaches,” Comput., Environ. Urban Syst.,
features can also affect the building extraction. vol. 88, Jul. 2021, Art. no. 101655.
In addition, inaccuracies in the dataset can also cause prob- [2] Y. Yan and B. Huang, “Estimation of building height using a single
lems for the model. For example, in the selected zoomed-in street view image via deep neural networks,” ISPRS J. Photogramm.
Remote Sens., vol. 192, pp. 83–98, Oct. 2022.
view image of Fig. 12(a), the reference building height data [3] D. Frantz et al., “Unveiling patterns in human dominated landscapes
and the image data are not consistent. The reference building through mapping the mass of U.S. built structures,” Nature Commun.,
heights were not collected at a time cross section; hence, there vol. 14, no. 1, p. 8014, Dec. 2023.
[4] S. Alijani, A. Pourahmad, H. Hatami Nejad, K. Ziari, and S. Sodoudi,
is data noise. This may cause the model to move in the wrong “A new approach of urban livability in tehran: Thermal comfort as a
optimization direction. primitive indicator. Case study, district 22,” Urban Climate, vol. 33,
To address the above issues, image super-resolution, Sep. 2020, Art. no. 100656.
[5] X. Li, Y. Zhou, P. Gong, K. C. Seto, and N. Clinton, “Developing a
multisource data integration, and noise-robust learning are method to estimate building height from Sentinel-1 data,” Remote Sens.
promising future research directions. Image super-resolution Environ., vol. 240, Apr. 2020, Art. no. 111705.
methods can be used to improve the discriminative ability of [6] F. Mostafavi, M. Tahsildoost, and Z. Zomorodian, “Energy efficiency
building structures. The integration of multisource geographic and carbon emission in high-rise buildings: A review (2005–2020),”
Building Environ., vol. 206, Dec. 2021, Art. no. 108329.
data, such as land cover and terrain information, can be [7] C. Xi, C. Ren, J. Wang, Z. Feng, and S.-J. Cao, “Impacts of urban-scale
explored to improve the prediction accuracy of the model. building height diversity on urban climates: A case study of Nanjing,
In addition, noise-robust learning can also be considered as China,” Energy Buildings, vol. 251, Nov. 2021, Art. no. 111350.
[8] M. van Ham, M. Uesugi, T. Tammaru, D. Manley, and H. Janssen,
a potential method that contributes to the optimization of the “Changing occupational structures and residential segregation in
feedback correction model. New York, London and Tokyo,” Nature Hum. Behav., vol. 4, no. 11,
pp. 1124–1134, Aug. 2020.
[9] C. Zeng, J. Wang, W. Zhan, P. Shi, and A. Gambles, “An elevation
VI. C ONCLUSION difference model for building height extraction from stereo-image-
derived DSMs,” Int. J. Remote Sens., vol. 35, no. 22, pp. 7614–7630,
In this article, we present a novel hybrid multimodal fusion Nov. 2014.
[10] C. Liu, X. Huang, D. W. Wen, H. J. Chen, and J. Y. Gong, “Assessing the
network for mapping accurate building height with Sentinel-1 quality of building height extraction from ZiYuan-3 multi-view imagery,”
SAR and Sentinel-2 optical imagery. First, we design a hybrid Remote Sens. Lett., vol. 8, no. 9, pp. 907–916, 2017.
Authorized licensed use limited to: Kongju Natonal University. Downloaded on August 05,2025 at [Link] UTC from IEEE Xplore. Restrictions apply.
4512419 IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, VOL. 62, 2024
[11] C. Chatzipoulka, K. Steemers, and M. Nikolopoulou, “Density and [33] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for
coverage values as indicators of thermal diversity in open spaces: image recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit.
Comparative analysis of London and Paris based on sun and wind (CVPR), Jun. 2016, pp. 770–778.
shadow maps,” Cities, vol. 100, May 2020, Art. no. 102645. [34] J. Li, Y. Wen, and L. He, “SCConv: Spatial and channel reconstruction
[12] Y. Xie, D. Feng, S. Xiong, J. Zhu, and Y. Liu, “Multi-scene building convolution for feature redundancy,” in Proc. IEEE/CVF Conf. Comput.
height estimation method based on shadow in high resolution imagery,” Vis. Pattern Recognit. (CVPR), Jun. 2023, pp. 6153–6162.
Remote Sens., vol. 13, no. 15, p. 2862, Jul. 2021. [35] A. Dosovitskiy et al., “An image is worth 16×16 words: Transformers
[13] M. Santangelo, M. Cardinali, F. Bucci, F. Fiorucci, and A. C. Mondini, for image recognition at scale,” 2020, arXiv:2010.11929.
“Exploring event landslide mapping using Sentinel-1 SAR backscatter [36] A. Khan, A. Chefranov, and H. Demirel, “Image scene geometry
products,” Geomorphology, vol. 397, Jan. 2022, Art. no. 108021. recognition using low-level features fusion at multi-layer deep CNN,”
[14] A. Mullissa et al., “Sentinel-1 SAR backscatter analysis ready data Neurocomputing, vol. 440, pp. 111–126, Jun. 2021.
preparation in Google Earth Engine,” Remote Sens., vol. 13, no. 10, [37] Q.-L. Zhang and Y.-B. Yang, “SA-net: Shuffle attention for deep
p. 1954, May 2021. convolutional neural networks,” in Proc. ICASSP - IEEE Int. Conf.
[15] M. Li, E. Koks, H. Taubenböck, and J. van Vliet, “Continental-scale Acoust., Speech Signal Process. (ICASSP), Jun. 2021, pp. 2235–2239.
mapping and analysis of 3D building structure,” Remote Sens. Environ., [38] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks
vol. 245, Aug. 2020, Art. no. 111859. for semantic segmentation,” in Proc. IEEE Conf. Comput. Vis. Pattern
[16] C. Yang and S. Zhao, “A building height dataset across China in 2017 Recognit. (CVPR), Jun. 2015, pp. 3431–3440.
estimated by the spatially-informed approach,” Sci. Data, vol. 9, no. 1, [39] H. Zhao et al., “Psanet: Point-wise spatial attention network for
p. 76, Mar. 2022. scene parsing,” in Proc. Eur. Conf. Comput. Vis. (ECCV), 2018,
[17] M. Schmitt et al., “There are no data like more data: Datasets for deep pp. 267–283.
learning in Earth observation,” IEEE Geosci. Remote Sens. Mag., vol. 11, [40] L. C. Chen, G. Papandreou, and I. Kokkinos, “DeepLab: Semantic image
no. 3, pp. 63–97, Sep. 2023. segmentation with deep convolutional nets, atrous convolution, and fully
[18] D. Hou, S. Wang, X. Tian, and H. Xing, “PCLUDA: A pseudo-label connected CRFs,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 40, no. 4,
consistency learning-based unsupervised domain adaptation method for pp. 834–848, Jun. 2017.
cross-domain optical remote sensing image retrieval,” IEEE Trans. [41] Z. Liu, H. Tang, L. Feng, and S. Lyu, “CBRA: The first multi-annual
Geosci. Remote Sens., vol. 61, 2023, Art. no. 5600314. (2016–2021) and high-resolution (2.5 m) building rooftop area dataset
[19] S. Wang, D. Hou, and H. Xing, “A self-supervised-driven open-set in China derived with super-resolution segmentation from Sentinel-
unsupervised domain adaptation method for optical remote sensing 2 imagery,” Earth Syst. Sci. Data Discuss., vol. 2023, pp. 1–40,
image scene classification and retrieval,” IEEE Trans. Geosci. Remote Apr. 2023.
Sens., vol. 61, 2023, Art. no. 5605515. [42] N. Gorelick, M. Hancher, M. Dixon, S. Ilyushchenko, D. Thau,
[20] K. Liu, Z. Jiang, M. Xu, M. Perc, and X. Li, “Tilt correction and R. Moore, “Google Earth Engine: Planetary-scale geospatial anal-
toward building detection of remote sensing images,” IEEE J. Sel. ysis for everyone,” Remote Sens. Environ., vol. 202, pp. 18–27,
Topics Appl. Earth Observ. Remote Sens., vol. 14, pp. 5854–5866, Dec. 2017.
2021. [43] O. Ronneberger, P. Fischer, and T. Brox, “U-Net: Convolutional
networks for biomedical image segmentation,” in Medical Image Com-
[21] R. Yadav, A. Nascetti, and Y. Ban, “A CNN regression model to estimate
puting and Computer-Assisted Intervention—MICCAI 2015. Cham,
buildings height maps using Sentinel-1 SAR and Sentinel-2 MSI time
Switzerland: Springer, Oct. 2015, pp. 234–241.
series,” 2023, arXiv:2307.01378.
[44] V. Badrinarayanan, A. Kendall, and R. Cipolla, “SegNet: A deep
[22] R. Li, T. Sun, F. Tian, and G.-H. Ni, “SHAFTS (v2022.3): A deep-
convolutional encoder–decoder architecture for image segmentation,”
learning-based Python package for simultaneous extraction of building
IEEE Trans. Pattern Anal. Mach. Intell., vol. 39, no. 12, pp. 2481–2495,
height and footprint from sentinel imagery,” Geosci. Model Develop.,
Dec. 2017.
vol. 16, no. 2, pp. 751–778, 2023.
[45] A. Chaurasia and E. Culurciello, “LinkNet: Exploiting encoder represen-
[23] B. Cai, Z. Shao, X. Huang, X. Zhou, and S. Fang, “Deep learning-based tations for efficient semantic segmentation,” in Proc. IEEE Vis. Commun.
building height mapping using Sentinel-1 and Sentinel-2 data,” Int. Image Process. (VCIP), Dec. 2017, pp. 1–4.
J. Appl. Earth Observ. Geoinf., vol. 122, Aug. 2023,
[46] L.-C. Chen, Y. Zhu, G. Papandreou, F. Schroff, and H. Adam,
Art. no. 103399.
“Encoder–decoder with atrous separable convolution for semantic
[24] Z. Gao et al., “Joint learning of semantic segmentation and height image segmentation,” in Proc. Eur. Conf. Comput. Vis. (ECCV), 2018,
estimation for remote sensing image leveraging contrastive learning,” pp. 801–818.
IEEE Trans. Geosci. Remote Sens., vol. 61, 2023, Art. no. 5614015.
[47] X. Li et al., “SFNet: Faster and accurate semantic segmentation via
[25] C.-J. Liu, V. A. Krylov, P. Kane, G. Kavanagh, and R. Dahyot, semantic flow,” 2022, arXiv:2207.04415.
“IM2ELEVATION: Building height estimation from single-view aerial [48] E. Xie, W. Wang, Z. Yu, A. Anandkumar, J. M. Alvarez, and
imagery,” Remote Sens., vol. 12, no. 17, p. 2719, Aug. 2020. P. Luo, “SegFormer: Simple and efficient design for semantic segmen-
[26] M. Ureña-Pliego, R. Martínez-Marín, B. González-Rodrigo, and tation with transformers,” in Proc. Neural Inf. Process. Syst., 2021,
M. Marchamalo-Sacristán, “Automatic building height estimation: pp. 12077–12090.
Machine learning models for urban image analysis,” Appl. Sci., vol. 13, [49] H. Yan, C. Zhang, and M. Wu, “Lawin transformer: Improving semantic
no. 8, p. 5037, Apr. 2023. segmentation transformer with multi-scale representations via large
[27] Y. Cao and X. Huang, “A deep learning method for building height window attention,” 2022, arXiv:2201.01615.
estimation using high-resolution multi-view imagery over urban areas: [50] J. Chen et al., “TransUNet: Transformers make strong encoders for
A case study of 42 Chinese cities,” Remote Sens. Environ., vol. 264, medical image segmentation,” 2021, arXiv:2102.04306.
Oct. 2021, Art. no. 112590. [51] X. Zhang, S. Cheng, L. Wang, and H. Li, “Asymmetric cross-attention
[28] P. Chen et al., “Leveraging Chinese GaoFen-7 imagery for high- hierarchical network based on CNN and transformer for bitemporal
resolution building height estimation in multiple cities,” Remote Sens. remote sensing images change detection,” IEEE Trans. Geosci. Remote
Environ., vol. 298, Dec. 2023, Art. no. 113802. Sens., vol. 61, 2023, Art. no. 2000415.
[29] P. Tripathy and T. Malladi, “Global flood mapper: A novel Google [52] S. C. Kulkarni and P. P. Rege, “Pixel level fusion techniques for SAR
Earth Engine application for rapid flood mapping using Sentinel-1 SAR,” and optical images: A review,” Inf. Fusion, vol. 59, pp. 13–29, Jul. 2020.
Natural Hazards, vol. 114, no. 2, pp. 1341–1363, Nov. 2022. [53] A. Galdran, “Image dehazing by artificial multiple-exposure image
[30] B. Bauer-Marschallinger et al., “The normalised Sentinel-1 global fusion,” Signal Process., vol. 149, pp. 135–147, Aug. 2018.
backscatter model, mapping earth’s land surface with C-band [54] W.-B. Wu et al., “A first Chinese building height estimate at 10 m reso-
microwaves,” Sci. Data, vol. 8, no. 1, p. 277, Oct. 2021. lution (CNBH-10 m) using multi-source Earth observations and machine
[31] K. Koppel, K. Zalite, K. Voormansik, and T. Jagdhuber, “Sensitivity learning,” Remote Sens. Environ., vol. 291, Jun. 2023, Art. no. 113578.
of Sentinel-1 backscatter to characteristics of buildings,” Int. J. Remote [55] H. Huang et al., “Estimating building height in China from ALOS
Sens., vol. 38, no. 22, pp. 6298–6318, Nov. 2017. AW3D30,” ISPRS J. Photogramm. Remote Sens., vol. 185, pp. 146–157,
[32] D. Frantz et al., “National-scale mapping of building height using Mar. 2022.
Sentinel-1 and Sentinel-2 time series,” Remote Sens. Environ., vol. 252, [56] X. Ma et al., “A global product of fine-scale urban building height based
Jan. 2021, Art. no. 112128. on spaceborne LiDAR,” 2023, arXiv:2310.14355.
Authorized licensed use limited to: Kongju Natonal University. Downloaded on August 05,2025 at [Link] UTC from IEEE Xplore. Restrictions apply.
WANG et al.: MF-BHNet: HYBRID MULTIMODAL FUSION NETWORK FOR BUILDING HEIGHT ESTIMATION 4512419
Siyuan Wang received the M.S. degree in car- Qing Ding received the Ph.D. degree in resources
tography and geographical information engineering and environment from Wuhan University, Wuhan,
from Central South University, Changsha, China, China, in 2023.
in 2023. He is currently pursuing the Ph.D. degree He is currently a teacher with the College of
in photogrammetry and remote sensing with the Geo-Exploration Science and Technology, Jilin Uni-
State Key Laboratory of Information Engineering in versity, Changchun, China. His research interests
Surveying, Mapping, and Remote Sensing, Wuhan include urban remote sensing mapping and change
University, Wuhan, China. detection.
His research interests include urban remote sensing
and deep learning.
Authorized licensed use limited to: Kongju Natonal University. Downloaded on August 05,2025 at [Link] UTC from IEEE Xplore. Restrictions apply.