2.
Literature Review
The application of remote sensing technology has advanced significantly in recent years,
especially in the areas of object segmentation and height extraction. High-resolution aerial
imaging and LiDAR (Light Detection and Ranging) technology have come together to provide a
potent tool for gathering three-dimensional data. The literature on this topic takes a variety of
approaches, from the most recent deep learning models to conventional geospatial
methodologies. This section places the current study in the context of a larger body of research
and reviews important works in the field.
2.1 Traditional Methods for Height Extraction
Digital Elevation Models (DEMs) and Digital Surface Models (DSMs) are examples of
conventional geospatial approaches that were previously used to extract object heights from
remote sensing data. According to Weinstein et al. (2020), DSMs include the heights of objects
like buildings, trees, and other structures, whereas DEMs depict the naked soil surface.
Researchers were able to determine the height of numerous objects above ground by deducting
the DEM from the DSM. Although this method is simple, it frequently necessitates a large
amount of manual labor for data preprocessing, especially when working with intricate
landscapes.
The inability of these conventional techniques to automatically segregate objects within
the data is one of their main drawbacks. This implies that every object (such as a tree or building)
needs to be manually identified, which adds effort to the process and increases the possibility of
human error. These techniques usually have trouble differentiating between items that overlap,
like buildings in an urban environment or trees in a dense forest (Wang et al., 2017). By
automating the segmentation and classification of objects, machine learning and deep learning
have emerged as solutions to these problems.
2.2 Machine Learning Approaches
Remote sensing has changed as a result of the use of deep learning (DL) and machine
learning (ML) techniques on LiDAR and aerial photography data. According to Qi et al. (2017),
Convolutional Neural Networks (CNNs) were among the first to be extensively used in remote
sensing applications, especially for object recognition and categorization. CNNs have been used
to recognize objects with high accuracy because they are particularly good at processing grid-
like data structures, including aerial pictures. Direct application of these methods to unstructured
point clouds results in reduced efficacy.
Techniques like PointNet and PointNet++, its successor, were created to get around this
restriction. A deep learning model called PointNet++ (Qi et al., 2017) was created especially to
deal with point clouds. Its ability to divide and categorize individual points according to their
spatial relationships makes it a very useful tool for applications like LiDAR data height
extraction and object detection. Even though PointNet++ has demonstrated strong performance
in numerous contexts, it still necessitates substantial labeled training data and computational
resources, which may be prohibitive for practitioners or smaller research teams without access to
these resources.
Transformer models have been modified for use in Computer Vision in recent years; they
were first created for Natural Language Processing (NLP) applications (Dosovitskiy et al., 2020).
The transformer architecture is very effective for large-scale activities like multi-modal data
fusion because it enables data processing in parallel. The combination of LiDAR and aerial
photography data has been made possible by the advent of Vision Transformers (ViTs), which
have further broadened the possibilities of deep learning in remote sensing (Dosovitskiy et al.,
2020). However, only well-funded research organizations are typically able to apply these
models due to their intricacy and high computational resource requirements.
2.3 Segmentation Models
Accurately segmenting objects from LiDAR point clouds and aerial data is a major
challenge in height extraction. Models such as the Segment Anything Model (SAM) have been
created in answer to this difficulty (Kirillov et al., 2023). SAM is a basic model that requires
very little human intervention to detect and segment items in an image. Because of its
adaptability to handle various data types, it has been successfully used in a number of sectors,
including medical imaging and remote sensing.
Although SAM and related models, such DeepForest (Weinstein et al., 2020), have
demonstrated encouraging outcomes in object detection from remote sensing data, they are not
without drawbacks. SAM's heavy reliance on labeled data for fine-tuning is one of its main
disadvantages. Furthermore, SAM's effectiveness generally deteriorates when applied to larger or
more intricate objects, like dense forest canopies or tall buildings, especially when using lower-
resolution data (Kirillov et al., 2023). Similar to this, DeepForest—a program designed
especially for detecting tree canopies—has demonstrated excellent accuracy in identifying
individual tree crowns but has trouble identifying overlapping canopies in densely populated
areas (Weinstein et al., 2020).
2.4 Multi-Modal Data Fusion and Large Language Models
Integration of multi-modal data, which combines information from many sensors (e.g.,
LiDAR, RGB, infrared), is a developing trend in the field of remote sensing. Using the
advantages of each data source, this method improves item recognition and categorization
accuracy. For instance, RGB photography captures color and texture, enabling more
sophisticated object recognition, whereas LiDAR data offers exact height information (Yang et
al., 2024).
Remote sensing has been impacted by recent developments in large language models, or
LLMs. Multi-modal LLMs have the ability to handle and understand remote sensing data, such
as aerial images and LiDAR, as demonstrated by models like as SkyEyeGPT and EarthGPT
(Yang et al., 2024). These models are very adaptable for tasks requiring multi-step analysis or
complicated reasoning since they can process both textual and visual inputs. The integration of
LLMs with remote sensing data offers intriguing prospects for the future of object detection and
height extraction, even though it is still in the experimental stages.
2.5 Gaps in the Literature
There is still a need in the literature for easily understandable, computationally effective
approaches that researchers without substantial funding or experience in machine learning may
employ, even with the quick developments in deep learning and transformer-based models.
Smaller institutions and individual researchers frequently lack access to the substantial
computational infrastructure and big labeled datasets needed for training and fine-tuning of the
current models, despite their great power.
The method proposed in this paper fills this gap by using commonly accessible geospatial
tools to extract object heights from aerial and LiDAR photos in a procedural manner. The
suggested approach, which requires less processing power and no specialist machine learning
understanding, is intended to be more approachable than deep learning-based models. This effort
attempts to close the knowledge gap between conventional techniques and the most recent
developments in GeoAI by providing a useful, replicable methodology.
Notes:
Traditional Methods: To set the historical background for the developments in remote sensing,
the article starts out by outlining earlier, manual techniques for extracting height data.
Machine Learning: The strengths and limitations of the machine learning and deep learning
approaches utilized in height extraction are highlighted in this section's more thorough
explanation.
Segmentation Models: SAM and DeepForest are thoroughly examined, and their drawbacks are
pointed out to establish the necessity of a methodical approach.
Multi-Modal Data and LLMs: Including multi-modal data fusion with LLMs highlights the
direction that this discipline is taking by introducing cutting-edge research trends.
Gaps in Literature: The conclusion of the literature review makes it evident why your
procedural method is a useful contribution by explicitly describing the gap your study is filling.