Determination of Elevations For Excavation Operations Using Drone
Determination of Elevations For Excavation Operations Using Drone
e-Publications@Marquette
Recommended Citation
Jiang, Yuhan, "Determination of Elevations for Excavation Operations Using Drone Technologies" (2020).
Dissertations (1934 -). 988.
https://epublications.marquette.edu/dissertations_mu/988
DETERMINATION OF ELEVATIONS FOR EXCAVATION
OPERATIONS USING DRONE TECHNOLOGIES
by
Yuhan Jiang
Milwaukee, Wisconsin
August 2020
ABSTRACT
DETERMINATION OF ELEVATIONS FOR EXCAVATION
OPERATIONS USING DRONE TECHNOLOGIES
Yuhan Jiang
This research project has three major tasks. First, the high-resolution ortho-image
and elevation-map datasets were acquired using the low-high ortho-image pair-based 3D-
reconstruction method. In detail, a vertical drone path is designed first to capture a 2:1
scale ortho-image pair of a construction site at two different altitudes. Then, to
simultaneously match the pixel pairs and determine elevations, the developed pixel
matching and virtual elevation algorithm provides the candidate pixel pairs in each virtual
plane for matching, and the four-scaling patch feature descriptors are used to match them.
Experimental results show that 92% of pixels in the pixel grid were strongly matched,
where the accuracy of elevations was within ±5 cm.
Second, the acquired high-resolution datasets were applied to train and test the
ortho-image encoder and elevation-map decoder, where the max-pooling and up-
sampling layers link the ortho-image and elevation-map in the same pixel coordinate.
This convolutional encoder-decoder was supplemented with an input ortho-image
overlapping disassembling and output elevation-map assembling algorithm to crop the
high-resolution datasets into multiple small-patch datasets for model training and testing.
Experimental results indicated 128×128-pixel small-patch had the best elevation
estimation performance, where 21.22% of the selected points were exactly matched with
“ground truth,” 31.21% points were accurately matched within ±5 cm.
ACKNOWLEDGMENTS
Yuhan Jiang
I would like to thank my advisor, Dr. Yong Bai, for guiding me and supporting
me throughout every single step of writing this dissertation. He has contributed so much
to my practical understanding of construction engineering and management.
I would like to thank my committee members, Dr. Saeed Karshenas and Dr.
Wenhui Sheng, for their invaluable advice and words of encouragement.
I would like to thank the MU employees who assisted me: Mr. Matthew Derosier,
for maintaining the workstation system; Dr. Anna P. Scanlon, Mr. Quinn Furumo and Mr.
Logan Newstrom for writing support.
Finally, but most importantly, I am thankful for my lovely family. Their belief in
me has sustained me throughout my life and helped me to reach my goals.
ii
TABLE OF CONTENTS
2.3.3 Image Processing and Computer Vision with Deep Learning ..............................9
3.6 Image Processing and Computer Vision with Deep Learning .......................................... 35
4.3 Pixel Grid Matching and Elevation Determination Algorithm Design ............................. 47
4.3.2 Low-high Ortho-image Pair Pixel Matching and Virtual Elevation Algorithm . 50
4.3.3 Low-high Ortho-image Pair Pixel Grid and Elevation-map Algorithm ............. 52
4.4 Pixel Grid Matching and Elevation Determination Experiment Design ........................... 56
6.4.2 Elevation Estimation Deep Learning Model Training and Validation ............... 86
v
7.4.2 Vegetation Identifying Deep Learning Model Training and Validation ........... 111
LIST OF TABLES
Table 22 Vegetation Identifying Model Training Parameters and Results .................................................. 111
LIST OF FIGURES
Figure 2 Search Configuration for ASCE library (left) ScienceDirect (right) ............................................... 12
Figure 4 An example of site plan with grid lines and contour lines .............................................................. 15
Figure 8 Prediction of drone applications in COENG and AUTCON for 2019 and 2020 ............................ 17
Figure 13 Four-channel RGB-D matrix, red, green, blue pixel, and gray depth value .................................. 27
Figure 30 Virtual depth-elevation model and pixel matching and elevation determination flowchart .......... 50
Figure 31 Pseudocode of low-high ortho-image pair pixel matching and virtual elevation algorithm ......... 52
Figure 41 Serpentine style drone path for roadway construction project ...................................................... 66
Figure 55 Workflow of the ortho-image disassembling and elevation-map assembling algorithm .............. 83
Figure 60 Data A: ground truth patches and model prediction patches (w/ early stopping) ......................... 88
Figure 61 Data CI: overlapping assembly of model predictions (w/ early stopping) .................................... 88
Figure 63 Data AO: predictions with different patch size and different epochs ............................................ 90
Figure 64 Data AO: point cloud comparison between predictions and ground truth .................................... 90
Figure 65 Patch size comparison Ⅰ: predictions for each training dataset ................................................... 92
Figure 66 Patch size comparison Ⅱ: additional predictions for data A and B .............................................. 93
Figure 67 Epochs comparison Ⅰ: predictions for early stopping vs 100 epochs .......................................... 94
Figure 68 Epochs comparison Ⅱ: training and validation loss (128×128-pixel vs 256×256-pixel) ............. 95
Figure 75 Testing dataset of ortho-image, label-image and elevation-map pair ......................................... 102
Figure 77 CNN-based image classification model with 32×32-pixel patch ................................................ 104
Figure 82 Training and validation results Ⅰ: loss and accuracy w/ early stopping trials ........................... 112
xi
Figure 83 Training and validation results Ⅱ: model predictions of data AM w/ early stopping trials ....... 113
Figure 84 Training and validation results Ⅲ: loss and accuracy of 50-epoch ............................................. 114
Figure 85 Training and validation results Ⅳ: assembly of model predictions ............................................ 115
Figure 87 Vegetation identifying results: vegetation index and mapped prediction error ........................... 117
Figure 88 Vegetation removing results Ⅰ: modified label-image and elevation-map ................................ 119
Figure 90 Vegetation removing results Ⅲ: elevation differential statistic summary .................................. 120
INTRODUCTION
1.1 Background
Excavations on construction sites are multi-scale scenes for surveying and measuring – they vary
from the larger area cut/fill projects with aerial-range measurements to the pit/trench excavation projects
with close-range measurements (Nunnally 2004; Barazzetti et al. 2010; Nex and Remondino 2014; Spence
and Kultermann 2016). Surveying plays a crucial role in determining the construction site’s geometrical
data – elevations and locations – which is important for measuring earth cut/ fill volume and designing the
excavation plan (Nunnally 2004; Peurifoy and Garold 2014). Additionally, elevations also benefit
construction professionals in optimizing earth moving path (Seo et al. 2011; Gwak et al. 2018), designing
temporary hauling road (Yi and Lu 2016), estimating cost and time duration (Hola and Schabowicz 2010)
and designing the site safety facilities as well (Wang, Zhang and Teizer 2015).
In the past decade, surveying on constructing site had been shifted from contact method to non-
contact method. Historically, surveying operations on construction sites are accomplished by contact
method – total station, GPS, measuring tape, level and theodolite (Nichols and Day 2010). Those tools help
construction professionals to acquire enough site’s geometrical information – distances, angles, points’
positions and elevations – for measuring operations by drawing a site plan and calculating earth cut/fill
quantities with the four-point method (Nunnally 2004; Peurifoy and Garold 2014; Spence and Kultermann
2016). Within the past decade, the non-contact surveying methods were developed and applied in
construction surveying, which include the terrestrial laser scanning, vehicle-borne/ Unmanned Aerial
Vehicle (UAV)-borne LiDAR (Du and Teng 2007; Takahashi et al. 2017; Kwon et al. 2017; Maghiar and
Mesta 2018) and close-range/ aerial photogrammetry (Nassar and Jung 2012; Siebert and Teizer 2014;
Sung and Kim 2016; Takahashi et al. 2017; Kwon et al. 2017; Maghiar and Mesta 2018). These non-
contact surveying methods help construction professionals to acquire a point cloud – bunch of coordinated
points – for creating a construction site’s digital terrain model (DTM). Then construction professionals
Although current surveying methods could achieve a precise result, their weaknesses are
noticeable (Du and Teng 2007; Takahashi et al. 2017; Maghiar and Mesta 2018). The contact methods rely
2
on surveyors’ movement from a target point to next point in the construction site that leads to a time-
consuming outdoor procedure and a high probability of interfering with other construction operations. The
non-contact methods avoid the conflict issue and reduce the surveying time to some extent by scanning
targets from one ground station to next station or scanning targets in a well-designed flight path that
produces a huge amount of unfiltered targets, such as the vegetation and other attached objects on the
construction site. However, processing the scanned data in non-contact method is not fast. A previous study
stated that the duration for estimating on-site soil volume after drone photogrammetry is one processing
day, under the conditions as the point cloud is generated by Agisoft PhotoScan, the geometry model is
created by Autodesk ReCap with the point cloud, and the soil volume is estimated with Autodesk Civil 3D
(Haur et al. 2018). In addition, the air-borne LiDAR system is not a reasonable surveying equipment for
construction application until its price drops down to a low number (Guo et al. 2017). Thus, quickly and
accurately determining elevations of a construction site in real-time is still a challenge for the construction
industry.
A potential approach to minimize the processing time for determining the construction site
elevations with image-based 3D-reconstruction method is that reducing the number of images need to be
processed. Previous research tried 3D-reconstruction from single-frame image with other geometrical
reference information in the past decades and confirmed that was an ill-posed problem if without any
reference information (Van den Heuvel, 1998; Hassner and Basri 2006; Saxena et al. 2008). In recent years,
researchers are continuously developing innovational approaches to estimate the relative-depth from a
single-image, which takes the advantage of convolutional neural networks (CNNs) and deep learning
(Eigen et al. 2014; Liu et al. 2015; Laina et al. 2016; Zhou et al 2017). For the construction industry, using
the advanced artificial intelligence (AI) technologies to automatically determine elevations directly from an
image of a construction site is an interesting research topic and meaningful challenge. Once overcome, the
real-time 3D-reconstruction of a construction site become possible, then the automation degree of the
Recently, small sized drones (a system of quadcopter, gimbal and small sized digital camera) are
increasingly regarded as the valid, cheap alternative remote imaging platform to large UAVs in civil
engineering applications. Small dimensions make them easily navigable in cluttered outdoor environments
and indoor environments (Takahashi et al. 2017; Siebert and Teizer 2014). In the drone application of
construction site elevation determination, the main challenge is measuring vertical distances (depths) from
the camera to the construction site ground surface. In Figure 1, the attached gimbal in the drone allows the
camera to face any desired orientation. Specifically, when the camera’s principal ray is perpendicular to the
construction site ground surface plane, the captured image is the top-view of the construction site (Siebert
Sensor
Height/Width
Width
O X f
(focal length) u
(Camera Lens) p(x,y,f)
Y target point in camera coordinate
Z
D e Image@H X x
Height
P(X,Y,Z) Image@H
(target point in word coordinate) Ground
y
Image
Height/Width
Site
Height/Width
Using the camera model in Figure 1 to determine the distance from the camera lens to the ground
is an ill-posed problem, which need at least an addition overlapping ortho-image from another position, and
the spatial relationship between these two positions should be known. For example, the traditional aerial
photogrammetry method needs a high overlapping ratio ortho-image series to complete image-based terrain
3D-reconstruction task (Nassar and Jung 2012; Siebert and Teizer 2014), which makes it impossible to
generate and output the elevation data quickly. In contrast, the classic left-right stereo-vision method that is
designed for determing depths of forward-facing objects is the fastest multiple image-based 3D-
reconstruction method (Sung and Kim 2016; Sophian et al. 2017). The stereo-vision method performs a
4
two-frame image-based 3D-reconstruction based on the triangulation model and saves all depth information
in a depth-map (grayscale image) as the result. However, the stereo-vision method is limited to measure
distances from objects’ surface to the stereo camera system in a close-range, because its measurable depth
range is limited by its small baseline (the distance between the two cameras). Furthermore, stitching stereo-
vision results makes it not different from the traditional aerial photogrammetry method, and the multiple
overlapping ortho-image based method had been confirmed that it is ineffectiveness with the large slope
ground surface (Westoby et al. 2012; Zhao and Lin 2016). Thus, the construction industry still waits for a
more rapid and simpler image-based 3D-reconstruction method for determining construction site
elevations, which will help construction professionals to manage their crews and avoid excess waste during
excavation operations.
Previous research results have shown the feasibility of using deep learning methods to recover the
relative depth information for each pixel of an image of indoor scenes (Eigen et al 2014; Liu et al. 2015;
Laina et al. 2016), outdoor scenes (Chen et al. 2016; Li and Snavely 2018) and scenes from automatic
driving applications (Garg et al. 2016). In addition, convolutional neural networks (CNNs) have been
verified as effective and reliable in micro-scale scenes, such as estimating the surface height map from a
single image of a foam mat and mouse pad (Zhou et al. 2017). Using deep learning method to estimate the
construction site elevations is equal to figure out the relationship between the elevation values’ feature and
ortho-images’ features of construction sites, which is the reference information in the single-image 3D-
reonstruction problem. To have the pixel-to-pixel relationship, the elevation values are better to save in the
grayscale image, which is referred to as an elevation-map in this research project. However, the challenge
is not only limited to find out an effective deep learning model from the previous research or develop a new
deep learning model for this specific individual task, it also needs to create a comprehensive construction
site ortho-image and elevation-map pair datasets, because there is no dataset available for training the deep
learning model. Thus, an ortho-image-based 3D-reconstruction method should be developed in advance for
vegetations and other ground attached objects on the rough construction site when determing the ground
elevations. This is because the light rays are reflected on the surface of vegetation instead of the “real”
5
ground surface. In contrast, the contact surveying methods with total station, GPS, level and theodolite, can
obtain the expected elevations as all selected target points are on the “real” ground surface. Thus, to
improve the effective of the image-based 3D-reconstruction method in construction site elevation
determination, the automatically detecting and removing the vegetation and other obstacles from the raw
surveying results and determining the “real” ground elevations are needed and important for construction
professionals to make the optimized decision in the excavation operations that heavily depend on the
elevation information.
Previous research also shows the feasibility of deep learning methods in object detection tasks
using image (Schneider et al. 2018), video (Kang et al. 2018), and image segmentation tasks (Noh et al.
2015; Badrinarayanan et al. 2017). The shortage of the current deep learning-based object detection
methods is that using low-resolution images for training the deep learning-based object detector, which
resizes the “ImageNet” (Deng et al. 2009) down to as small as 256×256-pixel, while the highest size is
limited to 800×1000-pixel (Han et al. 2015). This low-resolution issue is caused by the limitation of
computer system hardware. However, the directly exported images from a drone’s camera, such as the
ortho-image captured by DJI Phantom 4 Pro V2.0 is as large as 3648×4864-pixel, which is extremely
larger than the small-resolution of 256×256-pixel. Using the small-resolution image dataset to train the
object detection or image classification deep learning model can cause the loss of detail information.
Reducing the image size also impacts on the image segmentation, because the number of pixels for an
In summary, to improve the speed and accuracy of image-based method for determing the
construction site elevations for excavation operations, there is a need to develop a reliable method to collect
construction site ortho-image and elevation-map pair datasets, develop an innovative way to train
construction site elevation estimation deep learning model with high-resolution ortho-image and elevation-
map pair datasets, and also develop a method to automatically identify and remove the vegetation dimensions
This dissertation includes eight chapters. This chapter is an introduction to the research
Chapter 2: Objectives, Scope, and Methodology. This chapter describes the primary objectives of
Chapter 3: Literature Review. This chapter presents the findings from a comprehensive literature
review on drone applications in construction operations and image-based 3D-reconstrution methods. This
chapter also summaries the challenges and opportunities of drone and image-based method to determine
Testing. This chapter presents an effective, rapid and easily-implementable two-frame-image-based 3D-
Chapter 5: Ortho-Image and Elevation-Map Dataset Design and Acquisition Using Drone. This
chapter details how to use the developed low-high ortho-image pair-based elevations determination method
to acquire high-resolution construction site ortho-image and elevation-map pair datasets for training the
Chapter 6: Ortho-Image and Deep Learning-Based Elevation Estimation Algorithm Design And
Testing. This chapter presents a single-frame ortho-image-based 3D-reconstruction method for construction
Algorithm Design and Testing. This chapter presents a CNN-based image classification method to identify
vegetation objects on the raw construction site using the high-resolution ortho-image and determine the
Chapter 8: Conclusions and Recommendations. This chapter summarizes the procedures of the
developed methods, concludes the findings of the testing experiments, and recommends potential
improvements for future research on ortho-image and deep learning-based method in determining the
elevation of construction sites. Contributions of this research project are outlined as well.
7
To advance the construction site’s elevations determination method into a non-contact, robust and
rapid way by taking the advantages of drone technologies and eliminating the current construction
surveying methods’ shortfalls. This research project uses drone technologies, such as the gimbal-mounted
camera to get a stable ortho-image, the onboard altimeter and the GPS to learn how high above the ground
of a drone flies, and the camera model to calculate the geometrical data.
The primary goal of this proposed research is to advance the drone applications in construction
site elevations (surface heights) determination. The general idea is to reduce the required number of images
in the ortho-image based 3D-reconstruction. As the progress in success of this research project, the required
ortho-image number is decreasing from multi-frame images by the traditional drone photogrammetry
method to two-frame images by the developed drone-based low-high ortho-image pair-based method, and
This goal has been realized through achieving the specified research objectives that are described as
follows:
construction site’ elevation-map with a drone-based low-high ortho-image pair – two ortho-
images are captured in a high and a low position separately by a drone’s camera facing down
to the ground.
2. To create high-resolution construction site ortho-image and elevation-map pair datasets by the
3. To develop and test an elevation estimation deep learning model with the datasets created in
the 2nd objective for estimating a construction site’s elevations from its corresponding ortho-
remove the vegetation obstacles from the elevation-map results in the 1st or the 3rd objectives.
8
In this research project, the proposed elevation determination methods were all based on a drone
system acquired ortho-images, which means that these elevation determination methods focuses on 3D-
reconstruction of a construction site’s ground surface and excludes the vertical-side surfaces of all attached
objects, which makes it much simpler than traditional drone photogrammetry. But the proposed methods
were effective in determine elevation changes in vertical slopes, which is important to excavation
operations.
This research project considered using one-frame ortho-image to cover a construction site as much
as possible, which means it may not be able to cover an entire large site such as a roadway construction
site. In the experiments, a drone system (DJI Phantom 4 Pro V2.0) flied at 10-20 m and 20-40 m over the
ground, which had the measurable elevation range of [-5,5] m and [-10, 10] m and the area coverage of
8.47×8.47 m2 and 17.6×17.6 m2, respectively. In addition, this research project also considered the
possibility of stitching ortho-images and elevation-maps results, the stitching experiments were conducted
as well.
The construction site ortho-image and elevation-map pair datasets were collected from a lake
beach site at Atwater Park, Shorewood, Wisconsin. The dataset acquisition happened during the year of
2019 with safe flight conditions. The construction site ortho-images were transformed from the 10-m flight
height ortho-images with the high-resolution of 1568×1568-pixel. The generated elevation values were
saved in the same sized 8-bit grayscale image, as elevation-map with the high-resolution of 1568×1568-
pixel. Then, the construction site ortho-image and elevation-map pair datasets were built up. In addition,
the high-resolution label-images were 8-bit grayscale image used for training the deep learning-based
image classification (vegetation identifying) model, but they were saved in 1568-row and 1568-column
spread sheet format. In this research project, these high-resolution image datasets were not resized down to
training the proposed deep learning model, while a high-resolution image disassembling and model
The first phase of this research is an extensive literature review. The literature survey includes
processing and computer vision with deep learning. The reviewed literature includes journal papers,
A DJI Phantom 4 Pro V2.0 (a quadcopter drone equipped with automatic flight control system,
GPS, altimeter, gyroscope, inertial measurement unit and other sensors) was used to hover at the desired
position over the experiment site. The flight altitude data was directly read from the drone’s remote
controller, which has ±0.00 set as the drone takeoff point. In addition, the 3-axis gimbal enhances the
camera’s stability, which was yielded to -90 ° when capturing ortho-images of experiment site.
A modified stereo-vision triangulation method was designed for construction site elevations
determination in this research project. To automatically implement this computer vision method, the image
processing of translation, rotation, resize and subpixel level image corresponding matching were conducted
as well. In addition, the deep learning-based method with convolutional encoder-decoder model was used
to estimate the elevation value from an ortho-image, and convolutional neural network-based image
classification method was developed to identify the objects on the construction site.
The configuration of the computer system hardware and software environment is Python 3.6.8,
OpenCV 3.4.2, Keras 2.3.1, TensorFlow-GPU 1.14, CUDA 10.0 and cuDNN 7.6.4.38 on a workstation
system with 2×Xeon Gold [email protected] CPUs, 96GB (8GB×12) DDR4 2666 MHz memory and 4×11GB
Filed experiments were conducted at a lake beach site (Atwater Park, Shorewood, WI, USA).
Ortho-images were captured in the March, June and September of year 2019. Elevation data were generated
from the proposed ortho-image pair-based elevation determination method. Label-images were manual
drawn with an “Label-App” (programmed with Python 3.6.8 by the author) based on the corresponding
ortho-image.
Pearson correlation method was used to evaluate the relationship between the ortho-image pair
matching quality and the ortho-image pair capturing quality, such as translation distance and rotation
degree in the alignment of an ortho-image pair. Furthermore, descriptive statistic, histogram and contour
plot were applied to evaluate the elevation differential between the deep leaning model prediction and
2.3.6 Summary
Conclusions were drawn based on the results of data analysis. The major findings included the
elevation of the experiment site and creating the ortho-image and elevation-map pair datasets for training
the deep learning model; the comparison of the effectiveness of small-patch size and model training epoch
in the deep learning-based elevation estimation network model and object classification network model
with the high-resolution image datasets; and the effectiveness of the proposed vegetation removing method.
11
LITERATURE REVIEW
The aim of this literature review is to seek an innovational approach to advance the construction
site’s elevations determination into a non-contact, robust and rapid way by taking advantage of drone
technologies and eliminating the current methods’ shortfalls. The specific objective is to investigate the
potential of minimizing the processing time of drone and image-based 3D-reconstruction by reducing the
number of images needed to be processed during the geometrical data determination process. To achieve
that objective, two rounds of literature searches were conducted, which reversed the sequence of searches
applied in previous review articles (Chan and Owusu 2017; Nasirian et al. 2019). Additionally, the theory
of image processing and computer vision with deep learning were surveyed.
The first-round used the powerful “Google Scholar” engine to search the related terms of
construction surveying such as “3D geometric measurement,” “3D modeling,” “3D reconstruction,” “3D
mapping,” “3D terrain surface reconstruction,” “Digital Terrain Model (DTM),” “scene depth recovery,”
and “image surface height recovery.” This search round was not limited to the construction field, and all
journal articles and conference proceedings were searched. After that, the overall statutes of 3D-
reconstruction methods and technologies in different disciplines were clear. Then, comparisons of those
(COENG) and Automation in Construction (AUTCON), because those two journals are accepted as the top
ranked publications in the construction field (Wing 1997; Nasirian et al. 2019). This search round was
reserved to justify the reliability of using a drone in construction sites, and to find out which drone platform
and sensor have been accepted and adopted in previous construction field research. The terms “Unmanned
Aerial Vehicle,” “UAV,” “Drone,” and “MAV” were searched based on the first-round results. In detail: in
“ASCE Library” and “ScienceDirect,” their “Advance Search” tools are used (see Figure 2) to search a
term in “Anywhere,” and keep other options blank; if the term occurs in an article, then the article returns
to search result; and the review articles are excluded from the search results.
12
Table 1 lists the second-round search results with each screening step. The initial 15 COENG
citations were manually screened and two duplicated articles were removed; the 139 AUTCON citations
were exported to “RefWorks,” and duplicate articles were detected and removed by the RefWorks function,
then 74 unique articles remained. As the terms were matched in any part of an article, the returned citations
included the articles that did not mainly discuss the drone’s application, such as the selected terms that
occurred in the literature review or related works part, the future research suggestion, or had the similar
abbreviation, UAV and MAV. Thus, the full paper reading was conducted to filter those articles. After
reading the full paper, one COENG article and 27 AUTCON articles were retained, as that research either
discussed drone applications and problems or used drones to acquire the data. Additionally, the regression
analysis method was adopted to predict the drone publication in COENG and AUTCON in the year of 2019
and 2020.
Excavation is an essential construction activity to create the required planes and spaces for
buildings and infrastructure facilities – such as footings, foundations, and underground utilities – to provide
enough construction operating spaces for laborers and equipment (Spence and Kultermann 2016).
According to the depth and area, excavations can be classified into three types: a) mass excavation, which
usually removes larger amounts of earth from a huge depth and horizontal extent such as a building’s
basement; b) structural excavation, which removes earth in a confined area within a vertical extent and it
might need a support system during excavating; c) grading, which reconfigures the construction site’s
landform from the irregular shape (natural/current grade) to the designed shape (finish grade).
In each type of excavation, the construction professional needs a geometry model of the
construction site to accurately measure the volume of the earth to be excavated—if the current elevation is
higher than the required elevation—or placed—if the current elevation is lower than the required elevation
(Kraig et al. 2008; Spence and Kultermann 2016). Typically, a construction site is not an ideal level plane,
which usually has an irregular topography; it can be divided into small elements in geometry, like
trapezoidal bodies and cones (see Figure 3); after removing the vegetation and topsoil — then soil, rocks or
mixture materials are exposed — the construction site has a rough surface (Nichols and Day 2010).
Knowing the site’s elevations is important to the construction professionals because excavation
operations are complicated. Excavation operations are not limited to earthmoving activities such as
grubbing, clearing, scraping, excavating, hauling, backfilling, compacting and finishing. They also include
the important preparing works such as surveying, measuring, and planning before the actual earthmoving.
Additionally, excavation operations require some management works, namely safety inspecting, progress
14
monitoring and quality controlling works during excavating. Finally, documenting and recording works are
An excavation plan links the earthmoving, surveying and measuring, progress monitoring and
quality controlling operations. The quality of the excavation planning depends on the accuracy of the site
surveying and measuring, which leads to a good excavation plan and lowers the project cost by balancing
the cut/fill materials in a large-area project or a roadway project. (Seo et al. 2011; Peurifoy and Garold
The fundamental data used to build a 3D geometry model is a point cloud, which is a set of
vertices (xi, yi, zi) in a three-dimensional coordinate system (Remondino 2003; Nassar and Jung 2012; Rusu
and Cousins 2011). Additionally, each point in the point cloud has its color features— either RGB (red,
green, blue) or BGR (blue, green, red), depending on the programming platform used. Specifically, in the
OpenCV— open source computer vision, a library of programming functions mainly aimed at real-time
computer vision — the color format is BGR sequence, where each color has the value 0 ~ 255; while in the
(API) for rendering 2D and 3D vector graphics — the color format “glColor3f (0.0, 0.0, 0.0)” is RGB
In contact surveying, target points are selected by surveyors to represent excavation objects’
geometry features. The excavation objects are modeled in a combination of geometry elements (Figure 3)
or a site plan (Figure 4), then the lengths, widths, heights, and slopes of excavation objects are measured
from the site plan or the geometry model to calculate the volume of excavation (Nichols and Day 2010).
For a large-area grading project, the site plan usually needs to be divided into equal-sized grids. The
deviations between the current and required elevations at the four-corners represent the grid’s cut / fill
quantity (Peurifoy and Garold 2014; Gwak et al. 2018). While, in non-contact surveying, a target object is
recorded as a point cloud, and a construction site is modeled in a Triangulated Irregular Network (TIN)
model (see Figure 5) (Tsai 1993; Shewchuk 2002; Hearn et al. 2004; Sung and Kim 2016). Similar to the
grid in the site plan, each triangle in the TIN has three corners with two elevations. Commercial software,
15
such as the Autodesk Civil 3D, can semi-automatically accomplish TIN modeling works through external
Ideally, using point cloud to 3D model a construction site should only contain the minimum
requirement number of target points, such as the corners of grids or triangles, to represent the site’s
geometric shape. However, that is impossible for the existing image-based feature matching technologies,
without manually selecting or screening. Thus, the total station and GPS surveying still are the most
accurate surveying methods to determine elevations of a construction site. But, on the other hand, they are
both time-consuming and expensive. The detail of existing approaches in surveying will be discussed later.
Figure 4 An example of site plan with grid lines and contour lines
The Unmanned Aerial Vehicles (UAVs) were developed and used in military applications in the
past, because their weight, size and high cost of insurance limit their commercial applications (Van
Blyenburgh 1999; Siebert and Teizer 2014). With the development of precise GPS, gyroscopes and
16
inexpensive inertial measurement units (IMUs), the performance of UAVs had been significantly
improved, especially in its payload, flight endurance, stability, reliability and safety (Nex and Remondino
2014; Siebert and Teizer 2014). The micro aerial vehicles (MAVs) are increasingly regarded as the valid,
cheap alternative to UAVs in civil applications, and their small dimensions make them accessible to some
spatial conditions which are inaccessible with the large UAVs, such as cluttered outdoor settings and
The DJI Phantom series quadcopters are the most successful consumer MAV products, which are
becoming the synonym of “Drone” to the public. A drone integrated with a gimbal-mounted camera (see
Figure 6) is becoming the most popular remote imaging platform with diverse applications (Nex and
Remondino 2014; Siebert and Teizer 2014; Takahashi et al. 2017), because the gimbal enhanced the
camera’s stabilization in 3 axes (pitch, roll, yaw) and also make its rotation controllable in the pitch-axis
Sensor
Height/Width
f
O (focal length)
(Camera Lens)
Z e Image@H
(distance to target point) p (principal point)
(image point) H:flight height
(distance above ground)
P(X,Y,Z) Ground
(target point in word coordinate)
Image
Height/Width
Site
Height/Width
x 2 2 x
3 3
e (0, 0) y 4
principal point
v y
Specifically, when the pitch-axis of the gimbal is at -90°, the camera is just facing down to the
ground, to which then the image captured is called either a plan view (Zhang et al. 2015), orthophoto
(Westoby et al. 2012; Siebert and Teizer 2014), or ortho-image— the camera’s principal ray is
perpendicular to the ground level plane. The ground sample distance (GSD) defined in Eq. 1 is the spatial
17
resolution of an ortho-image taken by a drone at a specific height. The GSD has the unit 𝑚/𝑝𝑖𝑥𝑒𝑙 or
𝑐𝑚/𝑝𝑖𝑥𝑒𝑙, which means each pixel of the image stands for the distance in the real-world in meters or
centimeters.
𝑍 ∙ hsensor 𝑍 ∙ wsensor
𝑆𝑝𝑎𝑡𝑖𝑎𝑙 𝑅𝑒𝑠𝑜𝑙𝑢𝑡𝑖𝑜𝑛: 𝐺𝑆𝐷 = min(𝐺𝑆𝐷ℎ , 𝐺𝑆𝐷𝑤 ) = min ( , ) Eq. 1
𝑓 ∙ h𝑖𝑚𝑎𝑔𝑒 𝑓 ∙ w𝑖𝑚𝑎𝑔𝑒
Table 2 and Table 3 summarize all drone research articles in COENG and AUTCON. The first
drone application article occurred in 2007 in AUTCON (Metni and Hamel 2007). There was a large
increase in drone application articles published from 2015 to 2018 (see Figure 7). Figure 8 utilized the
second-order polynomial regression to predict the number of drone application articles published based on
the data from 2014 to 2018, the yearly increase will be 6 articles in COENG and AUTCON for 2019 and
2020. Comparing COENG and AUTCON, AUTCON includes more drone research than COENG. Among
those 28 research articles, different types of drones (see Figure 9) have been used as data acquisition
platforms, and the type of data obtained is dependent on the type of sensor installed (see Table 3):
Total 10
10 COENG 6 5
AUTON 3 9
5 1 1 1 1
0 0 0 0 0 1
0
2006 2008 2010 2012 2014 2016 2018 2020
Drone Applications Prediction in COENG and AUTCON for 2019 and 2020
24
22
Number of Articles
Figure 8 Prediction of drone applications in COENG and AUTCON for 2019 and 2020
18
Applications
ID Reference Year Journal
Surveying / Modeling Safety Progress Monitoring Inspection Others
Building / Hybrid Point Cloud: Drone and
1 Aguilar et al. 2019 2019 AUTCON
Terrestrial Photogrammetry
High-resolution Construction
2 Bang et al. 2017 2017 AUTCON
Site Panorama Generating
Building / Hybrid Point Cloud: Drone
3 Chen et al. 2018 2018 AUTCON
Photogrammetry and Laser Scanning
Bridge Deck
4 Ellenberg et al. 2016 2016 AUTCON
Delamination
5 Freimuth and König 2018 2018 AUTCON Visual Inspection
Indoor Under-construction
6 Hamledari et al. 2017 2017 AUTCON
Components Detecting
Construction Performance
7 Han and Golparvar-Fard 2017 2017 AUTCON
Analytics
Road Pavement / Point Cloud: Drone and Close- Road Pavement
8 Inzerillo et al. 2018 2018 AUTCON
range Photogrammetry Distress
Construction Progress
9 Han et al. 2018 2018 COENG
Monitoring
Determine Distance Between Mobile
10 Kim, D. et al. 2019 2019 AUTCON
Construction Resources
Concrete Mixer Truck / Point Cloud: Drone
11 Kim, H. and Kim 2018 2018 AUTCON
Photogrammetry
12 Kim, K. et al. 2017 2017 AUTCON Hazard Detection
Construction Site / Hybrid Point Cloud: Drone
13 Li, D. and Lu 2018 2018 AUTCON
Photogrammetry and Laser Scanning
14 Li, F. et al. 2018 2018 AUTCON Indoor Path Planning
15 Metni and Hamel 2007 2007 AUTCON Concrete Crack
Construction Site / Hybrid Point Cloud: Drone
16 Moon et al. 2019 2019 AUTCON
Photogrammetry and Laser Scanning
Bridge Structural
17 Morgenthal et al. 2019 2019 AUTCON
Elements
Concrete Bridge
18 Omar and Nehdi 2017 2017 AUTCON
Decks
Construction Site / Hybrid Point Cloud: Drone and
19 Park et al. 2019 2019 AUTCON
UGV Photogrammetry
20 Phung et al. 2017 2017 AUTCON Planar Surfaces Path Planning
21 Roca et al. 2013 2013 AUTCON Building Facades
22 Seo et al. 2018 2018 AUTCON Bridge Inspection
23 Siebert and Teizer 2014 2014 AUTCON Construction Site / DTM
On Road Vehicles Detecting and
24 Wang et al. 2016 2016 AUTCON
Tracking
25 Yang et al. 2018 2018 AUTCON Construction Site / DTM Path Planning
26 Zakeri et al. 2016 2016 AUTCON Asphalt Pavement
As-built Construction Site
27 Zhang et al. 2015 2015 AUTCON
Status
28 Zhong et al. 2018 2018 AUTCON Concrete Crack
Subtotal 9 2 3 11 6
19
1. The most popular drone brand adopted in these articles is “DJI”, with the “DJI Phantom”
series (Wang et al. 2016; Seo et al. 2018; Yang et al. 2018; Moon et al. 2019), “DJI Inspire”
(Omar and Nehdi 2017; Li, D. and Lu 2018; Aguilar et al. 2019) and “DJI Mavic” (Park et al.
2019). Before quadcopter drones were created, remote-control helicopters were being used in
bridge inspection (Metni and Hamel 2007). Although less common than the quadcopter drone,
the more powerful 6-rotor (Ellenberg et al. 2016) and 8-rotor drones (Roca et al. 2013; Zhong
et al. 2018; Morgenthal et al. 2019) have also been applied in construction research.
2. The most common sensor is “Optical Camera” in drone applications. Two “Thermal Camera”
were applied in bridge inspection (Ellenberg et al. 2016; Omar and Nehdi 2017), and a “RGB-
D Camera”, the Kinect, was used in the inspection of building facades (Roca et al. 2013). The
interesting thing is that none of those articles used the drone-borne LiDAR technology.
3. The most used 2D image style is RGB image, which is captured directly from the “Optical
Camera” and the most used 3D model style is “Point Cloud” generated by photogrammetry or
SfM method. The tendency to use “Point Cloud” to replace “RGB” images started in 2016
(see Figure 10), but for some specific applications, the processing of 2D images is more
effective and faster than the 3D model, such as surface planer inspection (Phung et al. 2017).
4. The most two common applications are “Inspection” and “Surveying/Modeling” (see Figure
11). The 11 inspections listed in Table 2 were mainly based on 2D images – “RGB” and
point cloud is another important drone application on construction sites. The point cloud is the
foundation to conduct other construction research, which are highly dependent on geometrical
data, such as indoor drone path planning (Li, F. et al. 2018), and road pavement distress
Drone Applications in COENG and AUTCON from Intilal Issue to Feb 3, 2019
Number of Articles
10 8 8
4
5 3
2
1 1
0
Helicopter DJI 3DR Other 6-rotor 8-roter Unknown UAV
Quadcopter
Drone Applications in COENG and AUTCON from Intilal Issue Feb 3, 2019
10
Number of Articles
RGB 6
4 4
5 Point Cloud
2 2
1 1 1 1
0
2006 2008 2010 2012 2014 2016 2018 2020
Drone Applications in COENG and AUTCON from Intilal Issue to Feb 3, 2019
5 Surveying / Modeling
Number of Articles
4 Safety 5
3 Progress Monitoring 4
2 Inspection
Others
1
0
2007 2013 2014 2015 2016 2017 2018 2019
Other than those 28 research articles in COENG and AUTCON, the first-round of search also
returned some drone applications which are beneficial to excavation operations. Drones started service as a
safety visual inspection tool in excavation operations (Irizarry et al. 2012; Gheisari et al. 2014; Ashour et
al. 2016; Gheisari and Esmaeili 2016; Kim et al. 2016). A real-time video stream of a construction site was
captured by the drone and transferred to each safety responsibility official for visually detecting hazards
and interacting with workers through communication speakers (Irizarry et al. 2012; Gheisari et al. 2014).
Using drones to prevent excavation accidents is based on the knowledge that sharing the real-time
construction site conditions to all construction participators will help participators take advantage of the
real-time conditions to avoid the hazards (Toole 2002). Furthermore, with the site 3D point cloud, Wang et
al. (2015) developed an algorithm to automatically extract height data from the 3D point cloud to identify
and locate fall hazards at an excavated pit. Another approach to prevent excavation accidents by using
Additionally, previous research also extended drone application to other fields, such as RFID
materials tracking on construction sites (Hubbard 2015) and construction quality control (Wang, Sun et al.
2015). Thus, if a drone system could acquire the real-time elevations, then the efficiency of construction
site safety management and quality control would be improved. Real-time quality control is important
22
because 6~12% of cost is wasted due to reworks of defective components, which are detected late during
construction (Josephson and Hammarlund 1999). The improvement of planning, real-time inspecting and
feedback can ensure the quality of construction works, reduce project duration and avoid exceeding cost as
Figure 12 and Table 4 summarize the existing approaches and their general procedures in
construction site surveying and modeling. Surveying (data scanning) is the starting procedure in every
construction work, especially in excavation operations, which determines a construction site’s elevations
and locations. The construction site surveying methods have experienced a progression from manual to
automatic, from contact to non-contact, and from small to large scene size as well (Nex and Remondino
2014). Modeling is the second procedure, which processes the raw data from the surveying results and
creates a geometry model, or a site plan, of the site. The comparisons of surveying techniques are compared
using scanning result, measurable area and distance range, capacity, advantages and disadvantages.
3D Laser
TOF Method
Scanner
Unordered
SfM Method Point 3D 3D Mesh
Digital Images
Camera Cloud Modeling Model
Ortho-image Triangulation
Series Method
Total
Station Target Manual
Site Plan
Points Operation
GPS
Measuring tape, level and theodolite are manual surveying tools for small size construction sites.
Their surveying results are angular deviations, horizontal, vertical and slope distances between two target
points (Nichols and Day 2010). They are suitable for surveying the range of 0 ~ 100 m, and the surveying
capacity is about 1,000 target points (Remondino and El-Hakim 2006; Nex and Remondino 2014).
Total station and GPS surveying devices are semi-automatic surveying equipment, which are the
most popular surveying methods on construction sites. They extend the surveying sense size to 10 km and
1,000 km respectively (Remondino and El-Hakim 2006; Nex and Remondino 2014). Total station—an
electronic theodolite integrated with an electronic distance measurement (EDM)—records the distance, the
angle and the height between two target points. GPS surveying records many selected target points with
GPS coordinates. Both have a 1,000 target points surveying capacity, which is the same as the manual
These manual tools and semi-automatic equipment are used in contact surveying methods, which
rely on surveyors’ movement on the construction site and the placing of the surveying device on the target
points in a sequence. That means the target points are manually selected by surveyors, and their accuracy
could be guaranteed. However, the manual tools and the total station need at least two cooperating
surveyors to complete the surveying task on the construction site. This time-consuming outdoor procedure
leads to a high probability of interfering with other construction operations on the construction site. In
addition, they cannot provide the in-time progress data after the excavation starts.
Remote sensor based surveying methods are continuously being developed and tested in the
construction industry, which include 3D Laser Scanning — by Terrestrial Laser (Du and Teng 2007),
Drone-borne LiDAR (Tulldahl and Larsson 2014; Guo et al. 2017) , and Photogrammetry — by Close-
range Photogrammetry (Arias et al. 2005; Barazzetti et al. 2010; Sung and Kim 2016) and Drone
Photogrammetry, known as Drone-SfM in computer vision (Nassar and Jung 2012; Siebert and Teizer
These non-contact surveying methods help construction professionals to obtain a 3D point cloud
and generate a construction site’s DTM (Digital Terrain Model). They can scan targets in the distance range
of 100 m by setting up several ground stations, and up to 5 km using a pre-planned drone path. Their
surveying capacity is more than 10 million raw points, while those points are recorded without manually
selecting target points and excluding the non-target points (Nex and Remondino 2014). In addition, those
remote surveying procedures avoid interfering with other construction operations and also reduce the
The Terrestrial Laser Scanning (or ground-based 3D laser scanning) has been adopted in
construction surveying for several years. It needs to be set up on a tripod at a fixed location in front of the
target object, and the time of flight (TOF) method is used to determine distances from the scanner to
targets, with a high speed of 10,000 ~100,000 points per second (Du and Teng 2007). In engineering
practices, multi-stations laser scanning and vehicle-based laser scanning solve the coverage limitations.
Although the drone-borne LiDAR system had been used for 3D habitat mapping in forest
ecosystems (Guo et al. 2017), there are no published articles using the drone-borne LiDAR in COENG and
AUTCON (see Table 3) to replace the ground-based or vehicle-based laser scanning. It is because the
small-sized LiDAR device still needs a powerful UAV to carry, such as the DJI Matrice series, which is
impossible for a small drone. Additionally, the investment (see Table 5) is another issue; the drone-borne
LiDAR systems are quoted for $ 60, 000 to $ 280, 000, which is too expensive to be adopted in
construction site surveying before its price drops down to a reasonable number.
Among the reviewed literatures, there are two interesting applications of ground-based 3D laser
scanning:
1. Using ground-based 3D laser scanning as the baseline for evaluating drone photogrammetry.
Takahashi et al. (2017), Maghiar et al. (2018) and Moon et al. (2019) designed experiments to
compare drone photogrammetry with ground-based 3D laser scanning. The results confirmed
2. Merging the point clouds from ground-based 3D laser scanning and drone photogrammetry to
create an integrated point cloud (Kwon et al. 2017; Li, D. and Lu 2018; Chen et al. 2018;
Moon et al. 2019). Kwon et al. (2017) tested the hybrid scanning method, which merged
ground-based 3D laser scanning with drone photogrammetry to generate a 3D point cloud for
an under-construction bridge. The laser scanner scanned the sides of the target in multi-
stations; the drone’s camera scanned the top view of the target, where is hard to be reached by
the ground laser scanner. Additionally, the side views can also be supplement with ground-
camera images and vehicle-mounted camera images (Barazzetti et al. 2010; Sung and Kim
2016; Inzerillo et al. 2018; Aguilar et al. 2019; Park et al. 2019)
Drone Manufactory
System Price Sensors Calculated Spatial Resolution
Platform Accuracy
@10m: 0.4°*3.14rad/180°*10m
Matrice 200 $ 60,000 for
= 6.9cm or 2.7 inch
and 210 education DJI M200 Velodyne PUCK-LITE @50m 4.6 cm+/-
@5m: 0.4°*3.14rad/180°*5m
LiDAR
= 3.49cm or 2.7 inch
Snoopy-V- $ 280,000 for
DJI M600 Unknow Laser Sensor @50m 3.2 cm+/-
Series education
DJI Phantom 4 $1,799 for DJI Phantom 8.8 mm*13.2 mm COMS, @20m: GSD=0.54 cm/px
-
Pro V2.0 retail 4 Pro Focal length =8.8 mm @40 m: GSD=1.08 cm/px
DJI Inspire 2 $4,249 for 8.8 mm*13.2 mm COMS,
DJI Inspire 2 @40 m: GSD=1.08 cm/px -
(X4S) retail Focal length =8.8 mm
DJI Inspire 2 $10,309 for 13 mm*17.3 mm COMS,
DJI Inspire 2 @40 m: GSD=0.88 cm/px -
(X5S-15mm) retail Focal length =15 mm
DJI Inspire 2 $10,309 for 13 mm*17.3 mm COMS,
DJI Inspire 2 @40 m: GSD=0.29 cm/px -
(X5S-45mm) retail Focal length =45 mm
Based on Table 2, it is clear that drone photogrammetry has the flexible range in object 3D-
reconstruction, especially in the construction field. It is not only limited to construction site surveying, but
it also has been used to perform building 3D-reconstruction (Aguilar et al. 2019; Chen et al. 2018), create
3D texture models of construction equipment (Kim, H. and Kim 2018) and pavement surface mesh models
as well (Inzerillo et al. 2018). Table 5 compares the spatial resolution and price between drone
photogrammetry and drone-borne LiDAR. Based on that comparison, the drone photogrammetry is the
most reasonable option for excavation operations. However, determination of elevations of a construction
site in real-time during the excavating still is a challenge, because drone photogrammetry needs at least one
workday to transform the raw images to a coarse 3D model for measuring with commercial
26
photogrammetry software, Agisoft PhotoScan (Haur et al. 2018). The advantages and shortfalls in drone
Table 6 categorizes the image-based 3D-reconstruction methods by the type of sensor and the
The scanning results are SAR images, and the modeling result is a DEM (digital elevation model) (Kirscht
and Rinke 1998) or a DTM (Nico et al. 2005) depending on the application environment. In infrastructure
construction, the ground-based SAR system (GB-SAR) has been used in dam deformation monitoring
(Huang, Q., et al. 2017) and landslide monitoring (Noferini et al. 2007). However, SAR has the same issue
Infrared radiation (IR) distance sensor, also known as RGB-Depth camera, is a device that uses
TOF to determine distances between the sensor and target objects. The scanning results are four-channel
RGB-D images (see Figure 13). The resolution of the distance sensor is smaller than the color sensor, such
as 320×240 pixels depth resolution for the old Kinect V1, and 512× 424 pixels for the latest Kinect V2. The
accuracy of depth measurement gets worse when the distance increases; 1.5 ~ 3m is an accepted distance
27
range for measurement (Litomisky 2012). In most cases, this sensor is only used indoors, such as indoor
scenes 3D-reconstruction (Holz et al. 2011; Huang, A.S. 2017) and body activities capturing (Guo 2018).
5 8 9 10 5 8 5
25 24 23 22 21 20 5 8
68 67 66 65 64 63 9 8 9
5 8 9 10 5 8 9 67 66 10
69 68 67 66 65 64 63 23 22 5
25 24 23 22 21 20 19 43 10 8
45 44 43 42 41 40 39 8 7 9
0 5 8 9 10 9 8 253 252
255 254 253 252 251 250 249 196
198 197 196 195 194 193 192
Figure 13 Four-channel RGB-D matrix, red, green, blue pixel, and gray depth value
Stereo camera system is the most common close-range photogrammetry device for measuring
distance between cameras and target objects. It is made up of two cameras with the same specifications and
parameters, the only difference between those two cameras is the baseline 𝑇 in spatial position. The
scanning result is an image pair (see Figure 14); the 𝑙𝑒𝑓𝑡 𝑖𝑚𝑎𝑔𝑒 (𝑥𝑙 , 𝑦𝑙 ) and 𝑟𝑖𝑔ℎ𝑡 𝑖𝑚𝑎𝑔𝑒 (𝑥𝑟 , 𝑦𝑟 ) are in the same
plane and perpendicular to each cameras’ principal ray. The triangulation method (see Eq. 2) is used to
calculate distances between the targets and cameras, because the distances in front of the cameras have a
𝑝 𝑝
negative relationship with the 𝐷𝑖𝑠𝑝𝑎𝑟𝑖𝑡𝑦 = 𝑥𝑙 − 𝑥𝑟 . With this relationship, it is feasible to generate a small
sense depth-map from the stereo camera image pair by traversing all common pixels of those two images
𝑝
𝑥𝑙 − 𝑐𝑥 𝑇𝑙
= 𝑇
𝑓 𝑍 𝑝 𝑝 𝑇 + 𝑇𝑙 𝑝 𝑝
𝑇
𝑝 ⇒ (𝑥𝑙 − 𝑐𝑥 ) + (𝑐𝑥 −𝑥𝑟 ) = 𝑙 𝑓 𝐷𝑖𝑠𝑝𝑎𝑟𝑖𝑡𝑦 = 𝑥𝑙 − 𝑥𝑟 = 𝑓
𝑐𝑥 − 𝑥𝑟 𝑇𝑟 𝑍 ⇒ 𝑍 ⇔𝑍= 𝑝 𝑝𝑓 Eq. 2
= 𝑝 𝑝
(𝑥𝑙 − 𝑥𝑟 ≠ 0) 𝑥 − 𝑥𝑟
𝑓 𝑍} 𝑙
𝑇𝑙 + 𝑇𝑙 = 𝑇 }
𝑝 𝑝 𝑝 𝑝
Where, 𝑃 is a target in front of cameras 𝑂𝑙 and 𝑂𝑟 ; 𝑝𝑙 (𝑥𝑙 , 𝑦𝑙 ) and 𝑝𝑟 (𝑥𝑟 , 𝑦𝑟 ) are image points of 𝑃;
(𝑐𝑥 , 𝑐𝑦 ) is image point of cameras, named as principal point, its ideally be the center of image plane.
𝑇 is the distance between cameras, named as Baseline; 𝑓 is the focal length of cameras;
Z is the distance between 𝑃 and Cameras
Structure from Monition (SfM), where 3D structure can be resolved from unordered images
(Ullman 1979; Westoby et al. 2012), often used in drone photogrammetry, is replacing traditional aerial
photogrammetry /aerial triangulation to generate the DTM. Both methods use cameras to capture multiple
overlapping images, then match the same target objects in those adjacent image pairs to generate DTM.
Figure 15 shows a three-frame aerial triangulation model, where the camera moves from left to right
without rotation; Figure 16 shows a two-frame SfM model, where the camera moves from 𝐶0 to 𝐶1 with a
translation ( 𝑡 = [𝑥 𝑦 𝑧]𝑇 ) and a rotation (𝑅). That is more complex than the stereo camera model (see
Figure 14), which only has the x-axis translation between the two cameras. So, a key task in SfM is to find
out the external position and orientation parameter [𝑡 𝑅] to align the sequence camera stations’ coordinates
to the initial station’s coordinate. Similar to the stereo camera model, the matched point pairs in those two
images can be found in the epipolar line pairs, while the local feature matching methods — such as SIFT,
SURF, which will be discussed later— have replaced the epipolar line method to extract keypoint pairs
from image pairs in SfM. Pix4D, Autodesk ReCap Pro, Agisoft PhotoScan and DroneDeploy are
commercial softwares of drone photogrammetry, and OpenSfM is an opensource SfM library written in
Python.
Image Pairs
Drone Path
Image 1 Image 2 Image 3
principal ray
site
Overlaps site
Compared with SAR, IR and stereo camera, the single digital camera with drone photogrammetry/
drone-SfM has several unbeatable advantages for scanning excavation objects on construction sites:
1. A small-sized digital camera could be mounted to a drone with a 3-axis gimbal, so that the
camera can move over a construction site without interfering with other construction
operations.
2. The digital camera has a larger spatial resolution than RGB-Depth cameras, so a large single
3. Fewer images are required to cover an entire construction site which helps to reduce the error
caused by image matching and reduce the processing time as well (Schenk 1999; Kaehler and
Bradski 2016).
overlapping ortho-images should have a rough surface, a relatively small slope, and should be captured in a
1. Only objects with a rough enough surface can be recorded as complicated textured images,
while objects that lack a sufficient quantity of unique detectable features — such as
transparent materials (i.e. windows), reflective materials (i.e. windows, mirrors, glossy paint,
water surface, snows), or the object with uniform surfaces with little variation— are unable to
generate sufficient feature keypoints due to low contrast (Lowe 2004, Solem 2012,Westoby et
30
al. 2012 ). Applying SfM in 3D-reconstructing uniform texture objects, such as vehicles,
object’s surface slope. It has a low error rate for relatively flat ground surfaces (Haur et al.
2018), while its error rate increases as the ground slope increases. Zhao and Lin (2016)
concluded that the error rate has a positive relation to the slope in the range of 55° to 90°.
That means it is unable to handle the steep, or near vertical topography (Westoby et al. 2012;
SfM method’s accuracy. Westoby et al. (2012) stated that the SfM method has a questionable
accuracy in dense vegetation areas. Zhao and Lin (2016) found that the accuracy of the SfM
method has a negative relationship with the hillshade value in range of 0 to 170.
Fortunately, most construction sites meet those requirements. After removing the vegetation and
topsoil, the soil, rocks or mixture materials are exposed as seen in Figure 17, which is a richly-textured
surface. The uniform brightness ortho-image requirement can be satisfied by using a drone to take the
images on a sunny day. In general, the SfM works for vertical surfaces, such as the building 3D-
reconstruction cases (Aguilar et al. 2019; Chen et al. 2018), because the camera can angle toward any
object; but, it is better to limit drone to a narrow flight region to avoid interruption with excavation
operations for safety reasons. Therefore, the main challenge of applying ortho-imaging based 3D-
reconstruction for construction site elevation determination is that the slope of the ground surface might be
Figure 18 shows the general procedures of 3D-reconstruction with image feature keypoint-based
ortho-image series (Figure 15) can be either captured in a sequence or extracted from a video, where the
adjacent ortho-images require a minimum of 70% and 40% overlap in longitudinal and traversal coverage,
respectively (Siebert and Teizer 2014).A previous experiment (Takahashi et al. 2017) tested the higher
longitudinal overlapping ratios from 80% to 90%, but it did not lead to a significant enhancement in the
measurement accuracy with the increasing overlapping ratio. However, the extra unnecessary images
require additional time to process image matching. Haur et al. (2018) reported the required time to estimate
on-site soil volume by drone photogrammetry is one workday, as the point cloud is produced using Agisoft
PhotoScan, the geometry model is created using Autodesk ReCap with the point cloud, and the soil volume
Images/Video
Features
SIFT/SURF
Points
Bundle
Sparse Point adjustment and
Cloud 3D scene
reconstruction
Dense Point
PMVS/CMVS
Cloud
The next step is extracting and matching keypoints from adjacent ortho-image pairs, which is
called local feature detection and description in computer vision (Kaehler and Bradski 2016). The most
common and widely used image local features are: the Scale-invariant Feature Transform (SIFT) – a
famous feature detection algorithm in computer vision to detect and describe local features in images,
which was patented in Canada by the University of British Columbia and published by David Lowe (Lowe
1990; Lowe 2004) – and the Speeded Up Robust Features (SURF) – another patented local feature detector
and descriptor, which is several times faster than SIFT and more robust against different image
transformations than SIFT (Bay et al. 2008). Figure 19 is a SIFT example, the green dots are SIFT
32
keypoints, and the matched keypoint pairs were linked with green lines. Although those green keypoints
were matched and located correctly, the noticeable weaknesses of SIFT still exist: a) sparse keypoints are
selected with rules, and the keypoints are randomly distributed in an image, b) the number of detected
keypoints in an image is less than the number of pixels in the image, and c) the number of matched
keypoints in the image pair is much less than the number of detected keypoints. Similarly, the result by
SURF may also not be dense enough to represent a construction site’s geometrical features, because most
candidate points are excluded by a number of criteria, like low contrast and points on edges (Solem 2012).
This is why the SfM method does not work well with steeply sloped ground, as the points on edges have
been removed. After getting the sparse point clouds, the Patch-based Multi-view Stereo (PMVS) /
Clustering Views for Multi-view Stereo (CMVS) (Furukawa and Ponce 2010) will be used to generate
The Normalized Cross Correlation (NCC) matching method is used to determine the relations
between a reference patch and a target patch (Lewis 1995), which are not limited to grayscale values or
gradient values (Solem 2012). It might be a suitable approach to match the customized pixel pairs which
are dense and uniformly spread in the overlaps of adjacent image pairs, since that has been verified in
PMVS (Furukawa and Ponce 2010). This is because reference pixels can be manually selected in the
preferred styles such as implementing a densely packed and uniformly spaced pixel grid, and the best
matched corresponding target pixels will be determined from the candidate target pixels. The NCC method
calculates the correlation between two equally sized image patches 𝐼𝑥,𝑦∈𝑤 (𝑥, 𝑦) and 𝐼′𝑥,𝑦∈𝑤 (𝑥′, 𝑦′) (Kaehler and
Bradski 2016), where the image patch is a rectangular portion (𝑊) which is centered around the interest
pixel with size (2N + 1) × (2N + 1). Its formula is defined as Eq. 3, by subtracting the mean 𝐼 ,̅ 𝐼 ′
̅ and scaling
with the standard deviation √∑𝑥,𝑦∈𝑊[𝐼(𝑥, 𝑦) − 𝐼 ]̅ 2 , √∑𝑥,𝑦∈𝑊 [𝐼′ (𝑥, 𝑦) − 𝐼̅′ ]2 , the 𝑁𝐶𝐶 method becomes robust to image
33
brightness changes (Solem 2012). However, compared to SIFT/SURF, the patch-based NCC is worse at
image scaling, rotation and projection transformations, because the rectangular patch is not invariant to
scale or rotation, and the patch size affects the matching results as well (Solem 2012).
Where, 𝐼𝑥,𝑦∈𝑤 (𝑥, 𝑦): the patch for the reference image pixel (x,y) ; 𝐼′𝑥′,𝑦′∈𝑤′ (𝑥′, 𝑦′) : the patch for the target image pixel (x’, y’);
1 ̅ = 1 ∑𝑥′,𝑦′∈𝑊′ 𝐼′(𝑥′, 𝑦′): mean of patch 𝐼′𝑥′,𝑦′∈𝑤′;
𝐼 ̅ = ∑𝑥,𝑦∈𝑊 𝐼(𝑥, 𝑦): mean of patch 𝐼𝑥,𝑦∈𝑤 ; 𝐼′
𝑁 𝑁
𝑊 and 𝑊′ are rectangular patches with size (2R+1)×(2R+1), which are centered around the reference pixel (x,y) or
candidate target pixel (x’,y’).
In traditional drone photogrammetry the overlapping images are captured at a constant altitude
(see Figure 15). Then these images have the same constant scene scale and spatial resolution, and the
objects in the overlapping parts have the same size, because these images meet at the pinhole camera model
(see Figure 6). In contrast, Daftry et al. (2015) proposed a novel SfM framework of using multi-scale
images — capturing in various depth (distance) from a building — to enhance the accuracy of the facade
3D-reconstruction. Additionally, Matthies et al. (1997, 2007), Li, R., et al. (2002),Xiong et al. (2005) and
Meng et al. (2013) continuously applied descent images to determine the topography of landing terrain for
space aircraft. Those descent images were captured in the landing path at different times and at different
altitudes, so those descent images are multi-scale images of the same terrain surface. That image-based 3D-
reconstruction result is suitable for choosing a safe landing area in planetary landing exploration tasks
(Meng et al. 2013). Thus, capturing images at different altitudes may be an approach to enhance the image-
Considering the success of descent image-based research, Figure 20 shows a modified, faster
drone photogrammetry for construction site elevation determination, which uses an ortho-image to cover
the entire building construction site (like image 1) at the special altitude, 𝑍 = 𝑓 ∙ 𝐻𝑠𝑖𝑡𝑒 ⁄ℎ𝑖𝑚𝑎𝑔𝑒 . Then
another image (like image 2) is captured above this altitude which also covers this site. Meng et al. (2013)
discussed that the low camera’s altitude should be half of the high camera’s altitude, so that the disparity of
neighbor pixels around the image center could be detected in 0.5-pixel level. If this multi-scale image-
34
based method can be automatically implemented using a computer system and program, then the required
number of images used in the ortho-image based 3D-reconstruction method is minimized to two.
Drone Path
Image 2
site
Additionally, the multiple image-based construction site elevations determination can be improved
in regard to model alignment. Currently, point cloud or the TIN mesh model generated from drone
photogrammetry/drone-SfM is a scale model, which needs at least 3 ground control points (GCPs) to align
the scale model to the real-world coordinate (Westoby et al. 2012). Furthermore, the modeling error should
less than 50 millimeters in elevation coordinate compared with the real-world coordinates (Takahashi et al.
2017). Then the developed image-based 3D-reconstruction method can be adapted in determing the
Recovery the 3D geometrical data from a single-image without reference information is an ill-
posed problem (Van den Heuvel 1998; Hassner and Basri 2006; Saxena et al. 2008). The methods
discussed in the previous section either used the equipment’s physical properties or the camera model and
epipolar geometry as the reference information. However, previous research has shown the feasibility of
using deep learning methods to recover the relative depth information for each pixel of an image of indoor
scenes (Eigen et al 2014; Liu et al. 2015; Laina et al. 2016), outdoor scenes (Chen et al. 2016; Li and
Snavely 2018) and scenes from automatic driving applications (Garg et al. 2016). In addition, convolutional
neural networks (CNNs) have been verified as effective and reliable in micro-scale scenes, such as
estimating the surface height map from a single image of a foam mat and mouse pad (Zhou et al. 2017).
However, the challenge of training a deep learning model is acquiring a dataset. Previous research
also discussed the approach for creating enough data for training the CNN model. In the automatic drive
35
application, the CNN model is trained for determining objects’ distances from a single forward-facing view
of a car; the dataset, depth and front-view image pair are created by a stereo camera system that is installed
on the front of the car (Garg et al. 2016). Another interesting approach to create a labeled dataset is using
artificial images, such as generating different view images from a photogrammetry-based concrete mixer
truck 3D-model and using these images to train a construction equipment object detector (Kim and Kim
2018). Thus, to guarantee the accuracy of CNN-based depth recovery or image surface height estimation,
the multiple image-based 3D-reconstruction methods still are important for acquiring datasets.
Therefore, for the construction industry, using the advanced artificial intelligence (AI)
technologies, such as deep learning to automatically determine elevations directly from an image of the
construction site, is an interesting research topic and meaningful challenge. Once overcome, the real-time
3D-reconstruction of a construction site becomes possible, and then the degree of automation of the
Currently, a digital image is easily acquired by various cameras and other image sensors,
especially by mobile phones and home digital cameras in our daily life; a digital image is used conveniently
to show and share with smart phones and computers. Behind those, the most essential thing is that a digital
image is formatted in matrices (see Figure 21). This approach is sampling an image at rectangular grids’
centers; the color, or intensity, at each of these center points is converted into a numerical value; apart from
the color / intensity, everything else is discarded when storing the image in a computer (Hearn et al. 2004;
Szeliski 2010). Thus, the computer can read and write a gray digital image as one layer (channel) two-
dimensional matrix of gray pixel values. Similarly, the RGB image and the RGB-D image, which have
been discussed in previous sections, are able to be read and written as a three-layer (channel) two-
5 8 9 10 5 8 5
25 24 23 22 21 20 5
5 8 9 10 5 8 9 25 24 23 22 21 20 5 8
68 67 66 65 64 63 9 8
69 68 67 66 65 64 63 68 67 66 65 64 63 9 8 9
5 8 9 10 5 8 9 67 66
5 8 9 10 5 8 9 67 66 10
25 24 23 22 21 20 19 69 68 67 66 65 64 63 23 22
69 68 67 66 65 64 63 23 22 5
45 44 43 42 41 40 39 25 24 23 22 21 20 19 43 10
25 24 23 22 21 20 19 43 10 8
0 5 8 9 10 9 8 45 44 43 42 41 40 39 8 7
45 44 43 42 41 40 39 8 7 9
0 5 8 9 10 9 8 253 252
255 254 253 252 251 250 249 0 5 8 9 10 9 8 253 252
255 254 253 252 251 250 249 196
198 197 196 195 194 193 192 255 254 253 252 251 250 249 196
198 197 196 195 194 193 192
198 197 196 195 194 193 192
B. Three-layers RGB Matrix. Each Layer C. Four-layers RGB-D Matrix. Each Layer is a
A. A Matrix of Gray Pixel
is a Two-dimensional Matrix of Red, Two-dimensional Matrix of Red, Green or Blue
Values
Green or Blue Pixel Values Pixel Values, and Gray Depth Value.
In this matrix formation, each grid is called a pixel, and the numerical value is called the pixel
value. Pixels can be read and written by the pixel coordinate – row index and column index of the matrix.
In the OpenCV-Python, the row index increases from left to right and the column index increases from top
to bottom of the image plane (see Figure 14). For example, the pixel value ‘22’ (in third row, forth column)
of the gray image 𝐼 in Figure 21-A, can be read by value = 𝐼[2,3] or written as 𝐼[2,3] = 22, because the
Figure 22 shows the basic set of 2D planar transformations. With this pixel coordinate (in matrix
formation), the image 2D transformations, such as rotation or translation, could be accomplished by matrix
multiplications (Hearn et al. 2004; Szeliski 2010). In detail: a) 2D points, pixel coordinates in an image can
be denoted using a pair of values, 𝑥 = [𝑥 𝑦]𝑇 , or using the homogeneous coordinates as 𝑥 = [𝑥 𝑦 1]𝑇 , then an
𝑥′ 1 0 𝑡𝑥 𝑥
𝑡𝑟𝑎𝑛𝑠𝑙𝑎𝑡𝑖𝑜𝑛: [𝑦′] = [0 1 𝑡𝑦 ] × [𝑦]
1 𝑖𝑚𝑎𝑔𝑒@𝐻′ 0 0 1 1 𝑖𝑚𝑎𝑔𝑒@𝐻 Eq. 4
𝑥′′ cos 𝜃 − sin 𝜃 0 𝑥′
𝑟𝑜𝑡𝑎𝑡𝑖𝑜𝑛: [𝑦′′] = [ sin 𝜃 cos 𝜃 0] × [𝑦′]
1 𝑖𝑚𝑎𝑔𝑒@𝐻 ′′ 0 0 1 1 𝑖𝑚𝑎𝑔𝑒@𝐻 ′
Where, [𝑥 𝑦 1]𝑇 , [𝑥 ′ 𝑦 ′ 1]𝑇 , [𝑥 ′′ 𝑦 ′′ 1]𝑇 is the homogeneous coordinate; 𝑡𝑥 , 𝑡𝑦 are the distances translated; θ is the
degree rotated in anticlockwise fashion.
A kernel is a small convolution matrix (see Table 7), which is used for smoothing (blurring) ,
sharpening, edge detection, and more image processing operations. Those image processing operations are
accomplished by doing a convolution between a kernel and an image (Kaehler and Bradski 2016). The
Identity
Edge detection
Sharpen
Box blur
(normalized)
Gaussian blur 3 × 3
(approximation)
Where, 𝑔(𝑥, 𝑦) is the filtered image;𝑓(𝑥, 𝑦) is the original image;ℎ(𝑘, 𝑙) is the filter kernel.
Where, the original image on the left is filtered (convolved) with the filter kernel in the middle to yield the filtered image on
the right; the light blue pixels indicate the source neighborhood for the light green destination pixel.
38
Image Pooling or down-sampling is the operation to extract image features after convolution. It is
similar to image scale transformation. Figure 24 is the example of max pooling, which uses the max pixel
value to stand for the feature of the 4 pixels. Another common pooling is mean pooling, which uses the
Image features are defined base on their applications, such as edges, corner/interest points and
blobs/regions of interest points. The basic feature is a pixel’s color or intensity. Assuming all pixels’ colors
/ intensities are unique in two images, then the pixels’ colors / intensities can be used to match the
corresponding objects in those two images, as the same objects have the same pixel value. However,
directly comparing the pixel value is an ineffective method for matching pixel pairs in two adjacent images
captured by a drone, because the environment conditions effect the images’ brightness.
Previous sections have described that the normalized cross-correlation matching method is a
suitable approach to match the image pairs. The correlation matching methods calculate the correlation
between two equally sized image patches 𝐼𝑥,𝑦∈𝑤 (𝑥, 𝑦) and 𝐼′𝑥,𝑦∈𝑤 (𝑥′, 𝑦′) rather than two pixels only (Solem
In image-based 3D-reconstruction, matching keypoint pairs is the most essential processing step.
The SIFT and SURF detections and matching results are not dense enough to represent a construction site,
because most points are dismissed based on several criteria, like low contrast and points on edges (Solem
2012). That is why drone photogrammetry/drone-SfM using ortho-images does not work well in big slope
surfaces, as the points on edges have been removed. Figure 25 shows an example of template matching
using the SIFT keypoint and homography method. The green lines show the same points in the images
captured in different positions. The white outline in the right image indicates the edges of the template. It
uses matched SIFT keypoints to calculate a 3×3 perspective transformation matrix. Then, using the matrix
to transform the four corners of the template (the left image) to its corresponding four points in the right
Image gradients are directional changes in intensities/colors of the image. Gradients in the x-axis
and y-axis directions are computed in Eq. 6. Gradients extract feature information from images, such as
edge detection (Solem 2012). It also can be used in feature and texture matching for images with different
brightness or captured with different cameras. That is another approach to solve the brightness issue.
Histogram of oriented gradients (HOG) is a feature descriptor used in object detection, which is
used in particular suites for human detection in images (Dalal and Triggs 2005). Memarzadeh et al. (2013)
detected the construction equipment and workers from a construction site video stream by HOG plus colors
40
features. Kim, H. and Kim, H. (2018) also developed a concrete truck detector with HOG feature and SVM
𝜕𝑓
𝑔𝑥 𝜕𝑥
∇𝑓 = [𝑔 ] = 𝜕𝑓 Eq. 6
𝑦
[𝜕𝑦]
𝜕𝑓
Where, is the derivative with respect to x (gradient in the x direction);
𝜕𝑥
𝜕𝑓
is the derivative with respect to y (gradient in the y direction).
𝜕𝑦
In general, object detection includes the task of object classification and object localization. The
results usually are marked with different-colored rectangular boxes for identifying different objects’
categories and their locations in the original image. Image classification task only needs to identify the
main object in the image or the specific image region. Image segmentation is more detail than object
detection and can get the result of a same sized label-image, which uses several colors to draw the different
objects’ categories in each pixel instead of the texture in the original image.
3.6.3.1 Limitations
Currently, detecting vegetation from a photogrammetric point cloud based on vegetation indexes
and points’ spatial geometrical relations (Anders et al. 2019; Cunliffe et al. 2016) has limitations because it
only allows a ground point subset and non-ground (vegetation) point subset to be classified. In addition, the
vegetation index methods are effective in identifying green and yellow vegetations, but ineffective with
other colors such as the withered vegetations and shaded vegetations. That also results in the issue of
Previous research results have shown the feasibility of deep learning methods in objects detection
using image (Schneider et al. 2018), video (Kang et al. 2018), point cloud (Engelcke et al. 2017), and
image segmentation (Noh et al. 2015; Badrinarayanan et al. 2017). The limitation of the current deep
learning-based methods is that they use low-resolution images for training the deep learning-based object
detector, which resize the ImageNet (Deng et al. 2009) down to as small as 256×256-pixel, while the high-
resolution is limited to 800×1,000-pixel (Han et al. 2015). This issue is caused by the limitation of
computer hardware. A directly exported image from a digital camera, such as the image captured using the
41
DJI Phantom 4 Pro V2.0, is as large as 3,648×4,864-pixel, which is extremely larger than the 256×256-
pixel. Reducing the image size impacts the effectiveness of image segmentation because the number of
pixels for representing each single object decreases as the image resize down. For example, if the image has
been shrunk three times, an 8×8-pixel patch in the original image becomes a single pixel in the shrunken
image.
3.6.3.2 Solutions
A potential approach to avoid resizing down the high-resolution image is that separately
identifying each 8×8-pixel small-patch of the original image and assembling them to be the high-resolution
result. The result of this approach is same as the result of image segmentation using the low-resolution
image. In addition, the small image patch size is better for training a deep learning-based image
classification model, where the image classification task only needs to identify the main object in the
image. In detail, the main object is distinguished from the background; and, no matter what other objects
are included in the background, this small image patch will be identified as the main object’s category.
Therefore, the possible procedures to segment a high-resolution image without resizing it may be to
disassemble it to small-patches and recording the sequence ID, classify each small-patch with a pre-trained
deep learning-based image classification model, assign the class-label to each small-patch, and assemble
This literature review mainly discussed the feasibilities, weaknesses, and research opportunities in
drone technologies and image-based 3D-reconstruction methods for determing construction site elevations.
The review was carried out using several steps. Firstly, this review evaluates the drone related publications
in the Journal of Construction Engineering and Management and the Automation in Construction. The
drone related research in the top ranked construction research publications has been rapidly increasing
starting from 2015. From this comprehensive review and quantitative assessment, the following interesting
1. “DJI Phantom” series and “DJI Inspire” series drones are the most popular drones that have
2. An optical camera is the most reasonable sensor to acquire the RGB image for inspection
applications.
3. The point clouds generated by photogrammetry / SfM methods have been benefiting drone
4. No published article mentioned the usage of the drone-borne LiDAR technology at this
moment.
Secondly, this review compared the current construction surveying techniques and image-based
3D-reconstruction method by reviewing the relative literatures from “Google Scholar.” From this
3D-reconstruction tasks, which have a complex procedure with data collection, data
photogrammetry software to generate the DTM for the construction sites, while future
2. 3D-reconstruction with specific equipment’s physical properties, such as LiDAR, or using the
camera model and epipolar geometry could get state-of-the-art results, while the cost and time
is not good enough for determing elevations on construction sites. Also, the robust algorithm
for extracting the desired features and eliminating the noisy features from an image pair is still
a challenge with the current image processing and computer vision methods.
3. Taking a single ortho-image over a construction site by a drone system, then using this single-
image to estimate the construction site elevations could be a feasible approach with deep
learning, which will reduce drone’s flying time and minimize risk of drone crash on a
construction site.
Therefore, considering the success of descent image cases, acquiring an ortho-image pair by drone
at a low altitude and a high altitude may be a possible approach for faster construction site elevation
43
determination, where the low ortho-image is the overlap and the overlap parts have different scales. In
Chapters 4 and 5, this goal was implemented by addressing three tasks: 1) to determine the distance from
the ground surface to the drone, a modified triangulation model is required; 2) to get the most accurate
dense corresponding matching results by NCC, the proposed image patch feature descriptors should be
sensitive to the different scales in the ortho-image pair, and the patch size should be self-adjusting between
small-patches for complex textured regions and large-patches for poorly textured regions; and 3) to rapidly
and automatically compute distances, an innovative approach and algorithms need to be developed for
generating matched pixel pairs in a dense pixel grid style while simultaneously determining the distances.
After that, Chapter 6 discussed how to use a single ortho-image to determine geometrical data using the
4.1 Introduction
This chapter presents a modified stereo-vision triangulation method for construction site
elevations determination, which uses a drone’s camera to capture a low-high ortho-image pair instead of a
left-right ortho-image pair of a construction site. This low-high ortho-image pair triangulation method is
designed to enlarge the baseline distance and increase the measurable depth range compared to the classic
stereo-vision method. This proposed method focuses on 3D-reconstruction of the ground surface of a
construction site and excludes the side surfaces of the attached objects, which makes it simpler than
traditional drone photogrammetry. In detail, the low ortho-image, which covers the entirety or sometimes a
large portion of the construction site, is captured at half the height of the high ortho-image (see Figure 26).
Then the entire low ortho-image is contained in the overlap of the ortho-image pair. Additionally, if a
construction site is larger than a single ortho-image frame, this proposed method can stitch its results from
adjacent ortho-image pairs with a very narrow overlapping strip compared to the high overlapping ratio in
Camera@ H
Point Cloud
Ortho-image@H
Pixel Grid
Camera@½ H Matching and
Elevation-Map
Pixel Grid Sampling Virtual Elevation
Algorithm Region Pixel Coordinate
Ortho-Image
Construction Site
performing subpixel level image corresponding matching. Most of the current developed image matching
methods are based on extracting and matching feature keypoints. The problem is that the extracted feature
points are not evenly distributed throughout the image pair. In addition, the traditional image-based 3D-
reconstrution methods, such as SfM, separate the image matching and geometrical data recovering into two
different sequential processes, which wastes some computing resources. Therefore, the author developed a
45
low-high ortho-image pair pixel matching and virtual elevation algorithm, which aims to generate matched
pixel pairs in pixel grid, while simultaneously determine the elevation data based on the low-high ortho-
image pair triangulation method and virtual elevation plane method. Additionally, testing was conducted on
a construction site. Experimental results were evaluated and presented in this chapter to show the efficiency
of the proposed method and the developed algorithms in a real construction site.
Using a single ortho-image (the size of a target object in an ortho-image has a negative
relationship with the drone flight height) to determine geometrical data is an ill-posed problem; more
reference information is needed such as additional ortho-images and camera positions (Eigen et al 2014).
Th developed a low-high ortho-image pair triangulation model is shown in Figure 27, where the drone
moves vertically along its camera’s principal ray without any horizontal shift or rotation.
f p (x y ) Image@H
(focal length)
p x y αf) Image@H p(x,y)
e T=H/2 e (Cx, Cy)
(Baseline) Image@H/2
p (x y
H
(high Camera O
O to ground ) e,e x
X overlay
(Cx, Cy)
Z
(distance to p(x,y)
target point ) Y
e Image@H/2
p(x,y,αf) H/2 e(Cx, Cy)
(low Camera O
Ele. to ground ) y
Y
P(X,Y,Z) X
Ground ±0.00
(Drone takeoff plane)
Z
* express in Image Coordinate (x, y); p and p’ are image point; e and e’ are principal point
The low ortho-image (𝐼𝑚𝑎𝑔𝑒@𝐻/2) is captured at the low camera position 𝑂. This low altitude
𝐻/2 = 𝛼𝑓 𝐻𝑠𝑖𝑡𝑒 ⁄ℎ𝑖𝑚𝑎𝑔𝑒 should be high enough to capture the entire construction site. The high ortho-image
(𝐼𝑚𝑎𝑔𝑒@𝐻) is captured at the high camera position 𝑂′ with altitude 𝐻. These two ortho-images have the
46
same principal point 𝑒 with a 2:1 scaling relation, and the altitude differential between the low-high camera
stations is the triangulation baseline 𝑇 = 𝐻/2. The 𝑊𝑜𝑟𝑙𝑑 𝐶𝑜𝑜𝑟𝑑𝑖𝑛𝑎𝑡𝑒(𝑋, 𝑌, 𝑍) is set at 𝐶𝑎𝑚𝑒𝑟𝑎 𝑂 𝐶𝑜𝑜𝑟𝑑𝑖𝑛𝑎𝑡𝑒,
where the Z-axis is aligned with the camera’s principal ray downward to the ground. If image point pair
𝑝(𝑥, 𝑦, 𝛼𝑓) and 𝑝′(𝑥′, 𝑦′, 𝛼𝑓) are matched, then, the target point 𝑃 (𝑋, 𝑌, 𝑍) can be calculated by Eq. 7. Especially
when 𝑥′ = 𝑥/2, 𝑦′ = 𝑦/2, it has 𝑍 = 𝐻/2, which means the target point on 𝐺𝑟𝑜𝑢𝑛𝑑 ± 0.00. Therefore, a target
point’s elevation (relative to the 𝐺𝑟𝑜𝑢𝑛𝑑 ± 0.00) can be determined by 𝐸𝑙𝑒𝑣𝑎𝑡𝑖𝑜𝑛 = 𝐻/2 − 𝑍.
𝐻 𝑥𝑥′
2𝑓𝑥 𝑥 − 𝑥′
𝑋 𝐻 𝑦𝑦′
[𝑌 ] = Eq. 7
2𝑓𝑦 𝑦 − 𝑦′
𝑍
𝐻 𝑥′ 𝐻 𝑦′
𝑜𝑟
[ 2 𝑥−𝑥 ′ 2 𝑦−𝑦 ′ ]
Where, 𝑃 (𝑋, 𝑌, 𝑍) (in Camera O coordinate) is a target point in the construction site;
𝑝(𝑥, 𝑦, 𝛼𝑓) (in Camera O coordinate) is the image point of 𝑃 on Image@H/2;
𝑝′(𝑥 ′ , 𝑦 ′ , 𝛼𝑓) (in Camera O’ coordinate) is the image point of 𝑃 on Image@H;
𝑓𝑥 and 𝑓𝑦 are focal length for image in x and y direction, ideally, it has 𝑓𝑥 = 𝑓𝑦 = 𝛼𝑓.
𝛼 is the factor to convert sensor size (mm) to image size (pixel)
𝑋 𝑍
= (Eq. 7 − 1)
𝑥 𝑓𝑥
From △Op’e’ ≌ △O’PZ has,
𝑋 𝑍 + 𝐻/2
= (Eq. 7 − 2)
𝑥′ 𝑓𝑥
Minus (Eq. 7-1) from (Eq. 7-2) has,
1 1 𝐻1 𝐻 𝑥𝑥 ′
𝑋( ′ − ) = ⇒𝑋= (Eq. 7 − 3 − 1)
𝑥 𝑥 2 𝑓𝑥 2𝑓𝑥 𝑥 − 𝑥 ′
Similarly,
𝐻 𝑦𝑦′
𝑌= (Eq. 7 − 3 − 2)
2𝑓𝑦 𝑦 − 𝑦′
From (Eq. 7-1) has,
𝑋
𝑍= 𝑓 (Eq. 7 − 4 − 1)
𝑥 𝑥
Similarly,
𝑌
𝑍= 𝑓 (Eq. 7 − 4 − 2)
𝑦 𝑦
Take (Eq. 7-3-1) into (Eq. 7-4-1) has,
𝐻 𝑥′
𝑍= (Eq. 7 − 5 − 1)
2 𝑥−𝑥 ′
47
𝐻 𝑦′
𝑍= (Eq. 7 − 5 − 2)
2 𝑦−𝑦 ′
𝑇
Thus, combine [𝑋, 𝑌, 𝑍] get the Eq. 7.
The low-high ortho-image pair is easily acquired in a very short time without interfering with
other construction operations. This is because the drones’ small dimensions and their equipped automatic
flight control system and sensors make them easily navigable in cluttered outdoor environments and allow
them to hover at desired positions. The drone flight altitude data can be easily read directly from the remote
controller, which has ±0.00 set as the drone takeoff point. The 3-axis gimbal helps the camera lens stably
Since the impacts of wind and GPS signal interference cannot be eliminated, a slightly horizontal
shift and rotation may occur during the drone’s movement from the low position to the high position, which
make the high ortho-image’s principal point slightly different from that of the low ortho-image. Thus, it is
necessary to align the low-high ortho-image pair to the same center with a slight image rotation and
translation.
The reference image (𝐼𝑚𝑎𝑔𝑒@𝐻/2) and target image (𝐼𝑚𝑎𝑔𝑒@𝐻) have a 2:1 scale, necessitating the
creation of separate patch feature descriptors for each. Figure 28 indicates the four scaling directions for
generating the four features 𝑔𝑢 ∗,𝑣 ∗ (𝑢, 𝑣) for a reference pixel 𝑝(𝑢, 𝑣). The example shows that 𝑔𝑢−1,𝑣−1 (𝑢, 𝑣)
matches with target pixel 5.75, meaning that the reference pixel is the bottom-right corner of the target
pixel. Thus, the reference pixel and the target subpixel 𝑝′(𝑢′ + 0.5, 𝑣′ + 0.5) are matched. Eq. 8 states the other
u
5 3 3
(u-1,v-1) (u,v-1) (u+1,v-1)
5 3 3 3 15 0 0 0
(u-1,v-1) (u,v-1) (u,v-1) (u+1,v-1) (u-1,v) (u,v) (u,v) (u+1,v)
4 Subsets
15 0 0 0 15 0 0 0
(u-1,v) (u,v) (u,v) (u+1,v) (u-1,v+1) (u,v+1) (u,v+1) (u+1,v+1)
Average pooling
5.75 1.5 7.5 0
g u*,v* 5.75 1.5 7.5 0
4 Features g u-1,v-1 g u,v-1 g u-1,v g u,v
Matched
Feature 5.75
Pixel (u ,v )
5.75 × ×
Target Image@H
v'
(u
×,v ) (u
× ,v ) (u ,v ) (u ,v )
* express in Pixel Coordinate (u, v)
𝑔𝑢−1,𝑣−1 (𝑢, 𝑣)
𝑔𝑢,𝑣−1 (𝑢, 𝑣)
𝐼𝑓 𝑇𝑎𝑟𝑔𝑒𝑡 𝑃𝑖𝑥𝑒𝑙 𝑝’(𝑢’, 𝑣’) 𝑚𝑎𝑡𝑐ℎ𝑠 𝑤𝑖𝑡ℎ 𝑠𝑐𝑎𝑙𝑖𝑛𝑔 𝑐𝑟𝑒𝑎𝑡𝑒𝑑 𝑃𝑖𝑥𝑒𝑙 𝑔𝑢∗,𝑣∗ (𝑢, 𝑣) ,
𝑔𝑢−1,𝑣 (𝑢, 𝑣)
{ 𝑔𝑢,𝑣 (𝑢, 𝑣)
𝑝′ (𝑢′ + 0.5, 𝑣 ′ + 0.5) (𝑢′ − 𝑤/2 + 0.75, 𝑣 ′ − ℎ/2 + 0.75) Eq. 8
𝑝′ (𝑢′ + 0.0, 𝑣 ′ + 0.5) (𝑢′ − 𝑤/2 + 0.25, 𝑣 ′ − ℎ/2 + 0.75)
𝑡ℎ𝑒𝑛 𝑟𝑒𝑡𝑢𝑟𝑛 𝑇𝑎𝑟𝑔𝑒𝑡 𝑆𝑢𝑏𝑝𝑖𝑥𝑒𝑙 , 𝑇𝑎𝑟𝑔𝑒𝑡 𝑃𝑜𝑖𝑛𝑡 𝑝′(𝑥 ′ , 𝑦 ′ ) ,
𝑝 ′ (𝑢 ′ ′
+ 0.5, 𝑣 + 0.0) (𝑢′ − 𝑤/2 + 0.75, 𝑣 ′ − ℎ/2 + 0.25)
′ (𝑢 ′ ′ ′ ′
{𝑝 + 0.0, 𝑣 + 0.0) {(𝑢 − 𝑤/2 + 0.25, 𝑣 − ℎ/2 + 0.25)
The 𝑃𝑖𝑥𝑒𝑙 𝐶𝑜𝑜𝑟𝑑𝑖𝑛𝑎𝑡𝑒 and 𝐼𝑚𝑎𝑔𝑒 𝐶𝑜𝑜𝑟𝑑𝑖𝑛𝑎𝑡𝑒 can be converted by (Eq. 8-1) and (Eq. 8-2). The
𝑃𝑖𝑥𝑒𝑙 𝐶𝑜𝑜𝑟𝑑𝑖𝑛𝑎𝑡𝑒 is a 2D-coordinate with the origin on the upper-left pixel. The 𝐼𝑚𝑎𝑔𝑒 𝐶𝑜𝑜𝑟𝑑𝑖𝑛𝑎𝑡𝑒 is a 3D-
coordinate with the origin on the image center (𝑤/2, ℎ/2), and a fixed z-axis value (𝑓, focal length).
𝑤
𝑥 = 𝑢 − + 0.5
2
𝐶𝑜𝑛𝑣𝑒𝑟𝑡 𝑃𝑖𝑥𝑒𝑙 𝐶𝑜𝑜𝑟𝑑𝑖𝑛𝑎𝑡𝑒 (𝑢, 𝑣)𝑡𝑜 𝐼𝑚𝑎𝑔𝑒 𝐶𝑜𝑜𝑟𝑑𝑖𝑛𝑎𝑡𝑒(𝑥, 𝑦), { ℎ (Eq. 8 − 1)
𝑦 = 𝑣 − + 0.5
2
49
𝑤
𝑢 = 𝑖𝑛𝑡 (𝑥 + )
2
𝐶𝑜𝑛𝑣𝑒𝑟𝑡 𝐼𝑚𝑎𝑔𝑒 𝐶𝑜𝑜𝑟𝑑𝑖𝑛𝑎𝑡𝑒(𝑥, 𝑦)𝑡𝑜 𝑃𝑖𝑥𝑒𝑙 𝐶𝑜𝑜𝑟𝑑𝑖𝑛𝑎𝑡𝑒 (𝑢, 𝑣), { ℎ (Eq. 8 − 2)
𝑣 = 𝑖𝑛𝑡 (𝑦 + )
2
Thus, from (Eq. 8-1), has the reference point 𝑝(𝑥, 𝑦),
𝑤 ℎ
𝑅𝑒𝑓𝑒𝑟𝑒𝑛𝑐𝑒 𝑝𝑜𝑖𝑛𝑡 𝑝(𝑥, 𝑦) = (𝑢 − + 0.5, 𝑣 − + 0.5) (Eq. 8 − 3)
2 2
In (Eq. 8-1) 𝑃𝑖𝑥𝑒𝑙 𝐶𝑜𝑜𝑟𝑑𝑖𝑛𝑎𝑡𝑒 adjusts 0.5 to get its 𝐼𝑚𝑎𝑔𝑒 𝐶𝑜𝑜𝑟𝑑𝑖𝑛𝑎𝑡𝑒, similarly, the 0.5 subpixel needs
adjusts 0.25 to get its 𝐼𝑚𝑎𝑔𝑒 𝐶𝑜𝑜𝑟𝑑𝑖𝑛𝑎𝑡𝑒. Thus, from Eq. 8, the target pixel 𝑝′(𝑢′ , 𝑣 ′ ) can be converted to target
point 𝑝′(𝑥 ′ , 𝑦 ′ ),
𝑤 ℎ
(𝑢′ − + 0.75, 𝑣 ′ − + 0.75)
2 2
′
𝑤 ′
ℎ
(𝑢 − + 0.25, 𝑣 − + 0.75)
𝑇𝑎𝑟𝑔𝑒𝑡 𝑝𝑜𝑖𝑛𝑡 𝑝 ′ (𝑥 ′ ′ )
,𝑦 = 2 2 (Eq. 8 − 4)
′
𝑤 ′
ℎ
(𝑢 − + 0.75, 𝑣 − + 0.25)
2 2
′
𝑤 ′
ℎ
{(𝑢 − 2 + 0.25, 𝑣 − 2 + 0.25)
In this research project, the single pixel feature descriptor is extended to a patch feature descriptor
using target patch 𝑢′𝑣′=𝐼 ′ 𝑢′,𝑣′∈(2𝑅+1)×(2𝑅+1) (𝑢′, 𝑣′) to represent the target pixel/point in the target image. The
patch size 𝑅 is self-adapting and depends on the previous matching result. 𝑅 will be increased during the
matching process until the minimum threshold is satisfied. Similarly, 𝑔𝑢 ∗,𝑣 ∗ (𝑢, 𝑣) is extended to patches 𝑢5𝑣5,
𝑢0𝑣5, 𝑢5𝑣0, 𝑢0𝑣0, which are reference patches of size (2𝑅 + 1) × (2𝑅 + 1) used to represent the reference
pixel/point in its four scaling directions. Each reference patch is generated from a patch
𝐼𝑢,𝑣∈[2×(2𝑅+1)]× [2×(2𝑅+1)] (𝑢, 𝑣) in the reference image with the average pooling operation.
As reference patch descriptors have the same size as the target patch descriptor, the NCC method
can be used to match them. In detail, using the NCC method a) calculate the four NCC values between
reference patch descriptors 𝑢5𝑣5, 𝑢0𝑣5, 𝑢5𝑣0, 𝑢0𝑣0 and target patch descriptor 𝑢’𝑣’; b) choose the largest NCC
value as the matched scaling direction; and c) calculate the subpixel location for target pixel/point by Eq. 8.
Figure 29 shows an example of a target patch descriptor and four-scaling reference patch
descriptors with 𝑅=1, in which the 3×3 reference patches are scaled from 6×6 pixels patch 𝑢5𝑣5. Thus, with
the predefined image scaling, the patch-based NCC method is effective for the low-high ortho-image pair
matching.
50
u-3,v-3 u-2,v-3 u-1,v-3 u,v-3 u+1,v-3 u+2,v-3 u-2,v-3 u-1,v-3 u,v-3 u+1,v-3 u+2,v-3 u+3,v-3
u'-1,v'-1 u',v'-1 u'+1,v'-1
u-3,v-2 u-2,v-2 u-1,v-2 u,v-2 u+1,v-2 u+2,v-2 u-2,v-2 u-1,v-2 u,v-2 u+1,v-2 u+2,v-2 u+3,v-2
u-3,v-1 u-2,v-1 u-1,v-1 u,v-1 u+1,v-1 u+2,v-1 u-2,v-1 u-1,v-1 u,v-1 u+1,v-1 u+2,v-1 u+3,v-1
u'-1,v' u',v' u'+1,v'
u-3,v u-2,v u-1,v u,v u+1,v u+2,v u-2,v u-1,v u,v u+1,v u+2,v u+3,v
u-3,v+1 u-2,v+1 u-1,v+1 u,v+1 u+1,v+1 u+2,v+1 u-2,v+1 u-1,v+1 u,v+1 u+1,v+1 u+2,v+1 u+3,v+1
u'-1,v'+1 u',v'+1 u'+1,v'+1
u-3,v+2 u-2,v+2 u-1,v+2 u,v+2 u+1,v+2 u+2,v+2 u-2,v+2 u-1,v+2 u,v+2 u+1,v+2 u+2,v+2 u+3,v+2
𝑢5𝑣5 𝑢0𝑣5 𝑢’𝑣’
u-3,v-2 u-2,v-2 u-1,v-2 u,v-2 u+1,v-2 u+2,v-2 u-2,v-2 u-1,v-2 u,v-2 u+1,v-2 u+2,v-2 u+3,v-2 𝑢5𝑣5,𝑢0𝑣5,𝑢5𝑣0,𝑢0𝑣0 are reference patch
u-3,v-1 u-2,v-1 u-1,v-1 u,v-1 u+1,v-1 u+2,v-1 u-2,v-1 u-1,v-1 u,v-1 u+1,v-1 u+2,v-1 u+3,v-1 descriptors of size 3×3, which are
generated from 6×6 patches by average
u-3,v u-2,v u-1,v u,v u+1,v u+2,v u-2,v u-1,v u,v u+1,v u+2,v u+3,v
pooling from the reference image;
u-3,v+1 u-2,v+1 u-1,v+1 u,v+1 u+1,v+1 u+2,v+1 u-2,v+1 u-1,v+1 u,v+1 u+1,v+1 u+2,v+1 u+3,v+1 𝑢’𝑣’ is target patch descriptor with 3×3
u-3,v+2 u-2,v+2 u-1,v+2 u,v+2 u+1,v+2 u+2,v+2 u-2,v+2 u-1,v+2 u,v+2 u+1,v+2 u+2,v+2 u+3,v+2 patch;
u-3,v+3 u-2,v+3 u-1,v+3 u,v+3 u+1,v+3 u+2,v+3 u-2,v+3 u-1,v+3 u,v+3 u+1,v+3 u+2,v+3 u+3,v+3 * patches express in Pixel Coordinate (u,v).
𝑢5𝑣0 𝑢0𝑣0
4.3.2 Low-high Ortho-image Pair Pixel Matching and Virtual Elevation Algorithm
The virtual depth-elevation plane model is shown in Figure 30, which avoids using a brute-force
algorithm to match all pixels in the target image for determing the corresponding target pixel for a reference
pixel. In detail, a construction site is divided into several discrete virtual planes and set the drone takeoff
plane as the origin plane (𝐷𝑒𝑝𝑡ℎ=0). The Depth-axis has positive values below the origin plane, while the
Elevation-axis has positive values above the origin plane. The origin plane has distance H/2 to the drone’s
low altitude position, so the real-word point 𝑃 on a virtual plane 𝐷𝑒𝑝𝑡ℎ has distance 𝑍 = 𝐻/2 + 𝐷𝑒𝑝𝑡ℎ to the
drone.
O X
f
Y Z
p Image@H/2
Z=H/2+Depth e
H/2 -
Discrete Virtual Planes
Ele.
Depth<0 P Ele.>0
Y
Depth
X
Depth=0 Ele.=0
-
Depth>0 Ele.<0
Depth Drone Takeoff Point
Figure 30 Virtual depth-elevation model and pixel matching and elevation determination flowchart
51
Taking 𝑍 = 𝐻/2 + 𝐷𝑒𝑝𝑡ℎ expression into Eq. 7 will results in 𝑥′ = 𝑓(𝑥, 𝐷𝑒𝑝𝑡ℎ, 𝐻/2) and 𝑦′ =
𝑓(𝑦, 𝐷𝑒𝑝𝑡ℎ, 𝐻/2). Thus, for a given reference point 𝑝(𝑥, 𝑦) and 𝐻/2 (is fixed), each virtual plane can generate a
candidate target point 𝑝𝑖 ′(𝑥𝑖 ’, 𝑦𝑖 ’) = 𝑓(𝑥, 𝑦, 𝐷𝑒𝑝𝑡ℎ𝑖 , 𝐻/2). If the reference point matches with the candidate target
point 𝑝𝑖 ′(𝑥𝑖 ’, 𝑦𝑖 ’) on virtual plane 𝐷𝑒𝑝𝑡ℎ𝑖 , then, the real-world point 𝑃 is located on that virtual plane with the
specific 𝐸𝑙𝑒𝑖 = −𝐷𝑒𝑝𝑡ℎ𝑖 . The 𝐸𝑙𝑒𝑣𝑎𝑡𝑖𝑜𝑛 𝐶𝑜𝑜𝑟𝑑𝑖𝑛𝑎𝑡𝑒 (𝑋’, 𝑌’, 𝐸𝑙𝑒. ) can be determined by Eq. 9.
𝑋′ 𝑥 ∙ 𝐺𝑆𝐷
𝐻 𝐻
[ 𝑌′ ] = [−𝑦 ∙ 𝐺𝑆𝐷], where 𝐸𝑙𝑒. ∈ (− , ) Eq. 9
2 2
𝐸𝑙𝑒. −𝐷𝑒𝑝𝑡ℎ
In Figure 30, 𝑍 is the distance from drone to point 𝑃, 𝐻/2 is the distance from drone to its takeoff
plane, 𝐷𝑒𝑝𝑡ℎ is the distance from the point 𝑃 to the drone takeoff plane. It has,
𝐻
𝑍= + 𝐷𝑒𝑝𝑡ℎ (Eq. 9 − 1 − 1)
2
Take (Eq. 9-1-1) into (Eq. 7-5-1) and (Eq. 7-5-2) has,
2
𝐻 𝑥′ 𝐻 1 + 𝐷𝑒𝑝𝑡ℎ
= + 𝐷𝑒𝑝𝑡ℎ ⇒ 𝑥 ′
= 𝑥 𝐻 (Eq. 9 − 2 − 1)
2 𝑥−𝑥 ′ 2 2
2 + 𝐷𝑒𝑝𝑡ℎ
𝐻
2
𝐻 𝑦′ 𝐻 1 + 𝐷𝑒𝑝𝑡ℎ
= + 𝐷𝑒𝑝𝑡ℎ ⇒ 𝑦 ′
= 𝑦 𝐻 (Eq. 9 − 2 − 2)
2 𝑦−𝑦 ′ 2 2
2 + 𝐷𝑒𝑝𝑡ℎ
𝐻
Thus, for each 𝐷𝑒𝑝𝑡ℎ, the target point 𝑝′(𝑥 ′ , 𝑦 ′ ) has the relationship with reference point 𝑝(𝑥, 𝑦),
𝐷𝑒𝑝𝑡ℎ , 𝐻/2,
𝐻
𝑝(𝑥 ′ , 𝑦 ′ ) = 𝑓 (𝑥, 𝑦, 𝐷𝑒𝑝𝑡ℎ, ) (Eq. 9 − 3)
2
Assume candidate target point 𝑝′(𝑥 ′ , 𝑦 ′ ) at virtual plane 𝐷𝑒𝑝𝑡ℎ matches with the reference point
𝑝(𝑥, 𝑦), then the real-world point 𝑃 falls on that virtual plane 𝐷𝑒𝑝𝑡ℎ.
𝑋 ′ = 𝑋 = 𝑥 ∙ 𝐺𝑆𝐷 (Eq. 9 − 4 − 1)
𝑌 ′ = −𝑌 = −𝑦 ∙ 𝐺𝑆𝐷 (Eq. 9 − 4 − 2)
As the 𝑥, 𝑦 and 𝐷𝑒𝑝𝑡ℎ can be determined by (Eq. 9-3), thus combine (Eq. 9-1-2), (Eq. 9-4-1) and
The proposed matching procedure is stated in the flowchart in Figure 30, which shows that a given
reference point/pixel can match a target point/pixel and return a virtual plane value/elevation value
simultaneously. For matching a series of point/pixel pairs, the 𝐷𝑒𝑝𝑡ℎ𝑔𝑢𝑒𝑠𝑠 set to the previous point/pixel’s
virtual plane value. A while-loop starts at 𝐷𝑒𝑝𝑡ℎ𝑔𝑢𝑒𝑠𝑠 , and goes to the adjacent virtual planes by plus and
minus the 𝐷𝑒𝑝𝑡ℎ𝑆𝑡𝑒𝑝 , until the best or most acceptable matching result is returned. The pseudocode of the
low-high ortho-image pair pixel matching and virtual elevation algorithm is presented in Figure 31.
𝐴𝑠𝑠𝑢𝑚𝑒 𝑵𝑪𝑪_𝑴𝑨𝑻𝑪𝑯_𝑺𝑪𝑨𝑳𝑰𝑵𝑮 _𝑳𝑨𝑩𝑬𝑳 (𝒖, 𝒗, 𝒖’, 𝒗’, 𝑹) 𝑟𝑒𝑡𝑢𝑟𝑛𝑠 𝑡ℎ𝑒 𝑏𝑖𝑔𝑔𝑒𝑠𝑡 𝑁𝐶𝐶 𝑣𝑎𝑙𝑢𝑒 𝑎𝑛𝑑 𝑖𝑡𝑠 𝑆𝑐𝑎𝑙𝑖𝑛𝑔 𝐷𝑖𝑟𝑒𝑐𝑡𝑖𝑜𝑛 𝐿𝑎𝑏𝑒𝑙𝑆𝑐𝑎𝑙𝑖𝑛𝑔 𝑓𝑟𝑜𝑚 𝑡ℎ𝑒 𝑚𝑎𝑡𝑐ℎ
𝑟𝑒𝑠𝑢𝑙𝑡𝑠 𝑜𝑓 𝑟𝑒𝑓𝑒𝑟𝑒𝑛𝑐𝑒 𝑝(𝑢, 𝑣) 𝑎𝑛𝑑 𝑡𝑎𝑟𝑔𝑒𝑡 𝑝(𝑢 ′ , 𝑣 ′ ) 𝑎𝑛𝑑 𝑡ℎ𝑒𝑖𝑟 𝑓𝑒𝑎𝑡𝑢𝑟𝑒 𝑑𝑒𝑠𝑐𝑟𝑖𝑝𝑡𝑜𝑟𝑠 𝑢5𝑣5, 𝑢0𝑣5, 𝑢5𝑣0, 𝑢0𝑣0 𝑎𝑛𝑑 𝑢 ′ 𝑣 ′ 𝑖𝑛 𝑠𝑖𝑧𝑒 (2𝑅 + 1) × (2𝑅 + 1);
𝐴𝑠𝑠𝑢𝑚𝑒 𝑰𝑴𝟐𝑷𝑿(𝒙, 𝒚) 𝑟𝑒𝑡𝑢𝑟𝑛𝑠 𝑡ℎ𝑒 𝑃𝑖𝑥𝑒𝑙 𝐶𝑜𝑜𝑟𝑑𝑖𝑛𝑎𝑡𝑒 𝑝(𝑢, 𝑣) 𝑓𝑟𝑜𝑚 𝑡ℎ𝑒 𝐼𝑚𝑎𝑔𝑒 𝐶𝑜𝑜𝑟𝑑𝑖𝑛𝑎𝑡𝑒 𝑝(𝑥, 𝑦);
𝐴𝑠𝑠𝑢𝑚𝑒 𝑷𝑿𝟐𝑰𝑴(𝒖, 𝒗) 𝑟𝑒𝑡𝑢𝑟𝑛𝑠 𝑡ℎ𝑒 𝐼𝑚𝑎𝑔𝑒 𝐶𝑜𝑜𝑟𝑑𝑖𝑛𝑎𝑡𝑒 𝑝(𝑥, 𝑦) 𝑓𝑟𝑜𝑚 𝑡ℎ𝑒 𝑃𝑖𝑥𝑒𝑙 𝐶𝑜𝑜𝑟𝑑𝑖𝑛𝑎𝑡𝑒 𝑝(𝑢, 𝑣);
𝐴𝑠𝑠𝑢𝑚𝑒 𝑺𝑼𝑩𝑷𝑿𝟐𝑰𝑴 (𝒖, 𝒗, 𝑳𝒂𝒃𝒆𝒍𝑺𝒄𝒂𝒍𝒊𝒏𝒈 ) 𝑟𝑒𝑡𝑢𝑟𝑛𝑠 𝑡ℎ𝑒 𝐼𝑚𝑎𝑔𝑒 𝐶𝑜𝑜𝑟𝑑𝑖𝑛𝑎𝑡𝑒 𝑓𝑟𝑜𝑚 𝑃𝑖𝑥𝑒𝑙 𝐶𝑜𝑜𝑟𝑑𝑖𝑛𝑎𝑡𝑒 𝑝(𝑢, 𝑣) 𝑎𝑛𝑑 𝐿𝑎𝑏𝑒𝑙𝑆𝑐𝑎𝑙𝑖𝑛𝑔 .
𝑰𝑴𝑨𝑮𝑬_𝑷𝑨𝑰𝑹_𝑴𝑨𝑻𝑪𝑯𝑰𝑵𝑮 _𝑽𝑰𝑹𝑻𝑼𝑨𝑳_𝑬𝑳𝑬𝑽𝑨𝑻𝑰𝑶𝑵 (𝑰𝒎𝒈 𝑯/𝟐 , 𝑰𝒎𝒈′ 𝑯 , 𝒑(𝒖, 𝒗), 𝑫𝒆𝒑𝒕𝒉𝒈𝒖𝒆𝒔𝒔 , 𝑫𝒆𝒑𝒕𝒉𝒔𝒕𝒆𝒑 , 𝑹, 𝑯/𝟐)
1 𝑰𝒏𝒊𝒕𝒊𝒂𝒍 𝐷𝑒𝑝𝑡ℎ = 𝐷𝑒𝑝𝑡ℎ𝑔𝑢𝑒𝑠𝑠 ; 𝑁𝐶𝐶𝑚𝑎𝑥 = 0; 𝑅𝑎𝑑𝑗 .𝑟𝑎𝑡𝑖𝑜 + = 1; 𝑅𝑎𝑑𝑗 .𝑟𝑎𝑡𝑖𝑜 − = 1; 𝐷𝑒𝑝𝑡ℎ𝑐𝑢𝑟𝑟𝑒𝑛𝑡 − = 𝐷𝑒𝑝𝑡ℎ𝑔𝑢𝑒𝑠𝑠 ; 𝐷𝑒𝑝𝑡ℎ𝑐𝑢𝑟𝑟𝑒𝑛𝑡 + = 𝐷𝑒𝑝𝑡ℎ𝑔𝑢𝑒𝑠𝑠 + 𝐷𝑒𝑝𝑡ℎ𝑆𝑡𝑒𝑝
2 𝑝(𝑥, 𝑦) = 𝑷𝑿𝟐𝑰𝑴(𝑢, 𝑣)
3 𝒘𝒉𝒊𝒍𝒆 𝐷𝑒𝑝𝑡ℎ𝑐𝑢𝑟𝑟𝑒𝑛𝑡 − > −𝐻/2 𝒐𝒓 𝐷𝑒𝑝𝑡ℎ𝑐𝑢𝑟𝑟𝑒𝑛𝑡 + < 𝐻/2
4 𝒊𝒇 𝐷𝑒𝑝𝑡ℎ𝑐𝑢𝑟𝑟𝑒𝑛𝑡 − > −𝐻/2
5 𝑝’(𝑥’, 𝑦’) = 𝑓(𝑥, 𝑦, 𝐷𝑒𝑝𝑡ℎ𝑐𝑢𝑟𝑟𝑒𝑛𝑡 − , 𝐻 ⁄2); 𝑝(𝑢 ′ , 𝑣 ′ ) = 𝑰𝑴𝟐𝑷𝑿(𝑥 ′ , 𝑦 ′ )
6 𝑁𝐶𝐶𝑐𝑢𝑟𝑟𝑒𝑛𝑡 − , 𝐿𝑎𝑏𝑒𝑙𝑠𝑐𝑎𝑙𝑖𝑛𝑔 − = 𝑵𝑪𝑪_𝑴𝑨𝑻𝑪𝑯_𝑺𝑪𝑨𝑳𝑰𝑵𝑮 _𝑳𝑨𝑩𝑬𝑳(𝑢, 𝑣, 𝑢’, 𝑣’, 𝑅 × 𝑅𝑎𝑑𝑗 .𝑟𝑎𝑡𝑖𝑜 − )
7 𝒊𝒇 𝑁𝐶𝐶𝑐𝑢𝑟𝑟𝑒𝑛𝑡 − > 𝑁𝐶𝐶𝑚𝑎𝑥
8 𝑁𝐶𝐶𝑚𝑎𝑥 = 𝑁𝐶𝐶𝑐𝑢𝑟𝑟𝑒𝑛𝑡 − ; 𝐷𝑒𝑝𝑡ℎ𝑚𝑎𝑡𝑐 ℎ𝑒𝑑 = 𝐷𝑒𝑝𝑡ℎ𝑐𝑢𝑟𝑟𝑒𝑛𝑡 − ; 𝑝′𝑚𝑎𝑡𝑐 ℎ𝑒𝑑 (𝑥 ′ , 𝑦 ′ ) = 𝑺𝑼𝑩𝑷𝑿𝟐𝑰𝑴 (𝑢 ′ , 𝑣 ′ , 𝐿𝑎𝑏𝑒𝑙𝑠𝑐𝑎𝑙𝑖𝑛𝑔 − )
9 𝒊𝒇 𝐷𝑒𝑝𝑡ℎ𝑐𝑢𝑟𝑟𝑒𝑛𝑡 + < 𝐻/2
10 𝑝’(𝑥’, 𝑦’) = 𝑓(𝑥, 𝑦, 𝐷𝑒𝑝𝑡ℎ𝑐𝑢𝑟𝑟𝑒𝑛𝑡 + , 𝐻 ⁄2); 𝑝(𝑢 ′ , 𝑣 ′ ) = 𝑰𝑴𝟐𝑷𝑿(𝑥 ′ , 𝑦 ′ )
11 𝑁𝐶𝐶𝑐𝑢𝑟𝑟𝑒𝑛𝑡 + , 𝐿𝑎𝑏𝑒𝑙𝑠𝑐𝑎𝑙𝑖𝑛𝑔 + = 𝑵𝑪𝑪_𝑴𝑨𝑻𝑪𝑯_𝑺𝑪𝑨𝑳𝑰𝑵𝑮_𝑳𝑨𝑩𝑬𝑳 (𝑢, 𝑣, 𝑢’, 𝑣’, 𝑅 × 𝑅𝑎𝑑𝑗 .𝑟𝑎𝑡𝑖𝑜 + )
12 𝒊𝒇 𝑁𝐶𝐶𝑐𝑢𝑟𝑟𝑒𝑛𝑡 + > 𝑁𝐶𝐶𝑚𝑎𝑥
13 𝑁𝐶𝐶𝑚𝑎𝑥 = 𝑁𝐶𝐶𝑐𝑢𝑟𝑟𝑒𝑛𝑡 + ; 𝐷𝑒𝑝𝑡ℎ𝑚𝑎𝑡𝑐 ℎ𝑒𝑑 = 𝐷𝑒𝑝𝑡ℎ𝑐𝑢𝑟𝑟𝑒𝑛𝑡 + ; 𝑝′𝑚𝑎𝑡𝑐 ℎ𝑒𝑑 (𝑥 ′ , 𝑦 ′ ) = 𝑺𝑼𝑩𝑷𝑿𝟐𝑰𝑴 (𝑢 ′ , 𝑣 ′ , 𝐿𝑎𝑏𝑒𝑙𝑠𝑐𝑎𝑙𝑖𝑛𝑔 + )
14 𝒊𝒇 𝑁𝐶𝐶𝑚𝑎𝑥 < 𝑇ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑𝑙𝑜𝑤
15 𝑅𝑎𝑑𝑗 .𝑟𝑎𝑡𝑖𝑜 − += 0.2
16 𝑅𝑎𝑑𝑗 .𝑟𝑎𝑡𝑖𝑜 + += 0.2
17 𝒆𝒍𝒔𝒆
18 𝐷𝑒𝑝𝑡ℎ𝑐𝑢𝑟𝑟𝑒𝑛𝑡 − −= 𝐷𝑒𝑝𝑡ℎ𝑆𝑡𝑒𝑝
19 𝐷𝑒𝑝𝑡ℎ𝑐𝑢𝑟𝑟𝑒𝑛𝑡 + += 𝐷𝑒𝑝𝑡ℎ𝑆𝑡𝑒𝑝
20 𝒓𝒆𝒕𝒖𝒓𝒏 𝑝(𝑥, 𝑦), 𝑝’𝑚𝑎𝑡𝑐 ℎ𝑒𝑑 (𝑥’, 𝑦’), 𝐷𝑒𝑝𝑡ℎ𝑚𝑎𝑡𝑐 ℎ𝑒𝑑 , 𝑁𝐶𝐶𝑚𝑎𝑥
Figure 31 Pseudocode of low-high ortho-image pair pixel matching and virtual elevation algorithm
Figure 32 explains the pixel grid formation for sampling a low-high ortho-image pair matching. In
detail, pixels are selected with an interval of 𝐺𝑟𝑖𝑑 𝑆𝑖𝑧𝑒, and each selected pixel in a pixel grid is designed to
53
share its elevation data to its neighbors within a 𝐺𝑟𝑖𝑑 𝑆𝑖𝑧𝑒 × 𝐺𝑟𝑖𝑑 𝑆𝑖𝑧𝑒 patch to create the 𝐸𝑙𝑒_𝑚𝑎𝑝. 𝑀𝑎𝑟𝑔𝑖𝑛 in each
image edge is used to guarantee that all selected pixels have their patch descriptors.
The pseudocode of pixel grid and elevation-map formation algorithm is presented in Figure 33.
𝐴𝑠𝑠𝑢𝑚𝑒 𝑨[𝑵] 𝑟𝑒𝑡𝑢𝑟𝑛𝑠 𝑎 𝑛𝑒𝑤 𝐿𝑖𝑠𝑡 𝐴[ ] 𝑤𝑖𝑡ℎ 𝑠𝑖𝑧𝑒 𝑁;
𝐴𝑠𝑠𝑢𝑚𝑒 𝑴𝑬𝑫𝑰𝑨𝑵 (𝑨[ ]) 𝑟𝑒𝑡𝑢𝑟𝑛𝑠 𝑡ℎ𝑒 𝑚𝑒𝑑𝑖𝑎𝑛 𝑣𝑎𝑙𝑢𝑒 𝑜𝑓 𝐿𝑖𝑠𝑡 𝐴[ ];
𝐴𝑠𝑠𝑢𝑚𝑒 𝑨[ ]. 𝑨𝑷𝑷𝑬𝑵𝑫 (𝑩) 𝑎𝑑𝑑 𝑒𝑙𝑒𝑚𝑒𝑛𝑡 𝐵 𝑡𝑜 𝑡ℎ𝑒 𝑒𝑛𝑑 𝑜𝑓 𝐿𝑖𝑠𝑡 𝐴[ ];
𝐴𝑠𝑠𝑢𝑚𝑒 𝑬𝑳𝑬𝟐𝑮𝑹𝑨𝒀 (𝑬𝒍𝒆. ) 𝑟𝑒𝑡𝑢𝑟𝑛𝑠 𝐺𝑟𝑎𝑦 𝑣𝑎𝑙𝑢𝑒 0~255 𝑓𝑟𝑜𝑚 𝑡ℎ𝑒 𝐸𝑙𝑒𝑣𝑎𝑡𝑖𝑜𝑛 − 𝐻/2~𝐻/2;
𝐴𝑠𝑠𝑢𝑚𝑒 𝑰𝒎𝒈(𝒖, 𝒗). 𝑪𝑶𝑷𝒀() 𝑟𝑒𝑡𝑢𝑟𝑛𝑠 𝑎 𝑠𝑎𝑚𝑒 𝑠𝑖𝑧𝑒 𝐺𝑟𝑎𝑦𝑠𝑐𝑎𝑙𝑒 𝐼𝑚𝑎𝑔𝑒/𝐴𝑟𝑟𝑎𝑦;
𝐴𝑠𝑠𝑢𝑚𝑒 𝑺𝑸𝑫𝑰𝑭𝑭_𝑵𝑶𝑹𝑴𝑬𝑫 (𝒑(𝒖, 𝒗), 𝒑′[𝒏](𝒖′ , 𝒗′ )) 𝑟𝑒𝑡𝑢𝑟𝑛𝑠 𝑡ℎ𝑒 𝑏𝑒𝑠𝑡 𝑚𝑎𝑡𝑐ℎ𝑒𝑑 𝒑′[𝑖] 𝑓𝑜𝑟 𝒑 𝑏𝑦 𝑁𝑜𝑟𝑚𝑎𝑙𝑖𝑧𝑒𝑑 𝑆𝑢𝑚 𝑜𝑓 𝑆𝑞𝑢𝑎𝑟𝑒𝑑 𝐷𝑖𝑓𝑓𝑒𝑟𝑒𝑛𝑐𝑒𝑠 (𝑆𝑆𝐷) ;
𝐴𝑠𝑠𝑢𝑚𝑒 𝑪𝑼𝑻(𝑰𝒎𝒈(𝒖, 𝒗), 𝑴𝒂𝒓𝒈𝒊𝒏) 𝑟𝑒𝑡𝑢𝑟𝑛𝑠 𝐼𝑚𝑔(𝑢, 𝑣)′ 𝑠 𝑐𝑒𝑛𝑡𝑟𝑎𝑙 𝑟𝑒𝑔𝑖𝑜𝑛 𝑤𝑖𝑡ℎ𝑜𝑢𝑡 𝑡ℎ𝑒 𝑀𝑎𝑟𝑔𝑖𝑛 𝑟𝑒𝑔𝑖𝑜𝑛.
𝑷𝑰𝑿𝑬𝑳_𝑮𝑹𝑰𝑫_𝑬𝑳𝑬𝑽𝑨𝑻𝑰𝑶𝑵 _𝑴𝑨𝑷( 𝑰𝒎𝒈 𝑯/𝟐 (𝒖, 𝒗), 𝑰𝒎𝒈′ 𝑯 (𝒖′ , 𝒗′ ), 𝑮𝒓𝒊𝒅𝒔𝒊𝒛𝒆 , 𝑯/𝟐)
1 𝑰𝒏𝒊𝒕𝒊𝒂𝒍 𝑣 = 𝑀𝑎𝑟𝑔𝑖𝑛; 𝐷𝑒𝑝𝑡ℎ_𝑚𝑎𝑝(𝑢, 𝑣) = 𝐼𝑚𝑔𝐻/2 (𝑢, 𝑣). 𝑪𝑶𝑷𝒀(); 𝐸𝑙𝑒_𝑚𝑎𝑝(𝑢, 𝑣) = 𝐼𝑚𝑔𝐻/2 (𝑢, 𝑣). 𝑪𝑶𝑷𝒀()
2 𝒘𝒉𝒊𝒍𝒆 𝑣 ≤ 𝐼𝑚𝑔𝐻_ℎ𝑒𝑖𝑔 ℎ𝑡 − 𝑀𝑎𝑟𝑔𝑖𝑛
3 𝑢 = 𝑀𝑎𝑟𝑔𝑖𝑛; 𝑝𝐿𝑖𝑠𝑡 (𝑥, 𝑦)[5]; 𝑝𝐿𝑖𝑠𝑡 ’(𝑥’, 𝑦’)[5]; 𝐷𝑒𝑝𝑡ℎ𝐿𝑖𝑠𝑡 [5]; 𝑁𝐶𝐶𝐿𝑖𝑠𝑡 [5]
4 𝒘𝒉𝒊𝒍𝒆 𝑢 ≤ 𝐼𝑚𝑔𝐻_𝑤𝑖𝑑𝑡 ℎ − 𝑀𝑎𝑟𝑔𝑖𝑛
5 𝐷𝑒𝑝𝑡ℎ𝑔𝑢𝑒𝑠𝑠 = 𝐷𝑒𝑝𝑡ℎ_𝑚𝑎𝑝(𝑺𝑸𝑫𝑰𝑭𝑭 _𝑵𝑶𝑹𝑴𝑬𝑫 (𝑝(𝑢, 𝑣), [𝑝(𝑢 − 𝐺𝑟𝑖𝑑𝑠𝑖𝑧𝑒 , 𝑣), 𝑝(𝑢 − 𝐺𝑟𝑖𝑑𝑠𝑖𝑧𝑒 , 𝑣 − 𝐺𝑟𝑖𝑑𝑠𝑖𝑧𝑒 ), 𝑝(𝑢, 𝑣 − 𝐺𝑟𝑖𝑑𝑠𝑖𝑧𝑒 ), 𝑝(𝑢 + 𝐺𝑟𝑖𝑑𝑠𝑖𝑧𝑒 , 𝑣 − 𝐺𝑟𝑖𝑑𝑠𝑖𝑧𝑒 )]))
6 𝑝0 (𝑥, 𝑦), 𝑝0 ’(𝑥’, 𝑦’), 𝐷𝑒𝑝𝑡ℎ0 , 𝑁𝐶𝐶0 = 𝑰𝑴𝑨𝑮𝑬_𝑷𝑨𝑰𝑹_𝑴𝑨𝑻𝑪𝑯𝑰𝑵𝑮 _𝑽𝑰𝑹𝑻𝑼𝑨𝑳 _𝑬𝑳𝑬𝑽𝑨𝑻𝑰𝑶𝑵 (𝐼𝑚𝑔𝐻/2 , 𝐼𝑚𝑔′ 𝐻 , 𝑝(𝑢, 𝑣), 𝐷𝑒𝑝𝑡ℎ𝑔𝑢𝑒𝑠𝑠 , 𝐷𝑒𝑝𝑡ℎ𝑠𝑡𝑒𝑝 , 𝑅, 𝐻/2)
7 𝑝1 (𝑥, 𝑦), 𝑝1 ’(𝑥’, 𝑦’), 𝐷𝑒𝑝𝑡ℎ1 , 𝑁𝐶𝐶1 = 𝑰𝑴𝑨𝑮𝑬_𝑷𝑨𝑰𝑹_𝑴𝑨𝑻𝑪𝑯𝑰𝑵𝑮 _𝑽𝑰𝑹𝑻𝑼𝑨𝑳 _𝑬𝑳𝑬𝑽𝑨𝑻𝑰𝑶𝑵 (𝐼𝑚𝑔𝐻/2 , 𝐼𝑚𝑔′ 𝐻 , 𝑝(𝑢 − 𝑠, 𝑣), 𝐷𝑒𝑝𝑡ℎ𝑔𝑢𝑒𝑠𝑠 , 𝐷𝑒𝑝𝑡ℎ𝑠𝑡𝑒𝑝 , 𝑅, 𝐻/2)
8 𝑝2 (𝑥, 𝑦), 𝑝2 ’(𝑥’, 𝑦’), 𝐷𝑒𝑝𝑡ℎ2 , 𝑁𝐶𝐶2 = 𝑰𝑴𝑨𝑮𝑬_𝑷𝑨𝑰𝑹_𝑴𝑨𝑻𝑪𝑯𝑰𝑵𝑮 _𝑽𝑰𝑹𝑻𝑼𝑨𝑳 _𝑬𝑳𝑬𝑽𝑨𝑻𝑰𝑶𝑵 (𝐼𝑚𝑔𝐻/2 , 𝐼𝑚𝑔′ 𝐻 , 𝑝(𝑢 + 𝑠, 𝑣), 𝐷𝑒𝑝𝑡ℎ𝑔𝑢𝑒𝑠𝑠 , 𝐷𝑒𝑝𝑡ℎ𝑠𝑡𝑒𝑝 , 𝑅, 𝐻/2)
9 𝑝𝟑 (𝑥, 𝑦), 𝑝3 ’(𝑥’, 𝑦’), 𝐷𝑒𝑝𝑡ℎ3 , 𝑁𝐶𝐶3 = 𝑰𝑴𝑨𝑮𝑬_𝑷𝑨𝑰𝑹_𝑴𝑨𝑻𝑪𝑯𝑰𝑵𝑮_𝑽𝑰𝑹𝑻𝑼𝑨𝑳 _𝑬𝑳𝑬𝑽𝑨𝑻𝑰𝑶𝑵 (𝐼𝑚𝑔𝐻/2 , 𝐼𝑚𝑔′ 𝐻 , 𝑝(𝑢, 𝑣 − 𝑠), 𝐷𝑒𝑝𝑡ℎ𝑔𝑢𝑒𝑠𝑠 , 𝐷𝑒𝑝𝑡ℎ𝑠𝑡𝑒𝑝 , 𝑅, 𝐻/2)
10 𝑝𝟒 (𝑥, 𝑦), 𝑝4 ’(𝑥’, 𝑦’), 𝐷𝑒𝑝𝑡ℎ4 , 𝑁𝐶𝐶4 = 𝑰𝑴𝑨𝑮𝑬_𝑷𝑨𝑰𝑹_𝑴𝑨𝑻𝑪𝑯𝑰𝑵𝑮_𝑽𝑰𝑹𝑻𝑼𝑨𝑳 _𝑬𝑳𝑬𝑽𝑨𝑻𝑰𝑶𝑵 (𝐼𝑚𝑔𝐻/2 , 𝐼𝑚𝑔′ 𝐻 , 𝑝(𝑢, 𝑣 + 𝑠), 𝐷𝑒𝑝𝑡ℎ𝑔𝑢𝑒𝑠𝑠 , 𝐷𝑒𝑝𝑡ℎ𝑠𝑡𝑒𝑝 , 𝑅, 𝐻/2)
11 𝑝(𝑥, 𝑦), 𝑝’(𝑥’, 𝑦’) , 𝐷𝑒𝑝𝑡ℎ, 𝑁𝐶𝐶 = 𝑴𝑬𝑫𝑰𝑨𝑵 (𝑝𝐿𝑖𝑠𝑡 (𝑥, 𝑦)), 𝑴𝑬𝑫𝑰𝑨𝑵 (𝑝𝐿𝑖𝑠𝑡 ’(𝑥’, 𝑦’)) , 𝑴𝑬𝑫𝑰𝑨𝑵 (𝐷𝑒𝑝𝑡ℎ𝐿𝑖𝑠𝑡 ), 𝑴𝑬𝑫𝑰𝑨𝑵 (𝑁𝐶𝐶𝐿𝑖𝑠𝑡 )
12 𝑃𝑖𝑥𝑒𝑙_𝐺𝑟𝑖𝑑[𝑝𝑎𝑖𝑟(𝑝(𝑢, 𝑣), 𝑝(𝑥, 𝑦), 𝑝′(𝑥 ′ , 𝑦 ′ ), 𝐷𝑒𝑝𝑡ℎ, 𝑁𝐶𝐶)]. 𝑨𝑷𝑷𝑬𝑵𝑫 ( 𝑝𝑎𝑖𝑟(𝑝(𝑢, 𝑣), 𝑝(𝑥, 𝑦), 𝑝′(𝑥 ′ , 𝑦 ′ ), 𝐷𝑒𝑝𝑡ℎ, 𝑁𝐶𝐶))
13 𝑋 ′ = 𝑥 × 𝐺𝑆𝐷; 𝑌 ′ = −𝑦 × 𝐺𝑆𝐷; 𝐸𝑙𝑒. = −𝐷𝑒𝑝𝑡ℎ
14 𝑃𝑜𝑖𝑛𝑡_𝐶𝑙𝑜𝑢𝑑 [𝑃(𝑋 ′ , 𝑌 ′ , 𝐸𝑙𝑒. )]. 𝑨𝑷𝑷𝑬𝑵𝑫 (𝑃(𝑋’, 𝑌’, 𝐸𝑙𝑒. ))
15 𝐸𝑙𝑒_𝑚𝑎𝑝(𝑢 − 𝐺𝑟𝑖𝑑𝑠𝑖𝑧𝑒 /2: 𝑢 + 𝐺𝑟𝑖𝑑𝑠𝑖𝑧 𝑒 /2, 𝑣 − 𝐺𝑟𝑖𝑑𝑠𝑖𝑧𝑒 /2: 𝑣 + 𝐺𝑟𝑖𝑑𝑠𝑖𝑧𝑒 /2) = 𝑬𝑳𝑬𝟐𝑮𝑹𝑨𝒀 (𝐸𝑙𝑒. )
16 𝐷𝑒𝑝𝑡ℎ_𝑚𝑎𝑝(𝑢 − 𝐺𝑟𝑖𝑑𝑠𝑖𝑧𝑒 /2: 𝑢 + 𝐺𝑟𝑖𝑑𝑠𝑖𝑧𝑒 /2, 𝑣 − 𝐺𝑟𝑖𝑑𝑠𝑖𝑧𝑒 /2: 𝑣 + 𝐺𝑟𝑖𝑑𝑠𝑖𝑧𝑒 /2) = 𝐷𝑒𝑝𝑡ℎ
17 𝑢+= 𝐺𝑟𝑖𝑑𝑠𝑖𝑧𝑒
18 𝑣+= 𝐺𝑟𝑖𝑑𝑠𝑖𝑧𝑒
19 𝑂𝑟𝑡ℎ𝑜_𝑖𝑚𝑎𝑔𝑒(𝑢, 𝑣) = 𝐶𝑈𝑇(𝐼𝑚𝑔𝐻/2 (𝑢, 𝑣), 𝑀𝑎𝑟𝑔𝑖𝑛); 𝐸𝑙𝑒𝑣𝑎𝑡𝑖𝑜𝑛 _𝑀𝑎𝑝(𝑢, 𝑣) = 𝐶𝑈𝑇(𝐸𝑙𝑒_𝑀𝑎𝑝(𝑢, 𝑣), 𝑀𝑎𝑟𝑔𝑖𝑛)
20 𝒓𝒆𝒕𝒖𝒓𝒏 𝑃𝑖𝑥𝑒𝑙_𝐺𝑟𝑖𝑑[𝑝𝑎𝑖𝑟(𝑝(𝑢, 𝑣), 𝑝(𝑥, 𝑦), 𝑝′(𝑥 ′ , 𝑦 ′ ), 𝐷𝑒𝑝𝑡ℎ, 𝑁𝐶𝐶)], 𝑃𝑜𝑖𝑛𝑡_𝐶𝑙𝑜𝑢𝑑 [𝑃(𝑋 ′ , 𝑌 ′ , 𝐸𝑙𝑒. )], 𝑂𝑟𝑡ℎ𝑜_𝑖𝑚𝑎𝑔𝑒(𝑢, 𝑣), 𝐸𝑙𝑒𝑣𝑎𝑡𝑖𝑜𝑛 _𝑀𝑎𝑝(𝑢, 𝑣)
In line 6 to 10, the pixel matching and virtual elevation algorithm is repeated 5 times in pixel
𝑝(𝑢, 𝑣) and its neighbors to enhance matching results using their median values. The distance (𝑠) to
neighboring pixels can be adjusted from 1 to 𝐺𝑟𝑖𝑑 𝑆𝑖𝑧𝑒. After traversed and matched all selected pixels, the
matched results are stored in the 𝑃𝑖𝑥𝑒𝑙_𝐺𝑟𝑖𝑑. A 3D point cloud 𝑃𝑜𝑖𝑛𝑡_𝐶𝑙𝑜𝑢𝑑 of a construction site is created as
well. Furthermore, an ortho-image and elevation-map (stores elevation data as 0~255 grayscale value) pair
is created by cutting off the low ortho-image and 𝐸𝑙𝑒_𝑚𝑎𝑝 margin separately, then pixels in the ortho-image
𝐴𝑠𝑠𝑢𝑚𝑒 𝑹𝑶𝑻𝑨𝑻𝑬(𝑰𝒎𝒈, 𝑫) 𝑟𝑒𝑡𝑢𝑟𝑛𝑠 𝑎 𝑛𝑒𝑤 𝐼𝑚𝑔′ (𝑢 ′ , 𝑣 ′ ) 𝑏𝑦 𝑎𝑛𝑡𝑖𝑐𝑙𝑜𝑐𝑘𝑤𝑖𝑠𝑒 𝑟𝑜𝑡𝑎𝑡𝑖𝑛𝑔 𝐼𝑚𝑔(𝑢, 𝑣) 𝑤𝑖𝑡ℎ 𝐷 𝑑𝑒𝑔𝑟𝑒𝑒𝑠;
𝐴𝑠𝑠𝑢𝑚𝑒 𝑷𝑰𝑿𝑬𝑳_𝑮𝑹𝑰𝑫_𝑬𝑳𝑬𝑽𝑨𝑻𝑰𝑶𝑵 _𝑴𝑨𝑷( 𝑰𝒎𝒈𝑯/𝟐 , 𝑰𝒎𝒈′𝑯 , 𝑮𝒓𝒊𝒅𝒔𝒊𝒛𝒆 , 𝑯/𝟐) 𝑟𝑒𝑡𝑢𝑟𝑛𝑠 𝑃𝑖𝑥𝑒𝑙_𝐺𝑟𝑖𝑑_𝐴𝑟𝑟𝑎𝑦[𝑟𝑜𝑤(𝑥, 𝑦 , 𝐷𝑒𝑝𝑡ℎ, 𝑁𝐶𝐶)];
𝐴𝑠𝑠𝑢𝑚𝑒 𝑨𝑹𝑹𝑨𝒀. 𝑺𝑶𝑹𝑻_𝟏_𝟐() 𝑠𝑜𝑟𝑡𝑠 𝑡ℎ𝑒 𝐴𝑟𝑟𝑎𝑦 𝑏𝑦 𝑖𝑡𝑠 1𝑠𝑡 𝑎𝑛𝑑 2𝑛𝑑 𝐶𝑜𝑙𝑢𝑚𝑛𝑠 ;
𝐴𝑠𝑠𝑢𝑚𝑒 𝑨𝑹𝑹𝑨𝒀. 𝑪𝑶𝑳[𝒊: 𝒋 ]𝑟𝑒𝑡𝑢𝑟𝑛𝑠 𝐴𝑟𝑟𝑎𝑦 ′ 𝑠 𝒊𝒕𝒉 𝑡𝑜 𝒋𝒕𝒉 𝐶𝑜𝑙𝑢𝑚𝑛𝑠 ;
𝐴𝑠𝑠𝑢𝑚𝑒 𝑨[ ]. 𝑳𝒐𝒘𝑭𝒆𝒏𝒄𝒆 () 𝑟𝑒𝑡𝑢𝑟𝑛𝑠 𝑡ℎ𝑒 𝐿𝑜𝑤 𝐹𝑒𝑛𝑐𝑒 𝑄1 − 1.5 × (𝑄3 − 𝑄1) 𝑜𝑓 𝐿𝑖𝑠𝑡 𝐴[ ];
𝐴𝑠𝑠𝑢𝑚𝑒 𝑳𝑬𝑵(𝑨[ ]) 𝑟𝑒𝑡𝑢𝑟𝑛𝑠 𝑡ℎ𝑒 𝑠𝑖𝑧𝑒 𝑜𝑓 𝐿𝑖𝑠𝑡 𝐴[ ].
𝑷𝑰𝑿𝑬𝑳_𝑮𝑹𝑰𝑫_𝑬𝑵𝑯𝑨𝑵𝑪𝑬𝑴 𝑬𝑵𝑻( 𝑰𝒎𝒈 𝑯/𝟐 (𝒖, 𝒗), 𝑰𝒎𝒈′ 𝑯 (𝒖′ , 𝒗′ ), 𝑮𝒓𝒊𝒅𝒔𝒊𝒛𝒆 , 𝑯/𝟐)
1 𝑰𝒏𝒊𝒕𝒊𝒂𝒍 𝐼𝑚𝑔0 = 𝐼𝑚𝑔𝐻/2 ; 𝐼𝑚𝑔90 = 𝑹𝑶𝑻𝑨𝑻𝑬(𝐼𝑚𝑔0 , 90); 𝐼𝑚𝑔180 = 𝑹𝑶𝑻𝑨𝑻𝑬(𝐼𝑚𝑔0 , 180); 𝐼𝑚𝑔270 = 𝑹𝑶𝑻𝑨𝑻𝑬(𝐼𝑚𝑔0 , 270);
𝐼𝑚𝑔′0 = 𝐼𝑚𝑔′𝐻 ; 𝐼𝑚𝑔′90 = 𝑹𝑶𝑻𝑨𝑻𝑬(𝐼𝑚𝑔′0 , 90); 𝐼𝑚𝑔′180 = 𝑹𝑶𝑻𝑨𝑻𝑬(𝐼𝑚𝑔′0 , 180); 𝐼𝑚𝑔′270 = 𝑹𝑶𝑻𝑨𝑻𝑬(𝐼𝑚𝑔′0 , 270)
2 𝒇𝒐𝒓 𝑖 𝒊𝒏 [0,90,180,270]
3 𝑃𝑖𝑥𝑒𝑙_𝐺𝑟𝑖𝑑_𝐴𝑟𝑟𝑎𝑦𝑖 [𝑟𝑜𝑤(𝑥, 𝑦 , 𝐷𝑒𝑝𝑡ℎ, 𝑁𝐶𝐶)] = 𝑷𝑰𝑿𝑬𝑳_𝑮𝑹𝑰𝑫_𝑬𝑳𝑬𝑽𝑨𝑻𝑰𝑶𝑵 _𝑴𝑨𝑷( 𝐼𝑚𝑔𝑖 , 𝐼𝑚𝑔𝑖′ , 𝐺𝑟𝑖𝑑𝑠𝑖𝑧𝑒 , 𝐻/2)
4 𝑅_𝑃𝑖𝑥𝑒𝑙_𝐺𝑟𝑖𝑑1 [𝑟𝑜𝑤(𝑥, 𝑦 , 𝐷𝑒𝑝𝑡ℎ, 𝑁𝐶𝐶)] = 𝑃𝑖𝑥𝑒𝑙_𝐺𝑟𝑖𝑑_𝐴𝑟𝑟𝑎𝑦0 [𝑟𝑜𝑤( 𝑥, 𝑦 , 𝐷𝑒𝑝𝑡ℎ, 𝑁𝐶𝐶)]. 𝑺𝑶𝑹𝑻_𝟏_𝟐();
𝑅_𝑃𝑖𝑥𝑒𝑙_𝐺𝑟𝑖𝑑2 [𝑟𝑜𝑤(𝑥, 𝑦 , 𝐷𝑒𝑝𝑡ℎ, 𝑁𝐶𝐶)] = 𝑃𝑖𝑥𝑒𝑙_𝐺𝑟𝑖𝑑_𝐴𝑟𝑟𝑎𝑦90 [𝑟𝑜𝑤(−𝑦, 𝑥 , 𝐷𝑒𝑝𝑡ℎ, 𝑁𝐶𝐶)] . 𝑺𝑶𝑹𝑻_𝟏_𝟐() ;
𝑅_𝑃𝑖𝑥𝑒𝑙_𝐺𝑟𝑖𝑑3 [𝑟𝑜𝑤(𝑥, 𝑦 , 𝐷𝑒𝑝𝑡ℎ, 𝑁𝐶𝐶)] = 𝑃𝑖𝑥𝑒𝑙_𝐺𝑟𝑖𝑑_𝐴𝑟𝑟𝑎𝑦180 [𝑟𝑜𝑤(−𝑥, −𝑦 , 𝐷𝑒𝑝𝑡ℎ, 𝑁𝐶𝐶)] . 𝑺𝑶𝑹𝑻_𝟏_𝟐() ;
𝑅_𝑃𝑖𝑥𝑒𝑙_𝐺𝑟𝑖𝑑4 [𝑟𝑜𝑤(𝑥, 𝑦 , 𝐷𝑒𝑝𝑡ℎ, 𝑁𝐶𝐶)] = 𝑃𝑖𝑥𝑒𝑙_𝐺𝑟𝑖𝑑_𝐴𝑟𝑟𝑎𝑦270 [𝑟𝑜𝑤(𝑦, −𝑥 , 𝐷𝑒𝑝𝑡ℎ, 𝑁𝐶𝐶)] . 𝑺𝑶𝑹𝑻_𝟏_𝟐()
5 𝒇𝒐𝒓 𝑖 𝒊𝒏 [1,2,3,4]
6 𝑥𝑖 [], 𝑦𝑖 [], 𝐷𝑖 [], 𝑊𝑖 [] = 𝑅_𝑃𝑖𝑥𝑒𝑙_𝐺𝑟𝑖𝑑𝑖 [𝑟𝑜𝑤(𝑥, 𝑦 , 𝐷𝑒𝑝𝑡ℎ, 𝑁𝐶𝐶)]. 𝑪𝑶𝑳[1: 4]
7 𝑄𝑖 = 𝑴𝑨𝑿(𝑊𝑖 . 𝑳𝒐𝒘𝑭𝒆𝒏𝒄𝒆 (),0.001)
8 𝑁 = 𝑳𝑬𝑵(𝑥1 []); 𝑥𝑙𝑖𝑠𝑡 [𝑁]; 𝑦𝑙𝑖𝑠𝑡 [𝑁]; 𝐷𝑒𝑝𝑡ℎ𝑙𝑖𝑠𝑡 [𝑁]; 𝐶𝑙𝑖𝑠𝑡 [𝑁]
9 𝒇𝒐𝒓 𝑖 𝒊𝒏 [1, … , 𝑁]
10 𝑥𝑙𝑖𝑠𝑡 . 𝑨𝑷𝑷𝑬𝑵𝑫 (𝑥1 [𝑖]); 𝑦𝑙𝑖𝑠𝑡 . 𝑨𝑷𝑷𝑬𝑵𝑫 (𝑦1 [𝑖]); 𝐷𝑒𝑝𝑡ℎ𝑙𝑖𝑠𝑡 . 𝑨𝑷𝑷𝑬𝑵𝑫 (𝑴𝑬𝑫𝑰𝑨𝑵 (𝑫𝒊|𝑊𝑖 ≥𝑄𝑖 [ ])) ; 𝐶𝑙𝑖𝑠𝑡 . 𝑨𝑷𝑷𝑬𝑵𝑫 (1𝑊1 ≥𝑄1 2𝑊2 ≥𝑄2 3𝑊3 ≥𝑄3 4𝑊4 ≥𝑄4 )
11 𝐸𝑛ℎ𝑎𝑛𝑐𝑒𝑑 _𝑃𝑖𝑥𝑒𝑙_𝐺𝑟𝑖𝑑_𝐴𝑟𝑟𝑎𝑦[𝑟𝑜𝑤(𝑥, 𝑦, 𝐷𝑒𝑝𝑡ℎ, 𝐶)]. 𝑨𝑷𝑷𝑬𝑵𝑫 ( 𝑟𝑜𝑤(𝑥𝑙𝑖𝑠𝑡 [𝑖], 𝑦𝑙𝑖𝑠𝑡 [𝑖], 𝐷𝑒𝑝𝑡ℎ𝑙𝑖𝑠𝑡 [𝑖], 𝐶𝑙𝑖𝑠𝑡 [𝑖]))
12 𝑋 ′ = 𝑥𝑙𝑖𝑠𝑡 [𝑖] × 𝐺𝑆𝐷; 𝑌 ′ = −𝑦𝑙𝑖𝑠𝑡 [𝑖] × 𝐺𝑆𝐷; 𝐸𝑙𝑒. = −𝐷𝑒𝑝𝑡ℎ𝑙𝑖𝑠𝑡 [𝑖]; 𝑝(𝑢, 𝑣) = 𝑰𝑴𝟐𝑷𝑿(𝑥, 𝑦)
13 𝑃𝑜𝑖𝑛𝑡_𝐶𝑙𝑜𝑢𝑑 [𝑃(𝑋 ′ , 𝑌 ′ , 𝐸𝑙𝑒. )]. 𝑨𝑷𝑷𝑬𝑵𝑫 (𝑃(𝑋’, 𝑌’, 𝐸𝑙𝑒. ))
14 𝐸𝑙𝑒_𝑚𝑎𝑝(𝑢 − 𝐺𝑟𝑖𝑑𝑠𝑖𝑧𝑒 /2: 𝑢 + 𝐺𝑟𝑖𝑑𝑠𝑖𝑧𝑒 /2, 𝑣 − 𝐺𝑟𝑖𝑑𝑠𝑖𝑧𝑒 /2: 𝑣 + 𝐺𝑟𝑖𝑑𝑠𝑖𝑧𝑒 /2) = 𝑬𝑳𝑬𝟐𝑮𝑹𝑨𝒀 (𝐸𝑙𝑒. )
15 𝑂𝑟𝑡ℎ𝑜_𝑖𝑚𝑎𝑔𝑒(𝑢, 𝑣) = 𝑪𝑼𝑻(𝐼𝑚𝑔𝐻/2 (𝑢, 𝑣), 𝑀𝑎𝑟𝑔𝑖𝑛); 𝐸𝑙𝑒𝑣𝑎𝑡𝑖𝑜𝑛 _𝑀𝑎𝑝(𝑢, 𝑣) = 𝑪𝑼𝑻(𝐸𝑙𝑒_𝑀𝑎𝑝(𝑢, 𝑣), 𝑀𝑎𝑟𝑔𝑖𝑛)
16 𝒓𝒆𝒕𝒖𝒓𝒏 𝐸𝑛ℎ𝑎𝑛𝑐𝑒𝑑 _𝑃𝑖𝑥𝑒𝑙_𝐺𝑟𝑖𝑑_𝐴𝑟𝑟𝑎𝑦[𝑟𝑜𝑤(𝑥, 𝑦, 𝐷𝑒𝑝𝑡ℎ, 𝐶)], 𝑃𝑜𝑖𝑛𝑡_𝐶𝑙𝑜𝑢𝑑 [𝑃(𝑋 ′ , 𝑌 ′ , 𝐸𝑙𝑒. )], 𝑂𝑟𝑡ℎ𝑜_𝑖𝑚𝑎𝑔𝑒(𝑢, 𝑣), 𝐸𝑙𝑒𝑣𝑎𝑡𝑖𝑜𝑛 _𝑀𝑎𝑝(𝑢, 𝑣)
In the pixel grid and elevation-map algorithm, the selected pixels are traversed row by row, each
pixel uses the previous pixel’s 𝐷𝑒𝑝𝑡ℎ value as the input variable 𝐷𝑒𝑝𝑡ℎ𝑔𝑢𝑒𝑠𝑠 for matching its own 𝐷𝑒𝑝𝑡ℎ value.
55
To make this algorithm robust, a low-high ortho-image pair is proposed to rotate 90°, 180° and 270° in a
counterclockwise fashion and the pixel grid and elevation-map algorithm is repeated to generate four
𝑃𝑖𝑥𝑒𝑙_𝐺𝑟𝑖𝑑𝑖 results starting from each corner of the ortho-image pair. In addition, the four results are
transformed back to the original coordinate. In each 𝑃𝑖𝑥𝑒𝑙_𝐺𝑟𝑖𝑑𝑖 , if a selected pixel has 𝑁𝐶𝐶 value 𝑊𝑖 [𝑢, 𝑣]
larger than 0.001 and 𝑊𝑖 . 𝐿𝑜𝑤𝑒𝑟 𝐹𝑒𝑛𝑐𝑒𝑖 (the lower fence of 𝑁𝐶𝐶 values of all selected pixels in 𝑃𝑖𝑥𝑒𝑙_𝐺𝑟𝑖𝑑𝑖 , any
𝑁𝐶𝐶 value less than the lower fence is considered as an outlier), it is considered as a strongly matched pixel
pair, otherwise it is a weakly matched pixel pair. Combining the four 𝑃𝑖𝑥𝑒𝑙_𝐺𝑟𝑖𝑑𝑖=1,2,3,4 matching results, each
selected pixel/point has 16 matching quality conditions that are listed in Table 8. For example, the 2nd
condition means a selected pixel has strongly matched results in the original, 90° rotation, and 180° rotation
ortho-image pairs and weakly matched result for 270°. It is then assigned “123” for its matching quality
Compare with 𝑄𝑖 = 𝑀𝐴𝑋(𝑊𝑖 . 𝐿𝑜𝑤𝑒𝑟 𝐹𝑒𝑛𝑐𝑒, 0.001) Matching Quality Label 𝐶𝑙𝑖𝑠𝑡 Enhanced Depth 𝐷𝑒𝑝𝑡ℎ𝑙𝑖𝑠𝑡
Matching Quality
W1 W2 W3 W4 1𝑊1 ≥𝑄1 2𝑊2 ≥𝑄2 3𝑊3 ≥𝑄3 4𝑊4 ≥𝑄4 𝑀𝐸𝐷𝐼𝐴𝑁(𝐷𝑖|𝑊𝑖 ≥𝑄𝑖 [ ])
Strongest 1 ≥ ≥ ≥ ≥ 1234 𝑀𝐸𝐷𝐼𝐴𝑁( 𝐷1 , 𝐷2 , 𝐷3 , 𝐷4 )
┌ 2 ≥ ≥ ≥ < 123 𝑀𝐸𝐷𝐼𝐴𝑁( 𝐷1 , 𝐷2 , 𝐷3 )
├ 3 ≥ ≥ < ≥ 124 𝑀𝐸𝐷𝐼𝐴𝑁( 𝐷1 , 𝐷2 , 𝐷4 )
strong
├ 4 ≥ < ≥ ≥ 134 𝑀𝐸𝐷𝐼𝐴𝑁( 𝐷1 , 𝐷3 , 𝐷4 )
└ 5 < ≥ ≥ ≥ 234 𝑀𝐸𝐷𝐼𝐴𝑁( 𝐷2 , 𝐷3 , 𝐷4 )
┌ 6 ≥ ≥ < < 12 𝑀𝐸𝐷𝐼𝐴𝑁( 𝐷1 , 𝐷2 )
├ 7 ≥ < ≥ < 13 𝑀𝐸𝐷𝐼𝐴𝑁( 𝐷1 , 𝐷3 )
├ 8 ≥ < < ≥ 14 𝑀𝐸𝐷𝐼𝐴𝑁( 𝐷1 , 𝐷4 )
weak
├ 9 < ≥ ≥ < 23 𝑀𝐸𝐷𝐼𝐴𝑁( 𝐷2 , 𝐷3 )
├ 10 < ≥ < ≥ 24 𝑀𝐸𝐷𝐼𝐴𝑁( 𝐷2 , 𝐷4 )
└ 11 < < ≥ ≥ 34 𝑀𝐸𝐷𝐼𝐴𝑁( 𝐷3 , 𝐷4 )
┌ 12 ≥ < < < 1 𝐷1
├ 13 < ≥ < < 2 𝐷2
weaker
├ 14 < < ≥ < 3 𝐷3
└ 15 < < < ≥ 4 𝐷4
Weakest 16 < < < < 0 𝑀𝐸𝐷𝐼𝐴𝑁( 𝐷1 , 𝐷2 , 𝐷3 , 𝐷4 )
*𝑊𝑖 is the 𝑁𝐶𝐶 values of all selected pixels in 𝑃𝑖𝑥𝑒𝑙_𝐺𝑟𝑖𝑑𝑖 ; 𝑊𝑖 . 𝐿𝑜𝑤𝑒𝑟 𝐹𝑒𝑛𝑐𝑒𝑖 = Q1-1.5×(Q3-Q1) , the lower fence of 𝑁𝐶𝐶 values of the selected
pixels in 𝑃𝑖𝑥𝑒𝑙_𝐺𝑟𝑖𝑑𝑖 , any 𝑁𝐶𝐶 value less than the lower fence is considered as an outlier; 1𝑊1 ≥𝑄1 means the 𝑊1 [𝑢, 𝑣] is a strongly matched
result in the original ortho-image pair, “1” will be assign to assemble the matching quality label 𝐶1 [𝑢, 𝑣], similarly, “2” is for 90°
rotation, “3” for 180° rotation and “4” for 270° rotation of the ortho-image pair; 𝐷1|𝑊1 ≥𝑄1 [𝑢, 𝑣] means pixel 𝑝(𝑢, 𝑣) is strongly matched
in the original ortho-image pair, if not, 𝐷1|𝑊1 <𝑄1 [𝑢, 𝑣] will not be used to enhance the 𝐷𝑒𝑝𝑡ℎ[𝑢, 𝑣].
In Table 8, the “weakest” means four-rotation matching results of a pixel are all weakly matched.
Another kind of weak matching is on the central of the ortho-image pair. When the reference point (𝑥, 𝑦) is
close to center (0,0), the target point (𝑥’, 𝑦’) is close to (0,0) and becomes insensitive to 𝐷𝑒𝑝𝑡ℎ variation (see
56
Eq. 9-2-1, Eq. 9-2-2). This leads to the pixel matching and virtual elevation algorithm to generate the same
𝑁𝐶𝐶 value from different 𝐷𝑒𝑝𝑡ℎ values. Fortunately, the center region is usually a flat plane for drone
takeoff; its elevation can be easily confined to its neighbors’ elevations. Therefore, the 𝐷𝑒𝑝𝑡ℎ of a weakest
pixel can be inherited from an adjacent pixel (𝐶≥1) that has the closest texture feature. The updated pixel
will be assigned a new matching label 𝐶=5 to participate in enhancing the remaining pixels.
The developed algorithms were programmed in Python 3.6.8 and verified on the construction site
shown in Figure 35. This beach site includes a stairway, a boardwalk with rest area, several garbage cans
and vegetation. The elevation differentials between the selected points were measured for evaluating the
developed method. The height of the bottom stair (above the ground) is 22.86 cm (9 inches) and the height
During this research, a DJI Phantom 4 Pro V2.0 (focal length=8.8 mm, 𝑆𝑒𝑛𝑠𝑜𝑟ℎ𝑒𝑖𝑔ℎ𝑡 =8.8 mm) took off at
point G and flew to point C, and captured ortho-images at five selected camera stations. At stations CA and
CI, the drone captured the ortho-image series at 10m, 20m and 40 m of heights, which have flat central
57
regions. At stations CG and CJ ortho-images at 10m and 20m of heights were captured, which have
concavo-convex central regions. At station CH ortho-images at 20m and 40m of heights were captured,
which are used to stitch with other ortho-images and experimental results. Thus, four 10-20 ortho-image
pairs and three 20-40 ortho-image pairs were assembled. Additionally, three pre-processing steps were
implemented to generate low-high ortho-image pairs shown in Figure 36: a) shrink original images
(4864×3648 pixels) to half resolution; b) cut images to square shape (1824×1824 pixels); and c) align high
Camera Station CA Camera Station CG Camera Station CH Camera Station CI Camera Station CJ
10m
20m
40m
The experimental parameters configurations for the developed method are explained in Table 9.
The experimental elevation range was first set as [−𝐻/4, 𝐻/4], then, each pixel in an elevation-map used a
grayscale value [0,255] to represent its elevation data. There are 200 major virtual planes and 1000 minor
virtual planes in the range of [−𝐻/4, 𝐻/4]. The pixel matching and virtual elevation algorithm searches all
major planes. If two adjacent major planes return the same matching value, then the algorithm adjusts the
𝐷𝑒𝑝𝑡ℎ𝑠𝑡𝑒𝑝 to 𝐷𝑒𝑝𝑡ℎ𝑠𝑡𝑒𝑝 /5 and searches the five minor planes between those two major planes to find the best
58
matching result. Additionally, the 𝐺𝑟𝑖𝑑 𝑆𝑖𝑧𝑒 for 20-40 ortho-image pairs is 3/4 of the 10-20 ortho-image
After running the developed Python program, the output includes a matched pixel grid with
matching quality labels(see Figure 37), an ortho-image and elevation-map pair(see Figure 38 and Figure
Value (H/2-H)
Parameters Comments
10-20 20-40
Ortho-image Size 1824×1824 pixels -
Grid Size 32 pixels 24 pixels Pixel Grid formation, see Figure 32
𝑅 =19 pixels 𝑅 × 𝑅𝑎𝑑𝑗.𝑟𝑎𝑡𝑖𝑜∗ is self-adapting, see Figure 31
Initial Patch Size
𝑊𝑖𝑛𝑑𝑜𝑤 𝑆𝑖𝑧𝑒 = 39×39 𝑊𝑖𝑛𝑑𝑜𝑤 = (2𝑅 + 1) × (2𝑅 + 1), see Figure 29
Maximum Patch Size 𝑅 × 𝑅𝑎𝑑𝑗.𝑟𝑎𝑡𝑖𝑜∗ = 76 pixels 𝑅𝑎𝑑𝑗.𝑟𝑎𝑡𝑖𝑜∗ ∈ [1,4] , see Figure 31
Margin Size 128 pixels 96 pixels 4 × 𝑀𝑎𝑥𝑖𝑚𝑢𝑚 (𝐺𝑟𝑖𝑑 𝑆𝑖𝑧𝑒, 𝑅) , see Figure 32
Expected Output Size 1568×1568 pixels 1632×1632 pixels 𝑂𝑟𝑡ℎ𝑜 − 𝑖𝑚𝑎𝑔𝑒 𝑆𝑖𝑧𝑒 − 2 × 𝑀𝑎𝑟𝑔𝑖𝑛 𝑆𝑖𝑧𝑒
Pixel Grid Number 2500 4761 (𝐸𝑥𝑝𝑒𝑐𝑡𝑒𝑑 𝑂𝑢𝑡𝑝𝑢𝑡 𝑆𝑖𝑧𝑒/𝐺𝑟𝑖𝑑𝑆𝑖𝑧𝑒 + 1)2
Ground Sample Distance 0.54 cm/pixel 1.08 cm/pixel Eq. 1
Horizontal Space Resolution 8.47×8.47 m2 17.6×17.6 m2 𝐺𝑆𝐷 × 𝑂𝑢𝑡𝑝𝑢𝑡 𝐼𝑚𝑎𝑔𝑒 𝑆𝑖𝑧𝑒
Elevation Range [-5 m,5 m] [-10m, 10m] [−𝐻/4, 𝐻/4]
Virtual Plane Number 200 Virtual Plane formation, see Figure 30
Major Depth Step 0.05 m 0.1 m 𝐷𝑒𝑝𝑡ℎ𝑆𝑡𝑒𝑝 = 𝐻/2 / 200, see Figure 31
Minor Depth Step 0.01 m 0.02 m 𝐷𝑒𝑝𝑡ℎ𝑆𝑡𝑒𝑝 /5
𝐸𝑙𝑒.𝑢,𝑣 + 𝐻/4 An 8bit Grayscale Image, see Figure 32
Elevation-map 𝑔𝑟𝑎𝑦𝑢,𝑣 = 255 ×
𝐻/2 𝐸𝑙𝑒.𝑢,𝑣 = 𝐻/2 × 𝑔𝑟𝑎𝑦𝑢,𝑣 /255 − 𝐻/4
Distance to Neighbor Pixels 𝑠 = 𝐺𝑟𝑖𝑑 𝑆𝑖𝑧𝑒/2 𝑠 ∈ [0, 𝐺𝑟𝑖𝑑 𝑆𝑖𝑧𝑒], see Figure 33
Pixels are confined to matching between virtual planes
Image Center Region 𝑅𝑎𝑑𝑖𝑢𝑠 = 192 pixels
𝐷𝑒𝑝𝑡ℎ𝑔𝑢𝑒𝑠𝑠 − 𝐷𝑒𝑝𝑡ℎ𝑆𝑡𝑒𝑝 and 𝐷𝑒𝑝𝑡ℎ𝑔𝑢𝑒𝑠𝑠 + 𝐷𝑒𝑝𝑡ℎ𝑆𝑡𝑒𝑝
Strong-matching Threshold 𝑀𝑎𝑥𝑖𝑚𝑢𝑚 (𝐿𝑜𝑤𝑒𝑟 𝐹𝑒𝑛𝑐𝑒𝑖 , 0.001) see Table 8 and Figure 34
Label 0 as Red Dot; Label 1,2,3,4 as Pink Dot; Label 12,13,14,23,24,34 as Blue Dot ; Label
Matched Pixel Label/Mark
123,124,134,234 as Cyan Dot; Label 1234 as Green Dot; See Table 8
The pixel grid matching results are listed in Figure 37. These experimental results show that the
developed four-scaling patch feature descriptor and pixel grid matching algorithm can generate the dense
pixel grids from the low-high ortho-image pairs, where strongly matched pixel pairs are evenly distributed
throughout each low-high ortho-image pair, even in the poorly textured beach regions and dense vegetation
regions.
59
10-20 CA Pixel Grid (77 Red Dots) 10-20 CA NCC value 20-40 CA Pixel Grid (31 Red Dots) 20-40 CA NCC value
10-20 CG Pixel Grid (0 Red Dots) 10-20 CG NCC value 20-40 CH Pixel Grid (14 Red Dots) 20-40 CH NCC value
10-20 CI Pixel Grid (0 Red Dots) 10-20 CI NCC value 20-40 CI Pixel Grid (20 Red Dots) 20-40 CI NCC value
The developed patch feature descriptors have a self-adapting mechanism (𝑅 × 𝑅𝑎𝑑𝑗.𝑟𝑎𝑡𝑖𝑜∗ , line 14
see Figure 31), which uses a large 𝑃𝑎𝑡𝑐ℎ 𝑆𝑖𝑧𝑒 to improve the matching results in poorly textured regions such
as the area with a red umbrella. Furthermore, the shaded regions of the umbrella on the rest area, and most
shaded regions of the tall tree on the rest area and the beach are well matched, which overcame the impact
60
of environment brightness changes. However, the SIFT method only matched 432 sparse keypoints shown
The NCC value distributions for each rotation of each ortho-image pair are different (see Figure
37) because their different starting corners result in different 𝐷𝑒𝑝𝑡ℎ𝑔𝑢𝑒𝑠𝑠 for the remaining selected pixels
(line 5 in Figure 33). In these results, the number of outliers on the boxplots has a positive correlation with
the number of weakest pixels, such as the 10-20 CA, which has the largest number of weakest pixels (77 of
2500). In addition, Table 10 shows the strongly matched ratio is [92.52%, 98.64%], while the weakest
matching ratio is only [0.00%, 3.08%] in the experimental results. Therefore, the developed pixel grid
enhancement algorithm that repeats matching from four starting corners of the squared ortho-image (line 3
The weakest matched pixel pairs, marked as red dots, primarily occurred on the regions of singular
trees, because their heights are suddenly different from their surroundings in a very small region compared
to the umbrella in 10-20 CG, which is strongly matched. What’s more, plants have limited impact on
elevation determination; further work should consider removing plants and restoring the ground surface
61
under them. Other weakest matched pixel pairs occur on the ground next to the upper-right corner of the
rest area in 10-20 CA, because the rest area and the beach have the low contrast texture caused by the shade
of the nearby tall tree. However, the 10-20 CA elevation results in Figure 37 show that their elevations
were determined well by the developed method because the incorrect elevations were replaced by the
The elevation date (converted from the grayscale elevation-map) results are shown in Figure 38
and Figure 39, which were aligned to the ortho-image center as the elevation origin.
10-20 CA Elevation and X/Y-Profile 20-40 CA Elevation and X/Y-Profile Overlap of Station CA
10-20 CI Elevation and X/Y-Profile 20-40 CI Elevation and X/Y-Profile Overlap of Station CI
* ortho-image shown in RGB color; the blue line is the x-profile, unit : m; red line is the y-profile,
*red line 10-20, unit: m; green
unit : m; elevation data shown in jet colormap, unit : m; 10-20 CA, 20-40 CA, 10-20CI and 20-40 CI
line 20-40, unit: m.
were aligned to image center as ± 0.00.
The experimental results show that the developed method is valid in flat central regions such as
CA station and CI station shown in Figure 38, and also works perfectly in the concavo-convex central
regions (see Figure 39). Furthermore, the developed method can handle steep and near vertical topography
such as the vertical side of the garbage can in CJ station, the edge of the rest area in CA station, the
umbrella in CA and CG stations, and the stairways in CI station. This is better than traditional drone
10-20 CG Elevation and X/Y-Profile 20-40 CH Elevation and X/Y-Profile 10-20 CJ Elevation and X/Y-Profile
*ortho-image shown in RGB color; the blue line is the x-profile, unit : m; red line is the y-profile, unit : m; elevation data shown in jet
colormap, unit : m; 10-20 CG was aligned to a point on the rest area; 20-40CH and 10-20 CJ were aligned to image center as ± 0.00.
The overlapped X/Y-Profile of 10-20 pairs and 20-40 pairs at stations CI and CA are matched at
most parts in Figure 38. As the 20-40 pairs’ GSD and 𝐷𝑒𝑝𝑡ℎ𝑠𝑡𝑒𝑝 are twice that of the 10-20 pairs’ (see Table
9), it is reasonable to have more detailed elevation variations such as edges, salient pole and concave pole
in the lower altitude ortho-image pairs. Furthermore, for the common objects in different low-high ortho-
image pairs, the developed method generated quite accurate elevation results.
The measured elevation differentials between the selected points are compared with the true
elevation differentials in Table 11. The measurement differences are [-4.36, 4.86] cm for the 10-20 ortho-
image pairs, and [-2.39, 2.76] cm for the 20-40 ortho-image pairs, which are satisfied with 5.00 cm error
standard (Takahashi et al. 2017). Therefore, the developed method is robust at different camera stations
The disassembled discrete virtual elevation plane result for the selected points are compared in
Table 12. Based on the virtual plane model (see Figure 30) and the pixel matching and virtual elevation
algorithm (see Figure 31), the matched result should fall within a three-virtual-plane-range, within the
interval [−𝐷𝑒𝑝𝑡ℎ_𝑆𝑡𝑒𝑝, 𝐷𝑒𝑝𝑡ℎ_𝑆𝑡𝑒𝑝 ]. In this chapter, the experimental site was broken into 200 major virtual
planes in the range of [-5,5] m for 10-20 ortho-image pairs and [-10,10] m for 20-40 ortho-image pairs. The
designed interval between two major virtual planes are 5.00 cm and 10.00 cm for 10-20 and 20-40 ortho-
image pairs respectively. 3 of 9 experimental results fell into the lower interval [−𝐷𝑒𝑝𝑡ℎ_𝑆𝑡𝑒𝑝, 0], and 6 of 9
fell into the upper interval [0,𝐷𝑒𝑝𝑡ℎ_𝑆𝑡𝑒𝑝]; they are all matched within the expected discrete virtual planes
based on the true elevation data. In other words, the developed virtual elevation algorithms are sensitive to
major plane changes. Therefore, the matched pixel grids from the low-high ortho-image pairs contain the
5.1 Introduction
Image and deep learning based methods have been applied to determine the relative depth
information for each pixel of an image of indoor scenes (Eigen et al 2014; Liu et al. 2015; Laina et al.
2016), outdoor scenes (Chen et al. 2016; Li and Snavely 2018) and scenes from automatic driving
applications (Garg et al. 2016). The main challenge of training a deep learning model is acquiring the
suitable dataset. In the application of determining construction site elevation, the elevation data can be
acquired by either contact or non-contact methods discussed in the literature review, but the challenge is
linking the ortho-image’s pixels with the elevation value in this same coordinate.
provides the possible approach that stores the elevation value in an equal size 8-bit grayscale image,
referred to as an elevation-map, which uses 0 as the elevation lower boundary and 255 as the elevation
upper boundary. In Figure 40, the elevation-map is represented in viridis colormap for better visualization,
and the X/Y-profiles show the elevation changes at the selected point (elevation unit: m). Acquiring
elevation data for each pixel of the ortho-image is unreasonable. To save time, Chapter 4 also provides the
grid pixel formation to simplify and share the same grayscale value / elevation value for a patch, such as a
32×32-pixel patch. For example, the 1st selected 𝑝𝑖𝑥𝑒𝑙 (16,16) shares its grayscale value / elevation value
with the patch 𝐸𝑙𝑒𝑣𝑎𝑡𝑖𝑜𝑛_𝑚𝑎𝑝[0: 31,0: 31]. Therefore, this chapter summarized the findings in using the
The developed method in Chapter 4 will capture an H-H/2 ortho-image pair at two flight altitude
positions. Such can be done, for example, as the low-height ortho-images have Image Size=3648×4864-
pixel, GSD=0.27 cm/pixel, Site Size=9.85×13.13 m2 with H/2=10 m; and the high-height ortho-images
After the processes of the developed elevation determination method in Chapter 4, the expected
output ortho-image is transformed (see Table 13) from the low-height ortho-image with the Image
Same as the high-resolution ortho-image, the high-resolution 8-bit grayscale elevation-map also
has the Image Size=1568×1568-pixel, GSD=0.54 cm/pixel, Site Size=8.47×8.47 m2 with H/2=10 m. After
the processes of the proposed elevation determination method, each pixel of the generated elevation-map
has the grayscale value ranges from 0 to 255 to represent the elevation value from [−𝐻/4, 𝐻/4], which is [-5,
𝐻/2 𝐻 10𝑚
𝐸𝑙𝑒𝑣𝑎𝑡𝑖𝑜𝑛@(𝑢, 𝑣) = 𝐸𝑙𝑒𝑣𝑎𝑡𝑖𝑜𝑛𝑚𝑎𝑝[𝑢,𝑣] × − = 𝐸𝑙𝑒𝑣𝑎𝑡𝑖𝑜𝑛𝑚𝑎𝑝[𝑢,𝑣] × − 5𝑚
255 4 255 Eq. 10
In addition, pixel grid formation is set as 32×32-pixel patch sharing the same grayscale value /
elevation value. For example, the 1st selected 𝑝𝑖𝑥𝑒𝑙 (16,16) shares its grayscale value / elevation value with
the patch 𝐸𝑙𝑒𝑣𝑎𝑡𝑖𝑜𝑛_𝑚𝑎𝑝[0: 31,0: 31]. Furthermore, the elevation-map will be aligned to the ortho-image center
The developed method in Chapter 4 has an adjustable measuring space range, which depends on
the drone flight altitude and camera parameters. Raising the drone’s altitude will increase the ortho-image
pair’s coverage, which is better for getting the overall construction site topography. On the other hand, to
get detailed structures’ shapes, it is better to use a lower altitude ortho-image pair, which use more pixels to
represent the small objects. The author recommends using a DJI Phantom 4 Pro V2.0 , which can give an
8.47×8.47 m2 coverage in 10-20 ortho-image pair and 17.6×17.6 m2 coverage in 20-40 ortho-image pair
Where the construction site is larger than a single image frame, such as roadway projects, a series
of ortho-image pairs can be captured through a serpentine style path (see Figure 41).
In detail, the drone is planned to takeoff and reach the desired altitudes 𝐻/2 and 𝐻 to capture the
low-high ortho-image pair at the takeoff station. After the drone finishes the high ortho-image capture at
altitude 𝐻, it moves forward to the next station where the distance between two stations should make the
adjacent low ortho-images have enough overlap for image stitching. The drone takes high ortho-image at
altitude 𝐻 at the 2nd location first, then it moves downward to capture the second low ortho-image at the 2nd
location after it reached to the desired altitude 𝐻/2. After that, the drone will continue movies forward and
67
repeat the previous steps until it acquires enough low-high ortho-image pairs to cover the entire
construction site. This designed path will guarantee that each low-high ortho-image pair has the same
center. Furthermore, it is convenient to add and modify some ortho-image pairs beyond the acquisitions
It is important to avoid the drone’s shift and rotation as much as possible during the vertical
moving and capturing of the low-high ortho-image pair at each station. In this research project, the pre-
processing steps of image rotation and image translation are based on the SIFT keypoints. The high ortho-
images are rotated in the range of [-2.862, 0.321] degrees, which have the minimum absolute rotation in 10-20
CA with 0.260 degrees, and the maximum absolute rotation in 10-20 CG with -2.862 degrees. The high
ortho-images translated in the range of [-3.99, 13.02] pixels in x-direction and [-22.57, 13.16 ] pixels in y-
direction, which have the minimum absolute translation in 20-40 CI with x/y-direction translation [1.83,1.30]
pixels, and the maximum absolute translation in 10-20 CA with [13.02, 13.16] pixels and 10-20 CG with [-1.85,
-22.57] pixels.
Table 14 shows the correlations between the absolute rotation degree and translation distance with
the number of weakest pixels in Table 10. Based on the correlation results, the X-direction translation (in
image width) has a significant positive correlation with the pixel matching quality. The Y-direction
translation (in image height) and the rotation have no significant correlation with the matching quality. The
maximum X-direction translation occurred in 10-20 CA, which has the largest number of weakest pixels.
Therefore, minimizing the image width direction shift is most important in acquiring the best low-high
The algorithm parameters configuration in Table 9 should be adapted for the elevation
determination on a construction site. The proposed pixel grid formation simplifies the ortho-image pairs’
matching. Reducing 𝐺𝑟𝑖𝑑 𝑆𝑖𝑧𝑒 can generate more detailed results while also raising the computing time. But
the additional computing cost in matching all pixels (𝐺𝑟𝑖𝑑 𝑆𝑖𝑧𝑒 =1) gives no additional benefits from the extra
To save time and avoid wasting computing resources, it is better to add an early stop function to
the pixel matching and virtual elevation algorithm (Figure 31). If 𝑁𝐶𝐶𝑐𝑢𝑟𝑟𝑒𝑛𝑡 + < 𝛼 𝑁𝐶𝐶𝑚𝑎𝑥 then stop matching
𝐷𝑒𝑝𝑡ℎ𝑐𝑢𝑟𝑟𝑒𝑛𝑡 + . If 𝑁𝐶𝐶𝑐𝑢𝑟𝑟𝑒𝑛𝑡− < 𝛼 𝑁𝐶𝐶𝑚𝑎𝑥 .then stop matching 𝐷𝑒𝑝𝑡ℎ𝑐𝑢𝑟𝑟𝑒𝑛𝑡− . The author recommends using 𝛼 =
0.7, which balances the computing time and accuracy. The matching time is also impacted by the site
shape. A relatively flat site takes less time than one with a lot of elevation changes. The tested matchings of
11449-pixel grid (𝐺𝑟𝑖𝑑 𝑆𝑖𝑧𝑒 =16) in 20-40 ortho-image pairs take slightly longer than 12 minutes, with the
experimental computer configuration of Python 3.6.8, Intel® Xeon® Gold 5122 [email protected] GHz. The computing time
for 2500-pixel grid (𝐺𝑟𝑖𝑑 𝑆𝑖𝑧𝑒 =32) in 10-20 ortho-image pairs and 4761-pixel grid (𝐺𝑟𝑖𝑑 𝑆𝑖𝑧𝑒 =24) in 20-40
ortho-image pairs are around 2 to 5 minutes, which are the recommended pixel grid configuration.
The configuration of 200 major planes and 1000 minor planes balanced the accuracy and
computing time. Table 12 shows 3 of 9 experimental results fell into the lower interval [−𝐷𝑒𝑝𝑡ℎ_𝑆𝑡𝑒𝑝, 0], and
6 of 9 fell into the upper interval [0,𝐷𝑒𝑝𝑡ℎ_𝑆𝑡𝑒𝑝]; they are all matched within the expected discrete virtual
planes based on the true elevation data. The 𝑇ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑𝑙𝑜𝑤 (line 14 in Figure 31) was set as 0.4 in this
research. The author recommended range is [0.3,0.7]. Raising it can improve the matching accuracy in poorly
textured ortho-image pairs but can result in errors as well. The noise points in 10-20 CA (see Figure 38)
were matched on the wrong virtual planes. Additionally, the distance 𝑠 (in lines 7,8,9, and 10 in Figure 33)
was set as 𝐺𝑟𝑖𝑑 𝑆𝑖𝑧𝑒/2 to balance the smoothing of the elevation-map and retaining detailed of elevation
changes.
69
Additionally, as the drone’s shift and rotation are unable to be totally eliminated, the image center
region, in Table 9, is an important parameter. Pixels in this region are limited to matching within the upper
and lower adjacent virtual planes [𝐷𝑒𝑝𝑡ℎ𝑔𝑢𝑒𝑠𝑠 − 𝐷𝑒𝑝𝑡ℎ𝑆𝑡𝑒𝑝 , 𝐷𝑒𝑝𝑡ℎ𝑔𝑢𝑒𝑠𝑠 + 𝐷𝑒𝑝𝑡ℎ𝑆𝑡𝑒𝑝 ]. This setting can help avoid incorrect
elevation results in the center region pixels and make their elevation results close to their surroundings.
Otherwise, due to the reason discussed in section 4.3.3.3, incorrect matching will happen there. In addition,
this research also applied this setting for pixels on y-axis (𝑥 ≤ 𝐺𝑟𝑖𝑑 𝑆𝑖𝑧𝑒). With the repeated matching in pixel
grid and elevation-map enhancement algorithm (line 3 in Figure 34), the noise points on both the X-axis and
Y-axis are reduced. The author recommends using a circular center region with radius=192 pixels, shown as
The high-resolution ortho-images and elevation-maps datasets are set up to train and test the
proposed deep learning-based method in this research. The selected construction site is a lake beach
(Atwater Park, Shorewood, WI, USA), which includes a stairway, boardwalk with rest area, several
garbage cans and vegetations (see Figure 35, Figure 42,Figure 56, Figure 80). The ortho-images were
captured in this site from March 2019 to September 2019. Thus, different vegetation growing situations
In Figure 43, Figure 44, Figure 45, the 1st and 2nd column ortho-images were taken in Atwater
Park (Shorewood, WI, USA) during different seasons. In detail, a) Data A and B were taken on 3/24/2019,
when the vegetation had not recovered yet; b) Data C and D were taken on 6/5/2019, when the vegetation
was growing; c) Data AC, AO, CA were taken on September 2019, when the vegetation was fully grown;
and d) Data B, D, AC and AO have the same wooden platform, which is different to Data CA. The 3rd
column high-resolution 24-bit RGB ortho-image has the Image Size=1568×1568-pixel, GSD=0.54
cm/pixel, Site Size=8.47×8.47 m2 with H/2=10 m. The 4th column high-resolution 8-bit grayscale elevation-
map also has the Image Size=1568×1568-pixel, GSD=0.54 cm/pixel, Site Size=8.47×8.47 m2. After the
processes of the developed elevation determination method, each pixel of the generated elevation-map has
the grayscale value ranges from 0 to 255 to represent the elevation value from [−5 𝑚, 5 𝑚].
In Figure 46, the 1st and 2nd column ortho-images were taken in Atwater Park (Shorewood, WI,
USA). In detail, data CG detail the umbrella, CI detail stairways and CJ detail garbage cans.
In Figure 47, the 1st and 2nd column ortho-images were taken in Atwater Park (Shorewood, WI,
USA). The 3rd column high-resolution 24-bit RGB ortho-image has the Image Size=1632×1632-pixel,
GSD=1.08 cm/pixel, Site Size=17.6×17.6 m2 with H/2=20 m. The 4th column high-resolution 8-bit
grayscale elevation-map also has the Image Size=1632×1632-pixel, GSD=1.08 cm/pixel, Site
Size=17.6×17.6 m2. After the processes of the developed elevation determination method, each pixel of the
generated elevation-map has the grayscale value ranges from 0 to 255 to represent the elevation value from
[−10 𝑚, 10 𝑚].
73
CA: 9/19/2019
CH: 9/19/2019
CI: 9/19/2019
As elevation data are saved in the grayscale elevation-map format (see Figure 40), it is convenient
to stitch adjacent elevation-maps into a larger elevation-map by simply selecting two corresponding points
in their associated ortho-images as the boundary and aligning the elevation data at the selected boundary.
Figure 48 shows the results of up-down stitching and left-right stitching. Although the 10-20 CJ and 10-20
CI ortho-images have different exposure values, the combined elevation-maps are smooth at their junctions.
The accuracy of the developed method was not impacted by the brightness of the environments.
74
In Figure 48 , point clouds were converted using each pixel of the stitched elevation-maps by Eq.
9. The overall shape of the experiment site was well reconstructed. The small objects, such as the single
tree on 20-40 CH, were also well reconstructed. The side points of vertical surfaces are missed because the
developed method only used top-view ortho-images, and the missed side points have no impact on
determining the elevations of a construction site. When the drone flew at 10 m, some small objects’ side
surfaces were recorded in the ortho-image, such as the garbage can in ortho-image 10 CI, because the
reflected rays converged through the camera lens instead of passing parallel into the lens. Enlarging the
altitude or flying the drone over these objects can eliminate this kind of effect, and the horizontal position
of a point on the vertical side surfaces can be corrected by Eq. 7 if necessary. What’s more, these small and
There are noticeable elevation errors on the edge of the red umbrella on 20-40 CA, where the pixel
pairs were weakly matched as pink and blue dots in Figure 37. This is different to the single tree, as its
75
pixel pairs were all strongest matched as green dots (see Figure 37). The state-of-the-art SIFT method is
invalid there, as no keypoint was matched in Figure 37. However, there are several approaches that can be
1. Lower down the drone flight altitude to 10-20 m for capturing more detail, such as 10-20 CG
in Figure 37.
2. Decrease 𝐺𝑟𝑖𝑑 𝑆𝑖𝑧𝑒 for dense matching more pixel pairs and smoothing vertical changes, which
is the same as low altitude for using more pixels to represent an object.
3. Remove weakly matched pixels, then use the PMVS method for dense reconstruction
4. Fix it with additional processes, such as using a convolutional neural network first to
distinguish the umbrella surface from other surfaces, then assign the correct elevation values
6.1 Introduction
generate the output as an ortho-image and elevation-map pair. In the case of automatic driving, the
forward-facing view has the camera’s principal ray perpendicular to the objects in front of the car. So, the
images captured in front of automatic driving cars and above construction site surfaces have the common
characteristic in that the objects in the same depth level / elevation level have common texture features in
the forward-facing view / ortho-image. Therefore, capturing an ortho-image over a construction site by
drone, then using this image to estimate the site elevations, is a feasible approach, which will reduce drone
flying time and avoid hazards of drone crashes in the construction site.
proposed to estimate elevations from the ortho-images of a construction site, which links each pixel of the
ortho-image with the same coordinate pixel of an elevation-map (see Figure 49). This chapter also
evaluates the effectiveness of the single-image-frame-based 3D-reonstruction method, which requires much
fewer images in estimating elevation than the developed low-high ortho-image pair method in Chapter 4.
To explain how to estimate site elevations from a single-frame drone-based ortho-image, the rest of this
chapter presents the dataset acquisition, model designs, training and testing, field experiments and result
discussions.
Pixel Elevation
Disassemble Pixel-to-Pixel Assemble
Considering the computing capacity of the workstation system, in Figure 50 the 1st to 5th columns
list the possible model input and output small-patch examples of 32×32-pixel, 64×64-pixel, 128×128-pixel,
256×256-pixel, and 512×512-pixel, which are cropped from [0:31,0:31], [0:63,0:63], [0:127,0:127], [0:255,0:255],
and [0:511,0:511] of the high-resolution 1536×1536-pxiel ortho-image and elevation-map (the 6th column)
respectively. For the elevation-map small-patches, each larger patch contains four times more elevation
values than the smaller patch. For example, the 64×64-pixel small-patch contains elevation values from
𝑝𝑖𝑥𝑒𝑙 (16,16), 𝑝𝑖𝑥𝑒𝑙 (16,48), 𝑝𝑖𝑥𝑒𝑙 (48,16) and 𝑝𝑖𝑥𝑒𝑙 (48,48), while the 32×32-pixel patch only contains the
elevation value from 𝑝𝑖𝑥𝑒𝑙(16,16). Thus, a smaller patch size is better for the deep learning model to learn
the local features from the input and output dataset. On the other hand, a larger patch size is better for
learning the global features from the input and output dataset.
pixel, and 512×512-pixel patches, the stride is set as 16, 32, 64 or 96 pixels for moving these square boxes
on the ortho-image and elevation-map (larger strides are used to avoid workstation system memory
shortages) and the number of patches can be determined by Eq. 11, where “⌊ ⌋” is the floor function.
Moreover, in order to make the deep learning model robust in different image orientations, the ortho-image
and elevation-map are planned to rotate 90, 180 and 270 degrees to increase the dataset by four times.
Table 15 lists the detailed parameters in creating the small-patch datasets from a high-resolution ortho-
An ortho-image acquired by the drone has RGB 3-channel. Considering that the color textures are
important in distinguishing different objects on the construction site, the texture information is kept rather
than using a grayscale image. Therefore, using a high-resolution ortho-image can produce the model
training datasets with 𝑠ℎ𝑎𝑝𝑒 (36100,32,32,3), 𝑠ℎ𝑎𝑝𝑒 (8836,64,64,3), 𝑠ℎ𝑎𝑝𝑒 (8100,128,128,3), 𝑠ℎ𝑎𝑝𝑒 (1764,256,256,3),
or 𝑠ℎ𝑎𝑝𝑒 (484,512,512,3), where the first number is the quantity of the small-patches.
The elevation-map generated from the low-high ortho-image pair-based method only has one
channel. Disassembling a high-resolution elevation-map can produce the small-patch datasets with
𝑠ℎ𝑎𝑝𝑒 (36100,32,32,1), 𝑠ℎ𝑎𝑝𝑒 (8836,64,64,1), 𝑠ℎ𝑎𝑝𝑒 (8100,128,128,1), 𝑠ℎ𝑎𝑝𝑒 (1764,256,256,1), or 𝑠ℎ𝑎𝑝𝑒 (484,512,512,1).
In this research project, the proposed deep learning model is a convolutional encoder-decoder
network model (see Figure 51), which has an equal number of max pooling layers and up sampling layers.
This type of model is referred to as an “hourglass-like” model, which has been widely used in image
segmentation, such as SegNet (Badrinarayanan et al. 2017). Another “hourglass-like” model uses
deconvolution network in the decoder, such as DeconvNet (Noh et al. 2015), where each deconvolution
(also known as transposed convolution) layer is the opposite operation of normal convolution (Chollet
79
2015). During this research project, the convolutional decoder and deconvolutional decoder were compared
and their generated results do not have any significant difference. What’s more, the proposed model is
different from SegNet, in which the up sampling layer is the first layer in the decoder, but the proposed
In the encoder block, the five convolution layers learn the model input ortho-image patch as
feature-maps; each convolution layer contains a 2D convolution operation with zero-padding (see Figure
52), and the layer output has the same size as the layer input (Chollet 2015). In detail, Figure 52 shows an
example of a zero-padded convolution operation. The original input is 5×5 in size, which has been padded
to 7×7; the 3×3 kernel convolution has a 5×5 output, which has the same size as the original input. If the
original input is not padded with zero, the convolution output is the filled 3×3 region only. Additionally,
each max pooling layer next to the convolution layer is a max pooling operation (see Figure 53), which
reduces the layer input (convolution layer output) to half size as the layer output. For example, Figure 53
shows an example of how max pooling (2×2 filter and strides =2)
In the decoder block, the five convolution layers interpret the feature-maps to model an elevation-
map output; each convolution layer contains a 2D convolution operation with zero-padding as well; the up
sampling layers are the reverse operations of max pooling operations, which enlarge the layer input to its
double size as the layer output. Figure 53 shows an example of how up sampling work in the model.
To make this encoder-decoder model able to interpret an ortho-image patch and predict an
equivalent size elevation-map patch, the intersection part of the encoder-decoder is proposed as a 512-
channel feature-map, which is generated from the “max_pooling2d_5” layer (see Table 16). For example,
the encoder generates a 4×4×512 feature-map for the 128×128×3 input (see Figure 51). This intermedia
feature-map is required by the model output. Based on the dataset creation, each elevation-map shares a
common integer from 0 to 255 (8-bit grayscale value) in every 32×32 patch, thus a 256-channel feature-
map with size 4×4 is required for the decoder to generate the 128×128×1 output. That can be explained as
each channel is the probability of the integer from 0 to 255. As a 128×128 elevation-map patch contains 16
(4×4) elevation values, thus at least a 4×4×256 feature-map is required for the decoder. The proposed
“conv2d_5” layer uses 512 filters (see Table 16), which doubled the required channel number. The 512-
channel feature-map can be understood as each channel is the probability of the element in list [0.0, 0.5,
1.0, …, 245.5, 255.0]. What’s more, in this research project, adding the interaction feature-map to 1024-channel
had no difference from the 512-channel. In addition, 5 max-pooling-layer is the maximum number for the
encoder because the smallest model input 32×32-pixel patch is transformed to 1×1-pixel feature-map after
Model Architecture for 32×32,64×64,128×128,256×256 and 512×512-pixel Patch Output Shapes for Each Patch
Blocks Layers (Type and kernel size) Strides Padding Activation Filters Parameters 32 64 128 256 512
Channels
s /Channels Number Rows/Columns
Input input_1 (Input Layer) - - - 3 0 32 64 128 256 512 3
Encoder conv2d_1 (Conv2D 3x3) 1 same ReLU 64 1792 32 64 128 256 512 64
max_pooling2d_1 (Max Pooling 2x2) 2 same - - 0 16 32 64 128 256 64
conv2d_2 (Conv2D 3x3) 1 same ReLU 128 73856 16 32 64 128 256 128
max_pooling2d_2 (Max Pooling 2x2) 2 same - - 0 8 16 32 64 128 128
conv2d_3 (Conv2D 3x3) 1 same ReLU 256 295168 8 16 32 64 128 256
max_pooling2d_3 (Max Pooling 2x2) 2 same - - 0 4 8 16 32 64 256
conv2d_4 (Conv2D 3x3) 1 same ReLU 512 1180160 4 8 16 32 64 512
max_pooling2d_4 (Max Pooling 2x2) 2 same - - 0 2 4 8 16 32 512
conv2d_5 (Conv2D 3x3) 1 same ReLU 512 2359808 2 4 8 16 32 512
max_pooling2d_5 (Max Pooling 2x2) 2 same - - 0 1 2 4 8 16 512
Decode conv2d_6 (Conv2D 3x3) 1 same ReLU 512 2359808 1 2 4 8 16 512
r up_sampling2d_1 (Up Sampling 2x2) 1 - - - 0 2 4 8 16 32 512
conv2d_7 (Conv2D 3x3) 1 same ReLU 512 2359808 2 4 8 16 32 512
up_sampling2d_2 (Up Sampling 2x2) 1 - - - 0 4 8 16 32 64 512
conv2d_8 (Conv2D 3x3) 1 same ReLU 256 1179904 4 8 16 32 64 256
up_sampling2d_3 (Up Sampling 2x2) 1 - - - 0 8 16 32 64 128 256
conv2d_9 (Conv2D 3x3) 1 same ReLU 128 295040 8 16 32 64 128 128
up_sampling2d_4 (Up Sampling 2x2) 1 - - - 0 16 32 64 128 256 128
conv2d_10 (Conv2D 3x3) 1 same ReLU 64 73792 16 32 64 128 256 64
up_sampling2d_5 (Up Sampling 2x2) 1 - - - 0 32 64 128 256 512 64
Output conv2d_11 (Conv2D 3x3) 1 same Sigmoid 1 577 32 64 128 256 512 1
Total parameters: 10,179,713 Layer output shape
Trainable parameters: 10,179,713 (Rows, Columns, Channels)
Non-trainable parameters: 0
Furthermore, each convolutional layer also includes an activation function, which performs the
non-linear transformation of the features generated from the convolution operation (Dettmers 2015). In the
proposed model, the input and output datasets, 24-bit RGB ortho-image and 8-bit grayscale elevation-map
pairs with value range [0,255] are normalized to the range [0,1] by dividing them by 255. Thus, the
activation function should progressively change from 0 to 1 with no discontinuity for generating the output.
The Rectified Linear Unit activation function (ReLU), 𝑓(𝑥) = 𝑚𝑎𝑥(0, 𝑥), is a very popular choice for use in
hidden layers; it is faster than many activation functions, such as Sigmoid. The ReLU function does not
always output a non-zero, so it results in less neurons being utilized and less dependence between features
(Nair and Hinton 2010). In addition, the Sigmoid activation function (also known as Logistic), 𝑓(𝑥) = 1/(1 +
𝑒𝑥𝑝(−𝑥) ) is used in the output layer to generate the continuous values for the elevation-map, instead of
using SoftMax function to classify the objects in SegNet (Badrinarayanan et al. 2017). The detailed model
layers and each layer output shape for each patch size trial are shown in Table 16, where the type of layers
This research project uses the “Sequential model API” in Keras to set up the convolutional
encoder-decoder network model. When compiling the model, it uses “rmsprop” as the optimizer, and
“mean_squared_error” as the loss function(Chollet 2015); “validation_split” is set to 0.05, which means
that 95% of the datasets is used for training the model and 5% of the dataset is used for validation. In this
research project, the efficiency of “early stopping” compared to non-stopping has been evaluated. The
“early stopping” technique stopped model training when the monitored quantity had stopped improving
(Chollet 2015), such as the training loss or validation loss had not decreased for 10 epochs. This research
What’s more, this research project uses the “same” padding for max pooling layers. As the model
input sizes are 32, 64, 128, 256 and 512, which can be divided by 32 (25), the padding setting in max
pooling should have no impact on the result because in each max pooling layer, the layer input size is
halved, while the layer output size is still an integer which can be divided by 2. However, the model results
varied on this setting. Using “same” padding generated a better result than “valid” padding.
The input layer and output layer of the proposed model (see Figure 51) indicate that the trained
model predicts an elevation-map patch from an input ortho-image patch. A model prediction example is
shown in Figure 54, while the edged area of each prediction patch is different from the center area. This is
because the zero-padding is used in convolution operations. The normal convolution operation shrinks the
input image size down to the filled center region in Figure 52. In this research project, the added padding
operation enlarges the image size with “0” before the convolution operation (see Figure 52). Then, the
zero-padding convolution ensures that the output maintains the same size as the input. However, the added
“0” produces unwanted features in the edge of the prediction patches. The side-by-side assembly of
Figure 55 shows the workflow of the ortho-image disassembling and elevation-map assembling
algorithm, which generates the elevation-map without unexpected “gridlines”. This algorithm needs to
disassemble the ortho-image into several overlapping patches. The required number of patches is
determined by Eq. 12. When assembling the elevation-map, only selected parts of each patch will be used.
Compared with the side-by-side approach, the proposed overlapping algorithm replaces the patch edges
with other predictions’ center regions. Additionally, the assembly of the elevation-map has the same GSD
as the ortho-image. Then, the 3D geometry data can be reconstructed by Eq. 13.
Record the
Predict elevation-map patch
sequence num.
by deep learning model
for each patch
1 3 5
Assemble elevation-map
6 8 10 with sequence num.
11 13 15
16 18 20 1
2
21 3
23 25 4
1 5
2 6 1 2 3 4 5
2 4 3 7
4 8
5 9 6 7 8 9 10
7 9 6
7 24 11 12 13 14 15
12 14 8 25
9 16 17 18 19 20
17 19
24 21 22 23 24 25
22 24 25
𝐼𝑚𝑎𝑔𝑒 𝑊𝑖𝑑𝑡ℎ
(𝑢 − ) ∙ 𝐺𝑆𝐷
2
𝑋 𝑥 ∙ 𝐺𝑆𝐷 𝐼𝑚𝑎𝑔𝑒 𝐻𝑒𝑖𝑔ℎ𝑡
[𝑌 ] = [ −𝑦 ∙ 𝐺𝑆𝐷 ] = −(𝑣 − ) ∙ 𝐺𝑆𝐷
Eq. 13
𝑍 𝐸𝑙𝑒𝑣𝑎𝑡𝑖𝑜𝑛 2
𝐸𝑙𝑒𝑣𝑎𝑡𝑖𝑜𝑛𝑅𝑎𝑛𝑔𝑒
[𝐸𝑙𝑒𝑣𝑎𝑡𝑖𝑜𝑛𝑚𝑎𝑝[𝑢,𝑣] × 255
+ 𝐸𝑙𝑒𝑣𝑎𝑡𝑖𝑜𝑛𝐿𝑜𝑤𝑒𝑟𝐵𝑜𝑢𝑛𝑑𝑎𝑟𝑦 ]
In this research project, the experiment datasets, the ortho-image and elevation-map pairs are
selected from Chapter 5. In addition, the edges of the ortho-images and elevation-maps are removed to
make their width (1,536-pixel) and height (1,536-pixel) which are exactly divisible by 32, 64, 128, 256 and
512. This is because the various patch size configurations will be compared. Figure 56 shows the spatial
Vegetation block
Vegetation block B,D,
Tree CA,CG AC,AO
A,C
CI Tree
Shrub block Tree Shrub block Tree
Figure 57 lists the model training and validation datasets. The ortho-images were taken in Atwater
Park (Shorewood, WI, USA) during different seasons. Data A and B were taken on 3/24/2019, when the
vegetation had not recovered yet. Data C and D were taken on 6/5/2019, when the vegetation was growing.
85
Additional data AC, CA, CG and CI were taken on September 2019, when the vegetation was fully grown.
Data AC is the same wooden platform as B and D; data CA is another wooden platform on this site; data
CG and CI detail the umbrella and stairways. In addition, Figure 58 includes an additional ortho-image and
elevation-map pair which is proposed to be used for quantitatively evaluating the trained model.
Furthermore, the elevation-maps were aligned by picking a point on the wooden platform / path and setting
B: 3/24/2019
D: 6/5/2019
C: 6/5/2019
CA: 9/19/2019
AC: 9/5/2019
CG: 9/19/2019
CI: 9/19/2019
Ortho-image Elevation-map
1536x1536 pixels 1536x1536 pixels
AO: 9/5/2019
The model training parameters including batch sizes, epochs and dataset numbers are listed in
Table 17. The “100 epochs” and “early stopping” were shared for the five different patch size trials. Eight
ortho-image and elevation-map pairs (see Figure 57) and their 4-rotations were used to train the model.
Thus, the total number of datasets is eight times the number listed in the last column of Table 15. The
dataset numbers varied for the five different patch size trials, because the system memory limitation
resulted in different strides being used for creating datasets. What’s more, in this research project, when
training the model, the “batch size=32” was used in 512×512-pixel patch trial and “batch size=128” was
used in the other trials. This is because of the single GPU’s memory limitation (11GB or 10.24GiB); an
additional 3.38 GiB memory and 2.29 GiB memory are needed for each GPU with batch size 128 and 64
respectively. Fortunately, the small batch size in 512×512-pixel patch trial only results into more model
The loss results of model training for each trial and the loss results of model validation (also
known as model testing) for each trial in Figure 59. The five different patch size trials were stopped at
different epochs (see Table 17). The 128×128-pixel patch trial stopped at 18th epoch is the earliest trial, and
the 256×256 patch stopped at 35th epoch. 256×256-pixel patch took the most epochs for the validation loss
Furthermore, the validation results of each patch size are shown in Figure 60, where the “ground
truths” are the elevation-map patches used in training the model, the predictions are the model outputs
generated from the trained model with the corresponding inputs. The “ground truths” and model predictions
are shown in the same viridis colormap range, the more similar the color the more accurate the predictions.
In visual, the model predictions are not a constant color (grayscale value) for a 32×32-pixel patch as the
elevation-map patches. The developed model decodes the elevation values for each pixel of the input patch
instead of a single elevation value for the whole patch. The model output results show that the trained
model can distinguish different objects, such as the wooden paths that are distinguished from the ground in
the 256×256 and 512×512 trials. The trained model also shows the ability to correct elevation value errors
that occur in the wooden path of 256×256 and 512×512 trials. In detail, the wooden paths of 256×256 and
512×512 in the “ground truth” have incorrect elevation values, while the predictions for the wooden paths
Ortho-image
Patches
Elevation-
map Patches
(Ground
Truth)
Model
Output
Patches
(Predictions)
Patch size 32×32 64×64 128×128 256×256 512×512
Figure 60 Data A: ground truth patches and model prediction patches (w/ early stopping)
For the five “early stopping” different patch size trials, the minimum model training loss occurred
on 128×128-pixel patch trial at its 18th epoch (see Figure 59). The 128×128-pixel patch also has the smaller
model validation loss, while the minimum model validation loss occurred on 64×64-pixel patch trial at its
27th epoch. The Data A predictions in Figure 60 indicates that the 128×128-pixel patch trial has better
performance than other patch sizes in the “early stopping” trials, and the overlapping assembled predictions
in Figure 61 confirms that the 128×128-pixel patch has the best performance in the “early stopping” trial
for Data CI as well. That may be because the 128×128-pixel patch balances the local features of each
32×32 patch and contains global features to connect each single 32×32 patch as well. The detailed
comparisons of the different patch sizes will be stated in the discussion section.
Figure 61 Data CI: overlapping assembly of model predictions (w/ early stopping)
89
Another model training was conducted without “early stopping”, the 18 to 100 epochs model
training loss and validation loss of the five different patch size trials are shown in Figure 62. The 128×128-
pixel patch has the minimum model training loss of 8.74E-06 at 100 epochs, which is smaller than 1.82E-
04 at the “early stopping” trial. The 64×64-pixel patch and 256×256-pixel patch trials have a more stable
decreasing trend and smaller values for training loss compared to the extreme size patches 32×32 and
512×512. Therefore, using the 128×128-pixel patch for the developed convolutional encoder-decoder
network model has the best model training and validation performance, followed by the 64×64-pixel patch
The testing data AO in Figure 58 is different from the training data AC in Figure 57; they were
captured on the same day but in different fight paths and sequences; the drone landed after captured the AC
low-high ortho-image pair and took off again to capture AO pair; for the AC pair, the 10 m ortho-image
was captured first followed by the 20 m ortho-image; but for the AO pairs, the 20m ortho-image was
Figure 63 contains the model predictions for data AO . Visually, the 128×128-pixel patch has the
best result in the “early stopping” trial and the 64×64-pixel patch is better than others, while the patches
128×128 and 256×256 are better than others in the 100 epochs trials. The 100 epochs results are more
90
detailed than the “early stopping” ones. These 2D predictions can be easily converted to 3D point clouds by
Eq. 13 with the selected 2,304 (48×48) points (strides=32 pixels in column and row directions).
Early Stopping:
Ortho-image 64×64 128×128 256×256 512×512
32×32
Figure 63 Data AO: predictions with different patch size and different epochs
Figure 64 overlaps the 128×128 and 256×256 prediction point clouds (one pixel is one point) with
the “ground truth”, which is converted from the elevation-map and plotted with RGB cubes. The model
predictions have the similar shape as the “ground truth” and are more accurate than the “ground truth” for
Figure 64 Data AO: point cloud comparison between predictions and ground truth
91
As the model training and testing results show, the 128×128-pixel patch and the 64×64-pixel patch
are better than the other patch sizes in the “early stopping” trials. Figure 65 shows the overlapping
assembly of model predictions with the “ground truth” elevation-map of the eight model training datasets
between these two patch sizes. In addition, several interesting points were selected to show their X/Y-
Each data in Figure 65 has ground surface, large objects or structures, and small objects. For the
ground surface 3D-reconstruction, the 128×128-pixel patch has the best performance, as seen with the
selected points in data A, the Y-profiles of data C, D and AC. What’s more, the tiny and sparse grass on the
ground shows no impact to the 3D-reconstruction of the ground surface shape, such as the X-profiles of
data B and D (see Figure 65). The trained model with 128×128-pixel patch correctly identifies that these
regions are ground surface and not vegetation. For large object 3D-reconstruction, the 128×128-pixel patch
also has the best performance, seen in the Y-profile of the umbrella in data CG, the Y-profile of the
stairways in data CI, and the wooden platforms and wooden paths in all of the training datasets. For small
objects, both the 128×128-pixel and 64×64-pixel patches have good performance in the 3D-reconstruction
of small objects’ shapes, such as the X/Y-profiles of the garbage can in data B and D.
In general, the 128×128-pixel patch has a better performance with the “early stopping” setting at
the 18th epoch than the 64×64-pixel patch trial with 27 epochs. However, training the developed model with
the smallest 32×32-pixel patch has given a potential function to correct the elevation errors in the “ground
truth”, such as the wooden path edge in the center region of data A, the wooden platform corner in data B,
and the gap between the platform and the garbage can in data B (see Figure 66). However, the large patch
size 256×256 and 512×512 trials retained these errors. Therefore, the median size 128×128-pixel patch is
the best option for balancing the local features and global features, each elevation value in the 32×32-pixel
Data AC, 64x64 and 128x128 Data CA, 64x64 and 128x128
Data CG, 64x64 and 128x128 Data CI, 64x64 and 128x128
Data A, 32×32, 256×256 and 512×512 Data B, 32×32, 256×256 and 512×512
Aside from the ground surfaces, the vegetation surfaces and wooden surfaces are the two major
textures in the experiment site (see Figure 56). The vegetation surfaces were captured during different
seasons; the vegetation blocks show different colors in data A, B, C, D, AC and CG (see Figure 65). In
Figure 65, the selected point in data AC is on the ground surface. The neighboring vegetation blocks were
3D-reconstructed well in the X-profile of data AC (128×128-pixel patch), in which the real vegetation
blocks’ surface heights ranged from 0.6 m to 0.90 m on September 05 th, 2019. The X-profiles of data A and
B also crossed the withered vegetation blocks, in which the 128×128-pixel patch results are matched with
the “ground truth”. In additional, the data CG and CI contain denser foliage in the shrub blocks, which are
different from the vegetation blocks. The Y-profile of data CG and X-profile of data CI are matched with
the “ground truth”. The wooden surfaces and ground surfaces were captured in different brightness
environments and their colors varied in Figure 65. When creating the experiment datasets, all wooden
surfaces (except the stairways) were set as elevation = ±0.00 m. They were all 3D-reconstructed well in the
model predictions. Furthermore, the Y-profile of data CI shows the 3D-reconstructed stairways are matched
with the “ground truth”, and the selected point in data CA has the correct elevation differential to the
wooden platform as well. Thus, the developed model that trained with 3-channel RGB ortho-images is
robust in complex textured regions for the “early stopping” 128×128-pixel patch trial.
In addition, there are three kinds of poorly textured region in the experiment dataset, including
shaded spots, shaded strips and shaded blocks. For small spots of shade, such as the garbage can’s shade in
94
data D (see Figure 65), the 128×128-pixel patch generated the correct predictions. For large shade blocks,
such as the tree’s shade and umbrella’s shade on the wooden platform in data CA and CG respectively (see
Figure 65), the 128×128-pixel patch has the correct predictions. The “early stopping” 128×128-pixel patch
trial has inconsistent performance for the shaded strips. The selected point in data AC is on the shade of the
vegetation block. The ground surface was identified as vegetation in the 64×64-pixel patch trial, but was
correctly identified using the 128×128-pixel patch. However, the 64×64-pixel patch trials are more aligned
with the “ground truth” than the 128×128-pixel patch trials, such as the Y-profiles of the shaded ground
surface close to the wooden platform in data CA and the shaded area next to the bottom stairs in data
CI(see Figure 65). Fortunately, adding model training epochs can improving the prediction accuracy (see
Figure 67), which will be discussed in the next section. Therefore, using the 128×128×3 RGB ortho-image
input patch and 128×128×1 grayscale elevation-map pair datasets to train the developed convolutional
encoder-decoder network model has a good performance both in complex textured and poorly textured
regions.
Data CA, 128×128 Early Stopping vs 100 Epochs Data CI, 128×128 Early Stopping vs 100 Epochs
The validations of data CA and CI (see Figure 67) indicated that it is worth continuing to train the
model after the “early stopping” point to improve the performance of the 128×128-pixel patch. This is due
to the model not training well enough at the 18 th epoch, though it still has the potential to narrow down the
variations of the validation loss (see Figure 68). In addition, the comparison of testing results (see Figure
63) shows that the 128×128-pixel and 256×256-pixel are better and smoother than other patch sizes in the
95
100 epochs trials, and the comparison of the two 100 epochs validation loss curves in Figure 68 confirmed
that the 128×128-pixel patch is more stable and can reach a stable trend earlier than the 256×256-pixel
patch. Thus, the well-trained 128×128-pixel patch has both the best model training and prediction
Furthermore, the quantitative evaluation of the model validation accuracy and testing accuracy
were conducted by measuring the point cloud (see Figure 64). In detail, for each validation result of
128×128-pixel patch “early stopping” trial and 128×128-pixel patch 100 epochs trial, 2,304 (48×48) points
(the centers of each 32×32-pixel patch) are selected from the corresponding ortho-images, elevation-maps
and overlapping assembled model predictions (1536×1536-pixel). Then the 3D point clouds were generated
by Eq. 13 with the selected 2,304 points. For each model training and validation data from A to CI, the
variable “ELE-DIFF-EARLY” was created as the elevation differential between “ground truth” and
128×128-pixel patch “early stopping”, and variable “ELE-DIFF-100” was created as the elevation
differential between “ground truth” and 128×128-pixel patch 100 epochs. Both variables have 18,432
(2,304×8) samples. For the testing data AO, the same variables were created and named as “AO-DIFF-
The descriptive statistics for the four variables are listed in Table 18. For the model training and
validation results, the 99% Confidence Interval (CI) of elevation differential is reduced from (1.6, 2.1) cm
to (0.6, 0.8) cm by adding the model training epochs. In addition, the trained model has a good result in
predicting the elevations for the testing data AO, the elevation differential has a 99% CI (2.14, 4.08) cm.
Therefore, the model training epochs have a positive effect in improving the model accuracy; after 100
Sample N Mean (unit: m) StDev SE Mean 99% CI for μ Minimum Q1 Median Q3 Maximum
ELE-DIFF-EARLY 18432 0.018495 0.133267 0.000982 (0.015966, 0.021024) -2.31373 -0.03922 0 0.07843 1.84314
ELE-DIFF-100 18432 0.0073 0.058502 0.000431 (0.006190, 0.008410) -1.05882 0 0 0.03922 0.94118
AO-DIFF-EARLY 2304 0.0424 0.17635 0.00367 (0.03293, 0.05187) -1.17647 -0.03922 0.03922 0.11765 1.01961
AO-DIFF-100 2304 0.03111 0.18096 0.00377 (0.02140, 0.04083) -1.05882 -0.03922 0 0.07843 1.21569
Figure 69 shows the distributions of the elevation differential between the “ground truth”, model
validation results and model testing results. The histogram of “ELE-DIFF-100-CM” shows that 94% of
points from the model training datasets have an elevation error less than 10 cm in the “well-trained” model
(100 epochs). The two histograms “AO-DIFF-100-CM” and “AO-DIFF-EARLY-CM” of the testing data
AO show that the “well-trained” model has a significant improvement over the “early stopping” model.
The “well-trained” model prediction accuracy is 52.43% compared to the 47.05 % on the “early stopping”
model, in which an accurate elevation measurement is defined as measurement error is equal to or less than
5.0 cm (Takahashi et al. 2017). The worst predictions (error > 25 cm or error < -25 cm) account for 9.64%
and 12.37% in the “well-trained” model and “early stopping” model respectively.
In addition, the prediction contour-maps are show in Figure 70, and the elevation differentials
were mapped as well. Most of the worst predictions of the “well-trained” model are on the edges of the
wooden platform and garbage cans. This is because the “ground truths” on these locations are incorrect, the
model predictions have corrected them. Excluding these errors, the model prediction accuracy will raise up.
Thus, the “well-trained” model at least has a 52.43% accuracy in estimation the construction site
elevations.
Figure 71 shows two 20m ortho-images which have the same GSD=0.54 cm/pixel as the model
training dataset. The blue garbage can (17.65 cm lower than the wooden platform) is the new object not
used in training the developed model and the original images were cut to 3584×4864-pixel without image
resize. The Y-profile at the blue garbage can is -13.7 cm, which is close to the true value with the error 3.95
cm <5.0 cm. The Y-profile on the top of the umbrella is 3.196 m, which is accurately matched with its true
value 3.20 m. Thus, training the developed model with the ortho-images captured at height 10 m can be
98
used in 3D-reconstruction of ortho-images at height 20 m, the trained model is also able to generate the
accurate elevations for the ortho-images at 20m as well. However, its performance worsens for the ortho-
Data CJ, 20 m, 128×128-pixel, 100 Epochs, Prediction Elevation Data Data CJ, Mesh Model
Data AN, 20 m, 128×128-pixel, 100 Epochs, Prediction Elevation Data AN, Mesh Model
Data
Furthermore, the top view of the experiment site in Figure 56 was captured at fight height 100 m.
The elevation prediction results of the “well-trained” models with 128×128-pixel and 32×32-pixel patches
are shown in Figure 72. The 32×32-pixel patch results show the ground surfaces, wooden surfaces, and
shrub blocks are well reconstructed, but the vegetation blocks are assigned with incorrect elevations. The
bad prediction of the 128×128-pixel patch occurs around (500, 2000), where the shaded wooden path was
not included in the model training datasets. Thus, to make the model satisfied with complex construction
site situations, a comprehensive dataset (ortho-image and elevation-map pairs) is required. This dataset
should include different textures of the construction site, because the top layer materials of the construction
sites are not limited to vegetation, water, snow, sand, rock, soil, concrete, asphalt, buildings and structures.
99
For training a precise deep learning model, the number of datasets should be large enough to cover the
Elevation Data
100 Epochs,
Elevation Data
32×32-pixel,
100 Epochs,
Point Cloud, 32×32-pixel,100 Epochs
7.1 Introduction
The performance of the image-based elevation determination methods in Chapter 4 and Chapter 6
are affected by the plants and other ground covers on the construction site when determing the ground
elevations. This is because the light rays are reflected on the surface of vegetation instead of the “real”
ground surface. Therefore, to improve the effectiveness of the image-based surveying methods,
automatically detecting and removing the vegetation and other obstacles from their raw surveying results
and determining the “real” ground elevations, are necessary and important for construction professionals
who heavily depend on elevation data in earthwork operations and facility layout.
In this chapter, a convolutional neural network (CNN) model is designed to classify the small
sized image patch into vegetation categories or other object categories using a drone-based high-resolution
construction site ortho-image (see Figure 73). Then, a vegetation removing algorithm is used to determine
the ground elevations covered by vegetation from the elevation-map. Experiments are conducted to
evaluate the effectiveness of the proposed method with high-resolution ortho-image and label-image pair
datasets. The label-image is marked at each pixel with an 8-bit grayscale value [0,255] to represent up to
256 objects’ categories. To explain how to determine the “real” ground elevation covered by vegetation
from a drone-based high-resolution ortho-image and elevation-map pair, the rest of this chapter presents the
research results of dataset acquisition and creation, model architecture designs, model training and testing,
CNN-based Image
Classification Vegetation
w/ Disassembling Removing
and Assembling Algorithm 3D Construction Site
Construction Site Algorithm Model w/o Vegetation
Figure 74 shows the graphical user interface of the “Label-App” which is designed for labeling an
ortho-image with 8-bit values [0, 255] and programmed using Python 3.6.8 and matplotlib 3.1.1 library.
The computer mouse is used to select vertexes on the ortho-image for identifying each object. The
keyboard is used to create a new class-label or select a predefined class-label such as “240-shade” in the
left side of the label-image. The label-image is shown in “terrain” colormap for better visualization. Same
as the ortho-image, the label-image also has the high-resolution 1568×1568-pixel, which is saved in two
file-formats including a 1,568×1,568-pixel grayscale image file for visualization and a 1,568-row and
1,568-column spread sheet file for training the deep learning model. Saving as spread sheet file is necessary
because the interpolation value appears on the edges of different objects in the image file.
which will be used as the testing dataset in this research project. The point cloud is generated using the
selected central points of each 32×32-pixel patch of the ortho-image (textures) and elevation-map
(elevation values). The high-resolution images used in this research project are 1,536×1,536-pixel, which
are generated by removing 16 pixels on each margin of the 1,568×1,568-pixel images. This process allows
each high-resolution image to be cropped into divisible integer numbers of 8×8-pixel, 16×16-pixel, 32×32-
AO: 9/5/2019
Front
Right
result, the high-resolution images cannot be directly used in training a deep learning model. The author
proposed to disassemble the high-resolution ortho-image and label-image pair into multiple overlapping
patch pairs with size 8×8-pixel, 16×16-pixel, 32×32-pixel or 64×64-pixel. Figure 76 shows the example of
these four different small sized patch pairs of ortho-images and label-images.
Ortho-image patch
Label-image patch
Class-label/ value Vegetation /130 Vegetation /130 Vegetation /130 Sand /80
In addition, when cropping these small-patches, the strides are set as 4, 8, 16 and 32-pixel
respectively. Moreover, in order to make the proposed deep learning model more robust in different image
orientations, the high-resolution ortho-images and label-images are planned to rotate 90, 180 and 270
degrees to increase datasets by four times. Table 19 listed the number of small-patch datasets from a
An ortho-image acquired by the drone has RGB 3-channel. Considering that the color textures are
important in distinguishing different objects on a construction site, the proposed deep learning model uses
the color ortho-image patches as model input data. Therefore, using a high-resolution ortho-image can
produce the model training datasets with 𝑠ℎ𝑎𝑝𝑒 (586756,8,8,3), 𝑠ℎ𝑎𝑝𝑒 (145924,16,16,3), 𝑠ℎ𝑎𝑝𝑒 (36100,32,32,3), or
𝑠ℎ𝑎𝑝𝑒 (8836,64,64,3), where the first number is the quantity of the small-patches.
A label-image generated from the “Label-App” only has one channel. Disassembling a high-
resolution label-image can produce the small-patch datasets with 𝑠ℎ𝑎𝑝𝑒 (586756,8,8,1), 𝑠ℎ𝑎𝑝𝑒 (145924,16,16,1),
𝑠ℎ𝑎𝑝𝑒 (36100,32,32,1), or 𝑠ℎ𝑎𝑝𝑒 (8836,64,64,1). Thus, the maximum frequency class-label /value in each small-
patch is determined and set as the class-label/value for each label-image patch. For example, in Figure 76
the “green” region is bigger than the “yellow” region of the 64×64-pixel label-image patch, thus, the class-
label “sand” /value “80” is assigned for that small-patch. By doing that, the small-patch datasets are
transformed into class vector (integers), such as [130, 95, … , 130] with 𝑠ℎ𝑎𝑝𝑒 (586756,1), 𝑠ℎ𝑎𝑝𝑒 (145924,1),
𝑠ℎ𝑎𝑝𝑒 (36100,1), or 𝑠ℎ𝑎𝑝𝑒 (8836,1). Additionally, the class vector needs to be converted to binary class matrix
with 𝑠ℎ𝑎𝑝𝑒 (586756,256,1), 𝑠ℎ𝑎𝑝𝑒 (145924,256,1), 𝑠ℎ𝑎𝑝𝑒 (36100,256,1), or 𝑠ℎ𝑎𝑝𝑒 (8836,256,1) as the model training
datasets (Chollet 2015). For example, a integer “130” is converted to a binary class vector
[0.00 , 0.02 , … , 1.0130 , … , 0.0255 ] with 𝑠ℎ𝑎𝑝𝑒 (256, 1), and a class vector is converted to a binary class matrix with
The CNN-based image classification model architecture is presented in Figure 77, which includes
a feature learning block and a classification block. In the feature learning block, three convolution layers
learn the ortho-image patches (model input) as feature-maps (layer outputs). Three max pooling layers
reduce feature-maps’ (layer inputs) size to its half-size as their layer outputs without losing important
features. For example, the 8×8-pixel, 16×16-pixel, 32×32-pixel and 64×64-pixel patches are resized down
to 1×1-pixel, 2×2-pixel, 4×4-pixel and 8×8-pixel patches respectively after the 3rd max pooling layer.
Disassembling ortho-image into Determining each patch s class- Assigning class-label to each patch as
overlapping patches, and recording label using CNN-based Image label-image prediction, and assembling
their sequence ID Classification Model them by the sequence ID
Output_3
1 3 5
6 8 10
11 13 15
16 18 20 1 2 3 4 5
21 23 25 6 7 8 9 10
11 12 13 14 15
2 4 16 17 18 19 20
7 9 21 22 23 24 25
12 14
17 19
Input: Output_0: Output_2:
22 24
Ortho-image Binary Label-image
patch Class Vector patch
-veg. 95%
-sand 2%
...
8×8×128 -n 0%
...
256
1024
...
2048
Output_1:
4096 Argmax
Class-label
Function
Convolution Convolution Convolution Max Pooling Dense ReLU Dense ReLU Dense
Max Pooling Max Pooling Flatten + Dropout + Dropout
+ ReLU + ReLU + ReLU + Dropout SoftMax
Feature Learning Block Classification Block
In the classification block, the flatten layer transforms the feature-map (layer input) into a feature-
vector (layer output), which can be used in the classification block. Three fully connected layers ( also
known as dense layers) transform feature-vectors (layer inputs) to a binary class vector as a model
The detailed model layers for the four different patch sizes are shown in Table 20, where the type
of layers is described in the Keras 2.3 style (Chollet 2015). After each convolutional layer and dense layer,
there is an activation function (layer) which performs the non-linear transformation of the input features
from the previous convolutional layers or dense layers (Dettmers 2015). As the model input datasets
normalize from value range [0,255] to [0.0,1.0] by dividing them by 255, the activation function should
progressively change from 0.0 to 1.0 with no discontinuity. Therefore, the Rectified Linear Unit activation
function (ReLU), 𝑓(𝑥) = 𝑚𝑎𝑥(0, 𝑥), is used in hidden layers. Because the ReLU function does not always
output a non-zero value, which results in less neurons being utilized and less dependence between features
(Nair and Hinton 2010), it is faster than the Sigmoid activation functions. In addition, the SoftMax
activation function is used in the 3rd dense layer to calculate the probabilities of the 256 class-labels in the
binary class matrix/vector. Finally, the dropout layers randomly set half of the input units to “0” at each
update during training time which helps prevent model overfitting (Chollet 2015).
Model Architecture for 8×8, 16×16, 32×32 and 64×64-pixel Patches Output Shapes for Each Patch
Row × Column
Blocks Layer (Type and filter size) Stride Padding Activation Channels
8×8 16×16 32×32 64×64
Input input_1 (Input Layer) - - - 8×8 16×16 32×32 64×64 3
Feature learning block conv2d_1 (64,Conv2D 3×3) 1 same ReLU 8×8 16×16 32×32 64×64 64
max_pooling2d_1 (Max Pooling 2×2) 2 - - 4×4 8×8 16×16 32×32 64
conv2d_2 (128,Conv2D 3×3) 1 same ReLU 4×4 8×8 16×16 32×32 128
max_pooling2d_2 (Max Pooling 2×2) 2 - - 2×2 4×4 8×8 16×16 128
conv2d_3 (256,Conv2D 3×3) 1 same ReLU 2×2 4×4 8×8 16×16 256
max_pooling2d_3 (Max Pooling 2×2) 2 - - 1×1 2×2 4×4 8×8 256
dropout_1 (Dropout 0.5) - - - 1×1 2×2 4×4 8×8 256
Classification block flatten_1 (Flatten) - - - 256 1,024 4,096 16,384 -
dense_1 (Dense) - - ReLU 256 1,024 2,048 4,096 -
dropout_2 (Dropout 0.5) - - - 256 1,024 2,048 4,096 -
dense_2 (Dense) - - ReLU 256 512 1,024 1,024 -
dropout_3 (Dropout 0.5) - - - 256 512 1,024 1,024 -
Output dense_3(Dense) - - SoftMax 256 -
For compiling the developed CNN-based model, the author use “adam” as the optimizer, and use
“categorical_crossentropy” as the loss function (Chollet 2015). “Validation_split” is set to 0.05, which
106
means that 95% of small-patch datasets are used for training the model and 5% of small-patch datasets are
monitored quantity of validation accuracy had stopped improving during the past 5 epochs (Chollet 2015).
In the developed CNN-based image classification model (see Figure 77), for an ortho-image patch
input, the model prediction output is a binary class vector (“Output_0”), which contains the probability
values of the 256 unique class-labels. With this model prediction, three post-processes need to be
First, the “Argmax” function returns the index of the maximum probability value of the binary
class vector, this index is the class-label/value prediction (“Output_1”) for the input ortho-image patch. For
example, the “veg” is the class-label prediction for the input ortho-image patch in Figure 77, because it has
Second, the class-label /value prediction is assigned to each pixel of the small-patch as the label-
image patch prediction (“Output_2”) for the corresponding input ortho-image patch.
Third, the small-patch will be used as a part of the high-resolution label-image prediction result
(“Output_3”).
Figure 77 shows the workflow of the high-resolution ortho-image overlapping disassembling and
high-resolution label-image assembling algorithm, which makes the proposed CNN-based image
classification model works with the high-resolution image instead of resizing the original image down to
the low-resolution.
This algorithm disassembles the ortho-image into several overlapping small-patches and records
their locations in their sequence ID. The number of small-patches is determined by Eq. 14. When
assembling the high-resolution label-image prediction, the small-patches are considered as corner patches,
107
edge patches or regular patches, and only the selected region (marked as filled rectangles) of each patch
will be used in the high-resolution label-image prediction. For example, 95-row and 95-column overlapping
small-patches with size 32×32-pixel are produced from a high-resolution 1,536×1,536-pixel ortho-image
for generating the 32×32-pixel label-image patch predictions; the same number of small-patches are
required to assemble a high-resolution 1,536×1,536-pixel label-image prediction, and for each regular
32×32-pixel label-image patch prediction, the used region is only a quarter of the regular patch, which is
the filled 16×16-pixel patch in Figure 77. In addition, with this developed algorithm, each 16×16-pixel
ortho-image patch is linked with a 16×16-pixel label-image patch prediction through a class-label
prediction. Therefore, using this method of CNN-based image classification model and the overlapping
disassembling and assembling algorithm with 8×8-pixel, 16×16-pixel, 32×32-pixel or 64×64-pixel patches
(useful region 4×4-pixel, 8×8-pixel, 16×16-pixel or 32×32-pixel in each regular patch) is similar to resizing
There are two approaches for removing the vegetations’ height from the raw surveying result
using the identified vegetation blocks in the label-image. One measures an average height of vegetation
blocks on the construction site, and then directly subtract this value from the raw elevation values of the
vegetation blocks. Another searches the neighboring ground blocks on the label-image, then interpolates
these surroundings’ elevation values as the “real” ground elevation under the vegetations. In this research
project, the vegetation removing algorithm (see Figure 78) is based on the second approach, because it is
more convenient for automatically determining the ground elevation without any manual participation, and
the result will be closer to the “true” ground elevation than the prior option.
image in the row-column-row-loops (see Figure 79). In each row-loop, the 𝑺𝑬𝑨𝑹𝑪𝑯_𝑽𝑬𝑮_𝑹𝑬𝑷𝑳𝑨𝑪𝑬_𝑮𝑹𝑶𝑼𝑵𝑫
algorithm uses a size adjustable window, which can be extended in the row direction, to search the
108
minimum number of “ground” class-labels. Similarly, in each column-loop, the adjustable window is
changed in column direction only. When sufficient “ground” class-labels appear in the search window, the
𝑺𝑬𝑨𝑹𝑪𝑯_𝑽𝑬𝑮_𝑹𝑬𝑷𝑳𝑨𝑪𝑬_𝑮𝑹𝑶𝑼𝑵𝑫 algorithm replaces the current “vegetation” patch’s elevation value with the
average elevation from the searched neighboring “grounds.” In addition, the label-image patch is updated
with the “ground” class-label, and the ortho-image patch is marked with a specific color as well.
1
3
5
7
9
11
13
2
4
15
6
8
10
17
12
14
19
16
18
Patches, Current Patch
Row-Loop Col.-Loop
Adjustable Win. in Row direction
Adjustable Win. in Col. direction
Figure 80 shows the overall condition of the experimental site, which is located at Atwater Park
Ten high-resolution ortho-image and label-image pairs are shown in Figure 81.
B: 3/24/2019
D: 6/5/2019
C: 6/5/2019
G: 9/5/2019
O: 9/5/2019
AD: 9/5/2019
AL: 9/5/2019
CG: 9/19/2019
AM: 9/5/2019
Label
Label
These ortho-images were captured during different seasons and they contain ten categories of
different objects and surfaces (see Table 21). For the vegetation blocks, in data A and B, the vegetation had
not recovered yet; in data C and D, the vegetation was growing; and in data G, O, AD, AL, AM and CG,
the vegetation was fully grown, and their heights were around 2 feet (60.96 cm ). In addition, the ortho-
image and label-image pair in Figure 75 is used to test the well trained model.
The small-patch datasets for training and testing the CNN-based image classification model are
created followed by the rules stated in the dataset creation section. Furthermore, each label-image patch is
assigned with a class-label. For example, a 32×32-pixel label-image patch has 1,024 elements in total. If
513 of them have the value “220”, then this patch has the “umbrella” class-label.
The model training parameters including dataset numbers, batch sizes, and epochs which are listed
in Table 22.
The results of training loss, training accuracy, validation loss and validation accuracy with “early
stopping” for four different pixel size are shown in Figure 82, which were stopped at different epochs (see
Table 22). The 64×64-pixel and 8×8-pixel patch trial stopped at the 13th epoch and were the earliest trials,
and the 32×32-pixel patch stopped at the 14th epoch. The 16×16-pixel patch took the most epochs for the
Figure 82 Training and validation results Ⅰ: loss and accuracy w/ early stopping trials
Furthermore, The validation results are shown in Figure 83, where the “ground truths” are the
class-labels used in training the model, the predictions are the class-label predictions generated from the
trained CNN-based image classification model. The smaller patches 8×8-pixel and 16×16-pixel have more
chance to have incorrect class-label prediction. The larger patches size 32×32-pixel and 64×64-pixel have
more chance to form complex label-image patches with multiple class-labels in one label-image patch, and
the class-label predictions are more likely correct. This result matches the training accuracy and validation
accuracy results in Figure 82, where the 32×32-pixel and 64×64-pixel have the better training accuracy and
113
validation accuracy than the 8×8-pixel and 16×16-pixel. However, it is hard to conclude either 32×32-pixel
or 64×64-pixel has the best performance based on the “early stopping” loss and accuracy plots.
Ground
Veg/130 Veg/130 Veg/130 Shade/240 Shade/240 Shade/240 Shade/240 Veg/130
Truths
Ortho-
image
patches
Label-
image
patches
Label-
image
prediction
patches
Predictions Veg/130 Sand/80 Veg/130 Sand/80 Shade/240 Shade/240 Shade/240 Veg/130
Matched? Yes No Yes No Yes Yes Yes Yes
Patch size 8×8 16×16 32×32 64×64
Figure 83 Training and validation results Ⅱ: model predictions of data AM w/ early stopping trials
The model training was conducted without “early stopping” as well. The 50-epoch model training
and validation results of the four different patch sizes are shown in Figure 84. The 64×64-pixel patch has
the largest model training accuracy of 0.9908 at the 50 th epoch, which is better than 0.9540 of the “early
stopping” trial. However, the 32×32-pixel patch has the best validation accuracy of 0.9304 at the 50 th
epoch, which is better than 0.9288 at the “early stopping” trial and also better than the validation accuracy
of 0.9219 for the 64×64-pixel patch at the 50th epoch. In addition, the 32×32-pixel patch has the smallest
validation loss as well. Thus, the author concludes that using the 32×32-pixel patch for the developed
CNN-based image classification model has the best model training and validation performance, followed
by the 64×64-pixel patch and 16×16-pixel patch. The smallest 8×8-pixel patch, however, has the worst
performance.
114
The extra training epochs after the “early stopping” point have no impact on the smaller 8×8-pixel
and 16×16-pixel patches based on the training accuracy and training loss in Figure 84, but they have
positive impacts on the 32×32-pixel and 64×64-pixel patches. However, the validation accuracy and loss
have nonsignificant improvement as the training epochs increase in the 32×32-pixel and 64×64-pixel patch
trials.
The cause of this issue can be explained from the assembled high-resolution model validation
results in Figure 85. Compared to the “early stopping”, the 50-epoch has incorrect model predictions on the
wooden platform of data AM and G, but it has better model predictions for the “withered” class-label in
data A and CG. Thus, the overall model validation accuracy is maintained around 93% for the 32×32-pixel
patch.
115
Data A AM G CG
The trained “early stopping” and 50-epoch models were tested with the high-resolution data AO in
Figure 75, which was disassembled as the model training datasets as well. For example, the 32×32-pixel
patch trial was tested with the 9,025 small-patch datasets. The results of model testing loss and testing
accuracy and the assembled high-resolution label-image predictions are shown in Figure 86. The best
testing accuracy of 0.9435 is the 32×32-pixel patch with 50-epoch, the second-best testing accuracy of
0.9433 is the 64×64-pixel patch with 50-epoch, and the third-best testing accuracy of 0.9423 is the 32×32-
pixel patch with “early stopping.” These results were the same as the model validation results, because the
“wood” class-label was getting worse, while the “withered” class-label was getting better when the training
Visually, in Figure 86 the 32×32-pixel patch with “early stopping” had the best prediction result
and followed by the 32×32-pixel patch with 50-epoch. The 64×64-pixel patch with the 50-epoch was a
reasonable result, too, but the 32×32-pixel patch was more accurate in the objects’ boundaries. The number
of each category of class-labels of the manually crafted label-image and the assembled high-resolution
label-image prediction result (32×32-pixel patch with “early stopping”) are summarized in Table 23 , where
93.57% (2,207,641 of 2,359,296 pixels) class-labels were exactly matched between them. This accuracy is
nonsignificant to the small-patch testing accuracy (94.23%). Thus, the developed overlapping small-patch
disassembling and assembling algorithm was efficient as the result of directly processes high-resolution
images.
117
Additionally, the class-label prediction errors (6.43% of unmatched) were mapped in pixel
coordinate as shown in Figure 87. The majority unmatched class-labels were appeared on the “withered”
region of the manually crafted label-image. In this research project, the “withered” vegetation class-label is
defined as a ground surface category between the “sand” and normal “vegetation.” However, it is hard to
distinguish the “withered” and “sand” in the ortho-image, and most “withered” blocks are “sand” blocks in
reality in the manually crafted label-images (see Figure 81). In the early stage of this research project, the
author obtained a 0.9646 validation accuracy and 0.9673 testing accuracy without adding the “withered”
class-label to these label-image datasets. Thus, most class-label errors can be avoided by considering the
Figure 87 Vegetation identifying results: vegetation index and mapped prediction error
Furthermore, the vegetation index 𝐸𝑥𝐺 = 2𝐺 − 𝑅 − 𝐵 (Anders et al. 2019) result is shown in Figure
87, where 15.32 % (361,221 of 2,357,926) pixels had been identified as vegetation. That result is close to
the 15.95% of “veg” class-label in the manually crafted label-image, but it also contains a large number of
incorrect results from the yellow and green garbage cans. Thus, the vegetation index method is not suitable
118
for the detailed vegetation detection at a complicatedly textured construction site with other green and
Therefore, the author concluded that the developed CNN-based image classification model with
32×32-pixel ortho-image patch input data had a good accuracy (93%) to identify the objects on the
Figure 78 were programmed using Python 3.6.8. The vegetation removing experiments were conducted
with the 32×32-pixel “early stopping” prediction result. The initial search window size is
(2 𝑞𝑠𝑧𝑖𝑒 × 𝑟𝑎𝑡𝑖𝑜𝑞 ) × (2 𝑞𝑠𝑧𝑖𝑒 × 𝑟𝑎𝑡𝑖𝑜𝑞 )=(2×32×10)×(2×32×10), where 𝑞𝑠𝑖𝑧𝑒 is the patch size of 32-pixel used in
the CNN-based image classification model, and the 𝑟𝑎𝑡𝑖𝑜𝑞 is set as 10. The max search windows size is
dependent on 𝑤𝑖𝑛𝑠𝑖𝑧𝑒_𝑚𝑎𝑥 , which is set as half of the image width=768-pixel. The minimum required number
of “ground” class-label pixels in the search window is set as 𝑞𝑠𝑖𝑧𝑒 × 𝑞𝑠𝑖𝑧𝑒 × 𝑟𝑎𝑡𝑖𝑜𝑞 =32×32×10. In addition, the
label-image traversal 𝑠𝑡𝑟𝑖𝑑𝑒 is set as 𝑞𝑠𝑖𝑧𝑒/4=8-pixel, which means the high-resolution 1,536×1,536-pixel
label-image is disassembled into 192-row, 192-column and 36,864 patches with a size of 8×8-pixel.
of “shade,” 256-pixel of “shrub,” 427,456-pixel of “veg,” which are considering as vegetation blocks that
need to be removed and replaced with class-label “ground”/ value “95.” It also contains 266,304-pixel of
“sand,” 26,624-pixel of “withered,” which are considered as ground blocks and needed to update class-
label to “ground”.
Table 23 shows the sum number of “shade,” “shrub,” “veg,” “sand” and “withered” in the label-
image prediction is equal to the number of “ground” in the vegetation removed label-image, which
confirms that the developed algorithm had successfully traversed the high-resolution label-image. In
119
addition, the label-image and ortho-image were successfully updated after removing the vegetation, where
the new “ground” blocks were marked with pink color in the ortho-image (see Figure 88).
Figure 89 shows the updated point cloud, which was generated using the selected central points of
each 32×32-pixel patch of the vegetation removed ortho-image (textures) and vegetation removed
elevation-map (elevation values). Among the selected 2,304-point, the vegetation category (“shade,”
“shrub” and “veg”) has 447-point which accounts for 19.40%; the ground category (“withered” and “sand”)
has 267-point which accounts for 11.59%. The sample distribution of the selected 2,304-point is similar to
Front
Left
Rear
Right
Figure 90 shows the elevation differential between the original and updated elevations on ground
and vegetation blocks. The non-ground and non-vegetation points (1,590) were excluded in the histogram.
The larger elevation changes (≥0.7 m) occur on the edges of the wooden platform and garbage cans, where
the updated elevations are more accurate than the elevations determined by the image-based method. The
majority of the elevation changes on the vegetation blocks ranged from 0.1 to 0.6 m, and the maximum
elevation changes on the vegetation block is 0.6-0.7 m, which is also shown as the peak point of X-profile
in Figure 88. This result is similar to the measured vegetation height of 0.61 m on the experimental site.
There are some negative elevation changes -0.3-0.0 m on the left end of the vegetation histogram,
which are the top-edge (Y>3.8 and -3.8<X< -2.2) vegetation block in the contour plot. This error appears
due to this region being conflicted with the requirement of the vegetation block needing to be surrounded
by ground blocks. Fortunately , this kind of error can be avoided by adding the necessary neighboring
grounds, such as the central vegetation block in the data AL in Figure 81, or stitching with the other
adjacent dataset. Because the 𝑺𝑬𝑨𝑹𝑪𝑯_𝑽𝑬𝑮_𝑹𝑬𝑷𝑳𝑨𝑪𝑬_𝑮𝑹𝑶𝑼𝑵𝑫 algorithm can generate the correct ground
elevations for the vegetation covered regions, the author concluded that the developed methods in this
paper (see Figure 73) can automatically identify the vegetation and determine the ground elevation covered
by the vegetations.
121
8.1 Summary
This research project utilized drone-based ortho-imaging to advance the image-based 3D-
reconstruction method for determining the construction site elevations with the consideration of the
affection of static vegetation. Major work and findings are presented in Chapters 4 to 7. A summary is
provided below.
reconstruction method for automatically determination of construction site elevations using drone
technology. The method of input images is different from the traditional drone photogrammetry method
and the classic stereo-vision method which are 2:1 scale ratio quadcopter drone-based low-high ortho-
image pairs instead of the same scale image pairs. The general procedure of the developed low-high ortho-
2. Matching pixel grid and determining elevations simultaneously by the low-high ortho-image
Chapter 5 discusses how to use the developed low-high ortho-image pair-based elevations
determination method to acquire a construction site high-resolution ortho-image and elevation-map dataset
for training the deep learning-based construction site elevation estimation model. Based on the acquired
dataset, a single-frame image-based 3D-reconstruction method for construction site elevation estimation
was developed and presented in Chapter 6, which only needs a drone-based ortho-image as the input. The
general procedure of the ortho-image and deep learning based-elevation estimation method includes the
following steps:
3. Using the trained convolutional encoder-decoder network model to predict the elevation-map
site using drone-based ortho-image and determine the “real” ground surface elevations from the raw
surveying results. The general procedure of the vegetation identifying and removing method includes the
following steps:
3. Using the trained CNN-based image classification model to generate the small-patches of
5. Modifying their elevation values with the surrounding grounds’ elevations in the same
6. Converting the modified elevation-map to elevation data or 3D point cloud of the construction
site.
8.2 Contributions
The success of this research project contributes to the advancement of drone ortho-imaging and
deep learning methods in construction site surveying. First, it advanced the multiple image-based 3D-
surveying techniques with drone ortho-imaging, which is a flexible option for conditions where drones
need to be kept away from target objects and a real-time as-build model is needed. The developed low-high
and elevation-maps datasets of construction sites for conducting deep learning-based research, which is
highly dependent on elevation data. In addition, for the earthwork operations, the generated pixel grid
results are easy to convert to a 2D site plan for updating the earthwork quantity and a 3D point cloud for
convolutional encoder-decoder network model for estimating construction site elevations. The developed
ortho-image and deep learning-based elevation estimation method can generate the elevation values as an
elevation-map output using a single ortho-image input. The experiments were conducted to evaluate the
effectiveness of the convolutional neural networks (CNNs) in the determination of construction site
elevations. The developed input image disassembling and output image assembling algorithm provides the
ideal training of a deep learning model with larger size images instead of shrinking images, which could
result in losing image detail. The success of this research project makes it possible to generate elevation
data on a construction site much faster than traditional survey methods, thus, speeding up the on-site
construction operations.
Third, it provided and verified a feasible approach of using a CNN to segment a high-resolution
ortho-image of construction sites. The developed model can be used for automatically identifying and
locating multiple static object categories from the raw surveying results. In addition, the developed method
can be extended to remove dynamic objects (i.e. moving objects like trucks, people, etc.) from the high-
resolution ortho-imaging videos. With these advancements, this research project has proved that it is
possible to use drone technologies to make the image-based construction surveying measurements of
8.3 Conclusions
Construction surveying plays a crucial role in determining construction sites’ elevations and
locations, which are important in earthwork operations and critical for making decisions. However,
accurately and quickly determining elevations of a construction site in real-time is still a challenge for the
construction industry. This research project utilized the drone ortho-imaging, deep learning, computer
vision, image processing and image classification methods to simplify and speed up the image-based 3D-
reconstruction techniques in the construction site elevation determination. The major findings of this
1. By only using two frame ortho-images, the developed low-high ortho-image based elevation
determination method focuses on 3D-reconstruction of the ground surface and excludes the
124
vertical side surfaces of any attached or sunken objects on a construction site, which makes it
simpler than traditional drone photogrammetry. This method maximizes the overlap of the
ortho-image pair, where the entire low ortho-image is contained in the overlap. It only took 2
to 5 minutes for determining elevations for 2500 points in the 10-20 m trial or 4761 points in
the 20-40 m trial with Python 3, while using a fast programming language such as C++ this
time could be reduced. In addition, the generated results, the ortho-image and elevation-map
pairs, were easily stitched using a very narrow overlapping strip, which is much less than the
2. For the automatic matching of the 2:1 scale ortho-image pair, the four-scaling reference patch
feature descriptors for the low ortho-image were designed first to have the same size as the
target patch feature descriptor for the high ortho-image. Then, the NCC method was used to
match the pixel grid and return the corresponding 0.5-pixel coordinate from the high ortho-
image for the given pixel from the low ortho-image. The developed pixel grid matching and
elevation determination algorithms were robust even for poorly textured surfaces and large
sloped surfaces, and also effective in indirectly lit environments. It can give an accurate pixel
grid match for the low-high ortho-image pair at least 92% of the time. It can produce an
accurate elevation result for the strongly matched pixel grid within the acceptable measuring
3. The input image overlapping disassembling and output image assembling algorithm which ran
in parallel with the deep learning models is developed, which made the workstation system
more efficient to train a deep learning model with high-resolution images instead of shrinking
images and losing image details. By disassembling the datasets into multiple small patches,
the number of datasets was significantly increased as well. With the suitable patch size, such
as the 128×128-pixel patch, the developed deep learning models balanced the global features
and local features, and it can even be well-trained earlier than larger patch sizes. The
experimental results showed that the 128×128-pixel patch trial stopped training the
convolutional encoder-decoder network model at the 18th epoch, while the 256×256-pixel
125
patch trial stopped at the 35th epoch . In addition, the smaller 32×32-pixel patch contains the
maximum local features, which was important for changes in edges and corners.
encoder-decoder network model for estimating construction site elevations. The results
showed that the 128×128-pixel patch had the best prediction performance when the elevation
values were shared in the elevation-map with a 32×32-pixel patch. Adding model training
epochs had a positive relationship to the model prediction accuracy. The testing results
showed that the “well-trained” model had a 52.43% accuracy in elevation estimation with a ±
5.0 cm error and 66.15% accuracy with a ± 10.0 cm error. Compared with the 94% accuracy
(error ± 10.0 cm) in model training, it still has potential for improving the deep learning
5. Experimental results showed the developed CNN-based image classification model using the
32×32-pixel patch had the best performance of 94% accuracy in identifying each main
object’s class-label from each small-patch ortho-image on the construction site. The testing
results showed that the developed method, which disassembled the 1,536×1,536-pixel high-
resolution image into 9,025 overlapping small-patches for image classification and assembled
categories were marked with different colors in the assembled high-resolution label-image
predictions with a high accuracy of 93%; and the edges of different objects were well
determined.
8.4 Recommendations
The following recommendations are suggested for implementing the results of this research
project and for future research on multiple/single ortho-image based 3D-reconsruction for construction site
elevation determination and excavation operations. First, this research project focused on determining the
elevation of a construction site, the transformation ability of converting the ortho-image and elevation-map
results to 3D point cloud had been addressed in this research project, and the textured point cloud was a
126
part of the generated results from the developed method as well. Transforming the point clouds results to
the earthwork volume quantities can be implemented by using Autodesk Civil 3D to create a TIN
(Triangulated Irregular Network) mesh model, and estimate the earthwork volume from the mesh model.
As the selected and matched pixels/points were in the intersection points of the regular grid as the site plan
formation, the volume can be estimated by the four-point-method (using four corners of each grid cell) or
three-point-method (using three corners of each triangular cell) when the current elevation and the designed
elevation of each intersection point are known. Figure 91 a and c show a 25-point 3D mesh model and 2D
site plan demo (programmed with C++ and OpenGL), where the x/y-axis ranges from 0 to 80 m, and the
current elevation ranges from 0 to 35 m. In this demo, 32 triangular cells were generated, each of them is a
small ground area (Area=20×20/2=200 m 2) and its volume can be estimated by (𝐸𝑙𝑒1 + 𝐸𝑙𝑒2 + 𝐸𝑙𝑒3 ) × 𝐴𝑟𝑒𝑎/3,
where 𝐸𝑙𝑒𝑖 is the elevation for each point. Figure 91 d and e show two earthwork estimation demos, where
the designed sites are two different flat planes (gray), and the volume estimation results for each triangular
cell with these two different design planes were calculated and indicated in the 3D surface model. In detail,
when the design plane is 10 m, part of the selected triangular cell (ID=20) needs to be cut (pink
surface/yellow edges) and part of it needs to be filled (yellow surface/pink edges), and the total earthwork
quantity is balanced in this triangular cell, as the volume=0 m3 (see Figure 91 d); when the design plane is
20 m, the selected triangular cell needs to be filled with 2,000 m3 (see Figure 91 e). Moreover, for
monitoring and estimating the earthwork quantities at an active excavation site, capturing two low-high
ortho-image pairs in two different times could be done. Then, the generated ortho-images and elevation-
maps can be used to determine the elevation changes and calculate the volume changes between those two
time points. The critical process is to align the two ortho-images and elevation-maps to the same
coordinate. Fortunately, it can be easily handled by the 2D image rotation and translation (see Figure 92).
The elevation changes for any selected point, can be determined by subtracting the latest elevation-map
from the previous elevation-map. Figure 92 shows an elevation monitoring demo, where the ortho-images
and elevation-maps were aligned to the same image center, and the x/y-profiles of the center point were
overlapped to show the elevation changes. Furthermore, this research project utilized an experimental site
at the lake beach, which simulated most cases of a real construction site including the larger sloped surface.
127
It is recommended that the developed methods and algorithms need to be evaluated at a real excavation site
in the future.
a) b) c)
d) e)
CG CF
Aligned Ortho-images
1824×1824-Pixel
Aligned Elevation-maps
1568×1568-Pixel
Elevation Comparisons
* picked point on the center of
the ortho-images; red curves
are CG; green curves are CF.
Second, the image patch-based NCC matching approach can be improved in the developed low-
high ortho-image pair-based elevation determination method. In this research project, the patch-based NCC
method was used to determine whether the reference pixel/patch in the reference image was strongly
matched with the candidate pixel/patch in the target image or not. The used reference and target patches
were grayscale single-channel, while future research could use the RGB 3-channel reference patch and
target patch to increase redundant features and enhance the matching accuracy. In addition, the developed
four-scaling reference patch feature descriptors were used to make the reference patch had the same size
with the target patch, while the developed convolutional encoder-decoder network model could be used to
generate the predicted target patch for each reference patch by reducing an up-sample layer in the encoder
block. Then the target patch prediction can be used to compare with the candidate patches of each virtual
plane in the target image. For a single low-high ortho-image pair, this proposed approach can be used in
dense pixel grid matching after getting the initial pixel grid matching results. When the training datasets are
big enough, the well-trained model can be used to generate the target patch prediction for new low-high
ortho-image pairs.
Third, the developed drone-based ortho-image and deep learning-based method can be used to
estimate construction site elevations if the convolutional encoder-decoder network model is well-trained
with datasets of similarly textured objects at sites. In this ortho-image-based 3D-reconstruction method, the
model training datasets are the reference information to estimate the construction site elevations. Thus, the
performance of the developed method relies on the quality and quantity of the model training datasets. The
quality means more comprehensive texture features and geometry shape features while the quantity helps to
build the ability to ignore incorrect elevation values in the dataset and noise in the model predictions. In
this research project, the elevation estimation model training dataset was limited to 10-m drone-based
ortho-images, which only contain a few objects in a single image frame. In addition, the formation of the
elevation-map only contains single elevation values in each 32×32-pixel patch. Therefore, adding the sixth
convolution layer or adding the filters in the fifth convolution layer for the developed CNN encoder model
has nonsignificant improvement in the model prediction. Future research can assign more elevation values
to each 32×32-pixel patch by the dense pixel grid matching, and then, the developed model may need
additional CNN encoder layers and CNN decoder layers to connect the added elevation features. In
129
addition, increasing the drone flight altitude can enlarge an image’s spatial resolution and include more
objects. Therefore, future research can train the developed model with more datasets at different altitudes
other than the 10 m ortho-images. Furthermore, to increase the accuracy of the elevation estimation, future
research would use image classification to assign a class-label for each patch (32×32-pixel). The class-label
can be used as the additional reference information (feature-map) to increase the accuracy of the elevation
prediction.
Fourth, this research project only considered removing the static vegetation blocks from the
image-based construction site surveying results. The model training datasets only contain the static objects
on a construction site, such as the static vegetation block and static structures. The CNN-based image
classification model was developed to identify the predefined static objects using the drone-based still
ortho-images. The experimental results confirmed that the developed classification model can be applied in
construction site surveying. However, there are additional works that need to be done until it can be used
for monitoring earthwork operations on a construction site in real time. The active excavating construction
site is much more complex than the still construction site. The dynamic objects such as excavators, dozers,
trucks and workers on the construction site have impacts on accurately determining the site elevations using
the non-contact surveying methods. For example, a moving dozer may be included in overlapping ortho-
image pairs with different locations, which will lead to the incorrect image pair matching using the
traditional drone photogrammetry method. If the dozer is stopped during the capture of the ortho-image
pair, it can still have an impact on the results of construction site surveying because they contain the height
of construction equipment. Thus, future research is needed to extend the CNN-based image classification
model training dataset to include all the potential static and dynamic objects at not only a still construction
BIBLIOGRAPHY
Aguilar, R., Noel, M. F., and Ramos, L. F. (2019). “Integration of reverse engineering and non-linear
numerical analysis for the seismic assessment of historical adobe buildings.” Automation in
construction, 98, 1-15.
Anders, N., Valente, J., Masselink, R., and Keesstra, S. (2019). “Comparing Filtering Techniques for
Removing Vegetation from UAV-Based Photogrammetric Point Clouds.” Drones, 3(3), 61.
Arias, P., Herraez, J., Lorenzo, H., and Ordonez, C. (2005). “Control of structural problems in cultural
heritage monuments using close-range photogrammetry and computer methods.” Computers &
structures, 83(21-22), 1754-1766.
Ashour, R., Taha, T., Mohamed, F., Hableel, E., Kheil, Y. A., Elsalamouny, M., ... and Cai, G. (2016).
“Site inspection drone: A solution for inspecting and regulating construction sites.” Proc. 2016 IEEE
59th International Midwest Symposium on Circuits and Systems (MWSCAS), IEEE, Abu Dhabi,
doi:10.1109/mwscas.2016.7870116.
Badrinarayanan, V., Kendall, A., and Cipolla, R. (2017). “SegNet: A Deep Convolutional Encoder-Decoder
Architecture for Image Segmentation.” IEEE Transactions on Pattern Analysis and Machine
Intelligence, 39(12), 2481-2495.
Bang, S., Kim, H., and Kim, H. (2017). “UAV-based automatic generation of high-resolution panorama at
a construction site with a focus on preprocessing for image stitching.” Automation in construction,
84, 70-80.
Barazzetti, L., Remondino, F., and Scaioni, M. (2010). “Automation in 3D reconstruction: Results on
different kinds of close-range blocks.” Proc. International Archives of Photogrammetry, Remote
Sensing and Spatial Information Sciences, ISPRS, Newcastle upon Tyne, 55-61.
Bay, H., Ess, A., Tuytelaars, T., and Van Gool, L. (2008). “Speeded-up robust features (SURF).” Computer
vision and image understanding,110(3), 346-359.
Bernardini, S., Fox, M., and Long, D. (2014). “Planning the Behaviour of Low-Cost Quadcopters for
Surveillance Missions.” Proc. 24th International Conference on Automated Planning and
Scheduling(ICAPS), AAAI, Portsmouth, 445-453.
Chan, A. P. C., and Owusu, E. K. (2017). “Corruption Forms in the Construction Industry: Literature
Review.” Journal of Construction Engineering and Management, 143(8), 4017057.
Chen, K., Lu, W., Xue, F., Tang, P., and Li, L. H. (2018). “Automatic building information model
reconstruction in high-density urban areas: Augmenting multi-source data with architectural
knowledge.” Automation in construction, 93, 22-34.
Chen, W., Fu, Z., Yang, D., and Deng, J. (2016). “Single-image depth perception in the wild.” Proc.
Conference on 30th Neural Information Processing Systems (NIPS 2016), NeurIPS, Barcelona, 730-
738.
Chollet, F. (2015). “Keras: The Python Deep Learning library.” <https://keras.io/> (Aug. 7, 2019).
Cunliffe, A. M., Brazier, R. E., and Anderson, K. (2016). “Ultra-fine grain landscape-scale quantification
of dryland vegetation structure with drone-acquired structure-from-motion photogrammetry.”
Remote Sensing of Environment, 183, 129-143.
131
Daftry, S., Hoppe, C., and Bischof, H. (2015). “Building with drones: Accurate 3D facade reconstruction
using MAVs.” Proc. 2015 IEEE International Conference on Robotics and Automation (ICRA),
IEEE, Seattle, 3487- 3494.
Dalal, N., and Triggs, B. (2005). “Histograms of oriented gradients for human detection.” Proc. 2015 IEEE
Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2005), IEEE,
San Diego, doi: 10.1109/CVPR.2005.177
Deng, J., Dong, W., Socher, R., Li, L., Li, K. and Li., F. (2009). “ImageNet: A large-scale hierarchical
image database.” Proc. 2009 IEEE Conference on Computer Vision and Pattern Recognition, IEEE,
Miami, 248-255.
Du, J. C., and Teng, H. C. (2007). “3D laser scanning and GPS technology for landslide earthwork volume
estimation.” Automation in construction, 16(5), 657-663.
Eigen, D., Puhrsch, C., and Fergus, R. (2014). “Depth map prediction from a single image using a multi-
scale deep network.” Proc. 28th Conference on Neural Information Processing Systems (NIPS
2014), NeurIPS, Montréal, 2366-2374.
Ellenberg, A., Kontsos, A., Moon, F., and Bartoli, I. (2016). “Bridge deck delamination identification from
unmanned aerial vehicle infrared imagery.” Automation in construction, 72, 155-165.
Engelcke, M., Rao, D., Wang, D. Z., Chi Hay Tong, and Posner, I. (2017). “Vote3Deep: Fast object
detection in 3D point clouds using efficient convolutional neural networks.” Proc. 2017 IEEE
International Conference on Robotics and Automation (ICRA) , IEEE, Singapore, 1355-1361.
Erickson, M. S., Bauer, J. J., and Hayes, W. C. (2013). “The accuracy of photo-based three-dimensional
scanning for collision reconstruction using 123D catch.” Proc. SAE 2013 World Congress &
Exhibition, doi:10.4271/2013-01-0784
Freimuth, H., and König, M. (2018). “Planning and executing construction inspections with unmanned
aerial vehicles.” Automation in construction, 96, 540-553.
Furukawa, Y., Ponce, J. (2010). “Accurate, Dense, and Robust Multiview Stereopsis.” IEEE Transactions
on Pattern Analysis and Machine Intelligence, 32 (8), 1362-1376.
Garg, R., BG, V. K., Carneiro, G., and Reid, I. (2016). “Unsupervised CNN for single view depth
estimation: Geometry to the rescue.” Proc. 14th European Conference on Computer Vision (ECCV
2016), Springer, Amsterdam, 740-756.
Gheisari, M., and Esmaeili, B. (2016). “Unmanned Aerial Systems (UAS) for Construction Safety
Applications.” Proc. Construction Research Congress 2016 , ASCE, Historic San Juan, 2642-2650.
Gheisari, M., Irizarry, J., and Walker, B. N. (2014). “UAS4SAFETY: The potential of unmanned aerial
systems for construction safety applications.” Proc. Construction Research Congress 2014 , ASCE,
Atlanta, 1801-1810.
Guo, H., Yu, Y., Ding, Q., and Skitmore, M. (2018). “Image-and-Skeleton-Based Parameterized Approach
to Real-Time Identification of Construction Workers’ Unsafe Behaviors.” Journal of Construction
Engineering and Management, 144(6). doi: 10.1061/(ASCE)CO.1943-7862.0001497.
132
Guo, Q., Su, Y., Hu, T., Zhao, X., Wu, F., Li, Y., ... and Zheng, Y. (2017). “An integrated UAV-borne lidar
system for 3D habitat mapping in three forest ecosystems across China.” International Journal of
Remote Sensing, 38(8-10), 2954-2972.
Gwak, H. S., Seo, J., and Lee, D. E. (2018). “Optimal cut-fill pairing and sequencing method in earthwork
operation.” Automation in Construction, 87, 60-73.
Hamledari, H., McCabe, B., and Davari, S. (2017). “Automated computer vision-based detection of
components of under-construction indoor partitions.” Automation in Construction, 74, 78-94.
Han, J., Zhang, D., Cheng, G., Guo, L. and Ren, J. (2015). “Object Detection in Optical Remote Sensing
Images Based on Weakly Supervised Learning and High-Level Feature Learning.” IEEE
Transactions on Geoscience and Remote Sensing, 53(6), 3325-3337.
Han, K., Degol, J. and Golparvar-Fard, M. (2018). “Geometry- and Appearance-Based Reasoning of
Construction Progress Monitoring.” Journal of Construction Engineering and Management, 144 (1),
04017110.
Han, K. K., and Golparvar-Fard, M. (2017). “Potential of big visual data and building information
modeling for construction performance analytics: An exploratory study.” Automation in
Construction, 73, 184-198.
Hassner, T., and Basri, R. (2006). “Example based 3D reconstruction from single 2D images.” Proc. 2006
Conference on Computer Vision and Pattern Recognition Workshop CVPRW’06), IEEE, New York,
doi: 10.1109/CVPRW.2006.76.
Haur, C. J., Kuo, L. S., Fu, C. P., Hsu, Y. L., and Da Heng, C. (2018). “Feasibility Study on UAV-assisted
Construction Surplus Soil Tracking Control and Management Technique.” Proc. 5th Annual
International Conference on Material Science and Environmental Engineering (MSEE2017), IOP
Publishing, Xiamen, doi:10.1088/1757-899X/301/1/012145.
Hearn, D., Baker, M. P., and Baker, M. P. (2004). Computer Graphics with OpenGL (3rd edition), Pearson
Prentice Hall, Upper Saddle River, NJ.
Hola, B., and Schabowicz, K. (2010). “Estimation of earthworks execution time cost by means of artificial
neural networks.” Automation in Construction, 19(5), 570-579.
Holz, D., Holzer, S., Rusu, R. B., and Behnke, S. (2011). “Real-time plane segmentation using RGB-D
cameras.” Proc., 15th RoboCup International Symposium 2011, Springer, Istanbul, pp. 306-317.
Huang, A. S., Bachrach, A., Henry, P., Krainin, M., Maturana, D., Fox, D., and Roy, N. (2017). “Visual
odometry and mapping for autonomous flight using an RGB-D camera.” Proc., the 15th
International Symposium of Robotic Research (ISRR), Springer, Flagstaff, 235-252.
Huang, Q., Luzi, G., Monserrat, O., and Crosetto, M. (2017). “Ground-based synthetic aperture radar
interferometry for deformation monitoring: a case study at Geheyan Dam, China.” Journal of
Applied Remote Sensing, 11(3). doi: 10.1117/1.JRS.11.036030.
Hubbard, B., Wang, H., Leasure, M., Ropp, T., Lofton, T., Hubbard, S., and Lin, S. (2015). “Feasibility
study of UAV use for RFID material tracking on construction sites.” Proc., 51st ASC Annual
International Conference. ASC, College Station.
Inzerillo, L., Di Mino, G., and Roberts, R. (2018). “Image-based 3D reconstruction using traditional and
UAV datasets for analysis of road pavement distress.” Automation in Construction, 96, 457-469.
133
Irizarry, J., Gheisari, M., and Walker, B. N. (2012). “Usability assessment of drone technology as safety
inspection tools.” Journal of Information Technology in Construction, 17, 194-212.
Josephson, P. E., and Hammarlund, Y. (1999). “The causes and costs of defects in construction: A study of
seven building projects.” Automation in construction, 8(6), 681-687.
Kaehler, A., and Bradski, G. (2016). Learning OpenCV 3: Computer Vision in C++ with the OpenCV
Library ,O'Reilly Media, Sebastopol, CA.
Kang, K., Li, H., Yan, J., Zeng, X., Yang, B., Xiao, T., Zhang, C., Wang, Z., Wang, R., Wang, X., and
Ouyang, W. (2018). “T-CNN: Tubelets With Convolutional Neural Networks for Object Detection
From Videos.” IEEE Transactions on Circuits and Systems for Video Technology, 28(10), 2896-
2907.
Kim, D., Liu, M., Lee, S., and Kamat, V. R. (2019). “Remote proximity monitoring between mobile
construction resources using camera-mounted UAVs.” Automation in Construction, 99, 168-182.
Kim, H., and Kim, H. (2018). “3D reconstruction of a concrete mixer truck for training object detectors.”
Automation in construction, 88, 23-30.
Kim, K., Kim, H., and Kim, H. (2017). “Image-based construction hazard avoidance system using
augmented reality in wearable device.” Automation in Construction, 83, 390-403.
Kim, S., Irizarry, J., and Costa, D. B. (2016). “Potential factors influencing the performance of unmanned
aerial system (UAS) integrated safety control for construction worksites.” Proc., Construction
Research Congress 2016,ASCE, Historic San Juan 2614-2623.
Kirscht, M., and Rinke, C. (1998). “3D Reconstruction of Buildings and Vegetation from Synthetic
Aperture Radar (SAR) Images.” Proc., IAPR Workshop on Machine Vision Application MVA’98) ,
IAPR, Makuhari, 228-231.
Kraig, K., Clifford, J. S., Christine, F., Richard, E. M. (2008). Construction Management Fundamentals,
McGraw-Hill, New York, NY.
Kwon, S., Park, J. W., Moon, D., Jung, S., and Park, H. (2017). “Smart Merging Method for Hybrid Point
Cloud Data using UAV and LIDAR in Earthwork Construction.” Procedia Engineering, 196, 21-28.
Laina, I., Rupprecht, C., Belagiannis, V., Tombari, F., and Navab, N. (2016). “Deeper depth prediction
with fully convolutional residual networks.” Proc., 4th International Conference on 3D Vision (3DV
2016), IEEE, Stanford, 239-248.
Lewis, J.P. (1995). “Fast Template Matching.” Proc., Vision Interface 95, Canadian Image Processing and
Pattern Recognition Society, Quebec City, 120-123.
Li, D., and Lu, M. (2018). “Integrating geometric models, site images and GIS based on Google Earth and
Keyhole Markup Language.” Automation in Construction, 89, 317-331.
Li, F., Zlatanova, S., Koopman, M., Bai, X., and Diakité, A. (2018). “Universal path planning for an indoor
drone.” Automation in Construction, 95, 275-283.
Li, R. Ma, F. , Xu, F., Matthies, L.H., Olson, C.F., Arvidson, R.E. (2002). “Localization of Mars rovers
using descent and surface‐based image data.” Journal of Geophysical Research: Planets, 107, doi:
10.1029/2000JE001443.
134
Li, Z., and Snavely, N. (2018). “MegaDepth: Learning single-view depth prediction from internet photos.”
Proc., 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, Salt Lake
City, 2041-2050.
Liu, F., Shen, C., and Lin, G. (2015). “Deep convolutional neural fields for depth estimation from a single
image.” Proc., 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2015),
IEEE, Boston,5162-5170.
Lowe, D. G. (1999). “Object recognition from local scale-invariant features.” Proc., 7th IEEE International
Conference on Computer Vision ICCV’99) , IEEE, Kerkyra, 1150-1157.
Lowe, D. G. (2004). “Distinctive image features from scale-invariant keypoints.” International Journal of
Computer Vision, 60(2), 91-110.
Maghiar, M., and Mesta, D.(2018). “Measurement comparison of city roadway intersection models
obtained via laser-scanning and photogrammetry.” Proc., 54th ASC Annual International
Conference ,ASC, Minneapolis, 568-576.
Matthies, L., Maimone, M., Johnson, A., Cheng, Y., Willson, R., Villalpando, C., Goldberg, S., Huertas,
A., Stein, A., Angelova, A. (2007). “Computer vision on Mars.” International Journal of Computer
Vision, 75 (1),67-92.
Matthies, L.H., Olson, C.F., Tharp, G., Laubach, S. (1997). “Visual localization methods for mars rovers
using lander, rover, and descent imagery.” <https://trs.jpl.nasa.gov/bitstream/handle/2014/22227/97-
0695.pdf?sequence=1> (May 8, 2018).
Memarzadeh, M., Golparvar-Fard, M., and Niebles, J. C. (2013). “Automated 2D detection of construction
equipment and workers from site video streams using histograms of oriented gradients and colors.”
Automation in Construction, 32, 24-37.
Meng, C., Zhou, N., Xue, X., and Jia, Y. (2013). “Homography-based depth recovery with descent
images.” Machine Vision and Applications, 24(5), 1093-1106.
Metni, N., and Hamel, T. (2007). “A UAV for bridge inspection: Visual servoing control law with
orientation limits.” Automation in Construction, 17(1), 3-10.
Moon, D., Chung, S., Kwon, S., Seo, J., and Shin, J. (2019). “Comparison and utilization of point cloud
generated from photogrammetry and laser scanning: 3D world model for smart heavy equipment
planning.” Automation in Construction, 98, 322-331.
Morgenthal, G., Hallermann, N., Kersten, J., Taraben, J., Debus, P., Helmrich, M., and Rodehorst, V.
(2019). “Framework for automated UAS-based structural condition assessment of bridges.”
Automation in Construction, 97, 77-95.
Nair, V. and Hinton, G. (2010). “Rectified linear units improve restricted boltzmann machines.” Proc., 27th
international conference on machine learning(ICML 2010), International Machine Learning Society
(IMLS), Haifa, 807-814.
Nasirian, A., Arashpour, M., and Abbasi, B. (2019). “Critical Literature Review of Labor Multiskilling in
Construction.” Journal of Construction Engineering and Management, 145(1), 4018113.
135
Nassar, K., and Jung, Y. (2012). “Structure-From-Motion Approach to the Reconstruction of Surfaces for
Earthwork Planning.” Journal of Construction Engineering and Project Management, 2 , 1-7.
Nex, F., and Remondino, F. (2014). “UAV for 3D mapping applications: a review.” Applied Geomatics,
6(1), 1-15.
Nichols, H., and Day, D. (2010). Moving the Earth: The Workbook of Excavation (6th edition), McGraw-
Hill, New York, NY.
Nico, G., Leva, D., Fortuny-Guasch, J., Antonello, G., and Tarchi, D. (2005). “Generation of digital terrain
models with a ground-based SAR system.” IEEE Transactions on Geoscience and Remote Sensing,
43(1), 45-49.
Noferini, L., Pieraccini, M., Mecatti, D., Macaluso, G., Atzeni, C., Mantovani, M., ... and Tagliavini, F.
(2007). “Using GB-SAR technique to monitor slow moving landslide.” Engineering Geology, 95(3-
4), 88-98.
Noh, H., Hong, S., and Han, B. (2015). “Learning Deconvolution Network for Semantic Segmentation.”
Proc., 2015 IEEE International Conference on Computer Vision (ICCV 2015), IEEE, Santiago,
1520–1528.
Nunnally, S. W. (2010). Construction Methods and Management (8th edition), Pearson Prentice Hall Upper
Saddle River, NJ.
Omar, T., and Nehdi, M. L. (2017). “Remote sensing of concrete bridge decks using unmanned aerial
vehicle infrared thermography.” Automation in Construction, 83, 360-371.
Park, J., Kim, P., Cho, Y. K., and Kang, J. (2019). “Framework for automated registration of UAV and
UGV point clouds using local features in images.” Automation in Construction, 98, 175-182.
Peurifoy, R. L., and Garold D. O. (2014). Estimating Construction Costs (6th edition), McGraw-Hill, New
York, NY.
Phung, M. D., Quach, C. H., Dinh, T. H., and Ha, Q. (2017). “Enhanced discrete particle swarm
optimization path planning for UAV vision-based surface inspection.” Automation in Construction,
81, 25-33.
Remondino, F. (2003). “From point cloud to surface: the modeling and visualization problem”. Proc.,
International Workshop on Visualization and Animation of Reality-based 3D Models, ISPRS,
Engadin.
Remondino, F., and El‐Hakim, S. (2006). “Image‐based 3D modelling: a review.” The Photogrammetric
Record, 21(115), 269-291.
Roca, D., Lagüela, S., Díaz-Vilariño, L., Armesto, J., and Arias, P. (2013). “Low-cost aerial unit for
outdoor inspection of building façades.” Automation in Construction, 36, 128-135.
Rusu, R. B., and Cousins, S. (2011). “3D is here: Point Cloud Library (PCL).” Proc., 2011 IEEE
International Conference on Robotics and Automation, IEEE, Shanghai, doi:
10.1109/ICRA.2011.5980567
Saxena, A., Chung, S. H., and Ng, A. Y. (2008). “3-D Depth Reconstruction from a Single Still Image.”
International journal of Computer Vision, 76(1), 53-69.
136
Schneider, S., Taylor, G. W., and Kremer, S. (2018). “Deep Learning Object Detection Methods for
Ecological Camera Trap Data.” Proc., 2018 15th Conference on Computer and Robot Vision (CRV),
IEEE, Toronto, 321-328.
Seo, J., Duque, L., and Wacker, J. (2018). “Drone-enabled bridge inspection methodology and application.”
Automation in Construction, 94, 112-126.
Seo, J., Lee, S., Kim, J., and Kim, S. K. (2011). “Task planner design for an automated excavation system.”
Automation in Construction, 20(7), 954-966.
Shewchuk, J. R. (2002). “Delaunay refinement algorithms for triangular mesh generation.” Computational
Geometry, 22(1-3), 21-74.
Siebert, S., and Teizer, J. (2014). “Mobile 3D mapping for surveying earthwork projects using an
Unmanned Aerial Vehicle (UAV) system.” Automation in Construction, 41, 1-14.
Solem, J. E. (2012). Programming Computer Vision with Python: Tools and algorithms for analyzing
images, O'Reilly Media, Sebastopol, CA.
Sophian, A., Sediono, W., Salahudin, M. R., Shamsuli, M. S. M., and Za’aba, D. Q. A. A. (2017).
“Evaluation of 3D-Distance Measurement Accuracy of Stereo-Vision Systems.” International
Journal of Applied Engineering Research, 12(16), 5946-5951.
Spence, W. P., and Kultermann, E. (2016). Construction Materials, Methods and Techniques (4th edition),
Cengage Learning, Alexandria, VA.
Sung, C., and Kim, P. Y. (2016). “3D terrain reconstruction of construction sites using a stereo camera.”
Automation in Construction, 64, 65-77.
Szeliski, R. (2010). Computer vision: algorithms and applications, Springer, New York, NY.
Takahashi, N., Wakutsu, R., Kato, T., Wakaizumi, T., Ooishi, T., and Matsuoka, R. (2017). “Experiment
On UAV Photogrammetry And Terrestrial Laser Scanning For ICT-Integrated Construction” Proc.,
2017 International Conference on Unmanned Aerial Vehicles in Geomatics, ISPRS , Bonn, 371-377.
Toole, T. M. (2002). “Construction Site Safety Roles.” Journal of Construction Engineering and
Management, 128(3), 203-210.
Tsai, V. J. (1993). “Delaunay triangulations in TIN creation: an overview and a linear-time algorithm.”
International Journal of Geographical Information Science, 7(6), 501-524.
Tulldahl, H. M., and Larsson, H. (2014). “Lidar on small UAV for 3D mapping.” Proc., The International
Society for Optical Engineering, SPIE, doi: 10.1117/12.2068448
Ullman, S. (1979). “The Interpretation of Structure from Motion.” Proceedings of the Royal Society of
London. Series B, Biological Sciences, 203(1153), 405-426.
Van Blyenburgh, P. (1999). “UAVs: an overview.” Air and Space Europe, 1(5-6), 43-47.
Van den Heuvel, F. A. (1998). “3D reconstruction from a single image using geometric constraints.” ISPRS
Journal of Photogrammetry and Remote Sensing, 53(6), 354-368.
137
Wang, J., Sun, W., Shou, W., Wang, X., Wu, C., Chong, H. Y., Liu, Y. and Sun, C. (2015). “Integrating
BIM and LiDAR for Real-time Construction Quality Control.” Journal of Intelligent and Robotic
Systems, 79(3-4), 417-432.
Wang, J., Zhang, S., and Teizer, J. (2015). “Geotechnical and safety protective equipment planning using
range point cloud data and rule checking in building information modeling.” Automation in
Construction, 49, 250-261.
Wang, L., Chen, F., and Yin, H. (2016). “Detecting and tracking vehicles in traffic by unmanned aerial
vehicles.” Automation in Construction, 72, 294-308.
Westoby, M. J., Brasington, J., Glasser, N. F., Hambrey, M. J., and Reynolds, J. M. (2012). “‘Structure-
from-Motion’photogrammetry: A low-cost, effective tool for geoscience applications.”
Geomorphology, 179, 300-314.
Wing, C. K. (1997). “The ranking of construction management journals.” Construction Management and
Economics, 15(4), 387-398.
Xiong, Y., Olson, C.F., Matthies, L.H. (2005). “Computing depth maps from descent images.” Machine
Vision and Applications, 16 (3), 139-147.
Yang, C., Tsai, M., Kang, S., and Hung, C. (2018). “UAV path planning method for digital terrain model
reconstruction – A debris fan example.” Automation in Construction, 93 214-230.
Yi, C., and Lu, M. (2016). “A mixed-integer linear programming approach for temporary haul road design
in rough-grading projects.” Automation in Construction, 71, 314-324.
Zakeri, H., Nejad, F. M., and Fahimifar, A. (2016). “Rahbin: A quadcopter unmanned aerial vehicle based
on a systematic image processing approach toward an automated asphalt pavement inspection.”
Automation in Construction, 72, 211-235.
Zhang, S., Teizer, J., Pradhananga, N., and Eastman, C. M. (2015). “Workforce location tracking to model,
visualize and analyze workspace requirements in building information models for construction
safety planning.” Automation in Construction, 60, 74-86.
Zhao, W. Q., and Lin, Z. (2016). “SfM Precise Surface Measurement: Evaluation of Resolution and
Accuracy and Error Analysis.” Geography and Geo-Information Science, 32(6), 25-31. (in Chinese).
Zhong, X., Peng, X., Yan, S., Shen, M., and Zhai, Y. (2018). “Assessment of the feasibility of detecting
concrete cracks in images acquired by unmanned aerial vehicles.” Automation in Construction, 89,
49-57.
Zhou, X., Zhong, G., Qi, L., Dong, J., Pham, T. D., and Mao, J. (2017). “Surface height map estimation
from a single image using convolutional neural networks.” Proc., 8th International Conference on
Graphic and Image Processing (ICGIP 2016), Society of Photo-Optical Instrumentation Engineers
(SPIE), Tokyo, 1022524.