Derives histograms and descriptive statistics (min, max, mean, r.m.s, etc.) for ODM or grid/raster data sets and stores the results graphically (SVG) and numerically (XML).
Data analysis is of crucial importance for ALS data processing. Besides 3D views and 2D maps, histograms of specific data attributes (including standard statistical parameters) are useful tools for analyzing certain data characteristics (e.g. distribution of heights, amplitudes, return numbers, gradients...) and for checking the data quality (e.g. strip differences considered to be normally distributed with expectation value = 0). Whereas 2D maps visualize the spatial distribution of certain data attributes, histograms condense the entire information in a bar plot and the descriptive statistical measures summarize and characterize the whole data sample. Thus, decisions upon the necessity of additional processing steps (e.g. strip adjustment) no longer rely on a pure visual data inspection but they are supported (or even advised) by quantified statistical measures. The following statistical parameters are provided:
Module Histo operates on either vector data sets (OPALS Data Manager, ODM) or regular grids/rasters in GDAL supported data formats. In both cases a single or multiple input files (of the same type) can be specified (parameter inFile). By default, the histograms and statistics are calculated for the heights (z) of the ODM point cloud or the first band of the grid/raster dataset, respectively. However, any other attribute stored in the ODM (as additional info) or raster band (zero-based band index or band name) can as well be used as basis (parameter attribute). In case of multiple attributes, the module derives seperate histograms and statistics for each attribute or raster band. The respective data samples are sorted into bins (classes) of equidistant width. The width can be specified either by the desired number of different bins (parameter nBins, default: 20) or by a specific bin width (parameter binWith). Per default, the histogram is limited to the 0.02 and 0.98 quantile moving outliers to the underflow and overflow bin. This behavior can be changed with the sampleRange parameter, by specifying relative or absolute sample range values. Relative values can be either specified in percentage (e.g. 5%
) or in quantile (e.g. q:0.05
) notation. Please note that the quantiles are only estimated per default which may result in slightly incorrect histogram limits, as see visiable in Figure 1. If exact limits are required the exactComputation mode has to be activated. If the histogram should cover the full sample range from \(x_{min}\) to \(x_{max}\) (without underflow and overflow bins), one can either specify quantile 0 and 1 (-sampleRange q:0
q:1
) or use the min max labels (-sampleRange min
max
). See example 3 for further details.
For processing integer attributes (EchoNumber, Classification, etc.) the module provides a specific integer processing mode (see procMode) where the bin width is constraint to integer values. Furthermore, the bin borders are shifted by half of the bin width, so that the bin centers correspond to the integer values (see EchoNumber histogram in Figure 1). Per default the module automatically switches to real or integer processing mode based on the type of the ODM attribute or the raster band type. Nevertheless, this behavior can be overruled by the procMode parameter. For non-continuous attributes like classification values, it might be relevant of skipping empty bins (see parameter skipEmptyBins) within the histogram (see example 5).
To limit the memory consumption, the module uses several approximation strategies that avoids storing all values in a sorted vector or list. Using the extended \(P^2\) algorithm (Jain and Chlamtac, 1985) for estimating quantiles and three data passes the median and sigma(mad) can be computed with a decent precision. Nevertheless, in certain siutation the exact quantiles, median and sigma(mad) values are required. This can be achieved by activating the exact computation mode (parameter exactComputation). There, the complete data series needs to be stored in a sorted vector, which might not be possible for huge data sets. On the other hand, the exact computation mode only requires one data pass which is typically faster than the non-exact mode. In case the module cannot allocate the vector for the full data series, it automatically drops back to the non-exact mode and outputs a warning.
Unless otherwise specified, all available data are considered for the calculations. However, in some situation it might be advantageous to restrict the input data. This can be achieved by specifying a data window (parameter limit) and/or a selection condition (parameter filter). The filter string must correspond to the OPALS Filters syntax. Please note, that filters based on certain point attributes (e.g.: Echo, Class, ...) are not are practicable for grid/raster datasets as the 3D grid points (x,y, grid value) do not contain additional point attributes. It is advisable to use the limit parameter (e.g.: limit xmin ymin xmax ymax) to specify a specific data window rather than specifying a region filter (e.g.: parameter filter "Region[xmin ymin xmax ymax]"). In the latter case the region filter is applied to the entire dataset (all ODM data points or entire grid) whereas in the first case a spatial sub-selection is applied beforehand resulting in better program performance. This is especially important when statistics are to be derived for a series of small patches based on the same data set using Module Histo repetitively. It is even possible to combine limits and filters in which case, first, the window query is applied and, subsequently, all points within the window area are checked w.r.t. the (potentially complex) region filter polygon.
Finally, the results (histogram and statistics) are stored as a complex object separately for each attribute. For the Python and C++ class implementations the results are directly accessible for further use (e.g. to decide on further processing steps based on statistical measures). Beyond that, graphical output (parameter plotFile) is provided as Scalable Vector Graphics which can be displayed in standard web browsers (Firefox, Opera, IE...). If an output parameter file (parameter outParamFile) is specified, the (numerical) results are additionally written to an XML file.
Possible values: automatic ... derived mode from attribute type real ........ allow arbitrary bin width values integer ..... limits bin width and centers bins to natural numbersFor features aggregated into bins like minority and majority, the bin center is used as representative value. The same applies to procMode=integer, where the bin limits are shifted by half the bin size so that the corresponding integer value is located in the bin center. With procMode=real the bin limits are mutiples of the bin sizes, which is suitable for processing features of floating point (real) type. In automatic mode the module selects real or integer procMode based on the attribute type.
The data used in the following examples are located in the $OPALS_ROOT/demo/
directory. Example 1 shows several histogram variants based on an ALS point cloud (fullwave.odm) whereas example 2 features histograms of regular grids (stip19/20.tif).
As a prerequisite for the following example, the ALS point cloud data must be imported into the ODM. To achieve that, change to the demo
directory and type:
Now, we are ready to derive histograms based on the resulting ODM featuring the 3D-coordinates (x, y, z) and additional attributes (x, y, z, gps time, amplitude, echo width, echo nr, echo qualifier) for each point. The following example demonstrates how to to analyze different point attributes in the form of histograms:
The following SVG plot files are created:
Please note, that the names of the respective SVG plot files (not specified explicitly in the above examples) have been estimated from the input file and attribute names. The results are shown in Figure 1.
A numerical representation of the histogram and statistics can be obtained by specifying an output parameter file as shown in the first example (parameter outParamFile). The resulting XML file histo.xml
contains the following output (excerpt):
Example 2 shows how to derive histograms of grid datasets. To generate the underlying grids, please perform the following preprocessing steps:
This procedure, first, imports the data of strips 19 and 20 (strip11/21.las
) into separate ODM files, and generates surface models as well as additional "sigmaZ" (=smoothness) and "excenter" (=extrapolation/occlusion) layers for each strip (strip11.tif
, strip11_sigmaZ.tif
, strip11_excen.tif
). Subsequently, a strip difference model (diff_11_21.tif
) and a respective grid mask (diff_11_21_mask.tif
) are derived. The examples below (results c.f. Figure 2) show the distribution of the grid heights and excenter layer of the surface model of strip 11. Furthermore, the height differences between strip 11 and 21 are analyzed in two variants, one using all available height differences in the strip overlap area, an the other one considering a grid mask to exclude all rough and occluded pixels.
Please further note, that wild cards are supported in case multiple input datasets.
By default the module uses the 2% and 98% quantile to limit the histogram. This typically moves outliers to the underflow and overflow bins and presents the relevant data with appropriate resolution. The follow example shows the effect on the z and amplitude attribute for demo dataset strip11.
Since the dataset contains long ranges and high amplitude values, the full range histograms are not very meaningful (upper images of figure 3) whereas the distribution of the attributes is nicely visable using the standard sample range (lower images of figure 3)
The main benefit of running Module Histo in Python is that the histogram as well as all standard statistics (min, mean, median, ...) are directly accessible after running the module via the Python API. This is exemplified in the sample script $OPALS_ROOT/demo/histoDemo.py
:
To run the script, type:
The script queries the output parameter histogram provided by Module Histo as a complex Python type (class) and uses its access functions to print the min, median, and 95% height quantile of dataset <fullwave.odm>. The following output is printed to the screen:
Example 4 demonstrates the skipEmptyBins parameter and its effect on the histogram. For non-continuous (integer) attributes like classification values or the like, a continuous histogram that covers the full data range might be inappropriate since the magnitude of those values is typically irrelevant. Activating the skipEmptyBins parameter as shown below, only plots occupied bins next to each other despite of the bin borders. To make non-continuous histogram easily recognizable, a uniform gab between all bars are inserted.
The skipEmptyBins parameter not only influences the visual presentation of the histogram but also the actual bin table within the OPALS log as well as the python histogram object.
Ressl, C., Kager, H. and Mandlburger, G., 2008. Quality checking of ALS projects using statistics of strip differences. In: IAPRS, XXXVII, pp. 25399860.
R. Jain and I. Chlamtac, The \(P^2\) algorithm for dynamic calculation of quantiles and histograms without storing observations, Communications of the ACM, Volume 28 (October), Number 10, 1985, p. 1076-1085.