Python Bindings of the OPALS Datamanager Library

The python binding (pyDM) of the C++ OPALS Datamanager Library exposes its functionality to python. Except of a some minor naming differences and some small differences caused by the language itself (mainly due to the python iterator concept) the python binding nearly completely matches the C++ Interface.

Before using the DM Library you should at least read The C++ OPALS Datamanager Library section, since it describes the central implementation concepts of the library. Then have a look at the examples section, which should give you a good starting point for your development. Each example is given in C++ and python. There the high similarity of the two interfaces is easy visible.

Differences of the DM API in C++ and Python

There are three major differences which are described in details in the following:

  • The C++ library makes use of the loose coupling principle. This means that the implementation is fully hidden to the API user. The functionality is exposed through interface (=abstract class) only, which is why all interface classes start with an I. The creation of DM objects is either done through special 'new' functions (DM::IPoint::New, DM::IWindow::New, ...) or through factory objects (DM::IPolygonFactory, DM::IAddInfoLayoutFactory, ..) in case of complex object construction. This means that all DM objects are created on the heap, leaving the destruction of the objects to the developers responsibility. To support the developer and to avoid memory leaks, the C++ library comes with a smart pointer class (DM::Handle), automatically taking care of the object destruction. So for each interface a corresponding smart pointer class exists (e.g. DM::IPoint / DM::PointHandle, DM::IBox / DM::BoxHandle). This effort is not necessary within in python, since the language natively work with smart pointers and garbage collection. Hence, the python classes were named like the C++ class without the leading I. A few examples are listed in a table in the section below.
  • The static 'New' functions (construction of objects) in C++ were translated to the native python object constructors 'init'. The following example shows the creation of a DM point (coordinates set to 0/0/0) in C++

    DM::PointHandle pt = DM::IPoint::New(0,0,0)

    and in python

    pt = pyDM.Point(0,0,0)
  • Although python and C++ support iterators there are significant differences in there usage. In the following a code snippet is presented, showing the way of iterating over all points of an ODM in C++

    for(auto it = odm->beginPoints(); it != odm->endPoints(); ++it)
    {
    const DM::IPoint &pt = *it; //do something with DM point pt
    }

    and in python

    for pt in odm.points():
    # do something with DM point pt

Class Naming Schema in C++ and Python

C++ class name python class name Description
DM::IPoint pyDM.Point 3d point object
DM::IBox pyDM.Box 3d box object
DM::IDatamanager pyDM.Datamanager ODM object

Performance of the Python Binding

Python is a powerful scripting language and therefore, an efficient tool for prototyping. Due to the huge number of extensions (many optimized C++ libraries provide a python binding. E.g. SciPy, NumPy, ...), python turned into a programming language for real-life software projects. In terms of performance, however, python cannot compete with C++ or similar languages. This is of minor concern in 'low performance' applications or if time critical section are computed by optimised libraries 'outside' of python. The python bindings of the OPALS modules represent the later case. The python binding of the DM provide low level access to the objects within an ODM, and allow manipulating or processing ODMs in a similar way as existing modules OPALS do. Considering the aforementioned statements, processing of huge point clouds on a point bases within python is not recommended. The bindings are useful for testing new processing strategies on small test sets, but for huge data sets (billon of points) it is recommended to switch to C++ or to use high level functionality (e.g. Datamanager.getHistogramSet, RConverter, etc. ) only. Depending on the task to perform, single point processing in python is between 3 to 10 times slower than in C++ (without any optimisation in C++)

Increasing Performance using NumPy

As mentioned before the pyDM interface provides low level access to points and their attributes which can be quite slow. For certain operations it is much faster to retrieve points and attributes as NumPy arrays and use optimised NumPy functions for processing. E.g. OPALS uses NumPy arrays to pass attribute information (as files) to R within the tree based classification. To support different data types of attributes pyDM works with dictionaries of one dimensional NumPy arrays (similar to the pandas DataFrame class).

As shown in example DM_numpy_spatial_query.py the results of spatial queries can be retrieved as dictionary of NumPy arrays (subsequently referred to as NumPy dicts) using the pyDM.NumpyConverter class. Attributes which are of interest need to be defined in the corresponding query layout. The x,y and z coordinates are also added as NumPy array to the dict, it the withCoordinates is activated. The following code snippet

print("\nPerform spatial query")
result = pyDM.NumpyConverter.searchPoint(dm, queryPoly, layout, withCoordinates=True) # polygon query without filter
#result = pyDM.NumpyConverter.searchPoint(dm, queryWin, layout, withCoordinates=True, filter=filter) # window query with filter
print("result=", result) # print returned object

which will print

Perform spatial query
result= {'x': array([529600.38699913, 529601.09499931, 529600.65399933, ...,
529648.82299805, 529649.56400299, 529649.1210022 ]), 'y': array([5338600.28499985, 5338600.32199955, 5338600.47299957, ...,
5338649.74700165, 5338649.83300018, 5338649.95500183]), 'z': array([271.796 , 271.81800002, 271.81199998, ..., 271.85600001,
271.90499997, 271.87 ]), 'Id': array([2147509245, 6442456307, 6442456304, ..., 6442477376, 6442477381,
6442477377], dtype=int64), 'GPSTime': array([37986.15342899, 37986.16094689, 37986.15343149, ...,
37985.89848596, 37985.90600406, 37985.89848866]), 'Amplitude': array([118., 118., 147., ..., 122., 134., 129.], dtype=float32), 'Classification': array([2, 2, 2, ..., 2, 2, 2], dtype=uint8)}

As it can be seen, the array dtype matches the attribute type in the ODM (also see ODM as a database table).

A central feature of the ODM is that it supports null values for attributes. Therefore the ODM uses separate null flags (rather than a specific no data value). So when querying NumPy dicts one needs to consider how null values should be translated to NumPy arrays. The pyDM supports 3 different ways:

  • no null values: If attributes do not contain any null values, no special handling is required for converting the NumPy array. It's the fastest conversion method but an exception will be thrown, if any null values occur.
  • no data value: This method indicates null values by a specific no data value. The user can provide one value for all attributes or a list values that matches the querying layout. Beside standard integer or floating point values, the labels 'max' and 'min' are also supported. In the later case data type specific minimum and maximum values are used (see ODM as a database table for minimum and maximum values of the different data type). It's the users responsibility to select an appropriate no data values that doesn't occur in the data.
  • masked arrays: MaskedArray is special NumPy class that combines a data and a mask array within one object. This option has the advantage that no no data value needs to be defined. On the other hand, masked arrays consume more memory and are a bit slower in creation. This method is enabled by the 'mask' label.

The null value conversion method is controlled by the noDataObj parameter. The DM_numpy_null_value.py example demonstrates all three methods of null value handling, by creating a few random points with two attributes. Whereas _attr1 is set for all points _attr2 is fill for half of the points only. Hence, the correct null value handling method is essential when querying _attr2.

As shown in following code snippet, only _attr1 (referred as shortLayout) can be converted without specification of a noDataObj parameter. Trying to retrieve both attributes (without noDataObj) will lead to an exception.

# We know that shortLayout, which only contains _attr1, doesn't have any null values.
# Hence, there is no need to provide noDataObj
resultNoNulls = pyDM.NumpyConverter.searchPoint(dm, queryWin, shortLayout, withCoordinates=True)
resultNoNulls2 = None
try:
# since _attr2 is not set for all points an exception will occur. catch it and continue
resultNoNulls2 = pyDM.NumpyConverter.searchPoint(dm, queryWin, layout, withCoordinates=True)
except Exception as e:
print("Exception occurred as expected:", e)
assert resultNoNulls2 == None # just to make sure that resultNoNulls2 is still none
print("\tno nulls=", resultNoNulls)

To overcome this exception one needs to set the noDataObj parameter. As mentioned above, it possible to provide one value for all attributes or a list of values matching the number of layout columns. The code example below uses 'min' in the first call and '[0, numpy.nan]' in the second call. 'min' results in the lowest possible value of corresponding attribute type which is 0 for _attr1 (type: uint16) and -3.40282347e+38 for attr2 (type: float). In the second call 0 and NaN (Not a Number; only possible for floating-point data types) as no data value. Attention: pyDM doesn't check if any attribute values are equal to the no data values. If this is the case, subsequent processing might interpret set attribute values as null values.

# use the minimum of the corresponding data type for all attributes as no data value
resultNoData1 = pyDM.NumpyConverter.searchPoint(dm, queryWin, layout, withCoordinates=True, noDataObj='min')
# use 0 as no data value for attr1 and NaN (=Not a Number) for attr2 (only possible for float and double attributes)
resultNoData2 = pyDM.NumpyConverter.searchPoint(dm, queryWin, layout, withCoordinates=True, noDataObj=[0, numpy.nan])
print("\tnodata1=", resultNoData1)
print("\tnodata2=", resultNoData2)

To overcome the aforementioned problem, one can retrieve attributes as masked NumPy arrays. This doesn't require the definition of an appropriate no data values, since the null flag status is stored in a separated mask array. Those MaskedArray objects are natively supported by NumPy which makes them both, flexible and efficient to use. The only downside of mask arrays is the higher memory consumption due to the additional mask array. As shown below, use 'mask' in the noDataObj parameter to retrieve the corresponding objects.

# use masked arrays for all attributes
resultMasked1 = pyDM.NumpyConverter.searchPoint(dm, queryWin, layout, withCoordinates=True, noDataObj='mask')
# use the maximum value as no data for attr1 and a MaskedArray for attr2
resultMasked2 = pyDM.NumpyConverter.searchPoint(dm, queryWin, layout, withCoordinates=True, noDataObj=['max', 'mask'])
print("\tmasked1=", resultMasked1)
print("\tmasked2=", resultMasked2)

Defining value type of output NumPy arrays

In some situations it might be beneficial to retrieve the NumPy arrays with a certain value type, instead of the ODM attribute type. Although NumPy arrays can be easily converted (see astype functionality for details), it's an extra step that requires additional memory. Therefore, pyDM provides the valueType parameter when retrieving NumPy arrays. This allows setting one specific value type for all returned objects (setting different value types for different attributes is not possible).

As shown below, valueType can be set as NumPy dtype or as pyDM.ColumnType:

# returns float64/double arrays, as defined by NumPy dtype
resultType1 = pyDM.NumpyConverter.searchPoint(dm, queryWin, layout, withCoordinates=True, noDataObj=numpy.nan,
valueType=numpy.dtype(float))
# returns float64/double arrays, as defined by pyDM.ColumnType
resultType2 = pyDM.NumpyConverter.searchPoint(dm, queryWin, layout, withCoordinates=True, noDataObj=numpy.nan,
valueType=pyDM.ColumnType.double_)
print("\tvalueType1=", resultType1)
print("\tvalueType2=", resultType2)

Notes on the Python Documentation

Due to the high similarity of the C++ and the python DM API, the C++ documentation of the DM is 'copied' to python. So it may appear that the python documentation contains links to the C++ documentation. If a function is not documented, it is well worth to have a look at the C++ documentation. Maybe it's described there.
Please note that the python documentation is derived in an external process which is why the python DM module does only partly contain the corresponding doc strings. So always have a look to the external documentation.

Examples

Examples demonstrating the usage of the python DM API can be found here

3d point object
Definition: pyDM.py:1678
@ odm
OPALS Datamanager file.
Smart pointer class using reference counting with support for DM objects (see ObjectBase)
Definition: Handle.hpp:75
3d point object
Definition: IPoint.hpp:14