Parallel Processing

When working with large data sets, it might get necessary to further parallelise processing apart from the module multithreading support. In opals, currently 4 concepts of different complexity are realised which can be controlled by the parameter distribute of the workflow scripts. In the following the different concepts, their complexity and activation procedure are described.

  • Concept 1 - single core processing: -distribute 1 or -distribute None
    This is the standard option for the workflow scripts. All modules are run sequentially and only in-module parallelisation using multiple threads is used.
  • Concept 2 - multi core processing: -distribute <processCount>
    This concept is used for powerful computers where module-based multithreading cannot fully load the CPU. Depending on the processCount, a corresponding number of process workers are started which run modules in parallel (if possible).
  • Concept 3 - multi machine processing using ipp: -distribute ipp
    With ipp (ipyparallel), multiple machines can take part in the processing. All I/O operations are done on the a file server (ie.e no local file access is performed). More details as to how to set up an ipp environment with controllers and engines can be found below.
  • Concept 4 - multi machine processing using ipp and opalsTask: -distribute ipp
    In case the network and/or the file server become a bottleneck, opalsTask can make better use of the local recourses of the participating machines. In this concept the input files are copied to a local temp directory, then processed locally and results are finally copied back to the file server. For more information about opalsTask, see the corresponding documentation.

In the following sections, each concept and its application is explained in detail. But first, to better understand the different aspects of parallelisation, the differences between multithreading and multi(core)processing are discussed.

Multithreading vs. multi(core)processing.

There are a few key differences between multiprocessing and multithreading. A process can be imagined as an execution unit of a whole program while threads represent execution units/segments of a process. Processes are independent of each other and have individual memory space allocated to them. Threads on the other hand share memory.

Multithreading and multiprocessing have different applications in the opals framework. A classic example of multiprocessing would be the usage Module Import to import multiple strips into multiple ODM files. As each of the Module Import instances can be run individually, using multiple processes will enhance the computational speed. An example of multithreading on the other hand would be a "divide and conquer" approach to writing a raster file, which can be divided into smaller tiles. Those are then handled by different threads and later combined into the finished raster file.

Though theoretical speedup of a program using techniques involving multiprocessing is given by Amdahl's Law, in reality there are other factors that limit the maximum possible speedup. In opals scripts like the package opalsQuality for example, certain parts of the script need to be finished before others can start running which of course leads to larger runtimes. There is also some overhead of the Python multiprocessing framework which can't be avoided and has to be taken into account.

Concept 1: single core processing.

As explained above, this is the standard processing mode when running an opals workflow script. When computing, one core of the computer where the script is run is utilised. Each module call within the script is executed sequentially, i.e. one after the other. As this is quite inefficient when working with modern computers with 8+ cores, utilising concepts 2-4 is recommended to achieve shorter runtimes through the optimal usage of the recourses available

Concept 2: multi core processing

In case there computations are run on a single computer but multiple cores should be utilised, multi core processing is the right choice.. If you want to run the script _grid.py using 6 processes for example, you can do so by setting the -distribute parameter to 6. E.g. for the demo data:

strip11.laz
strip12.laz
strip21.laz
strip22.laz
strip31.laz
strip32.laz
_grid.py -inFile strip??.laz -distribute 6

The runtime of this particular call will be about 3.5 times faster than when using the standard distribute value 1. The speedup can of course vary, depending on the hardware you are using.

Though you can choose the value of -distribute freely, it does not make sense to choose a value higher than the number of cores your computer has. In most cases, choosing a value higher than the amount of cores will impact the runtime in a negative way. For most opals scripts, a good value for the number of processes is the number of input files you have (again, keeping your number of cores in mind). Running the above example of _grid.py with more than 6 processes does not increase the speedup as for this script, there are only n different tasks to be performed at once, n being the number of input files.

A schematic represenation of a distribution of module runs using different numbers of workers/processes can be seen in Fig.1.

Fig. 1: Possible distribution of module runs using different values for -distribute

Example of a possible speedup using multicore Processing

As said in the opening paragraph, not only the number of processes can be chosen but also the number of threads that a program uses. The number of threads is controlled by the common parameter nbThreads and can be set in the .cfg file like so (excerpt of the cfg File of opalsQuality):

#############################################################################
########################### common parameters ###############################
#############################################################################
#############################################################################
####################### parameters for opalsGrid ###########################
#############################################################################
gridSize=1
interpolation=movingPlanes
neighbours=8
. . .

To give an idea of the magnitude of speedup achievable, different combinations of numbers of processes/threads were tested on the script _grid.py and the results depicted below in Fig.2. The demo dataset Loosdorf (To be integrated as a usecase example) containing 12 strips was used as the input. The numbers of processes (i.e. the value of the parameter -distribute) were [1,2,4,8,12]. For each configuration, each of the following number of threads was used to run _grid.py: [1,2,4,8,16,32]. The computer which was used to run these tests has 16 Cores and 32 Threads.

Fig. 2: Speedup of _grid.py, varying number of processes/threads

In order to interpret these results it is important to note that 2 runtimes of one and the same configuration can differ by up to 10-20 seconds. Having said that, for this dataset an increase of the number of processes is beneficial for decreasing the runtime in all of the configurations. Regarding the number of threads however, such a statement can not be made so easily. Increasing the number of threads to up to 4 is decreasing the runtime 100% of the time, when using 8 or more threads it is not as clear. While the general trend is indicating more threads --> more speedup, there are cases where using more threads actually slows down the program. This is due to the above described shared memory space between the threads that is not the case when using multiple processes. It can happen that the different threads (while they do each have to handle less workload) get in each other's way and perform worse as when there are less threads.

Which number of processes/threads to use

The main takeaway of the table in Fig.2 would be that an increase of the number of processes is in most cases a good idea in order to speedup your program (again keeping the specs of your hardware in mind). One should on the other hand be more careful about increasing the number of threads. The speedup may also be limited by the type of data drive that you are using, a HHD (Hard Disk Drive) or an SSD (Solid State Disk) which can operate much faster. Additional processes are typically more demanding on the physical drive (RAM) of your computer than additional threads. So the amount of RAM available to you is something to keep in mind as it can can also be a bottleneck for increasing the runtime of your program.

Example script showing how the multiprocessor object works

The instantation of a multiprocessor object is normally done by helper scripts analyzing the input argument -distribute. It can however very easily be instantiated and used in your own scripts. This example is supposed to showcase how the multiprocessor in the opals framework works and display the most important functions used by scripts like opalsQuality. Another purpose of the script is to demonstrate that using multiprocessing does not prevent queries on modules after they finished like you would when using .run() to run a module.

from opals.tools import processor
from opals import Info
import os
if __name__ == '__main__':
# First the input Files are specified
infPath = r"D:\fhacksto\opals\distro\demo" # path to your opals installation needs to be set
fileNames = ["strip11.laz", "strip21.laz", "strip31.laz"]
inFiles = [infPath + os.sep + inFile for inFile in fileNames]
# a multiprocessor object is instantiated using 3 workers. This is normally done during the analyzing of input files
# using the scripts opalsargparse.py and analyzeInput.py, controlled by the -distribute parameter. To make this work
# as a stand-alone script however we need to instantiate it by ourselves.
proc = processor.multiprocess(3)
# we then create an instance of opalsInfo, setting the exactComputation to 1 as we want to extract some statistics later.
mod = Info.Info(exactComputation=1)
# We can now loop over our .laz files and each set them as input parameters for opalsInfo
for inf in inFiles:
print(f"***Starting processing of {inf}...***")
mod.inFile = inf
# Using the multiprocessor function "submit" we can hand our module to the processor. Using the parameter "finished Text"
# We can set a message that prints when the module finished running. As soon as the module instance is submitted,
# it is assigned to a worker and executed in a seperate process while the loop continues running.
proc.submit(mod, finishedText=f'***Finished processing of {inf} ***')
# the object function "join" waits until all processes are finished so we can start with analyzing with analyzing the outputs.
proc.join()
# with indexing "[i]" (just like you would with e.g. a list) you can access the instances of the modules after they finish.
# The order is determined by their submitting-time, NOT the finished time. I.e. the position of a certain file in the
# input list corresponds to the index of the module in the processor which used said file as the input.
# The next two lines of code showcase, how you can query the crs of the first submitted file.
fileCRS = proc[0].statistic[0].getCoordRefSys()
print(f"The first processed file has the wkt-string {fileCRS}")
# The next snippet of code shows how you can iterate over the processor (also like you would with e.g. a list) to
# determine the file with the highest PointDensity of your dataset.
maxPDens = 0
for mod in proc:
filePDens = mod.statistic[0].getPointDensity()
if filePDens > maxPDens:
maxPDens = filePDens
maxPDensFile = mod.inFile
print(f"The file {maxPDensFile} has the highest Point Density ({maxPDens}) of the dataset.")

For more information about the member functions of the multiprocess class please visit the multiprocess Class Reference.

Concept 3: multi machine processing using ipp

When processing large projects and have multiple machines available for processing it makes sense to be able to use said machines. With the above described form of multiprocessing this is not possible, which is why a third processing option for the -distribute parameter was implemented, namely ipp. ipp is short for ipyparallel, a framework for parallel computing.

To use ipyparallel and the distribute option ipp, ipyparallel needs to be installed.

pip install ipyparallel

First, a so called ipcontroller is started using the following command, inserting the ip address of one's machine.

ipcontroller --ip=%IP_ADDRESS%

This creates several folders and files in the local user directory. Engines (workers) can then be started using a .json file, located at C:\Users\USER%\.ipython\profile_default\security. In a seperate shell (or multiple), execute the following command.

ipengine --file=C:\\Users\\%USER%\\.ipython\\profile_default\\security\\ipcontroller-engine.json

Doing this 3 times using 3 different opals shells results in 3 engines (workers) that can execute tasks (see Figure 3). Provided they can communicate via ssh, engines can be started on different machines using the same .json file. The file obviously has to be accesible to all machines, so it can either can be copied to the machines individually or copied to a shared network and accessed that way.

Fig. 3: Example of a controller and 3 engines.

Now that the controller and engines are set up, an opals workflow script where ipp is implemented can be called and the tasks are distributed to those engines when the input parameter and argument -distribute and ipp, respectively are set.

Since the computations in concept 3 are done via a shared network and not locally, this can quickly become a bottleneck. This problem is addressed in Concept 4.

Concept 4: multi machine processing using ipp and opalsTask

As explained in the paragraph above, when computing on multiple machines, the full local computing power should be used. Therefore in concept 4, the input files of a certain processing unit are copied from (e.g.) a server into a temp directory of the machine to which a certain task is assigned to, then computations are run locally and in the end the output files are copied back to the server. In opals, opalsTask was implemented to handle the above described process of creating temp directories and copying files in the background. For more details, please see the documentation for opalsTask. opalsTask is not something that is (de-)activated with a script parameter, but rather a script property, i.e. if opalsTask is used to define the tasks that are needed for a certain workflow. The only difference between concept 4 and concept 3 is the ability to compute the tasks locally instead of on a shared network.

opalsTask is currently realised in the following scritps:

  • opalsQuality

Input for the processor: OPALS-modules and PythonModules

So far we have only talked about OPALS modules such as opalsInfo and opalsImport as input for the processor, be it in single, multicore or multimachine processing mode. These OPALS modules can be submitted directly to the processor or multiple modules can firstly be gathered together to form an opalsTask, which can then in turn be submitted to the processor. However, also more complex structures can be processed in parallel, in the form of so called "Python Modules".

For different reason, it can be desired to not submit modules or tasks but instead wrap the modules in python code, which can then be submitted. As seen in the example below, there are two things, that need to be taken care of in order for it to work.

Firstly, the processor needs to recognize the code as a "Python Module".

@ nbThreads
number of concurrent threads
@ movingPlanes
moving (tilted) plane interpolation
@ file
generic file parameter (opalsInfo)
@ strip
strip ID of an image (opalsStripAdjust)
Contains the public interface of OPALS.
Definition: AbsValueOrQuantile.hpp:8
opalsGrid is the executable file of Module Grid
Definition: ModuleExecutables.hpp:103
@ info
Some progress that may be interesting in everyday-use.
@ screenLogLevel
verbosity level of screen output