Parallelizing Basins

locate_basins

Locating outlet basins is a computationally difficult task, so the raster, isnested, geojson, and save commands may take a long time to run when working with basins:

# May take a long time to run
segments.raster(..., basins=True)
segments.isnested()
segments.geojson(..., type="basins")
segments.save(..., type="basins")

When this is the case, you may be able to use parallelization to hasten this computation. The key command for parallelization is locate_basins. This command locates terminal outlet basins, and stores the locations internally for later use:

# This will still take a while
segments.locate_basins()

# But now these commands are fast,
# because the basins were pre-located
segments.raster(..., basins=True)
segments.isnested()
segments.geojson(..., type="basins")
segments.save(..., type="basins")

Setting the parallel option to True will cause the command to use multiple CPUs to locate the basins, which can potentially speed up this process:

# Potentially much faster
segments.locate_basins(parallel=True)

By default, the method will use the number of available CPUs - 1 (one CPU is reserved for the current process), but you can use the nprocess option to specify the number of parallel processes that should be used:

# Explicit number of processes
segments.locate_basins(parallel=True, nprocess=11)

Note

Speed improvements will scale with the number of CPUs and the size of the watershed. Small watersheds will benefit minimally, and may even run slower due to parallelization overhead.

Tip

Filtering the network may delete internally saved basin locations. As such, we recommend calling locate_basins after any filtering.

Requirements

The use of the parallel option adds two requirements. First, the parallel option cannot be used in an interactive Python session (it will cause the terminal to crash). Instead, you must use the option in a script called from the command line. For example:

python -m my_parallel_script

Second, the code in the parallelized script must be protected by a if __name__ == "__main__" block. So for example:

from pfdf.segments import Segments
from pfdf.raster import Raster

if __name__ == "__main__":
    flow = Raster('flow.tif')
    mask = Raster('mask.tif')
    segments = Segments(flow, mask)

    segments.locate_basins(parallel=True)
    segments.save('my-basins.shp', type="basins")

Neglecting this step will cause the routine to attempt to create an infinite number of parallel processes, crashing the terminal.

Tip

Read the parallel basins tutorial for a detailed example of basin parallelization.