Parallelization Tutorial¶

This tutorial shows how to locate outlet basins using multiple CPUs in parallel.

Introduction¶

Locating outlet basins is one of the most computationally demanding steps in many assessments. As such, parallelizing this task can sometimes improve runtime. This tutorial examines how and when to parallelize code that locates basins.

Prerequisites¶

Install pfdf¶

To run this tutorial, you must have installed pfdf 3+ with tutorial resources in your Jupyter kernel. The following line checks this is the case:

import check_installation

pfdf and tutorial resources are installed

Previous Tutorials¶

You must run the Preprocessing Tutorial before this tutorial. This is because we’ll use the preprocessed datasets to derive a stream segment network for this tutorial. The following line checks the workspace for the preprocessed datasets:

from tools import workspace
workspace.check_preprocessed()

The preprocessed datasets are present

We also strongly recommend completing the Hazard Assessment Tutorial before this one. This is because this tutorial assumes familiarity with many of the concepts introduced in that tutorial.

Example Network¶

Next, we’ll build an example stream segment network. This process is explored in detail in the Hazard Assessment Tutorial.

from tools import examples
segments = examples.build_segments()
print(segments)

Segments:
    Total Segments: 697
    Local Networks: 42
    Located Basins: False
    Raster Metadata:
        Shape: (1261, 1874)
        CRS("NAD83")
        Transform(dx=9.259259269219641e-05, dy=-9.25925927753796e-05, left=-117.99879629663243, top=34.23981481425213, crs="NAD83")
        BoundingBox(left=-117.99879629663243, bottom=34.123055554762374, right=-117.82527777792725, top=34.23981481425213, crs="NAD83")

Locating Basins¶

You can use the Segments.locate_basins command to locate the outlet basins, and this command is called implicitly by commands that require the basins. Since locating basins is computationally expensive, a Segments object will store the basin locations internally for later use. This allows subsequent commands that require the basins to proceed much more quickly. Note that these locations will be discarded if the network is later filtered in a way that changes the basins.

You can check if a Segments object has located basins using the located_basins property. This information is also displayed when you print the object. For example, we can check that our example network has not located the outlet basins:

print(segments.located_basins)

False

print(segments)

Segments:
    Total Segments: 697
    Local Networks: 42
    Located Basins: False
    Raster Metadata:
        Shape: (1261, 1874)
        CRS("NAD83")
        Transform(dx=9.259259269219641e-05, dy=-9.25925927753796e-05, left=-117.99879629663243, top=34.23981481425213, crs="NAD83")
        BoundingBox(left=-117.99879629663243, bottom=34.123055554762374, right=-117.82527777792725, top=34.23981481425213, crs="NAD83")

But if we locate the basins, we find this value changes to True:

segments = segments.copy()
segments.locate_basins()

print(segments.located_basins)

True

Note that if we remove a segment corresponding to one of the basins, then the basin locations are discarded. For example, segment 340 is one of the terminal segments in our network:

segments.remove(340, type='ids')
print(segments.located_basins)

True

How to Parallelize¶

By default, the locate_basins command locates basins sequentially. However, you can use the parallel option to instead locate basins in parallel (using multiple CPUs). However, the use of this option imposes several restrictions:

First, the parallelized code must ultimately be run from a command-line Python script. It cannot be run in an interactive session. Second, the parallelized code should be protected by a if __name__ == "__main__" block. For example, the following script illlustrates a hazard assessment with parallelized basins:

from pfdf import watershed
from pfdf.raster import Raster
from pfdf.segments import Segments

if __name__ == "__main__":

    # Watershed analysis
    pass

    # Delineate a network
    pass

    # Hazard assessment models
    pass

    # Locate basins in parallel
    segments.locate_basins(parallel=True)

    # Export to file
    pass

Most of this code will look familiar to readers of the assessment tutorial. However, there are two critical changes: a new block on line 5, and the locate_basins command on line 17.

We begin with the locate_basins command. Here, the key point is that we’ve set the parallel option to True, thereby instructing the command to locate the basins using multiple CPUs. Since we used the parallel option, we’d need to run this script from the command line, using something like:

python path/to/my_parallel_script.py

Second, the code in the script must be protected by the if __name__ == "__main__": block on line 5 of the example script. This is essential, because the Python interpreter re-imports the script for each activated CPU. If this block is missing, the re-imported script will reach the part of the code that activates multiple CPUs and will attempt to activate even more CPUs. These CPUs will each then re-import the script, resulting in an infinite loop that will eventually crash the terminal. By contrast, code in a if __name__ == "__main__": block isn’t run when the script is re-imported, thereby preventing the infinite loop.

When to Parallelize¶

Runtime improvements will scale with the number of CPUs and the size of the watershed, so large watersheds will benefit more strongly from parallelization than small watersheds. For moderately-sized watersheds, the time spent activating CPUs may exceed the performance boost from parallelization, causing the code to actually run slower. Keep this in mind when deciding whether or not to parallelize.

Rule of thumb¶

Parallelization is often appropriate if it takes 10+ minutes to locate the basins.