Skip to content

Move filter.py from xDEM to GeoUtils and integrate into raster class #691

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

vschaffn
Copy link
Contributor

@vschaffn vschaffn commented Apr 28, 2025

Resolves #690.

Context

The current filter.py module in xDEM provides a collection of filters that are highly useful for digital elevation models (DEMs). However, these filtering utilities are not intrinsically tied to elevation data—they are generally applicable to any geospatial raster data. Given this, it makes more architectural sense for them to live in GeoUtils.

Changes

  • Moved the filter.py module from xdem/ to geoutils/filters.py.
  • Removed gaussian_filter_cv to avoid dependence on opencv.
  • Moved tests/test_filter.py in xDEM to tests/test_filters.py in GeoUtils.
  • Added a new function _nan_safe_filter common for gaussian, mean and median filter, to deal with arrays containing nan values.
  • Added a generic _filter function mapping the available filters.
  • Added the Raster.filter method that extracts the array with Nans, uses the _filter function, and rebuild the Raster data.

Tests

  • Added new tests for the Raster.filter method.

Documentation

  • Added a filter section in raster_class.md.

Note

When this ticket is approved, a linked PR to remove filters in xDEM will be opened.


:returns: the filtered array (same shape as input).
"""
return _nan_safe_filter(array, scipy.ndimage.median_filter, **kwargs)
Copy link
Member

@rhugonnet rhugonnet May 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

EDIT: If I'm not mistaken, the implementation in xDEM doesn't fully work, and here neither.

In xDEM, replacing by inf:

array = [1, 2, 3]
array2 = [1, 2, 3, np.inf, np.inf, np.inf, np.inf]
np.median(array)
Out[16]: np.float64(2.0)
np.median(array2)
Out[17]: np.float64(inf)

Here, replacing by 0:

array3 = [1, 2, 3, 0, 0, 0, 0]
np.median(array3)
Out[19]: np.float64(0.0)

There is no way to ever retrieve 2 with a division of 0 by something else, as is done in _nan_safe_filter.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we simply call generic_filter with nanmean and nanmedian inside?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would be the simplest solution, but from my experience generic_filter is really slow. So we might consider other options in the future. At least, we could compare our implementation to that in the tests.

Copy link
Member

@rhugonnet rhugonnet May 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes that's unfortunately the case, see here (46 times slower to use generic_filter for median): https://scipy-user.scipy.narkive.com/YyMHNgYg/scipy-ndimage-filters-median-filter-with-nan-values

The other option is to use Numba as default, which should be fast for any custom operation like this.
This would mirror the behaviour of xDEM in terrain attributes, where I'm adding the option of using either a Numba or Scipy engine. (See here, PR not finalized yet: GlacioHack/xdem#486)

For this PR, we can stick to SciPy only and generic_filter.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's also a way to write a short C filter to pass to SciPy, which is usually much faster.
See: http://notmatthancock.github.io/2020/04/19/scipy-generic-filter-low-level-callable.html

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually my bad, for the nanmean we had a working filter already using the trick to replace by the zeros, and then dividing by the sum of valid pixels only, but I don't think this how it behaves exactly in the implementation of this PR.
We actually have an old working duplicate here: https://github.com/GlacioHack/xdem/blob/e1e2065ea52a7db3c8e88045d743aa1f6efd08bb/xdem/spatialstats.py#L2563

In short, to move forward in this PR, we need to remove the _nan_safe_filter() function (as this trick only works for the mean, it's not generalizable), re-implement the mean only with this trick.
And, for the median, simply use generic_filter(np.nanmedian) (which is quite slow, but there are no other choices I think).

Copy link
Member

@rhugonnet rhugonnet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great addition, thanks a lot for all the work! 😄

It would be great to add more quantitative tests to ensure the results are as expected with/without NaNs. If I didn't make a mistake in my comment above, the mean/median filters might be erroneous right now.

Maybe we should add a set of test manually checking the operation in a given window, for each filter type, to ensure there is no error?

Example for the mean:

arr = np.random.normal((5, 5))

arr_filtered = _filter(arr, method="mean", size=3)
assert np.nanmean(arr) == arr_filtered[2, 2]

They should all be easy to test manually like this, except for the gaussian one, that needs to use the logic in the StackOverflow post to correct the division factors.

My last comment: Should we add the size/radius parameter in Raster.filter()'s parameters directly, to make it more easy to grasp from the user perspective? We also need to specify in the docstring that all other kwargs refer to scipy.ndimage filters 😉.

@rhugonnet
Copy link
Member

rhugonnet commented May 7, 2025

I've been working on a very similar problems during the past week in GlacioHack/xdem#486, so while I still know the intricacies of the topic, here's how we could add Numba filters in GeoUtils to dramatically improve the speed for median (and we could re-use the same code structure for nmad, nanmax, and any other nanfunc):

from typing import Literal
import numpy as np
from numba import jit, prange
from scipy.ndimage import generic_filter

@jit(parallel=True)
def median_filter_numba(dem: np.ndarray, window_size: int):

    # Get input shapes
    N1, N2 = dem.shape

    # Define ranges to loop through given padding
    row_range = N1 - window_size + 1
    col_range = N2 - window_size + 1

    # Allocate output array
    outputs = np.full((row_range, col_range), fill_value=np.nan, dtype=np.float32)

    # Loop over every pixel concurrently by using prange
    for row in prange(row_range):
        for col in prange(col_range):

            outputs[row, col] = np.nanmedian(dem[row: row+window_size, col:col+window_size])

    return outputs


def median_filter(dem: np.ndarray, window_size: int, engine: Literal["scipy", "numba"]):

    # Using SciPy, options already include edge behaviour
    if engine == "scipy":
        return generic_filter(dem, np.nanmedian, size=window_size, mode="constant", cval=np.nan)
    # Using Numba, need to define edge behaviour ourselves
    elif engine == "numba":
        # We pad the input array with NaNs for half the window size
        hw = int((window_size - 1) / 2)
        dem = np.pad(dem, pad_width=((hw, hw), (hw, hw)), constant_values=np.nan)
        return median_filter_numba(dem, window_size)


############
# TEST SPEED
############

from time import time
rnd = np.random.default_rng(42)
nb_it = 10  # Several iterations to be more precise measuring CPU time
window_size = 11
sizes = [50, 100, 500, 1000]  # Array sizes

# Call Numba once in the void to trigger JIT compile
_ = median_filter(np.ones((10, 10)), window_size=3, engine="numba")

list_df = []
for s in sizes:
    dem = rnd.normal(size=(s, s))

    # Get computation times
    t1 = time()
    for i in range(nb_it):
        filt1 = median_filter(dem, window_size=window_size, engine="scipy")
    t2 = time()
    for i in range(nb_it):
        filt2 = median_filter(dem, window_size=window_size, engine="numba")
    t3 = time()

    # Check output is equal
    assert np.allclose(filt1, filt2)

    # Print elapsed times
    print(f"Size: {s}")
    print(f"Elapsed SciPy: {t2 - t1}")
    print(f"Elapsed Numba: {t3 - t2}")
Size: 50
Elapsed SciPy: 0.24158883094787598
Elapsed Numba: 0.005011320114135742
Size: 100
Elapsed SciPy: 0.9490830898284912
Elapsed Numba: 0.012121915817260742
Size: 500
Elapsed SciPy: 23.553598880767822
Elapsed Numba: 0.3076012134552002
Size: 1000
Elapsed SciPy: 94.9611337184906
Elapsed Numba: 1.2290620803833008

Quite a striking improvement, factor of 100! 😉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
3 participants