Skip to content

Optimize performance of Raster.get_stats() #693

Open
@rhugonnet

Description

@rhugonnet

While working on #492, I noticed a couple issues with Raster.get_stats():

  1. The function always computes all statistics even when asked for a single one, here:
    stats_dict = self._statistics(band=band, counts=counts)
    This makes it ~15 times slower on CPU when asking for a single stat.
  2. The function duplicates the data in RAM here:
    mdata = np.ma.filled(data.astype(float), np.nan)
    Normally we can skip this by calling function that work on masked arrays directly. For np.ma.percentile that doesn't exist in NumPy, we can call the SciPy module for masked array instead, using mquantiles: https://docs.scipy.org/doc/scipy/reference/stats.mstats.html#statistical-functions-for-masked-arrays-scipy-stats-mstats

I've had to re-structure these functions outside of the Raster class inside a stats/ module to re-use them on PointCloud objects, so better to wait until that PR is merged to tackle these changes.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementFeature improvement or requestperformanceRelated to computational performance

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions