Optimize Face Bounds Computation #877

philipc2 · 2024-08-05T20:50:35Z

Closes #730

Overview

Decorated Numpy functions (allclose, isclose, cross, dot) that were used by helper functions when building the face bounds
Added a global option to enable or disable fma instructions

Expected Usage

import uxarray as ux

grid_path = "/path/to/grid.nc"
data_path = "/path/to/data.nc"

uxds = ux.open_dataset(grid_path, data_path)

# this is how you use this function
some_output = uxds.some_function()

# this is another way to use this function
other_output = uxds.some_function(some_param = True)

github-actions · 2024-08-05T21:08:32Z

ASV Benchmarking

Benchmark Comparison Results

Benchmarks that have improved:

Change	Before [`1790e0e`]	After [`2678371`]	Ratio	Benchmark (Parameter)
-	13.8±0.03s	1.44±0.02s	0.1	face_bounds.FaceBounds.time_face_bounds(PosixPath('/home/runner/work/uxarray/uxarray/test/meshfiles/mpas/QU/oQU480.231010.nc'))
-	2.09±0.01s	186±5ms	0.09	face_bounds.FaceBounds.time_face_bounds(PosixPath('/home/runner/work/uxarray/uxarray/test/meshfiles/scrip/outCSne8/outCSne8.nc'))
-	19.5±0.08s	1.69±0.01s	0.09	face_bounds.FaceBounds.time_face_bounds(PosixPath('/home/runner/work/uxarray/uxarray/test/meshfiles/ugrid/geoflow-small/grid.nc'))
-	74.6±0.4ms	7.81±0.05ms	0.1	face_bounds.FaceBounds.time_face_bounds(PosixPath('/home/runner/work/uxarray/uxarray/test/meshfiles/ugrid/quad-hexagon/grid.nc'))
-	401M	347M	0.87	mpas_ocean.Integrate.peakmem_integrate('480km')

Benchmarks that have stayed the same:

Before [`1790e0e`]	After [`2678371`]	Ratio	Benchmark (Parameter)
1.63±0.01s	1.64±0.01s	1.00	import.Imports.timeraw_import_uxarray
621±5ms	639±7ms	1.03	mpas_ocean.ConnectivityConstruction.time_face_face_connectivity('120km')
39.1±0.4ms	39.3±0.8ms	1.00	mpas_ocean.ConnectivityConstruction.time_face_face_connectivity('480km')
1.83±0.06ms	1.60±0.2ms	~0.87	mpas_ocean.ConnectivityConstruction.time_n_nodes_per_face('120km')
471±7μs	478±6μs	1.01	mpas_ocean.ConnectivityConstruction.time_n_nodes_per_face('480km')
1.18±0μs	1.21±0μs	1.03	mpas_ocean.ConstructTreeStructures.time_ball_tree('120km')
292±1ns	305±6ns	1.04	mpas_ocean.ConstructTreeStructures.time_ball_tree('480km')
778±3ns	834±3ns	1.07	mpas_ocean.ConstructTreeStructures.time_kd_tree('120km')
279±1ns	287±2ns	1.03	mpas_ocean.ConstructTreeStructures.time_kd_tree('480km')
393M	397M	1.01	mpas_ocean.GeoDataFrame.peakmem_to_geodataframe('120km', False)
383M	385M	1.00	mpas_ocean.GeoDataFrame.peakmem_to_geodataframe('120km', True)
355M	354M	1.00	mpas_ocean.GeoDataFrame.peakmem_to_geodataframe('480km', False)
354M	354M	1.00	mpas_ocean.GeoDataFrame.peakmem_to_geodataframe('480km', True)
1.21±0s	1.22±0.01s	1.01	mpas_ocean.GeoDataFrame.time_to_geodataframe('120km', False)
58.5±0.6ms	58.5±0.4ms	1.00	mpas_ocean.GeoDataFrame.time_to_geodataframe('120km', True)
97.4±0.3ms	94.7±0.4ms	0.97	mpas_ocean.GeoDataFrame.time_to_geodataframe('480km', False)
5.30±0.1ms	5.41±0.1ms	1.02	mpas_ocean.GeoDataFrame.time_to_geodataframe('480km', True)
266M	266M	1.00	mpas_ocean.Gradient.peakmem_gradient('120km')
243M	244M	1.00	mpas_ocean.Gradient.peakmem_gradient('480km')
2.68±0.02ms	2.64±0.01ms	0.98	mpas_ocean.Gradient.time_gradient('120km')
293±4μs	285±3μs	0.97	mpas_ocean.Gradient.time_gradient('480km')
365M	365M	1.00	mpas_ocean.Integrate.peakmem_integrate('120km')
186±10ms	177±1ms	0.95	mpas_ocean.Integrate.time_integrate('120km')
11.8±0.1ms	12.1±0.02ms	1.02	mpas_ocean.Integrate.time_integrate('480km')
351±3ms	347±3ms	0.99	mpas_ocean.MatplotlibConversion.time_dataarray_to_polycollection('120km', 'exclude')
349±3ms	347±2ms	0.99	mpas_ocean.MatplotlibConversion.time_dataarray_to_polycollection('120km', 'include')
350±0.5ms	353±4ms	1.01	mpas_ocean.MatplotlibConversion.time_dataarray_to_polycollection('120km', 'split')
21.8±0.2ms	22.4±0.3ms	1.03	mpas_ocean.MatplotlibConversion.time_dataarray_to_polycollection('480km', 'exclude')
22.9±0.2ms	22.2±0.2ms	0.97	mpas_ocean.MatplotlibConversion.time_dataarray_to_polycollection('480km', 'include')
22.2±0.2ms	22.4±0.3ms	1.01	mpas_ocean.MatplotlibConversion.time_dataarray_to_polycollection('480km', 'split')
54.9±0.05ms	55.0±0.2ms	1.00	mpas_ocean.RemapDownsample.time_inverse_distance_weighted_remapping
44.3±0.2ms	44.3±0.2ms	1.00	mpas_ocean.RemapDownsample.time_nearest_neighbor_remapping
360±1ms	361±1ms	1.00	mpas_ocean.RemapUpsample.time_inverse_distance_weighted_remapping
264±0.9ms	264±0.9ms	1.00	mpas_ocean.RemapUpsample.time_nearest_neighbor_remapping
234M	239M	1.02	quad_hexagon.QuadHexagon.peakmem_open_dataset
232M	239M	1.03	quad_hexagon.QuadHexagon.peakmem_open_grid
7.10±0.2ms	7.02±0.2ms	0.99	quad_hexagon.QuadHexagon.time_open_dataset
5.99±0.02ms	6.03±0.04ms	1.01	quad_hexagon.QuadHexagon.time_open_grid

Benchmarks that have got worse:

Change	Before [`1790e0e`]	After [`2678371`]	Ratio	Benchmark (Parameter)
+	243M	376M	1.55	face_bounds.FaceBounds.peakmem_face_bounds(PosixPath('/home/runner/work/uxarray/uxarray/test/meshfiles/mpas/QU/oQU480.231010.nc'))
+	303M	377M	1.25	face_bounds.FaceBounds.peakmem_face_bounds(PosixPath('/home/runner/work/uxarray/uxarray/test/meshfiles/scrip/outCSne8/outCSne8.nc'))
+	321M	381M	1.19	face_bounds.FaceBounds.peakmem_face_bounds(PosixPath('/home/runner/work/uxarray/uxarray/test/meshfiles/ugrid/geoflow-small/grid.nc'))
+	318M	378M	1.19	face_bounds.FaceBounds.peakmem_face_bounds(PosixPath('/home/runner/work/uxarray/uxarray/test/meshfiles/ugrid/quad-hexagon/grid.nc'))

philipc2 · 2024-08-07T03:53:25Z

@hongyuchen1030

For the first step in optimizing, I've looked at decorating some of the commonly called functions with numba, and used a numba implementation of isclose. Looks to be about a 15-20% performance improvement to begin with.

Next step will be to eliminate any unnecessary coordinate conversion calls, such as _xyz_to_lonlat_rad

This might look like us providing both the latlon_rad and cartesian points in the function call to avoid needing to convert. Something like:

 def _point_within_gca(pt_latlon,
                       pt_xyz,
                       gca_a_latlon,
                       gca_a_xyz,
                       gca_b_latlon,
                       gca_b_xyz,
                       is_directed=False):

Hopefully once we've moved these conversions out of each function call, we can see if the remaining logic can be decorated with numba.

philipc2 · 2024-08-07T04:01:26Z

@hongyuchen1030

uxarray/uxarray/grid/arcs.py

Line 77 in 1790e0e

if np.allclose(angle, np.pi, rtol=0, atol=ERROR_TOLERANCE):

I had a question about this and other parts of the code where the Cartesian coordinates are converted to latlon in radians. Are we able to use the node_lon and node_lat that are part of the Grid (in degrees) instead of needing to run the conversion? Additionally, is radians here just a preference or can we use degrees?

hongyuchen1030 · 2024-08-07T05:48:19Z

Next step will be to eliminate any unnecessary coordinate conversion calls, such as _xyz_to_lonlat_rad

This might look like us providing both the latlon_rad and cartesian points in the function call to avoid needing to convert. Something like:

 def _point_within_gca(pt_latlon,
                       pt_xyz,
                       gca_a_latlon,
                       gca_a_xyz,
                       gca_b_latlon,
                       gca_b_xyz,
                       is_directed=False):

I have also been considering this design today. It will be highly beneficial to call the coordinate conversion only once, both for performance and accuracy.

We can extract all functions that take node coordinates as input and ensure that both Cartesian and spherical coordinates are required as parameters.

I had a question about this and other parts of the code where the Cartesian coordinates are converted to latlon in radians. Are we able to use the node_lon and node_lat that are part of the Grid (in degrees) instead of needing to run the conversion? Additionally, is radians here just a preference or can we use degrees?

That is another reason why I have been advocating for better documentation and conventions for the coordinate system.

Currently, we have the following coordinate systems:

Spherical coordinates in degrees for users.
Spherical coordinates in radians because the trigonometry functions in numpy use radians (np.sin, np.cos, etc.). And some other potential reasons we decided to use it long time ago
Cartesian coordinate system with normalization.
Cartesian coordinate system without normalization (we still haven't confirmed if the node data in MPAS or other unnormalized coordinates are equal to the normalized_coordinates * earth_radius yet).

Before we attempt to rewrite everything, we must ensure which coordinate systems we are using across the entire project.

philipc2 · 2024-08-07T06:21:45Z

I have also been considering this design today. It will be highly beneficial to call the coordinate conversion only once, both for performance and accuracy.

Agreed, was thinking of this as well.

That is another reason why I have been advocating for better documentation and conventions for the coordinate system.

We have some documentation here in our user guide. https://uxarray.readthedocs.io/en/latest/user-guide/representation.html

Any suggestions on how we could improve this?

Currently, we have the following coordinate systems:

Wondering if it would make sense to include a node_lon_rad and node_lat_rad extension to how we represent our coordinates. I've seen some other grids store the coordinates both in degrees and radians, so this might be a worth extension (and would help with some of the conversions we are doing).

Hopefully the normalization inconsistencies will be resolved with #878

hongyuchen1030 · 2024-08-07T06:47:04Z

We have some documentation here in our user guide. https://uxarray.readthedocs.io/en/latest/user-guide/representation.html

Any suggestions on how we could improve this?

I understand we already specify the convention underneath; however, the node_lat has the description specifying the units while others do not. It would be great to keep them uniform. Additionally, placing the note before the detailed description would enhance clarity.
Just a note: I think we didn't use the convention of [-180, 180] until we updated the coordinates conversion. The old conversion function used the convention of [0, 360). All my algorithms are based on [0, 2π) as well. I highly recommend using [0, 2π) or [0, 360) since it provides an easier way to handle the antimeridian faces. This also necessitates a thorough re-check after any major functions are updated.
It might also be beneficial to specify that we have a unique definition for the antimeridian points and the pole points.

Wondering if it would make sense to include a node_lon_rad and node_lat_rad extension to how we represent our coordinates. I've seen some other grids store the coordinates both in degrees and radians, so this might be a worth extension (and would help with some of the conversions we are doing).

I think what we've discovered so far is that the degrees are solely for the users, while all calculations use radians since the numpy trigonometry functions operate in radians. I'm always open to utilizing storage space to reduce time complexity.

philipc2 · 2024-08-07T17:14:34Z

I understand we already specify the convention underneath; however, the node_lat has the description specifying the units while others do not. It would be great to keep them uniform. Additionally, placing the note before the detailed description would enhance clarity.

That's a good catch!

Just a note: I think we didn't use the convention of [-180, 180] until we updated the coordinates conversion. The old conversion function used the convention of [0, 360). All my algorithms are based on [0, 2π) as well. I highly recommend using [0, 2π) or [0, 360) since it provides an easier way to handle the antimeridian faces. This also necessitates a thorough re-check after any major functions are updated.

We use -180 to 180 since it makes certain visualizations tasks (i.e. projections and the use of the antimeridian package) simpler. Most of our datasets are also already in the -180 to 180 range.

I think what we've discovered so far is that the degrees are solely for the users, while all calculations use radians since the numpy trigonometry functions operate in radians. I'm always open to utilizing storage space to reduce time complexity.

This isn't fully the case. Many of our internal functions (i.e. BallTree) use degrees.

numpy trigonometry functions operate in radians

One option could be to use Scipy's degree implementations, but I'm not sure how much this implementation differs from Numpy's

https://docs.scipy.org/doc/scipy/reference/generated/scipy.special.sindg.html#scipy.special.sindg

philipc2 · 2024-08-07T19:00:17Z

@hongyuchen1030

Would you mind explaining this code snippet for me?

uxarray/uxarray/grid/arcs.py

Lines 288 to 292 in 1790e0e

    
           d_a_max = ( 
        
               np.clip(d_a_max, 0, 1) 
        
               if np.isclose(d_a_max, [0, 1], atol=ERROR_TOLERANCE).any() 
        
               else d_a_max 
        
           )

hongyuchen1030 · 2024-08-07T19:06:35Z

@hongyuchen1030

Would you mind explaining this code snippet for me?

uxarray/uxarray/grid/arcs.py

Lines 288 to 292 in 1790e0e

d_a_max = (

np.clip(d_a_max, 0, 1)

if np.isclose(d_a_max, [0, 1], atol=ERROR_TOLERANCE).any()

else d_a_max

)

Sure, I wonder if you want to hop into a quick call?

hongyuchen1030 · 2024-08-07T19:34:36Z

We use -180 to 180 since it makes certain visualization tasks (i.e., projections and the use of the antimeridian package) simpler. Most of our datasets are also already in the -180 to 180 range.

I understand the benefits of using the -180 to 180 range for visualization tasks and consistency with existing datasets. My algorithms do not expose the longitude range to users, nor do they modify existing longitudes, so this shouldn't pose a significant issue. However, I wanted to bring this up because if we require functions to take node_lat and node_lon from the grid, we may need to quickly check if longitude conversion is necessary. This note could be helpful for troubleshooting future bugs.

This isn't fully the case. Many of our internal functions (i.e., BallTree) use degrees.

Is this because we directly use the grid.node_lat and grid.node_lon without calling the numpy trigonometric functions?

One option could be to use Scipy's degree implementations, but I'm not sure how much this implementation differs from Numpy's.

I compared the source code for Scipy's sindg and Numpy's sin. Here’s what I found:

Numpy’s sin is a wrapper for the compiled c_sin, with the source code for s_sin available here.
Scipy’s sindg was implemented recently in this .h file, and it still converts some variables into radians.

I haven't delved deeply into the internal workings of both functions, but we could perform a simple benchmark to compare their running times and compatibility if needed.

philipc2 · 2024-08-07T23:22:01Z

@hongyuchen1030

After some further optimizations, this is the point that I've gotten to. Looks like the pyfma implementations still contribute to a pretty significant slowdown, taking up the majority of the execution time.

FMA Disabled

~5.73s Execution Time

FMA Enabled

~14.1s Execution Time

Do you think anything from here could possible act as an alternative to using pyfma?

https://docs.scipy.org/doc/scipy/reference/linalg.blas.html

philipc2 · 2024-08-08T03:44:56Z

A few more refinements and with fma disabled, able to get down for 3.74s

philipc2 · 2024-08-08T03:59:25Z

From the benchmarks run above, looks like almost a 8-10x improvement in performance across the board with these changes.

I was also able to compute the bounds on a 30km MPAS atmosphere grid (655k Faces, 1,310,000 nodes) in about 40 seconds, which is a significant improvement and wouldn't of been possible without these optimizations.

hongyuchen1030 · 2024-08-08T04:21:28Z

@hongyuchen1030

After some further optimizations, this is the point that I've gotten to. Looks like the pyfma implementations still contribute to a pretty significant slowdown, taking up the majority of the execution time.

FMA Disabled

~5.73s Execution Time

FMA Enabled

~14.1s Execution Time

Do you think anything from here could possible act as an alternative to using pyfma?

https://docs.scipy.org/doc/scipy/reference/linalg.blas.html

I've been giving this case a lot of thought, considering my research findings and the intricacies of floating point operations.

Given the technical details involved with floating point operations, I won't delve into too much detail to avoid overwhelming you. Here's a summary of my thoughts:

Maintain Current Approach: We can keep the current implementation as it is, but set is_FMA_enable to False globally and use the basic np.cross. We'll document this choice, informing users that they can opt for a more accurate method by enabling FMA.
Adjust Error Tolerance: The immediate task would be to reset the error tolerance for intersection point snapping. This warrants a separate issue to be opened specifically for this. And since we are not allowed to use any compiled files in our project code base, we cannot do much about it now.

Rationale:

Algorithm Accuracy: While CDO uses Kahan's Algorithm with FMA, my research indicates there's a more accurate method available. This method works with and without FMA, though the non-FMA version requires more flops. We can explore this choice further as we develop in Python.
Floating Point Operations: Standard floating point operations generally produce acceptable results. We can monitor their reliability by calculating the condition number. I'm working on an adaptive algorithm to address this in the future.
Task Complexity: As you might have noticed, all these floating point operations in Python inevitably lead to compiled C. Although all these packages use the same compiled C function, their binding and wrapping are different, making them difficult to connect properly.

Conclusion:

Since FMA isn't currently ready for the Python package, we can focus on the general case first and handle the ill-conditioned cases later. The package should primarily use normal floating point operations, and when an ill-conditioned case arises, it's not overly time-consuming to apply the more accurate operations to this subset.

If you have any questions or need further clarification, please don't hesitate to ask. I'm more than happy to discuss and explain everything.

philipc2 · 2024-08-08T05:02:48Z

Maintain Current Approach: We can keep the current implementation as it is, but set is_FMA_enable to False globally and use the basic np.cross. We'll document this choice, informing users that they can opt for a more accurate method by enabling FMA.

I think this option makes the most sense to me and would fit the expectations of our users also.

And since we are not allowed to use any compiled files in our project code base, we cannot do much about it now.

I wouldn't say that we are not allowed, it's more something that we haven't explored yet. It might be worth considering.

I appreciate the write up and glad that we're making some good improvements both to the performance and accuracy of our operators.

philipc2 · 2024-08-09T05:59:02Z

Merging this so that we can continue our work with #785

Overall, was able to get about a 10-12x performance improvement by using numba without major rewrites to the algorithm. There's still some more optimizations we could do, but this would require significant rewrites, which should not be a priority right now.

hongyuchen1030 · 2024-08-09T18:05:56Z

Merging this so that we can continue our work with #785

Overall, was able to get about a 10-12x performance improvement by using numba without major rewrites to the algorithm. There's still some more optimizations we could do, but this would require significant rewrites, which should not be a priority right now.

Thank you very much for your excellent jobs! And I will open another issue and PR about our rewriting decisions for the helper functions.

philipc2 · 2024-08-09T18:06:51Z

Merging this so that we can continue our work with #785
Overall, was able to get about a 10-12x performance improvement by using numba without major rewrites to the algorithm. There's still some more optimizations we could do, but this would require significant rewrites, which should not be a priority right now.

Thank you very much for your excellent jobs! And I will open another issue and PR about our rewriting decisions for the helper functions.

Sounds good!

add benchmark for face bounds helpers

5097d52

philipc2 added the run-benchmark Run ASV benchmark workflow label Aug 5, 2024

philipc2 self-assigned this Aug 5, 2024

add benchmark for face bounds helpers

91c4cda

philipc2 mentioned this pull request Aug 6, 2024

Non-Conservative Zonal Mean #785

Open

14 tasks

optimize point_within_gca

632ff9e

philipc2 added 6 commits August 7, 2024 12:25

disable fma for benchmark

c21dfbb

numba implementations of isclose and allclose

5c55c68

dot and cross numba

15f4e5d

dot and cross numba

0aafc31

update dot

34ad9d5

update dot

c31d373

try to fix dot (1)

d939003

philipc2 added 4 commits August 7, 2024 17:33

gca_gca_extreme

e42a9af

numba xyz to latlon rad with no norm

c9c282f

disable FMA

93fece1

enable FMA

6ba1bb2

philipc2 added 2 commits August 7, 2024 20:46

add more dot calls

47a3320

remove leftover code

fd1d2f8

philipc2 added 5 commits August 7, 2024 21:00

fix tests

9cf7ce7

fix failing tests

d0bb23c

update isclose call

a237f6a

disable FMA for benchmark

b37aee4

fix broken isclose call

9198c44

add docstrings, option to enable and disable fma

4bf7577

philipc2 changed the title ~~DRAFT: Optimize Face Bounds Computation~~ Optimize Face Bounds Computation Aug 9, 2024

philipc2 linked an issue Aug 9, 2024 that may be closed by this pull request

Optimize Face Bounds Construction #730

Closed

philipc2 added the scalability Related to scalability & performance efforts label Aug 9, 2024

philipc2 added this to the Scalability & Performance milestone Aug 9, 2024

philipc2 added 6 commits August 8, 2024 22:52

add some vectorization

51ead1f

numba all

39da7df

point_within_gca numba

c212872

fix error in tests

c8030c7

better function name

cb9ec12

revert asv.conf.json

0b4b0bb

philipc2 marked this pull request as ready for review August 9, 2024 05:57

philipc2 merged commit 1a68daa into main Aug 9, 2024
15 of 16 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize Face Bounds Computation #877

Optimize Face Bounds Computation #877

philipc2 commented Aug 5, 2024 •

edited

Loading

github-actions bot commented Aug 5, 2024 •

edited

Loading

philipc2 commented Aug 7, 2024

philipc2 commented Aug 7, 2024

hongyuchen1030 commented Aug 7, 2024 •

edited

Loading

philipc2 commented Aug 7, 2024

hongyuchen1030 commented Aug 7, 2024

philipc2 commented Aug 7, 2024

philipc2 commented Aug 7, 2024

hongyuchen1030 commented Aug 7, 2024

hongyuchen1030 commented Aug 7, 2024

philipc2 commented Aug 7, 2024 •

edited

Loading

philipc2 commented Aug 8, 2024

philipc2 commented Aug 8, 2024

hongyuchen1030 commented Aug 8, 2024

FMA Disabled

FMA Enabled

philipc2 commented Aug 8, 2024

philipc2 commented Aug 9, 2024

hongyuchen1030 commented Aug 9, 2024

philipc2 commented Aug 9, 2024

Optimize Face Bounds Computation #877

Optimize Face Bounds Computation #877

Conversation

philipc2 commented Aug 5, 2024 • edited Loading

Overview

Expected Usage

github-actions bot commented Aug 5, 2024 • edited Loading

ASV Benchmarking

philipc2 commented Aug 7, 2024

philipc2 commented Aug 7, 2024

hongyuchen1030 commented Aug 7, 2024 • edited Loading

philipc2 commented Aug 7, 2024

hongyuchen1030 commented Aug 7, 2024

philipc2 commented Aug 7, 2024

philipc2 commented Aug 7, 2024

hongyuchen1030 commented Aug 7, 2024

hongyuchen1030 commented Aug 7, 2024

philipc2 commented Aug 7, 2024 • edited Loading

FMA Disabled

FMA Enabled

philipc2 commented Aug 8, 2024

philipc2 commented Aug 8, 2024

hongyuchen1030 commented Aug 8, 2024

FMA Disabled

FMA Enabled

Rationale:

Conclusion:

philipc2 commented Aug 8, 2024

philipc2 commented Aug 9, 2024

hongyuchen1030 commented Aug 9, 2024

philipc2 commented Aug 9, 2024

philipc2 commented Aug 5, 2024 •

edited

Loading

github-actions bot commented Aug 5, 2024 •

edited

Loading

hongyuchen1030 commented Aug 7, 2024 •

edited

Loading

philipc2 commented Aug 7, 2024 •

edited

Loading