Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"Numpy-isation" of irradiance decomposition functions #1455

Open
alexisstgc opened this issue May 11, 2022 · 17 comments · May be fixed by #1456
Open

"Numpy-isation" of irradiance decomposition functions #1455

alexisstgc opened this issue May 11, 2022 · 17 comments · May be fixed by #1456

Comments

@alexisstgc
Copy link

Problem
I generate GHI data in maps and I want to run it through irradiance decomposition models (specifically dirindex) in order to obtain DNI and DHI data. My data is in three dimensions (time, latitude, longitude) so passing it through pvlib's built-in functions requiring pandas.DataFrame isn't easy nor very efficient.

Solution
I already adapted all necessary functions to accept numpy arrays of any dimension, just with the time dimension last (see the linked PR). Results are coherent with the pvlib's functions output and the computation is extremely fast compared to the alternative below. However, I am looking for some advices :

  • Is there another solution I did not consider ?
  • Is numpy the best choice, should I be using xarray so that the time dimension could be at any position?
  • Is it something that could/should be implemented throughout the whole pvlib? Is it interseting

Alternatives I've considered
I tried looping over all the lat/lon points and calling dirindex for each of them, with multiprocessing but it took too long/ was too computationally heavy for the volume of data I was using

@alexisstgc alexisstgc linked a pull request May 11, 2022 that will close this issue
8 tasks
@wholmgren
Copy link
Member

Hi @alexisstgc thanks for bringing this up. It's a tricky and important problem.

see the linked PR

The linked PR currently only shows a single line changed with a comment.

Is there another solution I did not consider ?

A small number of pvlib functions return OrderedDict if the user provides higher-dimensional input. The idea was that results could still be accessed through key lookup (out['ghi']) works equally well with DataFrame and dict). Obviously we could use regular dicts now that they preserve order, but this code predates that feature.

I've gone back and forth on whether I prefer labeled results or just a tuple. There was an issue with some discussion of this a year or two ago, but I don't recall where.

Is numpy the best choice, should I be using xarray so that the time dimension could be at any position?

I am reluctant to endorse xarray because I worry that another user is going to come along and ask for support for something else (maybe something that hasn't been invented yet!). I'd prefer to maintain type consistency with inputs whenever possible.

Is it something that could/should be implemented throughout the whole pvlib? Is it interseting

I am generally in favor of pvlib functions accepting N-dimensional data, broadcasting as appropriate, and outputting something sensible.

There was also some related discussion regarding solar position calculations here Unidata/MetPy#1440 (comment)

@kandersolar
Copy link
Member

Is there another solution I did not consider ?

It is a bit clunky, but the general task of applying a function only vectorized across time to data with other dimensions (e.g. spatial) can often be achieved by reshaping the data into a very long 1-D representation, running the function, and then reshaping the result back to N-D.

There was an issue with some discussion of this a year or two ago, but I don't recall where.

Maybe #959

@alexisstgc
Copy link
Author

alexisstgc commented May 12, 2022

Hello @wholmgren and @kanderso-nrel ! Thank you for your kind replies. I forgot to mention that this is my first time participating on a public library, don't hesitate if you notice I do something wrong or if I forget something 🤷‍♂️ 😅

The linked PR currently only shows a single line changed with a comment.

Yes, I pushed my changes just now, but I have still some work to do to pass the tests and the code quality 😉

reshaping the data into a very long 1-D representation

Of course, how could I not think of this... The only problem I foresee is the times DateTimeIndex you have to provide with the multidimensional GHI data... But I will try as soon as I can.

@alexisstgc
Copy link
Author

BTW, in my PR I modified some additional points and I would also like some feedback on these:

  • I modified the precision of some variables. Because of the size of my input data, it is important for me to have as lightweight variables as possible. Most of them I put in float32 instead of default float64. But I guess for other users the bigger precision the better ?
  • I changed the precision from int64 to int8 in the bins of _dirint_bins() since they only hold ints from 0 to 7.
  • In dirindex, I encoutered some problems when the dni_dirint_clearsky is equal to zero so I put a workaround which seems reasonable.
  • In _dirint_coeffs, previously NaNs were added when no bin was assigned. I wanted to avoid NaNs as much as possible so I preferred to put a 1 instead so that we fall back to disc data.

@mikofski
Copy link
Member

I had a couple of thoughts about this topic:

  1. I would prefer an even more core function that strictly dealt with NumPy arrays only - that would allow broadcasting etc, these ultra-core functions could be wrapped by pandas glue to give consistent behavior to the existing functions.
  2. I'm not sure the OP request is suitable for pvlib, b/c as Kevin alluded even NumPy in reality is dealing with a single 1-D array internally in memory, this array is allocated space in chunks in the L2 cache and fed to the SIMD registers chunk per chunk, so if you have a 3-D spatial array as a timeseries, it might be so big that you would be better off seeking a different vectorization method like Dask and/or joblib, see: https://joblib.readthedocs.io/en/latest/auto_examples/parallel/distributed_backend_simple.html

@wholmgren
Copy link
Member

I would prefer an even more core function that strictly dealt with NumPy arrays only - that would allow broadcasting etc, these ultra-core functions could be wrapped by pandas glue to give consistent behavior to the existing functions.

Wondering what this would look like in practice... Let's start with the simple case of disc

I0 = get_extra_radiation(datetime_or_doy, 1370., 'spencer')
kt = clearness_index(ghi, solar_zenith, I0, min_cos_zenith=min_cos_zenith,
max_clearness_index=1)
am = atmosphere.get_relative_airmass(solar_zenith, model='kasten1966')
if pressure is not None:
am = atmosphere.get_absolute_airmass(am, pressure)
Kn, am = _disc_kn(kt, am, max_airmass=max_airmass)
dni = Kn * I0
bad_values = (solar_zenith > max_zenith) | (ghi < 0) | (dni < 0)
dni = np.where(bad_values, 0, dni)
output = OrderedDict()
output['dni'] = dni
output['kt'] = kt
output['airmass'] = am
if isinstance(datetime_or_doy, pd.DatetimeIndex):
output = pd.DataFrame(output, index=datetime_or_doy)
return output

get_extra_radiation probably needs some refactoring to support that use case. I think the other calls and operations already work with arbitrary input. Would you return before the OrderedDict? Before isinstance?

I think the OP's problem is well-posed. And this is not the first time we've seen users struggle against multi-dimensional input with pvlib functions. The array/Series changes in the linked PR seemed reasonable to me, though I didn't give it a full review. As @kanderso-nrel said, reshaping numpy arrays to work with pvlib functions is clunky - maybe workable in isolation, but probably not if you want to stitch steps together.

@mikofski
Copy link
Member

get_extra_radiation probably needs some refactoring to support that use case.

Many core functions might have to be refactored. Maybe even get_*_airmass().

I think the other calls and operations already work with arbitrary input.

I was hoping that this might be true, by some dumb luck and good architecture, but we'll see.

Would you return before the OrderedDict?

No, a pure NumPy version of disc would only take NumPy or array-like, and would only return NumPy structured arrays, so no need for OrderedDict ever.

Before isinstance?

Yes, since isinstance would be unnecessary, since there would only be one contract and one dispatch.

Here's a quick cheat to broadcast the output into a structured array:

dtype = np.dtype([('dni', np.float64), ('kt', np.float64), ('airmass', np.float64)])
shape = dni.shape
output = np.empty(shape, dt)  # OrderedDict()
output['dni'] = dni
output['kt'] = kt
output['airmass'] = am

works as long as dni, kt, and am all have same shape.

@adriesse
Copy link
Member

Good initiative and ideas. I like the idea of core numpy-only functions.

@wholmgren
Copy link
Member

The structured array documentation seems to suggest this is a bad fit for our use case

Structured datatypes are designed to be able to mimic ‘structs’ in the C language, and share a similar memory layout. They are meant for interfacing with C code and for low-level manipulation of structured buffers, for example for interpreting binary blobs. For these purposes they support specialized features such as subarrays, nested datatypes, and unions, and allow control over the memory layout of the structure.

Users looking to manipulate tabular data, such as stored in csv files, may find other pydata projects more suitable, such as xarray, pandas, or DataArray. These provide a high-level interface for tabular data analysis and are better optimized for that use. For instance, the C-struct-like memory layout of structured arrays in numpy can lead to poor cache behavior in comparison.

I've not used structured arrays except for tinkering, so I don't have any personal experience to add to that.

I am still not understanding what is the API for the users that want a DataFrame with columns dni, kt, and am (the pandas glue). How does that work in practice with the lower level disc function?

@mikofski
Copy link
Member

Haha 🤣 I must be missing something totally obvious. I use struct arrays all the time, and I believe they are by far the fastest. See this gist. By my test, structured arrays are at least 200x faster than xarray and 13x faster than pandas. Also as in my quick cheat example above, broadcasting becomes trivial. For very simple struct arrays, I don't think the syntax is that hard, you just need to define the dtype in advance.

The API would be something like:

f_high_level(x_in: df, xray, etc) -> df, xray, etc
    # convert input to numpy
    # call low level numpy functions
    # convert numpy to output

f_low_level(x_in: numpy) -> numpy

@wholmgren
Copy link
Member

I was surprised to find that I already wrote 2D array tests for clearness index. Airmass functions are tested with 1D numpy array. get_extra_radiation works with 2D array input but it's not tested. So, no surprise then that disc works with ND array input.

In [43]: pvlib.irradiance.disc(np.array([[0, 250], [500, 750]], dtype=np.float32), np.array([[100, 70], [40, 20]], dtyp
    ...: e=np.float32), np.array([[100, 100], [100, 100]], dtype=np.float32))
Out[43]:
OrderedDict([('dni',
              array([[  0.        , 414.87658824],
                     [165.11445531, 307.28344503]])),
             ('kt',
              array([[0.        , 0.53562295],
                     [0.47828513, 0.58485246]], dtype=float32)),
             ('airmass',
              array([[       nan, 2.89994597],
                     [1.30367947, 1.0634042 ]]))])

why is float32 preserved for kt but not the others??? Turns out that the type promotion happens in get_airmass_absolute. float32 * int yields float64 if the int is large enough to cause precision issues. And disc has a default pressure=101325 (no period, so it's an int). We can make get_airmass_absolute more robust by grouping the pressure normalization with (). Or user can provide a float pressure input. And/or we can change the default disc pressure to a float.

In [64]: pvlib.irradiance.disc(np.array([[0, 250], [500, 750]], dtype=np.float32), np.array([[100, 70], [40, 20]], dtyp
    ...: e=np.float32), np.array([[100, 100], [100, 100]], dtype=np.float32), pressure=101325.)
Out[64]:
OrderedDict([('dni',
              array([[  0.     , 414.87656],
                     [165.1144 , 307.28342]], dtype=float32)),
             ('kt',
              array([[0.        , 0.53562295],
                     [0.47828513, 0.58485246]], dtype=float32)),
             ('airmass',
              array([[      nan, 2.899946 ],
                     [1.3036796, 1.0634042]], dtype=float32))])

Ok, now I want to use dask arrays...

In [21]: out_da = irradiance.disc(da.from_array(ghi), da.from_array(solar_zenith), da.from_array(doy), pressure=101325.)

In [22]: out_da
Out[22]:
OrderedDict([('dni',
              dask.array<where, shape=(2, 2), dtype=float32, chunksize=(2, 2), chunktype=numpy.ndarray>),
             ('kt',
              dask.array<minimum, shape=(2, 2), dtype=float32, chunksize=(2, 2), chunktype=numpy.ndarray>),
             ('airmass',
              dask.array<minimum, shape=(2, 2), dtype=float32, chunksize=(2, 2), chunktype=numpy.ndarray>)])

In [23]: out_da['dni'].compute()
Out[23]:
array([[  0.     , 414.87656],
       [165.1144 , 307.28342]], dtype=float32)

It just works!

What if we used a structured array instead of a dict? Following @mikofski's example of a structured array...

In [43]: output = np.empty(shape, dtype)

In [44]: output['dni'] = out_da['dni']

In [45]: output['kt'] = out_da['kt']

In [46]: output['airmass'] = out_da['airmass']

In [47]: output
Out[47]:
array([[(  0.        , 0.        ,        nan),
        (414.8765564 , 0.53562295, 2.89994597)],
       [(165.11439514, 0.47828513, 1.30367959),
        (307.28341675, 0.58485246, 1.0634042 )]],
      dtype=[('dni', '<f8'), ('kt', '<f8'), ('airmass', '<f8')])

It appears that compute() is called under the hood somewhere. This is not good. So I still think we're better off sticking with dicts or tuples instead of structured array.

So at least for this function, we mostly have a documentation and testing problem. Not sure what to do about the API but seems a little silly to copy all those parameters and all that documentation so that we can be explicit about "function disc accepts and returns pandas, while function disc_array_like accepts anything array-like and returns a dict".

@mikofski
Copy link
Member

mikofski commented May 21, 2022

This is great! Good architecture always pays off with serendipity!

It appears that compute() is called under the hood somewhere. This is not good. So I still think we're better off sticking with dicts or tuples instead of structured array.

I'm not sure I follow what this means. What is compute() and why is this not good? I think if we're not considering the ultra-core functions that are all numpy, there's really no need to return a structured array, it's really no different than the ordered dict. I thought it was faster, but timing wise the ordered dict is faster:

import numpy as np
import pvlib as pvl

ghi = np.clip(np.exp(np.random.randn(10000,3))*100, a_max=1000, a_min=0)
ze = np.clip(np.exp(np.random.randn(10000,3))*10, a_max=80, a_min=0)
doy = np.random.randint(364, size=(10000,3))

x = pvl.irradiance.disc(ghi, ze, doy)

y = np.empty((10000,3), [('dni', float), ('kt', float), ('airmass', float)])
y['dni'] = x['dni']
y['kt'] = x['kt']
y['airmass'] = x['airmass']

%timeit x['dni']
# 39.6 ns ± 6.55 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
%timeit y['dni']
# 93.6 ns ± 1.8 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)

But can an ordered dict...

# do this
y[['dni', 'kt']]  # all dni and kt

# or this
y[:100, 2]  # all fields, but only column 2 up to row 100

# or this?
 pd.DataFrame(y[:, 1])  # make a dataframe from all fields for column 1

Anyway, like I said, I don't really see the need for structured arrays unless switching to a pure numpy function that doesn't use anything else, only numpy.

@adriesse
Copy link
Member

Anyway, like I said, I don't really see the need for structured arrays unless switching to a pure numpy function that doesn't use anything else, only numpy.

Do I understand correctly that if there is a single return "value", then there is agreement that returning a normal numpy array is fine? And the primary motivator for the structured arrays is in the case of multiple return "values", so that they can be accessed by name?

I would be in favor of using tuples and forgetting about the names.

@wholmgren
Copy link
Member

wholmgren commented May 24, 2022

Do I understand correctly that if there is a single return "value", then there is agreement that returning a normal numpy array is fine?

Not quite. I am interpreting the request for "numpy-isation" as a request for "de-pandas-isation" and for array duck typing. Maybe I am not listening well!

Anyways, I am advocating for core functions that respect the user's choice of input type. So in the case of a single return value, if I provide say, dask array in then I'll get a dask array out. Same for xarray or cupy. Or pandas (but not explicitly bundled)! Core functions that strictly return numpy would be great for pure numpy users but I think we can do better. The main thing is that core pvlib should not attempt to bundle things up into pandas objects (or anything else).

The __array_function__ protocol was introduced a few years ago. It's now possible to write functions that look like they are "pure numpy", but numpy dispatches the actual work to the input object's library. We don't have to do anything to take advantage of it other than use newer versions of dependencies and write code that uses the numpy API instead of the pandas API. What we cannot do is explicitly cast the return values into a numpy array or structured array.

The array function protocol is not perfect (NEP-37) but I think if we write non-opinionated functions then more and more things will just work over time. More recent developments:

I would be in favor of using tuples and forgetting about the names.

I am coming around to this. Lots of numpy and scipy functions return tuples of arrays. My hesitancy is how to make this work for most of core pvlib without doubling the scope of the core API or going through a painful deprecation process.

What is compute() and why is this not good?

This turns a lazy Dask collection into its in-memory equivalent. For example a Dask array turns into a NumPy array and a Dask dataframe turns into a Pandas dataframe. The entire dataset must fit into memory before calling this operation. It's also a rude API.

Like I said above, maybe I am not listening. Do you really want functions that always return numpy no matter what? Or do you only care that functions 1. work with numpy inputs and 2. return a tuple/dict of numpy outputs?

@adriesse
Copy link
Member

I'd have to of time researching python topics to understand everything discussed and mentioned in this issue, so I am not really qualified to comment, but anyway...

I do like it when a function that receives Series arguments returns a Series, but typically I'm building a DataFrame as I go and assigning an array to a column works just as well.

@mikofski
Copy link
Member

Like I said above, maybe I am not listening. Do you really want functions that always return numpy no matter what? Or do you only care that functions 1. work with numpy inputs and 2. return a tuple/dict of numpy outputs?

You've opened my eyes. I just want

  1. the numpy API
  2. ability to broadcast automatically as much as possible with the minimum of work like numpy does
  3. the speed and memory enhancements of numpy intrinsics like SIMD/cache etc.

Up to now I've been assuming numpy was the only way to get all 3 of these features, but now I realize there could be other backend tools that could handle these dispatches even better. I'm all for that.

I would be in favor of using tuples and forgetting about the names.

I'm okay with either. I'm not married to numpy structured arrays. Maybe I was wrong about their use, everyone seems to keep burying them now that other tools are here. I thought they work faster, but maybe not in the way that's most needed or it might not be as important. For grabbing a field of data, seems like a tuple or dictionary is just as good, and more generic.

Some APIs automatically return numpy structured arrays, like HDF5 for example. Also pandas has a to_records() that also returns a structured array.

@adriesse
Copy link
Member

@mikofski Beta was also better than VHS. :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants