"Numpy-isation" of irradiance decomposition functions #1455

alexisstgc · 2022-05-11T07:14:43Z

Problem
I generate GHI data in maps and I want to run it through irradiance decomposition models (specifically dirindex) in order to obtain DNI and DHI data. My data is in three dimensions (time, latitude, longitude) so passing it through pvlib's built-in functions requiring pandas.DataFrame isn't easy nor very efficient.

Solution
I already adapted all necessary functions to accept numpy arrays of any dimension, just with the time dimension last (see the linked PR). Results are coherent with the pvlib's functions output and the computation is extremely fast compared to the alternative below. However, I am looking for some advices :

Is there another solution I did not consider ?
Is numpy the best choice, should I be using xarray so that the time dimension could be at any position?
Is it something that could/should be implemented throughout the whole pvlib? Is it interseting

Alternatives I've considered
I tried looping over all the lat/lon points and calling dirindex for each of them, with multiprocessing but it took too long/ was too computationally heavy for the volume of data I was using

The text was updated successfully, but these errors were encountered:

wholmgren · 2022-05-11T18:38:09Z

Hi @alexisstgc thanks for bringing this up. It's a tricky and important problem.

see the linked PR

The linked PR currently only shows a single line changed with a comment.

Is there another solution I did not consider ?

A small number of pvlib functions return OrderedDict if the user provides higher-dimensional input. The idea was that results could still be accessed through key lookup (out['ghi']) works equally well with DataFrame and dict). Obviously we could use regular dicts now that they preserve order, but this code predates that feature.

I've gone back and forth on whether I prefer labeled results or just a tuple. There was an issue with some discussion of this a year or two ago, but I don't recall where.

Is numpy the best choice, should I be using xarray so that the time dimension could be at any position?

I am reluctant to endorse xarray because I worry that another user is going to come along and ask for support for something else (maybe something that hasn't been invented yet!). I'd prefer to maintain type consistency with inputs whenever possible.

Is it something that could/should be implemented throughout the whole pvlib? Is it interseting

I am generally in favor of pvlib functions accepting N-dimensional data, broadcasting as appropriate, and outputting something sensible.

There was also some related discussion regarding solar position calculations here Unidata/MetPy#1440 (comment)

kandersolar · 2022-05-11T19:25:06Z

Is there another solution I did not consider ?

It is a bit clunky, but the general task of applying a function only vectorized across time to data with other dimensions (e.g. spatial) can often be achieved by reshaping the data into a very long 1-D representation, running the function, and then reshaping the result back to N-D.

There was an issue with some discussion of this a year or two ago, but I don't recall where.

Maybe #959

alexisstgc · 2022-05-12T09:49:11Z

Hello @wholmgren and @kanderso-nrel ! Thank you for your kind replies. I forgot to mention that this is my first time participating on a public library, don't hesitate if you notice I do something wrong or if I forget something 🤷‍♂️ 😅

The linked PR currently only shows a single line changed with a comment.

Yes, I pushed my changes just now, but I have still some work to do to pass the tests and the code quality 😉

reshaping the data into a very long 1-D representation

Of course, how could I not think of this... The only problem I foresee is the times DateTimeIndex you have to provide with the multidimensional GHI data... But I will try as soon as I can.

alexisstgc · 2022-05-12T12:58:25Z

BTW, in my PR I modified some additional points and I would also like some feedback on these:

I modified the precision of some variables. Because of the size of my input data, it is important for me to have as lightweight variables as possible. Most of them I put in float32 instead of default float64. But I guess for other users the bigger precision the better ?
I changed the precision from int64 to int8 in the bins of _dirint_bins() since they only hold ints from 0 to 7.
In dirindex, I encoutered some problems when the dni_dirint_clearsky is equal to zero so I put a workaround which seems reasonable.
In _dirint_coeffs, previously NaNs were added when no bin was assigned. I wanted to avoid NaNs as much as possible so I preferred to put a 1 instead so that we fall back to disc data.

mikofski · 2022-05-12T15:34:03Z

I had a couple of thoughts about this topic:

I would prefer an even more core function that strictly dealt with NumPy arrays only - that would allow broadcasting etc, these ultra-core functions could be wrapped by pandas glue to give consistent behavior to the existing functions.
I'm not sure the OP request is suitable for pvlib, b/c as Kevin alluded even NumPy in reality is dealing with a single 1-D array internally in memory, this array is allocated space in chunks in the L2 cache and fed to the SIMD registers chunk per chunk, so if you have a 3-D spatial array as a timeseries, it might be so big that you would be better off seeking a different vectorization method like Dask and/or joblib, see: https://joblib.readthedocs.io/en/latest/auto_examples/parallel/distributed_backend_simple.html

wholmgren · 2022-05-17T18:22:28Z

I would prefer an even more core function that strictly dealt with NumPy arrays only - that would allow broadcasting etc, these ultra-core functions could be wrapped by pandas glue to give consistent behavior to the existing functions.

Wondering what this would look like in practice... Let's start with the simple case of disc

pvlib-python/pvlib/irradiance.py

Lines 1400 to 1423 in 5047b26

    
           I0 = get_extra_radiation(datetime_or_doy, 1370., 'spencer') 
        
           kt = clearness_index(ghi, solar_zenith, I0, min_cos_zenith=min_cos_zenith, 
        
                                max_clearness_index=1) 
        
           am = atmosphere.get_relative_airmass(solar_zenith, model='kasten1966') 
        
           if pressure is not None: 
        
               am = atmosphere.get_absolute_airmass(am, pressure) 
        
           Kn, am = _disc_kn(kt, am, max_airmass=max_airmass) 
        
           dni = Kn * I0 
        
           bad_values = (solar_zenith > max_zenith) | (ghi < 0) | (dni < 0) 
        
           dni = np.where(bad_values, 0, dni) 
        
           output = OrderedDict() 
        
           output['dni'] = dni 
        
           output['kt'] = kt 
        
           output['airmass'] = am 
        
           if isinstance(datetime_or_doy, pd.DatetimeIndex): 
        
               output = pd.DataFrame(output, index=datetime_or_doy) 
        
           return output

get_extra_radiation probably needs some refactoring to support that use case. I think the other calls and operations already work with arbitrary input. Would you return before the OrderedDict? Before isinstance?

I think the OP's problem is well-posed. And this is not the first time we've seen users struggle against multi-dimensional input with pvlib functions. The array/Series changes in the linked PR seemed reasonable to me, though I didn't give it a full review. As @kanderso-nrel said, reshaping numpy arrays to work with pvlib functions is clunky - maybe workable in isolation, but probably not if you want to stitch steps together.

mikofski · 2022-05-18T15:56:14Z

get_extra_radiation probably needs some refactoring to support that use case.

Many core functions might have to be refactored. Maybe even get_*_airmass().

I think the other calls and operations already work with arbitrary input.

I was hoping that this might be true, by some dumb luck and good architecture, but we'll see.

Would you return before the OrderedDict?

No, a pure NumPy version of disc would only take NumPy or array-like, and would only return NumPy structured arrays, so no need for OrderedDict ever.

Before isinstance?

Yes, since isinstance would be unnecessary, since there would only be one contract and one dispatch.

Here's a quick cheat to broadcast the output into a structured array:

dtype = np.dtype([('dni', np.float64), ('kt', np.float64), ('airmass', np.float64)])
shape = dni.shape
output = np.empty(shape, dt)  # OrderedDict()
output['dni'] = dni
output['kt'] = kt
output['airmass'] = am

works as long as dni, kt, and am all have same shape.

adriesse · 2022-05-19T18:59:08Z

Good initiative and ideas. I like the idea of core numpy-only functions.

wholmgren · 2022-05-19T20:13:22Z

The structured array documentation seems to suggest this is a bad fit for our use case

Structured datatypes are designed to be able to mimic ‘structs’ in the C language, and share a similar memory layout. They are meant for interfacing with C code and for low-level manipulation of structured buffers, for example for interpreting binary blobs. For these purposes they support specialized features such as subarrays, nested datatypes, and unions, and allow control over the memory layout of the structure.

Users looking to manipulate tabular data, such as stored in csv files, may find other pydata projects more suitable, such as xarray, pandas, or DataArray. These provide a high-level interface for tabular data analysis and are better optimized for that use. For instance, the C-struct-like memory layout of structured arrays in numpy can lead to poor cache behavior in comparison.

I've not used structured arrays except for tinkering, so I don't have any personal experience to add to that.

I am still not understanding what is the API for the users that want a DataFrame with columns dni, kt, and am (the pandas glue). How does that work in practice with the lower level disc function?

mikofski · 2022-05-20T00:20:49Z

Haha 🤣 I must be missing something totally obvious. I use struct arrays all the time, and I believe they are by far the fastest. See this gist. By my test, structured arrays are at least 200x faster than xarray and 13x faster than pandas. Also as in my quick cheat example above, broadcasting becomes trivial. For very simple struct arrays, I don't think the syntax is that hard, you just need to define the dtype in advance.

The API would be something like:

f_high_level(x_in: df, xray, etc) -> df, xray, etc
    # convert input to numpy
    # call low level numpy functions
    # convert numpy to output

f_low_level(x_in: numpy) -> numpy

wholmgren · 2022-05-20T23:23:55Z

I was surprised to find that I already wrote 2D array tests for clearness index. Airmass functions are tested with 1D numpy array. get_extra_radiation works with 2D array input but it's not tested. So, no surprise then that disc works with ND array input.

In [43]: pvlib.irradiance.disc(np.array([[0, 250], [500, 750]], dtype=np.float32), np.array([[100, 70], [40, 20]], dtyp
    ...: e=np.float32), np.array([[100, 100], [100, 100]], dtype=np.float32))
Out[43]:
OrderedDict([('dni',
              array([[  0.        , 414.87658824],
                     [165.11445531, 307.28344503]])),
             ('kt',
              array([[0.        , 0.53562295],
                     [0.47828513, 0.58485246]], dtype=float32)),
             ('airmass',
              array([[       nan, 2.89994597],
                     [1.30367947, 1.0634042 ]]))])

why is float32 preserved for kt but not the others??? Turns out that the type promotion happens in get_airmass_absolute. float32 * int yields float64 if the int is large enough to cause precision issues. And disc has a default pressure=101325 (no period, so it's an int). We can make get_airmass_absolute more robust by grouping the pressure normalization with (). Or user can provide a float pressure input. And/or we can change the default disc pressure to a float.

In [64]: pvlib.irradiance.disc(np.array([[0, 250], [500, 750]], dtype=np.float32), np.array([[100, 70], [40, 20]], dtyp
    ...: e=np.float32), np.array([[100, 100], [100, 100]], dtype=np.float32), pressure=101325.)
Out[64]:
OrderedDict([('dni',
              array([[  0.     , 414.87656],
                     [165.1144 , 307.28342]], dtype=float32)),
             ('kt',
              array([[0.        , 0.53562295],
                     [0.47828513, 0.58485246]], dtype=float32)),
             ('airmass',
              array([[      nan, 2.899946 ],
                     [1.3036796, 1.0634042]], dtype=float32))])

Ok, now I want to use dask arrays...

In [21]: out_da = irradiance.disc(da.from_array(ghi), da.from_array(solar_zenith), da.from_array(doy), pressure=101325.)

In [22]: out_da
Out[22]:
OrderedDict([('dni',
              dask.array<where, shape=(2, 2), dtype=float32, chunksize=(2, 2), chunktype=numpy.ndarray>),
             ('kt',
              dask.array<minimum, shape=(2, 2), dtype=float32, chunksize=(2, 2), chunktype=numpy.ndarray>),
             ('airmass',
              dask.array<minimum, shape=(2, 2), dtype=float32, chunksize=(2, 2), chunktype=numpy.ndarray>)])

In [23]: out_da['dni'].compute()
Out[23]:
array([[  0.     , 414.87656],
       [165.1144 , 307.28342]], dtype=float32)

It just works!

What if we used a structured array instead of a dict? Following @mikofski's example of a structured array...

In [43]: output = np.empty(shape, dtype)

In [44]: output['dni'] = out_da['dni']

In [45]: output['kt'] = out_da['kt']

In [46]: output['airmass'] = out_da['airmass']

In [47]: output
Out[47]:
array([[(  0.        , 0.        ,        nan),
        (414.8765564 , 0.53562295, 2.89994597)],
       [(165.11439514, 0.47828513, 1.30367959),
        (307.28341675, 0.58485246, 1.0634042 )]],
      dtype=[('dni', '<f8'), ('kt', '<f8'), ('airmass', '<f8')])

It appears that compute() is called under the hood somewhere. This is not good. So I still think we're better off sticking with dicts or tuples instead of structured array.

So at least for this function, we mostly have a documentation and testing problem. Not sure what to do about the API but seems a little silly to copy all those parameters and all that documentation so that we can be explicit about "function disc accepts and returns pandas, while function disc_array_like accepts anything array-like and returns a dict".

mikofski · 2022-05-21T05:31:12Z

This is great! Good architecture always pays off with serendipity!

It appears that compute() is called under the hood somewhere. This is not good. So I still think we're better off sticking with dicts or tuples instead of structured array.

I'm not sure I follow what this means. What is compute() and why is this not good? I think if we're not considering the ultra-core functions that are all numpy, there's really no need to return a structured array, it's really no different than the ordered dict. I thought it was faster, but timing wise the ordered dict is faster:

import numpy as np
import pvlib as pvl

ghi = np.clip(np.exp(np.random.randn(10000,3))*100, a_max=1000, a_min=0)
ze = np.clip(np.exp(np.random.randn(10000,3))*10, a_max=80, a_min=0)
doy = np.random.randint(364, size=(10000,3))

x = pvl.irradiance.disc(ghi, ze, doy)

y = np.empty((10000,3), [('dni', float), ('kt', float), ('airmass', float)])
y['dni'] = x['dni']
y['kt'] = x['kt']
y['airmass'] = x['airmass']

%timeit x['dni']
# 39.6 ns ± 6.55 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
%timeit y['dni']
# 93.6 ns ± 1.8 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)

But can an ordered dict...

# do this
y[['dni', 'kt']]  # all dni and kt

# or this
y[:100, 2]  # all fields, but only column 2 up to row 100

# or this?
 pd.DataFrame(y[:, 1])  # make a dataframe from all fields for column 1

Anyway, like I said, I don't really see the need for structured arrays unless switching to a pure numpy function that doesn't use anything else, only numpy.

adriesse · 2022-05-24T11:43:52Z

Anyway, like I said, I don't really see the need for structured arrays unless switching to a pure numpy function that doesn't use anything else, only numpy.

Do I understand correctly that if there is a single return "value", then there is agreement that returning a normal numpy array is fine? And the primary motivator for the structured arrays is in the case of multiple return "values", so that they can be accessed by name?

I would be in favor of using tuples and forgetting about the names.

wholmgren · 2022-05-24T16:49:36Z

Do I understand correctly that if there is a single return "value", then there is agreement that returning a normal numpy array is fine?

Not quite. I am interpreting the request for "numpy-isation" as a request for "de-pandas-isation" and for array duck typing. Maybe I am not listening well!

Anyways, I am advocating for core functions that respect the user's choice of input type. So in the case of a single return value, if I provide say, dask array in then I'll get a dask array out. Same for xarray or cupy. Or pandas (but not explicitly bundled)! Core functions that strictly return numpy would be great for pure numpy users but I think we can do better. The main thing is that core pvlib should not attempt to bundle things up into pandas objects (or anything else).

The __array_function__ protocol was introduced a few years ago. It's now possible to write functions that look like they are "pure numpy", but numpy dispatches the actual work to the input object's library. We don't have to do anything to take advantage of it other than use newer versions of dependencies and write code that uses the numpy API instead of the pandas API. What we cannot do is explicitly cast the return values into a numpy array or structured array.

The array function protocol is not perfect (NEP-37) but I think if we write non-opinionated functions then more and more things will just work over time. More recent developments:

I would be in favor of using tuples and forgetting about the names.

I am coming around to this. Lots of numpy and scipy functions return tuples of arrays. My hesitancy is how to make this work for most of core pvlib without doubling the scope of the core API or going through a painful deprecation process.

What is compute() and why is this not good?

This turns a lazy Dask collection into its in-memory equivalent. For example a Dask array turns into a NumPy array and a Dask dataframe turns into a Pandas dataframe. The entire dataset must fit into memory before calling this operation. It's also a rude API.

Like I said above, maybe I am not listening. Do you really want functions that always return numpy no matter what? Or do you only care that functions 1. work with numpy inputs and 2. return a tuple/dict of numpy outputs?

adriesse · 2022-05-24T19:37:04Z

I'd have to of time researching python topics to understand everything discussed and mentioned in this issue, so I am not really qualified to comment, but anyway...

I do like it when a function that receives Series arguments returns a Series, but typically I'm building a DataFrame as I go and assigning an array to a column works just as well.

mikofski · 2022-05-24T20:45:42Z

Like I said above, maybe I am not listening. Do you really want functions that always return numpy no matter what? Or do you only care that functions 1. work with numpy inputs and 2. return a tuple/dict of numpy outputs?

You've opened my eyes. I just want

the numpy API
ability to broadcast automatically as much as possible with the minimum of work like numpy does
the speed and memory enhancements of numpy intrinsics like SIMD/cache etc.

Up to now I've been assuming numpy was the only way to get all 3 of these features, but now I realize there could be other backend tools that could handle these dispatches even better. I'm all for that.

I would be in favor of using tuples and forgetting about the names.

I'm okay with either. I'm not married to numpy structured arrays. Maybe I was wrong about their use, everyone seems to keep burying them now that other tools are here. I thought they work faster, but maybe not in the way that's most needed or it might not be as important. For grabbing a field of data, seems like a tuple or dictionary is just as good, and more generic.

Some APIs automatically return numpy structured arrays, like HDF5 for example. Also pandas has a to_records() that also returns a structured array.

adriesse · 2022-05-25T14:57:53Z

@mikofski Beta was also better than VHS. :)

alexisstgc linked a pull request May 11, 2022 that will close this issue

"Numpy-isation" of irradiance decomposition functions #1456

Open

8 tasks

wholmgren mentioned this issue May 25, 2022

irradiance.clearsky_index() isn't tolerant of pandas DataFrame inputs #1464

Open

kandersolar mentioned this issue May 26, 2022

Data shift algorithm pvlib/pvanalytics#142

Merged

6 tasks

mikofski mentioned this issue Mar 6, 2023

is there still a need for irradiance functions to return OrderDict since Python-3.7 is minimum version? #1684

Open

jules-ch mentioned this issue Dec 5, 2023

Replacing pandas with polars #1926

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

"Numpy-isation" of irradiance decomposition functions #1455

"Numpy-isation" of irradiance decomposition functions #1455

alexisstgc commented May 11, 2022

wholmgren commented May 11, 2022

kandersolar commented May 11, 2022

alexisstgc commented May 12, 2022 •

edited

Loading

alexisstgc commented May 12, 2022

mikofski commented May 12, 2022

wholmgren commented May 17, 2022

mikofski commented May 18, 2022

adriesse commented May 19, 2022

wholmgren commented May 19, 2022

mikofski commented May 20, 2022

wholmgren commented May 20, 2022

mikofski commented May 21, 2022 •

edited

Loading

adriesse commented May 24, 2022

wholmgren commented May 24, 2022 •

edited

Loading

adriesse commented May 24, 2022

mikofski commented May 24, 2022

adriesse commented May 25, 2022

"Numpy-isation" of irradiance decomposition functions #1455

"Numpy-isation" of irradiance decomposition functions #1455

Comments

alexisstgc commented May 11, 2022

wholmgren commented May 11, 2022

kandersolar commented May 11, 2022

alexisstgc commented May 12, 2022 • edited Loading

alexisstgc commented May 12, 2022

mikofski commented May 12, 2022

wholmgren commented May 17, 2022

mikofski commented May 18, 2022

adriesse commented May 19, 2022

wholmgren commented May 19, 2022

mikofski commented May 20, 2022

wholmgren commented May 20, 2022

mikofski commented May 21, 2022 • edited Loading

adriesse commented May 24, 2022

wholmgren commented May 24, 2022 • edited Loading

adriesse commented May 24, 2022

mikofski commented May 24, 2022

adriesse commented May 25, 2022

alexisstgc commented May 12, 2022 •

edited

Loading

mikofski commented May 21, 2022 •

edited

Loading

wholmgren commented May 24, 2022 •

edited

Loading