-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[bump minor] Pk1d covariance new #1067
Conversation
Ok, this sounds like great improvement, I guess 10 NERSC mins should be much better than the previous version and at least similar structures are obtained. How small is small? E.g. is a smoothing step necessary to make any sense out of the right/bottom 1/3 of the covar? Could you run this on a significantly large set of data? I'll go through the code changes now... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks fine overall, needs somewhat more commenting, and it would be good to have some of the selections defined in one place for flexibility/easier maintanence
return covariance_array | ||
|
||
|
||
def compute_cov_not_vectorized( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is not only not vectorized, but the actual complete old method, i.e. not only slower, but also doing slightly different things potentially by actually summing up each individual mode every time. Maybe clarify that in the docstring
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is what we are doing on the new method too, no ?
I have added a comment on this function
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, but in the old one the approach is split in a completely different way by looping over every mode while in the new we're precomputing intermediate summaries that are then reused multiple times, and in addition do everything in a vectorized way...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comment added
py/picca/pk1d/postproc_pk1d.py
Outdated
mean_pk_product = np.outer(mean_pk, mean_pk) | ||
|
||
sum_p1d_weights = np.nansum(p1d_weights, axis=0) | ||
weights_sum_product = np.outer(sum_p1d_weights, sum_p1d_weights) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is very confusing naming given that below you have weights_product_sum
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes but we have indeed a sum of product and a product of sums. I renamed them with the of, it should be a little bit clearer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe adding what is summed into the name could help, but not sure...
else: | ||
p1d_sub_table["weight"] = 1 | ||
|
||
p1d_sub_table = p1d_sub_table[p1d_sub_table["k_index"] >= 0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are we storing negative k_index somewhere? Why is this cut needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was a convention for the case when a k bin computed for the p1d does not fall inside the output k binning. I removed those bins as they do not contribute to the covariance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This cut is needed because I made a method to compute the p1d groups in a vectorized way, and -1 change the last k bin, but it should not change any bin
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still don't fully get how this line is changing anything. Wouldn't the selector be always true? Or are there nans or -1s in that table for bins that just aren't filled and that you want to remove here (in which case I would test for not having exactly that filler value and not via >=0)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With the implementation right now, this is just to remove p1d pixels which were not associated with any wavenumber bin.
I made a method previously for which this cut was necessary, but now it is just to reduce a little the input dataset, removing unecessary pixels.
Comment added.
""" | ||
|
||
select_z = (p1d_table["forest_z"] < zbin_edges[izbin + 1]) & ( | ||
p1d_table["forest_z"] > zbin_edges[izbin] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is the p1d_table
of each chunk and not the mean_p1d_table
of the combined stats, so my comments regarding indexing shouldn't apply here...
nbins_k, | ||
) | ||
if number_worker == 1: | ||
output_cov = [func(*p1d_los) for p1d_los in p1d_los_table] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure if this should be *p1d_los
or just p1d_los
, I'd have guessed the latter, was this tested on a single core as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is *p1d_los because p1d_los is a table with 5 array, that we give to func. Yes I tested it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but why isn't that needed below in the mapped version, shouldn't that be a starmap then or similar? But I guess if it works it works...
i_max = (izbin + 1) * nbins_k * nbins_k | ||
cov_table["covariance"][i_min:i_max] = covariance_array | ||
|
||
if compute_bootstrap: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is the bootstrap method tested? If yes, would be nice to know it's results, if no please at least add comments that this isn't tested yet...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The bootstrap method works and gives very similar results than the classical covariance, which makes sense considering the number of chunks.
All the Y1 plot I am showing is with bootstrap
Ouch, this looks much more correlated than I expected... Can you chainge the colormap so that it actually goes from -1 to 1 and maybe also that it's white/yellow for 0 correlation and diverging outside? I looked at it and at first glance thought green is zero, but that isn't actually right and it looks like except when comparing the largest and smallest modes available things are >50% correlated everywhere |
maybe we should also add a line for covar computation here: picca/py/picca/tests/test_2_pk1d.py Lines 88 to 97 in 02f6c2e
and compare the output to a saved state. That would at least make sure this code is run regularly (e.g. useful for library version changes) and we don't change stuff by accident, even if the covar of so few spectra will be garbage... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given that this PR seems to do what it's supposed to, I'd be fine with merging. Please consider adding comments to the code where there are still semi-open questions (so that we e.g. don't wonder why we did certain steps in the future and remove those). Also please consider adding a line to the test in your favourite config with the relevant output (or send me the line you'd like to test and I'd add it).
Comments and tests added, waiting for pytest, if successful, let's merge it |
if one would like to use the bootstrap in testing, maybe allow setting a seed in the command line call via an additional argument and initialize an |
Here since the bootstrap is just a run of the covariance matrix, it is not essential to test it. |
Two improvements: