Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[bump minor] Pk1d covariance new #1067

Merged
merged 11 commits into from
Jul 15, 2024
Merged

[bump minor] Pk1d covariance new #1067

merged 11 commits into from
Jul 15, 2024

Conversation

corentinravoux
Copy link
Contributor

Two improvements:

  • Implementation of a new estimate of the covariance matrix, detailed here: https://www.overleaf.com/7592125432tksctmwfpngw#c4fed4
  • Covariance matrix calculation computed in a vectorized way, and with memory optimization. Now working in ~ 10 NERSC node minutes for a Y1 sample.

@corentinravoux
Copy link
Contributor Author

image
Comparison of covariance matrix calculation on a small example (left before, right after), there are not identical since the definition has changed, but their shape are similar.

@corentinravoux corentinravoux self-assigned this Jul 5, 2024
@corentinravoux corentinravoux linked an issue Jul 5, 2024 that may be closed by this pull request
@Waelthus
Copy link
Contributor

Waelthus commented Jul 5, 2024

Ok, this sounds like great improvement, I guess 10 NERSC mins should be much better than the previous version and at least similar structures are obtained. How small is small? E.g. is a smoothing step necessary to make any sense out of the right/bottom 1/3 of the covar?
Also do we have an idea where the subtle differences come from? Is this just numerical from e.g. different order of operations or is the difference in method actually analytically expected to produce slightly different results? E.g. at the strong correlation spikes around pixel [0,8] (is that correlation due to the SiII/III correlations? Or something to do with DLAs/ cont fitting? Looks like it's the longest range correlations we have in here, but somewhat weaker in the new approach...)
I guess we cannot do inter-redshift-bin correlations just yet, but that should be ok...

Could you run this on a significantly large set of data? I'll go through the code changes now...

@corentinravoux
Copy link
Contributor Author

corentinravoux commented Jul 5, 2024

This is a very small example, like 2-3% of Y1, just for the test.
The main difference between the estimators is that now we are treated the weights properly in the normalization of all the covariance matrix, so it can the profile according the average shape of weights over redshift range. I think it is hard to interpret.
For "the strong correlation spikes around pixel [0,8]" I think this is due to the lack of modes that increases the stat error bar at very large scales. This can also be impacted by what you mention.

In Y1 run, associated to systematics (I will show it at the upcoming DESI meeting):
image

Copy link
Contributor

@Waelthus Waelthus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks fine overall, needs somewhat more commenting, and it would be good to have some of the selections defined in one place for flexibility/easier maintanence

py/picca/pk1d/postproc_pk1d.py Show resolved Hide resolved
py/picca/pk1d/postproc_pk1d.py Show resolved Hide resolved
return covariance_array


def compute_cov_not_vectorized(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not only not vectorized, but the actual complete old method, i.e. not only slower, but also doing slightly different things potentially by actually summing up each individual mode every time. Maybe clarify that in the docstring

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is what we are doing on the new method too, no ?
I have added a comment on this function

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, but in the old one the approach is split in a completely different way by looping over every mode while in the new we're precomputing intermediate summaries that are then reused multiple times, and in addition do everything in a vectorized way...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment added

py/picca/pk1d/postproc_pk1d.py Show resolved Hide resolved
mean_pk_product = np.outer(mean_pk, mean_pk)

sum_p1d_weights = np.nansum(p1d_weights, axis=0)
weights_sum_product = np.outer(sum_p1d_weights, sum_p1d_weights)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is very confusing naming given that below you have weights_product_sum

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes but we have indeed a sum of product and a product of sums. I renamed them with the of, it should be a little bit clearer

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe adding what is summed into the name could help, but not sure...

py/picca/pk1d/postproc_pk1d.py Outdated Show resolved Hide resolved
else:
p1d_sub_table["weight"] = 1

p1d_sub_table = p1d_sub_table[p1d_sub_table["k_index"] >= 0]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are we storing negative k_index somewhere? Why is this cut needed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was a convention for the case when a k bin computed for the p1d does not fall inside the output k binning. I removed those bins as they do not contribute to the covariance.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This cut is needed because I made a method to compute the p1d groups in a vectorized way, and -1 change the last k bin, but it should not change any bin

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still don't fully get how this line is changing anything. Wouldn't the selector be always true? Or are there nans or -1s in that table for bins that just aren't filled and that you want to remove here (in which case I would test for not having exactly that filler value and not via >=0)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the implementation right now, this is just to remove p1d pixels which were not associated with any wavenumber bin.
I made a method previously for which this cut was necessary, but now it is just to reduce a little the input dataset, removing unecessary pixels.
Comment added.

"""

select_z = (p1d_table["forest_z"] < zbin_edges[izbin + 1]) & (
p1d_table["forest_z"] > zbin_edges[izbin]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is the p1d_table of each chunk and not the mean_p1d_table of the combined stats, so my comments regarding indexing shouldn't apply here...

nbins_k,
)
if number_worker == 1:
output_cov = [func(*p1d_los) for p1d_los in p1d_los_table]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure if this should be *p1d_los or just p1d_los, I'd have guessed the latter, was this tested on a single core as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is *p1d_los because p1d_los is a table with 5 array, that we give to func. Yes I tested it

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but why isn't that needed below in the mapped version, shouldn't that be a starmap then or similar? But I guess if it works it works...

i_max = (izbin + 1) * nbins_k * nbins_k
cov_table["covariance"][i_min:i_max] = covariance_array

if compute_bootstrap:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is the bootstrap method tested? If yes, would be nice to know it's results, if no please at least add comments that this isn't tested yet...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The bootstrap method works and gives very similar results than the classical covariance, which makes sense considering the number of chunks.
All the Y1 plot I am showing is with bootstrap

@Waelthus
Copy link
Contributor

Waelthus commented Jul 5, 2024

In Y1 run, associated to systematics (I will show it at the upcoming DESI meeting)

Ouch, this looks much more correlated than I expected... Can you chainge the colormap so that it actually goes from -1 to 1 and maybe also that it's white/yellow for 0 correlation and diverging outside? I looked at it and at first glance thought green is zero, but that isn't actually right and it looks like except when comparing the largest and smallest modes available things are >50% correlated everywhere

@corentinravoux
Copy link
Contributor Author

Be carefull on interpreting that, because the systematic covariance matrix is computed in a very simplified way, thus boosting the correlations.
Bootstrap stat only covariance:
image
stat + syst:
image
As you can see with the simple syst cov model, it is exagerated

@Waelthus
Copy link
Contributor

maybe we should also add a line for covar computation here:

shutil.copy(self._masterFiles + "/test_Pk1D/Pk1D.fits.gz",
self._branchFiles + "/Products/meanPk1D")
print(os.listdir(self._branchFiles + "/Products/meanPk1D"))
cmd = "picca_Pk1D_postprocess.py "
cmd += " --in-dir " + self._branchFiles + "/Products/meanPk1D"
cmd += " --output-file " + self._branchFiles + "/Products/meanPk1D/meanPk1D.fits.gz"
#- small sample => k,z-bins changed wrt default ones
cmd += " --zedge-min 2.1 --zedge-max 3.1 --zedge-bin 0.2"
cmd += " --kedge-min 0.015 --kedge-max 0.035 --kedge-bin 0.005"
picca_Pk1D_postprocess.main(cmd.split()[1:])

and compare the output to a saved state. That would at least make sure this code is run regularly (e.g. useful for library version changes) and we don't change stuff by accident, even if the covar of so few spectra will be garbage...

Copy link
Contributor

@Waelthus Waelthus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that this PR seems to do what it's supposed to, I'd be fine with merging. Please consider adding comments to the code where there are still semi-open questions (so that we e.g. don't wonder why we did certain steps in the future and remove those). Also please consider adding a line to the test in your favourite config with the relevant output (or send me the line you'd like to test and I'd add it).

@corentinravoux
Copy link
Contributor Author

Comments and tests added, waiting for pytest, if successful, let's merge it

@Waelthus
Copy link
Contributor

if one would like to use the bootstrap in testing, maybe allow setting a seed in the command line call via an additional argument and initialize an np.default_rng with that seed if given or without if not, then also the bootstrap could run deterministically.
But not fully needed here, and glad there is a test at all

@corentinravoux
Copy link
Contributor Author

Here since the bootstrap is just a run of the covariance matrix, it is not essential to test it.
But yes the seed for the bootstrap can be added later.
I think we can merge now

@Waelthus Waelthus added this pull request to the merge queue Jul 15, 2024
@Waelthus Waelthus removed this pull request from the merge queue due to a manual request Jul 15, 2024
@Waelthus Waelthus changed the title Pk1d covariance new [bump minor] Pk1d covariance new Jul 15, 2024
@Waelthus Waelthus added this pull request to the merge queue Jul 15, 2024
Merged via the queue into master with commit c2b4fad Jul 15, 2024
10 checks passed
@Waelthus Waelthus deleted the pk1d_covariance_new branch July 15, 2024 16:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Covariance measurement for Pk1d needs improvements
2 participants