Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DL1 writer tool (ctapipe-stage1-process) #1163

Merged
merged 156 commits into from
May 19, 2020

Conversation

kosack
Copy link
Contributor

@kosack kosack commented Oct 30, 2019

Overview:

This is a PR to start a discussion, containing a fully working DL1 writer tool that writes a proposed standard DL1 HDF5 format. This is currently all contained in ctapipe/tools/stage1.py.

This PR is provided so we can discuss the general workflow, usage, and data format produced, but please see the caveats below.

Caveats:

Currently there are a lot of extra features included in the stage1.py file, that were needed to get everything to work properly:

  • a configurable ImageCleaner class
  • a configurable DataChecker class (for performing cuts and tracking statistics)
  • a function to write core metadata
  • some new Container classes, and a few algorithms to fill them
  • some of the functions inside the Stage1Process Tool could be also moved to common places (e.g. functions to write the SubarrayDescription in HDF5 format, could go in a file like ctapipe/io/hdf5.py)

These all should eventually move to standard places in ctapipe, but for now are included so that they can slowly be replaced by separate PRs that put them where needed. I suggest we avoid discussion/review of the implementations of those pieces until their subsequent PRs, but the overall idea of them can be discussed here if needed.

No unit tests are currently there for these features (will come with the PRs mentioned).

Ideally once we are ready to merge this PR, it will be simplified down to a bare minimum, and the rest will be already parts of ctapipe.

Usage:

If you want to test running this, you need to check out this branch, run make develop (so that the new executable ctapipe-stage1-process is generated), then run:

> ctapipe-stage1-process --help 

And follow instructions.

You can either specify a lot of things on the command-line, or better yet use a config file similar to this:

{
    "Stage1Process": {
        "config_file": "",
        "output_filename": "lapalma_proton_small.h5",
        "overwrite": true,
        "write_images": true,
        "image_extractor_type": "NeighborPeakWindowSum",
        "image_cleaner_type": "TailcutsImageCleaner"
    },
    "EventSource": {
        "allowed_tels": [],
        "input_url": "~/Data/CTA/Prod3/LaPalmaRefSim/proton_20deg_180deg_run18___cta-prod3-demo-2147m-LaPalma-baseline.simtel.gz",
        "skip_calibration_events": true
    },
    "TailcutsImageCleaner": {
        "boundary_threshold_pe": [
            ["type","*", 5.0],
            ["type", "LST*", 3.0],
            ["type", "MST*", 4.0]
        ],
        "min_picture_neighbors":[
            ["type","*",2]
        ],
        "picture_threshold_pe":  [
                ["type", "*", 10.0],
                ["type", "LST_LST_LSTCam", 6.0],
                ["type", "MST_MST_NectarCam", 6.0],
                ["id", 12, 15.0]
        ]
    },
    "ImageDataChecker": {
        "selection_functions": {
            "enough_pixels": "lambda im: np.count_nonzero(im) > 2",
            "enough_charge": "lambda im: im.sum() > 100"
        }
    },
    "ThresholdGainSelector": {
        "threshold": 4000
    }
}

Currently the config-file overrides anything on the command line (this is how traitlets.config works, but it is not ideal), so it is best to leave out anything you would want to change on the command-line from the config file.

Output:

The output can be viewed using vitables GUI, or accessed using the tables (PyTables) python module (for the DL1 data), or the Astropy Table or Pandas for the configuration data.

In Vitables it looks like this (if you enable the --write-images option to get the DL1a data as well as the parameters):
image

Example access to the image data:

import tables
f = tables.open_file("lapalma_proton_small.h5")
ims_lst =  f.root.dl1.event.telescope.images.tel_001

images = ims_lst.col("image")
images_mc = ims_lst.col("mc_photo_electron_image") 

# plot residuals between reconstructed image and monte-carlo charge:
plt.pcolormesh(images - images_mc)

image

Here, the x-axis is pixel_id and the y axis is Event count.

plt.hist( (images - images_mc).ravel() , bins=100)

image

@codecov
Copy link

codecov bot commented Oct 30, 2019

Codecov Report

Merging #1163 into master will not change coverage.
The diff coverage is n/a.

Impacted file tree graph

@@           Coverage Diff           @@
##           master    #1163   +/-   ##
=======================================
  Coverage   91.11%   91.11%           
=======================================
  Files         183      183           
  Lines       12542    12542           
=======================================
  Hits        11428    11428           
  Misses       1114     1114           

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 8f76a39...8f76a39. Read the comment docs.

@kosack
Copy link
Contributor Author

kosack commented Oct 30, 2019

this is related to #554 #1066 #1059

@bregeon
Copy link

bregeon commented Oct 30, 2019

just a quick question to understand better the context, how much different is "ctapipe-stage1-process" from "protopipe write_dl1.py"

@vuillaut
Copy link
Member

just a quick question to understand better the context, how much different is "ctapipe-stage1-process" from "protopipe write_dl1.py"

The DL1 structure is completely different.

@vuillaut
Copy link
Member

Related to #1165
In the R0/R1/DL0/DL1 containers, we map telescopes per tel_id. Why not keep the same in the DL1 file structure and not merge them per telescope_type as it is done now?
The file will be produced per telescope anyway so it does make sense to have one table per tel_id, no?

@kosack
Copy link
Contributor Author

kosack commented Oct 30, 2019

just a quick question to understand better the context, how much different is "ctapipe-stage1-process" from "protopipe write_dl1.py"

It will replace it in the next major protopipe release. But it's not fully the same thing yet - for example there is only 1 cleaning in this version, in prototype there are 2, with 1 being the biggest island, and thus 2 sets of parmeters. So we need to think how to do that in a nice way (perhaps always have a default set called "parameters/", and any additional are just new datasets like "parameters_bigisland/"

@kosack
Copy link
Contributor Author

kosack commented Oct 30, 2019

In the R0/R1/DL0/DL1 containers, we map telescopes per tel_id

That's for efficiency: the data from each telescope type has the same shape, so why not put it in the same table to avoid overhead of many datasets? If you want to plot something per telescope, you still have the tel_id index column (and actual pytables index).

In the lower-levels, you treat each telescope separately perhaps, but there isn't a good reason in DL1, since those differences should be calibrated away (except for inter-telescope calibration).

Really, I would like to have all parameters and images in a single table each, but since at least for the images, the size changes for each telescope type, so it's not possible. For the parameters, the last revision did that (and just allowed you to split by tel_type_id if you wanted - we could go back to that as well).

I guess it depends on whether we want to later do benchmarks by individual telescope, or by type? I would guess most are by type, so it's easier to have them already combined that way.

@kosack
Copy link
Contributor Author

kosack commented Oct 30, 2019

@vuillaut Though I suppose you are right, in real data, we'd split by tel_id, not type, so perhaps it is easier to do that. It's an easy change, but it then makes making a plot of e.g. some quantitiy for each telescope type more annoying, since you have to merge all telescope tables first. But this way, merging is also not too hard, since a single telescope would write a table called LST_LST_LSTCam, with tel_id=1

@kosack
Copy link
Contributor Author

kosack commented Oct 31, 2019

Maybe I'll make an option to split by telescope type vs telescope id. For training or benchmarking, type is most useful, but for real data, id is better... (but then we have 2 variants of the data model... not sure that is nice either). Perhaps just an API function for easily converting between the two would be ok (e.g. something that merges all the per-telescope tables into a single per-type table on request)

@watsonjj
Copy link
Contributor

Maybe I'll make an option to split by telescope type vs telescope id. For training or benchmarking, type is most useful, but for real data, id is better... (but then we have 2 variants of the data model... not sure that is nice either).

I think this would be a bad approach, for exactly as you say.

Would storing a table per telescope id be that inefficient, considering we generally have a fixed number of telescopes?

Alternatively just a high level method in the DL1Reader to obtain the table for a telid might be sufficient

@kosack
Copy link
Contributor Author

kosack commented Oct 31, 2019

Would storing a table per telescope id be that inefficient, considering we generally have a fixed number of telescopes?

There is overhead for each table (metadata, etc), but not sure how much that is. If you think of this in database terms, nobody would ever store multiple tables with the same schema since that's hard to manage, they would just invent a new index. But here, since merging would be easer (just linking the telescope data sets into a single file), it may be the right way to go.

@maxnoe
Copy link
Member

maxnoe commented Oct 31, 2019

A third possibility would be to use hdf5 variable length arrays in just a single table.

From the data point of view, I like this the most.

I don't know, what the performance impact will be.

@kosack
Copy link
Contributor Author

kosack commented Oct 31, 2019

By the way, just to try to explain future plans: the idea for this Tool would be eventually to be structured like this:
image

Where the ImageProcessor is a Component that includes an ImageCleaner and ImageParameterizer (such that you could also have multiple ImageProcessors, if you want, like in the current ProtoPipe where one uses island cleaning and the other not).

Currently the Monitoring data stream is not needed, since we are working only with Monte-Carlo, but it will be needed for real data.

It's not quite structured in such a modular way so far, but getting there.

@kosack
Copy link
Contributor Author

kosack commented Oct 31, 2019

A third possibility would be to use hdf5 variable length arrays in just a single table.

We tried that a while back, and the performance was pretty bad - you lose a lot of the HPC features. It also makes it hard to read anything with pandas, etc.

@kosack
Copy link
Contributor Author

kosack commented Oct 31, 2019

Alternatively just a high level method in the DL1Reader to obtain the table for a telid might be sufficient

Yes, that's what I meant by an API function. I think it may be the right way to go, but it means you can't
just use Pandas or PyTables in an easy way - you have to go through another API first, which is somewhat annoying.

@maxnoe
Copy link
Member

maxnoe commented Oct 31, 2019

We tried that a while back, and the performance was pretty bad - you lose a lot of the HPC features. It also makes it hard to read anything with pandas, etc.

Reading arrays with pandas is always bad, does not matter whether same shape per row or different length.

This is what I tried:

import tables
import numpy as np
from tables import IsDescription, UInt64Col, UInt16Col


class DL1(IsDescription):
    array_event_id = UInt64Col()
    telescope_id = UInt16Col()


telescope_events = [
    (1, 1, np.random.normal(size=1440)),
    (1, 2, np.random.normal(size=1039)),
    (1, 3, np.random.normal(size=1855)),
    (2, 3, np.random.normal(size=1855)),
    (2, 4, np.random.normal(size=1855)),
    (3, 1, np.random.normal(size=1440)),
] 


with tables.open_file('test.hdf5', mode='w') as f:
    f.create_group('/', 'dl1')
    table = f.create_table('/dl1', 'telescope_events', DL1)

    images = f.create_vlarray('/dl1', 'images', tables.Float32Atom())

    for event_id, tel_id, img in telescope_events:
        row = table.row
        row['array_event_id'] = event_id
        row['telescope_id'] = tel_id
        row.append()
        images.append(img)

@maxnoe
Copy link
Member

maxnoe commented Oct 31, 2019

This has the advantage, that you can read the data that has the same shape for all telescope nicely using pandas and for the images, you get lists of numpy arrays, which seems to be not that bad compared to multiple tables.

@maxnoe
Copy link
Member

maxnoe commented Oct 31, 2019

For me, reading the variable length images is a little faster than reading from three same shape tables:

In [1]: import h5py

In [2]: f = h5py.File('./vlarray.hdf5')

In [3]: %timeit images = f['dl1/images'][:]
3.02 ms ± 170 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [4]: f = h5py.File('tables.hdf5')

In [5]: %timeit images = [f[f'dl1/type_{i}']['image'][:] for i in range(3)]
3.91 ms ± 13 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

The variable length array file is slightly larger, I don't know why.

Here is the code:

import tables
import numpy as np
from tables import IsDescription, UInt64Col, UInt16Col, Float32Col

# create some pseudo data
np.random.seed(0)

num_pixels = [1855, 1039, 2048]
telescope_types = [0] * 4 + [1] * 10 + [2] * 20
telescope_ids = np.arange(len(telescope_types))
n_events = 100


telescope_events = []
for event_id in range(n_events):
    n_telescopes = min(2 + np.random.poisson(10), len(telescope_ids))
    triggered = np.random.choice(telescope_ids, n_telescopes, replace=False)

    for tel_id in triggered:
        size = num_pixels[telescope_types[tel_id]]
        telescope_events.append((
            event_id, tel_id, np.random.normal(size=size)
        ))


# write out as one table with metadata and one variable length array column
class DL1(IsDescription):
    array_event_id = UInt64Col()
    telescope_id = UInt16Col()


with tables.open_file('vlarray.hdf5', mode='w') as f:
    f.create_group('/', 'dl1')
    table = f.create_table('/dl1', 'telescope_events', DL1)

    images = f.create_vlarray('/dl1', 'images', tables.Float32Atom())

    for event_id, tel_id, img in telescope_events:
        row = table.row
        row['array_event_id'] = event_id
        row['telescope_id'] = tel_id
        row.append()
        images.append(img)


# write out as same shape tables for each telescope type
def description(num_pixel):
    class DL1(IsDescription):
        array_event_id = UInt64Col()
        telescope_id = UInt16Col()
        image = Float32Col(shape=(num_pixel))

    return DL1


with tables.open_file('tables.hdf5', mode='w') as f:
    f.create_group('/', 'dl1')
    tables = {}

    for event_id, tel_id, img in telescope_events:
        tel_type = telescope_types[tel_id]
        if tel_type not in tables:
            desc = description(num_pixels[tel_type])
            tables[tel_type] = f.create_table('/dl1', f'type_{tel_type}', desc)

        table = tables[tel_type]
        row = table.row
        row['array_event_id'] = event_id
        row['telescope_id'] = tel_id
        row['image'] = img
        row.append()

@vuillaut
Copy link
Member

Would storing a table per telescope id be that inefficient, considering we generally have a fixed number of telescopes?

There is overhead for each table (metadata, etc), but not sure how much that is. If you think of this in database terms, nobody would ever store multiple tables with the same schema since that's hard to manage, they would just invent a new index. But here, since merging would be easer (just linking the telescope data sets into a single file), it may be the right way to go.

Trying to compare the advantages and drawbacks of both approaches, here is what I can come up with:

  • Per tel_type
    • Advantages:
      • easy plotting and benchmarks across same telescope type
      • easy training for one telescope type
    • Drawback:
      • different data model between R1/DL0 and DL1
      • when processing tables from R1/DL0 (per tel_id), we won't be able to do that in parallel per table since that would be different processes writing in the same output.
  • Per tel_id
    • Advantages:
      • consistent data model from R1 to DL1
      • efficient analysis and writing
    • Drawback:
      • a bit more annoying benchmark and training per telescope type.
      • more metadata

In the case of per tel_id, getting a table of all tel_type is not that difficult, if you have a table of tel_id per tel_type, that is two lines instead of one:

tel_ids = table_of_tel_ids[tel_type]
pd.concat([pd.read_hdf(df(filename, key=f'path_to_{tel_id}') for tel_id in tel_ids])

(something similar with pytables)

Of course, I like loading all my parameters at once, but I am not sure this outweigh the drawbacks.

@vuillaut
Copy link
Member

A third possibility would be to use hdf5 variable length arrays in just a single table.

We tried that a while back, and the performance was pretty bad - you lose a lot of the HPC features. It also makes it hard to read anything with pandas, etc.

There is also the compression issue. It seems varlen arrays are not (well) compressible.

@vuillaut
Copy link
Member

The variable length array file is slightly larger, I don't know why.

That might well be compression.

@maxnoe
Copy link
Member

maxnoe commented Oct 31, 2019

That might well be compression.

I did not enable any compression for either.

@maxnoe maxnoe requested a review from bregeon May 14, 2020 13:52
bregeon
bregeon previously approved these changes May 14, 2020
@maxnoe
Copy link
Member

maxnoe commented May 18, 2020

@kosack I'm working on a fix for the extractor issue

@maxnoe maxnoe requested a review from vuillaut May 18, 2020 18:05
Copy link
Member

@vuillaut vuillaut left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi.

The data model looks great. I actually don't have any more comments on it (we know it will evolve in the future but it encompasses all the exchanges we had I think).

I have a couple of suggestion for improvements but I don't think they are show-stoppers.

I will continue digging today.

"""
Generate DL1 (a or b) output files in HDF5 format from {R0,R1,DL0} inputs.

# TODO: add event time per telescope!
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

still TODO?

)

# convert all values to strings, since hdf5 can't handle Times, etc.:
# TODO: add activity_stop_time?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

still TODO?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, done

Comment on lines +131 to +132
The config file should be in JSON or python format (see traitlets docs). For an
example, see ctapipe/examples/stage1_config.json in the main code repo.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note for later: getting this config file from the online Tool would be nice for users.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean? Exporting the used config? That is supported by the Tool itself, isn't it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean having a way to dump the exhaustive config file without having to download the (non-exhaustive?) example from the repository. It's a proposal of improvement, nothing blocking of course.

ctapipe/tools/stage1.py Show resolved Hide resolved
ctapipe/tools/stage1.py Show resolved Hide resolved
image_criteria = self.check_image(image_selected)
self.log.debug(
"image_criteria: %s",
list(zip(self.check_image.criteria_names[1:], image_criteria)),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would it be interesting to write the criteria status event per event as columns in the DL1 parameters table? overkill?

@vuillaut
Copy link
Member

There is also an important discrepancy between true and reco parameters. Do you have an idea why?

image

@maxnoe
Copy link
Member

maxnoe commented May 19, 2020

This looks like a combination of noise and cleaning.
Looking through some of the images, I noticed that the showers are much larger at sub-cleaning-threshold values (long tails with 1-2 pe per pixel)

That fits with most parameters here:

  • width /length gets smaller, the effect for width is stronger
  • intensity gets smaller (more pronounced for small showers, where a larger proportion of the light is in dim pixels)
  • Concentration gets larger (more light in less pixels)
  • concentration core gets smaller (because the core is smaller as per 1 and 2

We could apply the same cleaning to the true image before calculating the true parameters, but that would be problematic when using a cleaning method that relies on time, as we don't have peak time for the true image right now.

@kosack
Copy link
Contributor Author

kosack commented May 19, 2020

Yes, you probably have to plot those in bins of total intensity to better understand the distributions. The true_hillas parameters are just the ones computed from the noise-free image, so at low intensity, you're dominated by noise. Since this is a power-law spectrum, the plots you make are mostly dominated by low-energy/low-intensity images. I don't think this is a problem with the code itself.

@vuillaut
Copy link
Member

vuillaut commented May 19, 2020

We could apply the same cleaning to the true image before calculating the true parameters, but that would be problematic when using a cleaning method that relies on time, as we don't have peak time for the true image right now.

That would be a solution indeed but we could also argue that the signal extraction (and thus the cleaning) is part of the benchmark. There is no wrong way to look at it, just need to keep it in mind.

@vuillaut
Copy link
Member

vuillaut commented May 19, 2020

Ok. I will approve since I don't have any major/blocking feedback, just a couple of minor ones that can (or not) be addressed here or later, as your prefer.
Maybe just remove the TODO if there is nothing TODO anymore ;-)
@maxnoe and @kosack congrats for the massive work, the DL1 looks really great

@kosack kosack merged commit 7e96382 into cta-observatory:master May 19, 2020
@maxnoe
Copy link
Member

maxnoe commented May 19, 2020

Woohoo

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants