DL1 writer tool (ctapipe-stage1-process) #1163

kosack · 2019-10-30T14:48:24Z

Overview:

This is a PR to start a discussion, containing a fully working DL1 writer tool that writes a proposed standard DL1 HDF5 format. This is currently all contained in ctapipe/tools/stage1.py.

This PR is provided so we can discuss the general workflow, usage, and data format produced, but please see the caveats below.

Caveats:

Currently there are a lot of extra features included in the stage1.py file, that were needed to get everything to work properly:

a configurable ImageCleaner class
a configurable DataChecker class (for performing cuts and tracking statistics)
a function to write core metadata
some new Container classes, and a few algorithms to fill them
some of the functions inside the Stage1Process Tool could be also moved to common places (e.g. functions to write the SubarrayDescription in HDF5 format, could go in a file like ctapipe/io/hdf5.py)

These all should eventually move to standard places in ctapipe, but for now are included so that they can slowly be replaced by separate PRs that put them where needed. I suggest we avoid discussion/review of the implementations of those pieces until their subsequent PRs, but the overall idea of them can be discussed here if needed.

No unit tests are currently there for these features (will come with the PRs mentioned).

Ideally once we are ready to merge this PR, it will be simplified down to a bare minimum, and the rest will be already parts of ctapipe.

Usage:

If you want to test running this, you need to check out this branch, run make develop (so that the new executable ctapipe-stage1-process is generated), then run:

> ctapipe-stage1-process --help

And follow instructions.

You can either specify a lot of things on the command-line, or better yet use a config file similar to this:

{
    "Stage1Process": {
        "config_file": "",
        "output_filename": "lapalma_proton_small.h5",
        "overwrite": true,
        "write_images": true,
        "image_extractor_type": "NeighborPeakWindowSum",
        "image_cleaner_type": "TailcutsImageCleaner"
    },
    "EventSource": {
        "allowed_tels": [],
        "input_url": "~/Data/CTA/Prod3/LaPalmaRefSim/proton_20deg_180deg_run18___cta-prod3-demo-2147m-LaPalma-baseline.simtel.gz",
        "skip_calibration_events": true
    },
    "TailcutsImageCleaner": {
        "boundary_threshold_pe": [
            ["type","*", 5.0],
            ["type", "LST*", 3.0],
            ["type", "MST*", 4.0]
        ],
        "min_picture_neighbors":[
            ["type","*",2]
        ],
        "picture_threshold_pe":  [
                ["type", "*", 10.0],
                ["type", "LST_LST_LSTCam", 6.0],
                ["type", "MST_MST_NectarCam", 6.0],
                ["id", 12, 15.0]
        ]
    },
    "ImageDataChecker": {
        "selection_functions": {
            "enough_pixels": "lambda im: np.count_nonzero(im) > 2",
            "enough_charge": "lambda im: im.sum() > 100"
        }
    },
    "ThresholdGainSelector": {
        "threshold": 4000
    }
}

Currently the config-file overrides anything on the command line (this is how traitlets.config works, but it is not ideal), so it is best to leave out anything you would want to change on the command-line from the config file.

Output:

The output can be viewed using vitables GUI, or accessed using the tables (PyTables) python module (for the DL1 data), or the Astropy Table or Pandas for the configuration data.

In Vitables it looks like this (if you enable the --write-images option to get the DL1a data as well as the parameters):

Example access to the image data:

import tables
f = tables.open_file("lapalma_proton_small.h5")
ims_lst =  f.root.dl1.event.telescope.images.tel_001

images = ims_lst.col("image")
images_mc = ims_lst.col("mc_photo_electron_image") 

# plot residuals between reconstructed image and monte-carlo charge:
plt.pcolormesh(images - images_mc)

Here, the x-axis is pixel_id and the y axis is Event count.

plt.hist( (images - images_mc).ravel() , bins=100)

…ration group

codecov · 2019-10-30T15:07:14Z

Codecov Report

Merging #1163 into master will not change coverage.
The diff coverage is n/a.

@@           Coverage Diff           @@
##           master    #1163   +/-   ##
=======================================
  Coverage   91.11%   91.11%           
=======================================
  Files         183      183           
  Lines       12542    12542           
=======================================
  Hits        11428    11428           
  Misses       1114     1114

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 8f76a39...8f76a39. Read the comment docs.

kosack · 2019-10-30T15:38:47Z

this is related to #554 #1066 #1059

bregeon · 2019-10-30T16:04:30Z

just a quick question to understand better the context, how much different is "ctapipe-stage1-process" from "protopipe write_dl1.py"

vuillaut · 2019-10-30T16:19:16Z

just a quick question to understand better the context, how much different is "ctapipe-stage1-process" from "protopipe write_dl1.py"

The DL1 structure is completely different.

vuillaut · 2019-10-30T16:56:06Z

Related to #1165
In the R0/R1/DL0/DL1 containers, we map telescopes per tel_id. Why not keep the same in the DL1 file structure and not merge them per telescope_type as it is done now?
The file will be produced per telescope anyway so it does make sense to have one table per tel_id, no?

kosack · 2019-10-30T16:56:32Z

just a quick question to understand better the context, how much different is "ctapipe-stage1-process" from "protopipe write_dl1.py"

It will replace it in the next major protopipe release. But it's not fully the same thing yet - for example there is only 1 cleaning in this version, in prototype there are 2, with 1 being the biggest island, and thus 2 sets of parmeters. So we need to think how to do that in a nice way (perhaps always have a default set called "parameters/", and any additional are just new datasets like "parameters_bigisland/"

kosack · 2019-10-30T16:59:49Z

In the R0/R1/DL0/DL1 containers, we map telescopes per tel_id

That's for efficiency: the data from each telescope type has the same shape, so why not put it in the same table to avoid overhead of many datasets? If you want to plot something per telescope, you still have the tel_id index column (and actual pytables index).

In the lower-levels, you treat each telescope separately perhaps, but there isn't a good reason in DL1, since those differences should be calibrated away (except for inter-telescope calibration).

Really, I would like to have all parameters and images in a single table each, but since at least for the images, the size changes for each telescope type, so it's not possible. For the parameters, the last revision did that (and just allowed you to split by tel_type_id if you wanted - we could go back to that as well).

I guess it depends on whether we want to later do benchmarks by individual telescope, or by type? I would guess most are by type, so it's easier to have them already combined that way.

kosack · 2019-10-30T17:06:19Z

@vuillaut Though I suppose you are right, in real data, we'd split by tel_id, not type, so perhaps it is easier to do that. It's an easy change, but it then makes making a plot of e.g. some quantitiy for each telescope type more annoying, since you have to merge all telescope tables first. But this way, merging is also not too hard, since a single telescope would write a table called LST_LST_LSTCam, with tel_id=1

kosack · 2019-10-31T10:08:36Z

Maybe I'll make an option to split by telescope type vs telescope id. For training or benchmarking, type is most useful, but for real data, id is better... (but then we have 2 variants of the data model... not sure that is nice either). Perhaps just an API function for easily converting between the two would be ok (e.g. something that merges all the per-telescope tables into a single per-type table on request)

watsonjj · 2019-10-31T10:12:38Z

Maybe I'll make an option to split by telescope type vs telescope id. For training or benchmarking, type is most useful, but for real data, id is better... (but then we have 2 variants of the data model... not sure that is nice either).

I think this would be a bad approach, for exactly as you say.

Would storing a table per telescope id be that inefficient, considering we generally have a fixed number of telescopes?

Alternatively just a high level method in the DL1Reader to obtain the table for a telid might be sufficient

kosack · 2019-10-31T13:27:51Z

Would storing a table per telescope id be that inefficient, considering we generally have a fixed number of telescopes?

There is overhead for each table (metadata, etc), but not sure how much that is. If you think of this in database terms, nobody would ever store multiple tables with the same schema since that's hard to manage, they would just invent a new index. But here, since merging would be easer (just linking the telescope data sets into a single file), it may be the right way to go.

maxnoe · 2019-10-31T13:29:53Z

A third possibility would be to use hdf5 variable length arrays in just a single table.

From the data point of view, I like this the most.

I don't know, what the performance impact will be.

kosack · 2019-10-31T13:38:44Z

By the way, just to try to explain future plans: the idea for this Tool would be eventually to be structured like this:

Where the ImageProcessor is a Component that includes an ImageCleaner and ImageParameterizer (such that you could also have multiple ImageProcessors, if you want, like in the current ProtoPipe where one uses island cleaning and the other not).

Currently the Monitoring data stream is not needed, since we are working only with Monte-Carlo, but it will be needed for real data.

It's not quite structured in such a modular way so far, but getting there.

kosack · 2019-10-31T13:40:03Z

A third possibility would be to use hdf5 variable length arrays in just a single table.

We tried that a while back, and the performance was pretty bad - you lose a lot of the HPC features. It also makes it hard to read anything with pandas, etc.

kosack · 2019-10-31T13:45:28Z

Alternatively just a high level method in the DL1Reader to obtain the table for a telid might be sufficient

Yes, that's what I meant by an API function. I think it may be the right way to go, but it means you can't
just use Pandas or PyTables in an easy way - you have to go through another API first, which is somewhat annoying.

maxnoe · 2019-10-31T13:48:46Z

We tried that a while back, and the performance was pretty bad - you lose a lot of the HPC features. It also makes it hard to read anything with pandas, etc.

Reading arrays with pandas is always bad, does not matter whether same shape per row or different length.

This is what I tried:

import tables
import numpy as np
from tables import IsDescription, UInt64Col, UInt16Col


class DL1(IsDescription):
    array_event_id = UInt64Col()
    telescope_id = UInt16Col()


telescope_events = [
    (1, 1, np.random.normal(size=1440)),
    (1, 2, np.random.normal(size=1039)),
    (1, 3, np.random.normal(size=1855)),
    (2, 3, np.random.normal(size=1855)),
    (2, 4, np.random.normal(size=1855)),
    (3, 1, np.random.normal(size=1440)),
] 


with tables.open_file('test.hdf5', mode='w') as f:
    f.create_group('/', 'dl1')
    table = f.create_table('/dl1', 'telescope_events', DL1)

    images = f.create_vlarray('/dl1', 'images', tables.Float32Atom())

    for event_id, tel_id, img in telescope_events:
        row = table.row
        row['array_event_id'] = event_id
        row['telescope_id'] = tel_id
        row.append()
        images.append(img)

maxnoe · 2019-10-31T13:49:56Z

This has the advantage, that you can read the data that has the same shape for all telescope nicely using pandas and for the images, you get lists of numpy arrays, which seems to be not that bad compared to multiple tables.

maxnoe · 2019-10-31T14:13:05Z

For me, reading the variable length images is a little faster than reading from three same shape tables:

In [1]: import h5py

In [2]: f = h5py.File('./vlarray.hdf5')

In [3]: %timeit images = f['dl1/images'][:]
3.02 ms ± 170 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [4]: f = h5py.File('tables.hdf5')

In [5]: %timeit images = [f[f'dl1/type_{i}']['image'][:] for i in range(3)]
3.91 ms ± 13 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

The variable length array file is slightly larger, I don't know why.

Here is the code:

import tables
import numpy as np
from tables import IsDescription, UInt64Col, UInt16Col, Float32Col

# create some pseudo data
np.random.seed(0)

num_pixels = [1855, 1039, 2048]
telescope_types = [0] * 4 + [1] * 10 + [2] * 20
telescope_ids = np.arange(len(telescope_types))
n_events = 100


telescope_events = []
for event_id in range(n_events):
    n_telescopes = min(2 + np.random.poisson(10), len(telescope_ids))
    triggered = np.random.choice(telescope_ids, n_telescopes, replace=False)

    for tel_id in triggered:
        size = num_pixels[telescope_types[tel_id]]
        telescope_events.append((
            event_id, tel_id, np.random.normal(size=size)
        ))


# write out as one table with metadata and one variable length array column
class DL1(IsDescription):
    array_event_id = UInt64Col()
    telescope_id = UInt16Col()


with tables.open_file('vlarray.hdf5', mode='w') as f:
    f.create_group('/', 'dl1')
    table = f.create_table('/dl1', 'telescope_events', DL1)

    images = f.create_vlarray('/dl1', 'images', tables.Float32Atom())

    for event_id, tel_id, img in telescope_events:
        row = table.row
        row['array_event_id'] = event_id
        row['telescope_id'] = tel_id
        row.append()
        images.append(img)


# write out as same shape tables for each telescope type
def description(num_pixel):
    class DL1(IsDescription):
        array_event_id = UInt64Col()
        telescope_id = UInt16Col()
        image = Float32Col(shape=(num_pixel))

    return DL1


with tables.open_file('tables.hdf5', mode='w') as f:
    f.create_group('/', 'dl1')
    tables = {}

    for event_id, tel_id, img in telescope_events:
        tel_type = telescope_types[tel_id]
        if tel_type not in tables:
            desc = description(num_pixels[tel_type])
            tables[tel_type] = f.create_table('/dl1', f'type_{tel_type}', desc)

        table = tables[tel_type]
        row = table.row
        row['array_event_id'] = event_id
        row['telescope_id'] = tel_id
        row['image'] = img
        row.append()

vuillaut · 2019-10-31T14:27:00Z

Would storing a table per telescope id be that inefficient, considering we generally have a fixed number of telescopes?

There is overhead for each table (metadata, etc), but not sure how much that is. If you think of this in database terms, nobody would ever store multiple tables with the same schema since that's hard to manage, they would just invent a new index. But here, since merging would be easer (just linking the telescope data sets into a single file), it may be the right way to go.

Trying to compare the advantages and drawbacks of both approaches, here is what I can come up with:

Per tel_type
- Advantages:
  - easy plotting and benchmarks across same telescope type
  - easy training for one telescope type
- Drawback:
  - different data model between R1/DL0 and DL1
  - when processing tables from R1/DL0 (per tel_id), we won't be able to do that in parallel per table since that would be different processes writing in the same output.
Per tel_id
- Advantages:
  - consistent data model from R1 to DL1
  - efficient analysis and writing
- Drawback:
  - a bit more annoying benchmark and training per telescope type.
  - more metadata

In the case of per tel_id, getting a table of all tel_type is not that difficult, if you have a table of tel_id per tel_type, that is two lines instead of one:

tel_ids = table_of_tel_ids[tel_type]
pd.concat([pd.read_hdf(df(filename, key=f'path_to_{tel_id}') for tel_id in tel_ids])

(something similar with pytables)

Of course, I like loading all my parameters at once, but I am not sure this outweigh the drawbacks.

vuillaut · 2019-10-31T14:28:41Z

A third possibility would be to use hdf5 variable length arrays in just a single table.

We tried that a while back, and the performance was pretty bad - you lose a lot of the HPC features. It also makes it hard to read anything with pandas, etc.

There is also the compression issue. It seems varlen arrays are not (well) compressible.

vuillaut · 2019-10-31T14:31:15Z

The variable length array file is slightly larger, I don't know why.

That might well be compression.

maxnoe · 2019-10-31T14:36:50Z

That might well be compression.

I did not enable any compression for either.

stage1: Use zstd compression by default

Improve defaults in stage1 config example

…nto feature/dl1_tool

maxnoe · 2020-05-18T12:27:30Z

@kosack I'm working on a fix for the extractor issue

…nto feature/dl1_tool

vuillaut

Hi.

The data model looks great. I actually don't have any more comments on it (we know it will evolve in the future but it encompasses all the exchanges we had I think).

I have a couple of suggestion for improvements but I don't think they are show-stoppers.

I will continue digging today.

vuillaut · 2020-05-18T21:15:00Z

ctapipe/tools/stage1.py

+"""
+Generate DL1 (a or b) output files in HDF5 format from {R0,R1,DL0} inputs.
+
+# TODO: add event time per telescope!


still TODO?

vuillaut · 2020-05-18T21:16:16Z

ctapipe/tools/stage1.py

+    )
+
+    # convert all values to strings, since hdf5 can't handle Times, etc.:
+    # TODO: add activity_stop_time?


still TODO?

vuillaut · 2020-05-18T21:23:08Z

ctapipe/tools/stage1.py

+    The config file should be in JSON or python format (see traitlets docs). For an
+    example, see ctapipe/examples/stage1_config.json in the main code repo.


Note for later: getting this config file from the online Tool would be nice for users.

What do you mean? Exporting the used config? That is supported by the Tool itself, isn't it?

I mean having a way to dump the exhaustive config file without having to download the (non-exhaustive?) example from the repository. It's a proposal of improvement, nothing blocking of course.

ctapipe/tools/stage1.py

vuillaut · 2020-05-19T08:44:12Z

ctapipe/tools/stage1.py

+        image_criteria = self.check_image(image_selected)
+        self.log.debug(
+            "image_criteria: %s",
+            list(zip(self.check_image.criteria_names[1:], image_criteria)),


would it be interesting to write the criteria status event per event as columns in the DL1 parameters table? overkill?

vuillaut · 2020-05-19T09:07:41Z

There is also an important discrepancy between true and reco parameters. Do you have an idea why?

maxnoe · 2020-05-19T09:44:29Z

This looks like a combination of noise and cleaning.
Looking through some of the images, I noticed that the showers are much larger at sub-cleaning-threshold values (long tails with 1-2 pe per pixel)

That fits with most parameters here:

width /length gets smaller, the effect for width is stronger
intensity gets smaller (more pronounced for small showers, where a larger proportion of the light is in dim pixels)
Concentration gets larger (more light in less pixels)
concentration core gets smaller (because the core is smaller as per 1 and 2

We could apply the same cleaning to the true image before calculating the true parameters, but that would be problematic when using a cleaning method that relies on time, as we don't have peak time for the true image right now.

kosack · 2020-05-19T12:13:29Z

Yes, you probably have to plot those in bins of total intensity to better understand the distributions. The true_hillas parameters are just the ones computed from the noise-free image, so at low intensity, you're dominated by noise. Since this is a power-law spectrum, the plots you make are mostly dominated by low-energy/low-intensity images. I don't think this is a problem with the code itself.

vuillaut · 2020-05-19T13:32:44Z

We could apply the same cleaning to the true image before calculating the true parameters, but that would be problematic when using a cleaning method that relies on time, as we don't have peak time for the true image right now.

That would be a solution indeed but we could also argue that the signal extraction (and thus the cleaning) is part of the benchmark. There is no wrong way to look at it, just need to keep it in mind.

vuillaut · 2020-05-19T13:37:07Z

Ok. I will approve since I don't have any major/blocking feedback, just a couple of minor ones that can (or not) be addressed here or later, as your prefer.
Maybe just remove the TODO if there is nothing TODO anymore ;-)
@maxnoe and @kosack congrats for the massive work, the DL1 looks really great

maxnoe · 2020-05-19T14:31:04Z

Woohoo

kosack added 6 commits October 30, 2019 14:51

add ctapipe-stage1-process (DL1 writer) tool

c923f56

add entry_point to setup.py

b63f2b7

data model v2.0.0: move simulation and instrument configs to /configu…

67208ba

…ration group

added sample config file, and message

b3c6bd7

removed sample config from code (now in examples/config)

b9298ba

cleanup imports

75930b0

kosack mentioned this pull request Oct 30, 2019

DL1 data file layout discussion #1059

Closed

kosack mentioned this pull request Oct 31, 2019

Discussion on restructuring Event Container structure #1165

Open

maxnoe requested a review from bregeon May 14, 2020 13:52

bregeon previously approved these changes May 14, 2020

View reviewed changes

maxnoe and others added 2 commits May 14, 2020 16:43

stage1: Use zstd compression by default

704faea

Merge pull request #17 from cta-observatory/zstd_stage1

46405bf

stage1: Use zstd compression by default

kosack dismissed bregeon’s stale review via 46405bf May 14, 2020 14:46

maxnoe and others added 2 commits May 14, 2020 16:53

Improve defaults in stage1 config example

83f14b0

Merge pull request #18 from cta-observatory/stage1_config

8cd2a7e

Improve defaults in stage1 config example

kosack requested a review from bregeon May 14, 2020 16:22

maxnoe mentioned this pull request May 15, 2020

use float32 for images, and validate #1329

Merged

Merge branch 'master' of https://github.com/cta-observatory/ctapipe i…

9cef9fc

…nto feature/dl1_tool

kosack and others added 3 commits May 18, 2020 14:51

Merge branch 'master' of https://github.com/cta-observatory/ctapipe i…

f79a677

…nto feature/dl1_tool

Merge remote-tracking branch 'origin/master' into feature/dl1_tool

1d7b842

remove forced naming of mc → true

8f76a39

maxnoe requested a review from vuillaut May 18, 2020 18:05

watsonjj approved these changes May 19, 2020

View reviewed changes

vuillaut reviewed May 19, 2020

View reviewed changes

vuillaut approved these changes May 19, 2020

View reviewed changes

kosack merged commit 7e96382 into cta-observatory:master May 19, 2020

kosack deleted the feature/dl1_tool branch October 2, 2020 09:53

This was referenced Oct 26, 2020

Typed Items #445

Closed

use event.pointing correctly #807

Closed

TableWriter improvements #863

Closed

cleanup where pointing info is stored #854

Closed

		The config file should be in JSON or python format (see traitlets docs). For an
		example, see ctapipe/examples/stage1_config.json in the main code repo.

DL1 writer tool (ctapipe-stage1-process) #1163

DL1 writer tool (ctapipe-stage1-process) #1163

Conversation

kosack commented Oct 30, 2019 • edited Loading

Overview:

Caveats:

Usage:

Output:

Example access to the image data:

codecov bot commented Oct 30, 2019 • edited Loading

Codecov Report

kosack commented Oct 30, 2019

bregeon commented Oct 30, 2019

vuillaut commented Oct 30, 2019

vuillaut commented Oct 30, 2019

kosack commented Oct 30, 2019 • edited Loading

kosack commented Oct 30, 2019 • edited Loading

kosack commented Oct 30, 2019

kosack commented Oct 31, 2019

watsonjj commented Oct 31, 2019

kosack commented Oct 31, 2019

maxnoe commented Oct 31, 2019

kosack commented Oct 31, 2019

kosack commented Oct 31, 2019

kosack commented Oct 31, 2019

maxnoe commented Oct 31, 2019

maxnoe commented Oct 31, 2019

maxnoe commented Oct 31, 2019

vuillaut commented Oct 31, 2019

vuillaut commented Oct 31, 2019

vuillaut commented Oct 31, 2019

maxnoe commented Oct 31, 2019

maxnoe commented May 18, 2020

vuillaut left a comment

Choose a reason for hiding this comment

vuillaut May 18, 2020

Choose a reason for hiding this comment

vuillaut May 18, 2020

Choose a reason for hiding this comment

maxnoe May 19, 2020

Choose a reason for hiding this comment

vuillaut May 18, 2020

Choose a reason for hiding this comment

maxnoe May 19, 2020

Choose a reason for hiding this comment

vuillaut May 19, 2020

Choose a reason for hiding this comment

vuillaut May 19, 2020

Choose a reason for hiding this comment

vuillaut commented May 19, 2020

maxnoe commented May 19, 2020

kosack commented May 19, 2020

vuillaut commented May 19, 2020 • edited Loading

vuillaut commented May 19, 2020 • edited Loading

maxnoe commented May 19, 2020

kosack commented Oct 30, 2019 •

edited

Loading

codecov bot commented Oct 30, 2019 •

edited

Loading

kosack commented Oct 30, 2019 •

edited

Loading

kosack commented Oct 30, 2019 •

edited

Loading

vuillaut commented May 19, 2020 •

edited

Loading

vuillaut commented May 19, 2020 •

edited

Loading