Using new parquet in train #104

cbutsko · 2024-09-10T13:50:53Z

No description provided.

…nd valid_date, with the possibility to augment the latter

…dded looping through individual parquets for performance reasons

kvantricht

Good work! Added already some comments. The process_parquet is something that's for me hard to read just as code. Let's start by testing it thoroughly.

paper_eval.py

kvantricht · 2024-09-10T16:40:45Z

presto/dataset.py

+        if (valid_position < cls.NUM_TIMESTEPS // 2) or (
+            valid_position > (available_timesteps - cls.NUM_TIMESTEPS // 2)
+        ):
+            augment = True


In this case we enter unconsciously in augmentation which might not be what we desire. Can't we put this logic inside if not augment and in that case not choose the valid_position as the center_point but rather the the point that keeps valid_position closest as possible to the center? Then it's always deterministic and we don't have to force going through the augmentation part.

good point. tried to rewrite it here b2f1aa3

kvantricht · 2024-09-10T16:42:47Z

presto/dataset.py

+        # make sure that month for encoding gets shifted according to
+        # the selected timestep positions
+        month = (
+            pd.to_datetime(row_d["start_date"]) + pd.DateOffset(months=timestep_positions[0])


You could also just add timestep_positions[0] and not use the overhead of the pd.DateOffset method. I don't think we're resilient in any case to situations where we're working with dekadal data. This is a potential risk if someone (including ourselves) is going over the dekadal track. Can we make it more universal here? Or if needed for now do a check and raise something, so nobody blindly takes this while it won't work as expected.

good point. tried to account for that here ad05b3d
we will also need to think about renaming this month thing (since it can be something else too), and also making sure that other relative timestep positions (valid_position and timestep_ind) are computed not as month, but in a more generic fashion

You mean this has to be tackled in process_parquet?

I am not sure I understand what initial_start_date_position actually means and why it cannot just be the month inferred from start_date.

well, maybe I'm overthinking it...

here's an example:

we are working on a monthly basis, with start_date in October (hence, month = 10).
and we want to shift it 4 timesteps forward. we can just make (10 + 4) and add the modulo part not to get bad month value % cls.NUM_TIMESTEPS

we are working on a dekadal basis. start date is the same. I assume that our timestep indices are not in month chunks, but in 10-day intervals. so, we have observations for ts0, ts1, ..., ts45 (for example). our NUM_TIMESTEPS variable should be set to 36 instead of 12 (like Giorgia did). the valid_date should also translate into 10-day chunk instead of month. so that we can select 36 timesteps around the valid_position.
now, we have our timestep_positions returned in dekadal steps. so when we add 4 dekadal steps to a month value, it doesn't make sense. we need to add apples to apples to get the date that accounts for the shift.
and now we can take it's month and pass it to presto. I realize now that this particular step is missing in my implementation.

am I missing something?

Following most of it. But still, start_date is a real date, from which we can infer what would normally be the start month being fed to Presto. Why do we need to translate it to a position? I agree that what we have to add to it due to the shift should account for the time "resolution". valid_position should be the translation of valid_date (which is irrespective of time resolution) to the position in the timesteps (which is depending on the time resolution). I might be overthinking it just as much.

okay, maybe something like this can work:

step_converted_to_month = np.ceil(timestep_positions[0] * (365 // NUM_TIMESTEPS) / 30) month = (pd.to_datetime(start_date).month + step_converted_to_month) % 12 - 1

it can be a little imprecise in some cases, not more than 1 month error.
but it's just a one-liner, so probably we can sacrifice this little bit of precision

yeah let's make a note that Giorgia can test this out properly when doing dekadal runs.

presto/eval.py

kvantricht · 2024-09-10T16:48:03Z

presto/dataset.py

@@ -275,6 +326,7 @@ def __init__(
            dataframe = dataframe[(~dataframe.end_date.dt.year.isin(years_to_remove))]
        self.target_function = target_function if target_function is not None else self.target_crop
        self._class_weights: Optional[np.ndarray] = None
+        self.augment = augment


should we add logging somewhere that shows that augmentation is enabled when initializing a dataset like that?

good point. addressed that here ae27e25

presto/utils.py

kvantricht · 2024-09-10T16:51:02Z

presto/utils.py

+    # add dummy value + rename stuff for compatibility with existing functions
+    df["OPTICAL-B8A"] = NODATAVALUE
+
+    # TODO: this needs to go away once the transition to new data is complete


yes, and can we immediately go to presto naming convention then as well?

…nyway False by default

…sing with augment parameter

…imports and failing tests

kvantricht · 2024-09-23T06:24:16Z

@cbutsko is this branch already part of croptype? You were training on new parquet format already? But maybe only locally?

cbutsko · 2024-09-23T07:10:57Z

@kvantricht this branch is not yet the part of croptype. My plan is to sync it now with main, and then merge it directly into croptype, as you suggested.

kvantricht · 2024-09-23T11:31:14Z

paper_eval.py

@cbutsko in croptype branch this file is removed. What does that mean for the changes here?

I merged the functionality of paper_eval into train.py. I think in this branch it doesn't really matter. I just wanted to align this branch with main and then merge into croptype, resolving these conflict there. also, now after discussing ss training with Giorgia, it seems more feasible to create two separate files with more clear functions: something like train_self_supervised.py and train_finetuned.py

That sounds like a good plan!

kvantricht · 2024-09-23T11:32:33Z

presto/dataset.py

@@ -32,7 +34,7 @@


 class WorldCerealBase(Dataset):
-    _NODATAVALUE = 65535
+    # _NODATAVALUE = 65535


@cbutsko why is this commented? And are changes below like the addition of get_timestep_positions not part of the other branch?

kvantricht · 2024-09-23T11:34:33Z

presto/dataset.py

-    _NODATAVALUE = 65535
+    # _NODATAVALUE = 65535
+    Y = "worldcereal_cropland"
+    BAND_MAPPING = {


these changes will all conflict with croptype branch I think?

…resto-worldcereal into using-new-parquet-in-train

gabrieltseng · 2024-09-27T13:32:35Z

presto/dataset.py

+    @classmethod
+    def get_timestep_positions(cls, row_d: Dict, augment: bool = False) -> List[int]:
+        available_timesteps = int(row_d["available_timesteps"])
+        valid_position = int(row_d["valid_position"])


Could you add a comment describing what valid_position represents?

I see its created in utils.py but its still not obvious to me what it represents - could you add a comment there instead?

gabrieltseng · 2024-09-27T13:35:36Z

presto/dataset.py

+        available_timesteps = int(row_d["available_timesteps"])
+        valid_position = int(row_d["valid_position"])
+
+        if not augment:


Could you also make a note that augment here means temporal jittering? Am I right in understanding MIN_EDGE_BUFFER is just used to determine how much temporal jittering is allowed? In this case maybe it makes sense as a class attribute (or even as a default argument to this function)

gabrieltseng · 2024-09-27T13:36:37Z

presto/eval.py

@@ -306,7 +311,7 @@ def evaluate(
        test_preds_np = test_preds_np >= self.threshold
        prefix = f"{self.name}_{finetuned_model.__class__.__name__}"

-        catboost_preds = test_ds.df.worldcereal_prediction
+        catboost_preds = test_ds.df.worldcereal_prediction == 11


What is the motivation for this change?

gabrieltseng · 2024-09-27T13:45:09Z

catboost_info/catboost_training.json

@@ -0,0 +1,1557 @@
+{


I assume this was an accidental commit?

gabrieltseng · 2024-09-27T13:47:28Z

Nice job @cbutsko - threw in my own comments too. +1 to @kvantricht 's point that the process_parquet is doing a lot, but the docstring is really helpful.

Christina Butsko added 5 commits September 10, 2024 12:13

removed process_parquet function to utils

523014f

addeded augment parameter

5c74855

added function for timeseries subsetting, so that it is centered arou…

cb197a3

…nd valid_date, with the possibility to augment the latter

added augment parameter; replaced default link to new parquet file; a…

9e9ac9d

…dded looping through individual parquets for performance reasons

major rework of process_parquet function; minimal viable functionality

b3e0284

cbutsko requested a review from kvantricht September 10, 2024 13:50

kvantricht reviewed Sep 10, 2024

View reviewed changes

Christina Butsko added 12 commits September 11, 2024 09:29

moved MIN_EDGE_BUFFER parameter from utils to dataset.py

1d37536

added logger message about enabled augmentation

ae27e25

removed augment=False parameter from evaluate function, since it is a…

8d7e4c1

…nyway False by default

rephrased checking if valid_date is too close to the edge without mes…

b2f1aa3

…sing with augment parameter

bugs and typos fixes

ff25509

moved NODATA and MIN_EGDE parameters to dataops.py to avoid circular …

e338c09

…imports and failing tests

updated test dataset to use new ong parquet format

c9ffa01

updated tests

f6e1a9e

created separate test file for process_parquet function

407cec9

an attempt to make time_token shift more general than just for months

ad05b3d

merging changes from main

2945e9d

black fix

ef06f94

Christina Butsko added 4 commits September 23, 2024 09:25

adding test long parquet file

30ab19c

fixed test file path

96dbc0d

isort fix

55dbbbe

fixed test and commented lines that will not be needed after merge

a7eedd8

kvantricht reviewed Sep 23, 2024

View reviewed changes

kvantricht added 2 commits September 23, 2024 13:50

Formatting

6f7646f

Formatting

203c4ac

Christina Butsko and others added 4 commits September 23, 2024 14:34

making GT values binary crop/nocrop

465d65a

Test with 1 epoch finetuning

54cc2be

Merge branch 'using-new-parquet-in-train' of github.com:WorldCereal/p…

e44445a

…resto-worldcereal into using-new-parquet-in-train

Bump einops version

1a957a2

gabrieltseng reviewed Sep 27, 2024

View reviewed changes

catboost_info/catboost_training.json

@@ -0,0 +1,1557 @@

{

Copy link

Collaborator

gabrieltseng Sep 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume this was an accidental commit?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using new parquet in train #104

Using new parquet in train #104

cbutsko commented Sep 10, 2024

kvantricht left a comment

kvantricht Sep 10, 2024

cbutsko Sep 11, 2024

kvantricht Sep 10, 2024

cbutsko Sep 11, 2024

kvantricht Sep 11, 2024

kvantricht Sep 11, 2024

cbutsko Sep 11, 2024 •

edited

Loading

kvantricht Sep 11, 2024

cbutsko Sep 11, 2024

kvantricht Sep 11, 2024

kvantricht Sep 10, 2024

cbutsko Sep 11, 2024

kvantricht Sep 10, 2024

kvantricht commented Sep 23, 2024

cbutsko commented Sep 23, 2024

kvantricht Sep 23, 2024

cbutsko Sep 23, 2024

kvantricht Sep 23, 2024

kvantricht Sep 23, 2024

kvantricht Sep 23, 2024

gabrieltseng Sep 27, 2024

gabrieltseng Sep 27, 2024 •

edited

Loading

gabrieltseng Sep 27, 2024

gabrieltseng Sep 27, 2024

gabrieltseng Sep 27, 2024

gabrieltseng commented Sep 27, 2024

Using new parquet in train #104

Are you sure you want to change the base?

Using new parquet in train #104

Conversation

cbutsko commented Sep 10, 2024

kvantricht left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cbutsko Sep 11, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kvantricht commented Sep 23, 2024

cbutsko commented Sep 23, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gabrieltseng Sep 27, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gabrieltseng commented Sep 27, 2024

cbutsko Sep 11, 2024 •

edited

Loading

gabrieltseng Sep 27, 2024 •

edited

Loading