Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] NVtabular.dataset.to_parquet(...) Improperly matched output dtypes detected in time, object and datetime64[ns] #1883

Open
Zachacy opened this issue Aug 15, 2024 · 1 comment
Labels
bug Something isn't working

Comments

@Zachacy
Copy link

Zachacy commented Aug 15, 2024

I tried run NVIDIA Merlin on Microsoft’s News Dataset (MIND) tutorial ...
In running to Step 5: Feature Engineering - time-based features happened error:

data_train = nvt.Dataset(os.path.join(data_input_path, "train.parquet"), engine="parquet",part_size="256MB")
data_valid = nvt.Dataset(os.path.join(data_input_path, "valid.parquet"), engine="parquet",part_size="256MB")

dict_dtypes={}
for col in cat_features.columns:
    dict_dtypes[col] = np.int64

for col in cont_features.columns:
    dict_dtypes[col] = np.float32

for col in labels:
    dict_dtypes[col] = np.float32
%%time
proc.fit(data_train)

%%time

**proc.transform(data_train).to_parquet**(output_path= output_train_path, ## <- this line error
                                shuffle=nvt.io.Shuffle.PER_PARTITION,
                                dtypes=dict_dtypes,
                                out_files_per_proc=10,
                                cats = cat_features.columns,
                                conts = cont_features.columns,
                                labels = labels)

/core/merlin/io/dataset.py:863: UserWarning: Only created 1 files did not have enough partitions to create 10 files.
warnings.warn(
/usr/local/lib/python3.8/dist-packages/cudf/core/dataframe.py:1253: UserWarning: The deep parameter is ignored and is only included for pandas compatibility.
warnings.warn(
/usr/local/lib/python3.8/dist-packages/cudf/core/dataframe.py:1253: UserWarning: The deep parameter is ignored and is only included for pandas compatibility.
warnings.warn(
Failed to transform operator <nvtabular.ops.lambdaop.LambdaOp object at 0x7fa63bd86a00>
Traceback (most recent call last):
File "/nvtabular/nvtabular/workflow/workflow.py", line 485, in _transform_partition
raise TypeError(
TypeError: Improperly matched output dtypes detected in time, object and datetime64[ns]
distributed.worker - WARNING - Compute Failed
Function: _write_subgraph
args: (<merlin.io.dask.DaskSubgraph object at 0x7fa68c63f6d0>, ('part_0.parquet', 'part_1.parquet', 'part_2.parquet', 'part_3.parquet', 'part_4.parquet', 'part_5.parquet', 'part_6.parquet', 'part_7.parquet', 'part_8.parquet', 'part_9.parquet'), '/share/recommenders/MIND/processed_nvt/train', <Shuffle.PER_PARTITION: 0>, <fsspec.implementations.local.LocalFileSystem object at 0x7fa76da543a0>, ['time_hour', 'hist_cat_0', 'hist_subcat_0', 'hist_cat_1', 'hist_subcat_1', 'hist_cat_2', 'hist_subcat_2', 'hist_cat_3', 'hist_subcat_3', 'hist_cat_4', 'hist_subcat_4', 'hist_cat_5', 'hist_subcat_5', 'hist_cat_6', 'hist_subcat_6', 'hist_cat_7', 'hist_subcat_7', 'hist_cat_8', 'hist_subcat_8', 'hist_cat_9', 'hist_subcat_9', 'impr_cat', 'impr_subcat', 'impression_id', 'uid', 'time_minute', 'time_second', 'time_wd', 'time_day', 'time_day_week', 'time'], ['hist_count'], ['label'], 'parquet', 0, False, '')
kwargs: {}
Exception: "TypeError('Improperly matched output dtypes detected in time, object and datetime64[ns]')"

I environment refer [merlin-training:22.04]

Thanks!

@Zachacy Zachacy added the bug Something isn't working label Aug 15, 2024
@rnyak
Copy link
Contributor

rnyak commented Sep 24, 2024

@EmmaQiaoCh @minseokl fyi.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants