New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

【CSV 数据源增强】支持 skipFirstNlines & skipLastNLines 能力 #1804

Open

ZhengshuaiPENG opened this issue Aug 4, 2022 · 1 comment

Assignees

Labels

byzer-lang 2.3.3 byzer-lang 2.4.0

Contributor

ZhengshuaiPENG commented Aug 4, 2022

需求

当前 csv 数据源在 load 时不支持跳过首尾行，为了帮助用户处理一些并不是特别标准的 csv 文件时，可以提高加载的效率

需要考虑在 where 语句中，加入如下参数：

skipFirstNLines, 值为自然数，注意参数的检查报错，加载时跳过首几行
skipLastNLines，值为自然数，注意参数的检查报错，加载时跳过尾几行

需要注意的是需要同时考虑支持单文件和标准分片文件目录的两种方式。

其中目录中为单文件的情况为， notebook 中 csv 上传，然后写 load 语句加载
目录中包含多个数据分片文件，一般为 save csv.\$path`` 所产生的目录，遵从 hadoop 文件存储体系标准

The text was updated successfully, but these errors were encountered:

ZhengshuaiPENG added the byzer-lang 2.3.3 label

ZhengshuaiPENG assigned ckeys

Contributor Author

ZhengshuaiPENG commented Sep 2, 2022 •

edited

Loading

SkipFirstNLines PR： https://github.com/byzer-org/byzer-lang/pull/1805，在 Byzer 2.3.3 中发布

SkipLastNLinesPR：https://github.com/byzer-org/byzer-lang/pull/1842/files ，预计在 Byzer 2.3.4 中发布

ZhengshuaiPENG added the byzer-lang 2.4.0 label

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment