Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

【CSV 数据源增强】支持 skipFirstNlines & skipLastNLines 能力 #1804

Open
ZhengshuaiPENG opened this issue Aug 4, 2022 · 1 comment

Comments

@ZhengshuaiPENG
Copy link
Contributor

需求

当前 csv 数据源在 load 时不支持跳过首尾行,为了帮助用户处理一些并不是特别标准的 csv 文件时,可以提高加载的效率

需要考虑在 where 语句中,加入如下参数:

  • skipFirstNLines, 值为自然数,注意参数的检查报错 ,加载时跳过首几行
  • skipLastNLines,值为自然数,注意参数的检查报错, 加载时跳过尾几行

需要注意的是需要同时考虑支持单文件和标准分片文件目录的两种方式。

  • 其中目录中为单文件的情况为, notebook 中 csv 上传,然后写 load 语句加载
  • 目录中包含多个数据分片文件,一般为 save csv.\$path`` 所产生的目录,遵从 hadoop 文件存储体系标准
@ZhengshuaiPENG
Copy link
Contributor Author

ZhengshuaiPENG commented Sep 2, 2022

SkipFirstNLines PR: https://github.com/byzer-org/byzer-lang/pull/1805, 在 Byzer 2.3.3 中发布

SkipLastNLinesPR:https://github.com/byzer-org/byzer-lang/pull/1842/files ,预计在 Byzer 2.3.4 中发布

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants