Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Alternate filename formatting for split #1365

Closed
sloanlance opened this issue Aug 21, 2023 · 1 comment · Fixed by #1366
Closed

Alternate filename formatting for split #1365

sloanlance opened this issue Aug 21, 2023 · 1 comment · Fixed by #1366

Comments

@sloanlance
Copy link
Contributor

I recently enjoyed using split to break a large TSV file into several smaller ones using the -g option to group by field values. Using basic shell tools took a few steps to do this, but mlr could do it all in one. However, when mlr was done, I still had to write a loop to rename all the output files. It is…

# Fix file name formatting.
# Remove mlr's "split_" prefix, change "_" and "+" back to spaces.

for filename in split_*.tsv; do
  newFilename="$(sed -e 's/^split_//' -e 's/[+_]/ /g' <<< "${filename}")"
  mv "${filename}" "${newFilename}"
done

What I'd like to see split do better is…

  1. When --prefix '' is specified, DO NOT start the output filenames with underscore. I.e., do not name the files _col+a+value_col+b+value.tsv.
  2. DO NOT use underscores between parts of the filename.
  3. DO NOT replace spaces with + signs.

To address those, I would change how the prefix is handled (item 1) and add an option to not change spaces to other characters (items 2 and 3).

I'm not very experienced with go, but I'm keen to learn, so I would be happy to work on a solution for this and open a PR. If this is deemed to be a legitimate issue, that is.

@johnkerl johnkerl changed the title alternate filename formatting for split Alternate filename formatting for split Aug 21, 2023
@johnkerl
Copy link
Owner

@sloanlance I'd love to look at a PR! :)

sloanlance pushed a commit to sloanlance/miller that referenced this issue Aug 21, 2023
sloanlance added a commit to sloanlance/miller that referenced this issue Aug 21, 2023
* Don't use joiner string when prefix is empty.
* Add option to specify joiner string.
* Add option to not URL-escape file names.
sloanlance added a commit to sloanlance/miller that referenced this issue Aug 21, 2023
sloanlance added a commit to sloanlance/miller that referenced this issue Aug 22, 2023
I **_thought_** it'd be cool to apply URL-escaping to the file name prefix as well, just in case it included spaces or other characters.  I forgot that a common use for the prefix is to specify a directory path that will contain the file.  When the slashes ("`/`") of the path are URL-escaped, they become "`%2F`" and the directories will not be created.  So, I moved the prefix handling code to come after the URL-escaping.
sloanlance added a commit to sloanlance/miller that referenced this issue Aug 22, 2023
sloanlance added a commit to sloanlance/miller that referenced this issue Aug 22, 2023
Trying to make the `return` statement cleaner, I thought it'd be good to add the file name suffix immediately after the file name is URL-escaped.  I'd forgotten that the suffix will not be added if the new `-e` option is used to skip URL-escaping.  So, I put the suffix back where I had it.
sloanlance added a commit to sloanlance/miller that referenced this issue Aug 22, 2023
Not strictly part of this issue, but as I was checking for docs that I should update as a result of my changes, I noticed this document showed how to split data using the `put` and `tee` combination, but not about the `split` verb.
sloanlance added a commit to sloanlance/miller that referenced this issue Aug 22, 2023
When I ran `make dev`, generating `data-diving-examples.md` failed.  The two `manpage.txt` files ended up empty, but `mlr.1` seems to be correct.
johnkerl pushed a commit that referenced this issue Aug 23, 2023
* #1365 - filename options for `split`

* Don't use joiner string when prefix is empty.
* Add option to specify joiner string.
* Add option to not URL-escape file names.

* #1365 - update documentation

* #1365 - don't URL-escape file name prefix

I **_thought_** it'd be cool to apply URL-escaping to the file name prefix as well, just in case it included spaces or other characters.  I forgot that a common use for the prefix is to specify a directory path that will contain the file.  When the slashes ("`/`") of the path are URL-escaped, they become "`%2F`" and the directories will not be created.  So, I moved the prefix handling code to come after the URL-escaping.

* #1365 - new `split` options for CLI help output

* #1365 - fix escape/suffix logic error

Trying to make the `return` statement cleaner, I thought it'd be good to add the file name suffix immediately after the file name is URL-escaped.  I'd forgotten that the suffix will not be added if the new `-e` option is used to skip URL-escaping.  So, I put the suffix back where I had it.

* #1365 - add `split` to the "10 minutes" document

Not strictly part of this issue, but as I was checking for docs that I should update as a result of my changes, I noticed this document showed how to split data using the `put` and `tee` combination, but not about the `split` verb.

* #1365 - updated manpage

When I ran `make dev`, generating `data-diving-examples.md` failed.  The two `manpage.txt` files ended up empty, but `mlr.1` seems to be correct.

---------

Co-authored-by: Mr. Lance E Sloan (sloanlance) <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants