Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

.SD vignette based on SO answer #3412

Closed
MichaelChirico opened this issue Feb 17, 2019 · 6 comments · Fixed by #3572
Closed

.SD vignette based on SO answer #3412

MichaelChirico opened this issue Feb 17, 2019 · 6 comments · Fixed by #3572
Assignees
Milestone

Comments

@MichaelChirico
Copy link
Member

This SO answer

Encouraged to include as vignette:

I want to be sure to add a bit about chaining since I mentioned but forgot to include that in the original answer.

Another hold-up is the use of Lahman data. I guess I can try and find a URL instead of adding a Suggests for that package...

@arunsrinivasan
Copy link
Member

arunsrinivasan commented Feb 17, 2019

Sounds great! Why not go for a "special symbols" vignette altogether? Could you not use the same flights data we already use though?

@jangorecki
Copy link
Member

Vignette only about special symbols sounds cryptic. Maybe vignette about complex queries, which will cover special symbols and tricks like { inside j, or newcol := { ... }. Printing from j. Grouping columns being length 1 inside j. Combining columns with .SD list c(list(col=f1(col1)), lapply(.SD, f2), list(col5=f3(col1))).

@Henrik-P
Copy link

Henrik-P commented Feb 18, 2019

Because the choice of data is mentioned: in general, I think it's better with minimal data sets in examples and vignettes - so much easier to track results following each step of code. To be honest, I don't think we need 253316 rows (flights) or 44963 (Pitching) to explain and demonstrate the actual functionality of .SD (or other data.table functions for that sake) ;)

Benchmarking on large data sets could be provided in a separate vignette.

Just my 2c.

@MichaelChirico MichaelChirico self-assigned this Apr 3, 2019
@MichaelChirico
Copy link
Member Author

@Henrik-P I have to disagree.

  • 253K and 45K rows are tiny in the grand scheme of things. They are trivially small to fread:
system.time(fread('vignettes/flights14.csv'))
#   user  system elapsed 
#  0.102   0.012   0.037

library(data.table)
library(Lahman)
Pitching = as.data.table(Pitching)
Pitching = Pitching[ , .(playerID, yearID, teamID, W, L, G, ERA)]

tmp = tempfile()
system.time({fwrite(Pitching, tmp); fread(tmp)})
#    user  system elapsed 
#   0.020   0.002   0.023 

Could we use a subset of the data? Sure, if we're careful to preserve reproducibility. In terms of "tracking results", this is easily superable -- vignette writer can drill down to a specific group as needed.

  • Practical examples are strictly superior to contrived ones. Using foo and bar IMO is the quickest way to ensure a vignette stays unread. flights ✈️ data is pretty universal & easy to understand... Lahman baseball ⚾️ data is a bit more niche but I don't think the basic analysis gets too abstruse and things like ERA can be meaningfully conveyed a few words.

@jangorecki
Copy link
Member

They are not tiny for human eye, I think this is what @Henrik-P mean. If we can have 20-40 rows data to present same functionality, then it will be easier for readers to follow.

@MichaelChirico
Copy link
Member Author

I'm going to stick with the original data. I think small data is good for the examples section in documentation; vignettes (to me) are about telling/writing a story, a sort of interactive blog post.

Since Lahman is on Github, I'll just download.file/load the data & direct users to the package & original website as proper citation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants