Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update R docs based on deprecated parameters/behaviour #9437

Merged
merged 6 commits into from
Aug 9, 2023

Conversation

joshbrowning2358
Copy link
Contributor

@joshbrowning2358 joshbrowning2358 commented Aug 3, 2023

This PR is meant to resolve one of the issues on the Roadmap features issue; in particular it does this task "Update the R tutorial "Discover your data" to reflect the deletion of RealCover".

Close #5293 .

@trivialfis
Copy link
Member

Thank you for the nice work, I will review the changes next week. Need some time to go through the document.


```{r, results='hide'}
data(Arthritis)
df <- data.table(Arthritis, keep.rownames = FALSE)
```

> `data.table` is 100% compliant with **R** `data.frame` but its syntax is more consistent and its performance for large dataset is [best in class](https://stackoverflow.com/questions/21435339/data-table-vs-dplyr-can-one-do-something-well-the-other-cant-or-does-poorly) (`dplyr` from **R** and `Pandas` from **Python** [included](https://github.com/Rdatatable/data.table/wiki/Benchmarks-%3A-Grouping)). Some parts of **XGBoost** **R** package use `data.table`.
> `data.table` is 100% compliant with **R** `data.frame` but its syntax is more consistent and its performance for large dataset is [best in class](http://stackoverflow.com/questions/21435339/data-table-vs-dplyr-can-one-do-something-well-the-other-cant-or-does-poorly) (`dplyr` from **R** and `Pandas` from **Python** [included](https://github.com/Rdatatable/data.table/wiki/Benchmarks-%3A-Grouping)). Some parts of **XGBoost's** **R** package use `data.table`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why remove the TLS address?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I think that was a typo. I reverted that change back to https


`Cover` measures the relative quantity of observations concerned by a feature.
`Cover` measures the relative percent of observations concerned by a feature.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually, it's the hessian, which serves as a proxy, not sure how to phrase it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated the text, let me know if that looks good now!


`Frequency` is a simpler way to measure the `Gain`. It just counts the number of times a feature is used in all generated trees. You should not use it (unless you know why you want to use it).

#### Improvement in the interpretability of feature importance data.table
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any chance we can revise this part instead of removing it entirely? We still have the properties representing global feature importance, including weight, gain, cover, total_gain, and total_cover.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason I removed this section is because it seems to discuss importance information that is now deprecated. The data and label args are now deprecated here, so the code that was here before can no longer compute RealCover or RealCover %. As far as I can tell, it seems like total_gain and total_cover are already discussed above under the name Gain and Cover (based on this) so I'm not sure what else would be useful to talk about.

Alternatively, I could file an issue to support those metrics again, or I could write some R code in this vignette to compute those values if that'd be useful!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, let's remove it.

Copy link
Member

@trivialfis trivialfis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for updating the document!

@trivialfis trivialfis merged commit 7f85484 into dmlc:master Aug 9, 2023
22 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

don't have "Split" "RealCover" "RealCover" col
2 participants