Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[G]VIF #548

Merged
merged 4 commits into from
Sep 14, 2023
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion Project.toml
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,7 @@ SpecialFunctions = "0.6, 0.7, 0.8, 0.9, 0.10, 1, 2.0"
StatsAPI = "1.4"
StatsBase = "0.33.5, 0.34"
StatsFuns = "0.6, 0.7, 0.8, 0.9, 1.0"
StatsModels = "0.6.23, 0.7"
StatsModels = "0.7.3"
Tables = "1"
julia = "1.6"

Expand Down
3 changes: 2 additions & 1 deletion src/GLM.jl
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@ module GLM
import Base: (\), convert, show, size
import LinearAlgebra: cholesky, cholesky!
import Statistics: cor
using StatsAPI
import StatsBase: coef, coeftable, coefnames, confint, deviance, nulldeviance, dof, dof_residual,
loglikelihood, nullloglikelihood, nobs, stderror, vcov,
residuals, predict, predict!,
Expand All @@ -21,7 +22,7 @@ module GLM
export coef, coeftable, confint, deviance, nulldeviance, dof, dof_residual,
loglikelihood, nullloglikelihood, nobs, stderror, vcov, residuals, predict,
fitted, fit, fit!, model_response, response, modelmatrix, r2, r², adjr2, adjr²,
cooksdistance, hasintercept, dispersion
cooksdistance, hasintercept, dispersion, vif, gvif, termnames
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should just reexport StatsModels? That sounds natural.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only "problem" is that breaking changes in StatsModels necessarily become breaking changes in GLM.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah but there shouldn't be breaking changes in StatsModels minor releases, and anyway users who need these functions will do using StatsModels.


export
# types
Expand Down
2 changes: 1 addition & 1 deletion src/linpred.jl
Original file line number Diff line number Diff line change
Expand Up @@ -362,7 +362,7 @@ fitted(m::LinPredModel) = m.rr.mu
predict(mm::LinPredModel) = fitted(mm)
residuals(obj::LinPredModel) = residuals(obj.rr)

function formula(obj::LinPredModel)
function StatsModels.formula(obj::LinPredModel)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While we are at it. When is it called. When I do:

julia> formula(lm(x, y))
ERROR: type LinearModel has no field fr

julia> formula(glm(x, y, Normal()))
ERROR: type GeneralizedLinearModel has no field fr

other methods are called.

Do we have tests for different cases when formula is not present?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm, will investigate. I thought we caught this when Milan removed TableRegressionModel.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On current master:

julia> formula(lm(ones(10, 1),  randn(10)))
ERROR: ArgumentError: model was fitted without a formula
Stacktrace:
 [1] formula(obj::LinearModel{GLM.LmResp{Vector{Float64}}, GLM.DensePredChol{Float64, LinearAlgebra.CholeskyPivoted{Float64, Matrix{Float64}, Vector{Int64}}}})
   @ GLM ~/Code/GLM.jl/src/linpred.jl:366
 [2] top-level scope
   @ REPL[13]:1

julia> formula(glm(ones(10, 1),  randn(10), Normal()))
ERROR: ArgumentError: model was fitted without a formula
Stacktrace:
 [1] formula(obj::GeneralizedLinearModel{GLM.GlmResp{Vector{Float64}, Normal{Float64}, IdentityLink}, GLM.DensePredChol{Float64, LinearAlgebra.CholeskyPivoted{Float64, Matrix{Float64}, Vector{Int64}}}})
   @ GLM ~/Code/GLM.jl/src/linpred.jl:366
 [2] top-level scope
   @ REPL[14]:1

(will have to keep this in mind for the backport to 1.x where we still have TableRegressionModel)

obj.formula === nothing && throw(ArgumentError("model was fitted without a formula"))
return obj.formula
end
Expand Down
9 changes: 9 additions & 0 deletions test/runtests.jl
Original file line number Diff line number Diff line change
Expand Up @@ -2011,3 +2011,12 @@ end
@test_throws ArgumentError lm(@formula(OptDen ~ Carb), form; method=:pr)
@test_throws ArgumentError glm(@formula(OptDen ~ Carb), form, Normal(); method=:pr)
end

@testset "[G]VIF" begin
duncan = RDatasets.dataset("car", "Duncan")
lm1 = lm(@formula(Prestige ~ 1 + Income + Education), duncan)
@test termnames(lm1)[2] == coefnames(lm1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. do we have tests when coefnames and termnames differ?
  2. do we have a decision what should be done in the case of lm(X, y) (i.e. model fitted without formula, it still prints variable names as x1 etc.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. This falls back to StatsModels -- the test there is just making sure we've successfully imported and exported the symbol.
  2. on master,termnames will error based on there being no formula (formula will return nothing).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think termnames should not be defined if there is no formula -- there are only Terms when there is a formula.

Copy link
Contributor

@bkamins bkamins Sep 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So what should a user do to perform VIF analysis for the model = lm(X, y) case?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

vif works, but not gvif. So I think they can still do vif. If they're able to construct a model matrix directly for something with non trivial contrast coding, then they could probably also do adapt the gvif source to extract the correct columns.

@test vif(lm1) ≈ gvif(lm1)
lm2 = lm(@formula(Prestige ~ 1 + Income + Education + Type), duncan)
@test gvif(lm2; scale=true) ≈ [1.486330, 2.301648, 1.502666] atol=1e-4
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Can you please add a comment on where these values are taken from?
  2. Also do we have tests for vif/gvif for glm?
  3. Do we have tests for vif/gvif for models without formula?
  4. Do we have tests for vif/gvif for models that have complex formulas, something like e.g @formula(y~(1+a*(b+log(c)))&(1+d))? (of course this is artificial, but I hope it is clear what I mean

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are just the StatsModels tests carried forward to models actually fitted here. 😄 But I can add a cross reference.

end
Loading