Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Generalised Linear Models to FSharpStats #334

Merged
merged 54 commits into from
Oct 24, 2024
Merged

Conversation

LibraChris
Copy link
Contributor

Changes introduced in this PR

  • Implemented QR-based solver for GLMs
  • Added multiple tests for various GLM link and variance functions
  • Added functions for GLM statistics and documentation

Description

This PR introduces the GLM (Generalized Linear Models) module in FSharp.Stats. The key features include the implementation of an Iteratively Reweighted Least Squares (IRLS) solver on a QR-based solver for GLMs, enhancing the accuracy and performance of statistical modeling. Extensive tests have been added and updated to ensure robustness, covering various link functions and variance calculations. Documentation has been thoroughly updated to assist users in leveraging these new features effectively.

[Required] please make sure you checked that

  • The project builds without problems on your machine

[Optional]

  • Added unit tests regarding the added features

@LibraChris
Copy link
Contributor Author

Hey @bvenn, would you mind checking this out? Please squish all the commits if possible to avoid confusion. If there is anything I should add or change please let me know

}
)

module QR =
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this the correct location for a function like this? Are these functions so specialized that nobody ever needs to use them in another context? if not, i'd suggest putting them in the managed LinearAlgebra module, otherwise you can also set the whole QR module here to internal

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While some of the functions in QR are highly specialised, some, qrAlternative and solveLinearQR, aren't. The module is needed for future development when I will add a solver based on singular value decomposition, but at the moment it is redundant.

I will move the the specified functions to LinearAlgebra but keep the rest there if that´s ok.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Allrighty. You can then mark the whole specialized module as internal to avoid leaking any of that stuff in the public API.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doing that would also remove it from the testing I on the stepwise function. You can have a look and tell me if it is ok now.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can expose internals to other projects in the repo by adding this attribute in the project file:

<ItemGroup> 
  <InternalsVisibleTo Include="TestProject1" /> 
</ItemGroup>

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kMutagene hey, I changed the code accordingly. Could you have another look if its better now?

LibraChris added a commit to LibraChris/FSharp.Stats that referenced this pull request Jun 18, 2024
for i=0 to testingArray.Length-1 do
let expected = linkExpected.[i]
let actual = linkFActual.[i]
if isInf actual then
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not understand this test setup. I think you should base the first conditional on the expected value and not on the actual value, e.g. "If the expected value is infinity then actual should be infinity", same for the tests below

Copy link
Member

@kMutagene kMutagene left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM from a maintenance standpoint - i cannot assess the implementation itself or the Linear Algebra though, so as long as your tests pass i think this is fine, test setup is a little weird but maybe i just do not understand the specifics

@LibraChris
Copy link
Contributor Author

Hey @bvenn, anything you would change before the merge?

@bvenn
Copy link
Member

bvenn commented Jun 22, 2024

Hey @bvenn, anything you would change before the merge?

Sorry, I was busy during the last weeks. I'll check out your PR this week.

Copy link
Member

@bvenn bvenn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GLM + documentation looks good, but I have a general thought: Why is the GLM module assigned to the Fitting-namespace rather than Testing? I know that GLMs are directly linked to regression analysis, but I personally would search at Testing when looking for GLMs. But maybe I'm wrong here, lets discuss.

Comment on lines +252 to +256
let qN = setqOfA i
let qLength = getVectorLength qN
let rValue = sqrt(qLength)
r[i,i] <- rValue
qLengths[i] <- qLength
Copy link
Member

@bvenn bvenn Jun 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if I get whats happening here:

  1. qLengths is a zero-vector that gets filled iteratively for each i in line 262
  2. in line 256 setqOfA is called with the current i. However in l 247 qLength is called at position i which always should be 0, because it was initialized with 0. Just after the determination of qLength the value at this specific index is replaced in line 262.

Obviously I miss a step, but is there a possibility to create the qLength by Vector.init (...) rather than keeping it mutable and accessing/mutating it at multiple positions? If not, I'm happy to merge it, but I got confused during my attempt to understand whats happening here..

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you additionally set into -brackets a reference to this implementation and a short description what the benefit, drawback is when using this implementation over others.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not have a direct reference since I had a look at the mathematical explanation for QR via Gram Schmidt and tried implementing it myself. The only real upside using this method instead of the in FSharp.Stats established one is the difference in the output Dimensions in R. An actual reference would be the numpy implantation of the reduced qr https://numpy.org/doc/stable/reference/generated/numpy.linalg.qr.html .

I also think that this implemtation needs to be optimised in the future to perfect the GLM.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if I get whats happening here:

  1. qLengths is a zero-vector that gets filled iteratively for each i in line 262
  2. in line 256 setqOfA is called with the current i. However in l 247 qLength is called at position i which always should be 0, because it was initialized with 0. Just after the determination of qLength the value at this specific index is replaced in line 262.

Obviously I miss a step, but is there a possibility to create the qLength by Vector.init (...) rather than keeping it mutable and accessing/mutating it at multiple positions? If not, I'm happy to merge it, but I got confused during my attempt to understand whats happening here..

The mutation was done to simplify the function to match the mathematical explanation better. This is one of the things that could be updated when I get to it.

src/FSharp.Stats/Fitting/GeneralisedLinearModel.fs Outdated Show resolved Hide resolved
// Computes the link function value for a given parameter.
getLink = fun b -> Math.Pow(b,-2.)//1.0 / b
// Computes the inverse link function value for a given parameter.
getInvLink = fun a -> Math.Pow(a,(1./ -2.))//1.0 / a
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume the 1./2. and later -1.-1. are left on purpose to make the implementation easier to understand 👍

Comment on lines +261 to +264
let internal signFunction x =
if x > 0. then 1.
elif x = 0. then 0.
else -1.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could use the build in sign function

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ping

let b = System.Math.Log(a)
let c = endV - muV
let d = c / muV
let e = -b + d
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

e isn't used, is it correct?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ping

@bvenn
Copy link
Member

bvenn commented Jun 23, 2024

When watching the docs dotnet fsdocs watch --eval I noticed the API-Reference is broken. Nothing is displayed. I'm not sure if the cause lies within this PR or some error happened earlier. If you find time, you can check if it worked prior to your commits.

@bvenn
Copy link
Member

bvenn commented Jul 3, 2024

When watching the docs dotnet fsdocs watch --eval I noticed the API-Reference is broken. Nothing is displayed. I'm not sure if the cause lies within this PR or some error happened earlier. If you find time, you can check if it worked prior to your commits.

The issue seems to be older than this PR or it is due to my setup. Could you check the docs in your machine?

@kMutagene
Copy link
Member

When watching the docs dotnet fsdocs watch --eval I noticed the API-Reference is broken. Nothing is displayed. I'm not sure if the cause lies within this PR or some error happened earlier. If you find time, you can check if it worked prior to your commits.

The issue seems to be older than this PR or it is due to my setup. Could you check the docs in your machine?

Try using the build target (build.cmd watchdocs), which should make sure that the necessary binaries are built before creating the docs

@bvenn
Copy link
Member

bvenn commented Jul 3, 2024

I still get the following error during build watchdocs:

"System.Xml.XmlException: The ',' character, hexadecimal value 0x2C, cannot be included in a name. Line 15278, position 99"
I think I fixed an issue similar to this within the last months.

@LibraChris your branch is 42 commits behind, if you could rebase, the issue may be gone

Rename PersonOfZ to PearsonOfZScore
Comment on lines 36 to 44
<Compile Include="GeneralisedLinearModels.fs" />
<Compile Include="Quantile.fs" />
<Compile Include="Rank.fs" />
<Compile Include="Main.fs" />
</ItemGroup>
<ItemGroup>
<EmbeddedResource Include="data/glm_test_lungcap.csv" CopyToOutputDirectory="PreserveNewest" />
<EmbeddedResource Include="data/glm_test_energy.csv" CopyToOutputDirectory="PreserveNewest" />
<EmbeddedResource Include="data/glm_test_cheese.csv" CopyToOutputDirectory="PreserveNewest" />
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please fix these indentations

Comment on lines 269 to 277
/// Solves a linear system of equations using QR decomposition.
///
/// Parameters:
/// - A: The coefficient matrix of the linear system.
/// - t: The target vector of the linear system.
///
/// Returns:
/// - mX: The solution vector of the linear system.
/// - r: The upper triangular matrix obtained from QR decomposition.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the updates!
I'm going to merge this PR if you just could add proper XML docs like seen in:

/// <summary>
/// Calculates the slope of a regression line through the origin
/// </summary>
/// <param name="xData">vector of x values</param>
/// <param name="yData">vector of y values</param>
/// <returns>Slope of the regression line through the origin</returns>
/// <example>
/// <code>
/// // e.g. days since a certain event
/// let xData = vector [|0.;1.;2.;3.;4.;5.;|]
/// // some measured feature, that theoretically is zero at day 0
/// let yData = vector [|1.;5.;9.;13.;17.;18.;|]
///
/// // Estimate the slope of the regression line.
/// let coefficients =
/// LinearRegression.OLS.Linear.RTO.fit xData yData
/// </code>
/// </example>

The same for line 222 f.

Updated XML documentation for qrAlternative and solveLinearQR
@bvenn bvenn merged commit 57f8939 into fslaborg:developer Oct 24, 2024
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants