JuliaAI · ablaom · Feb 23, 2024 · Feb 23, 2024 · Feb 23, 2024
diff --git a/.gitignore b/.gitignore
@@ -1,2 +1,10 @@
-*DS_Store
 Manifest.toml
+.ipynb_checkpoints
+*~
+#*
+.DS_Store
+sandbox/
+/docs/build/
+/docs/site/
+/docs/Manifest.toml
+.vscode
diff --git a/README.md b/README.md
@@ -9,16 +9,11 @@ machine learning models into
 | :-----------: | :------: |
 | [![Build Status](https://github.com/JuliaAI/MLJModelInterface.jl/workflows/CI/badge.svg)](https://github.com/JuliaAI/MLJModelInterface.jl/actions) | [![codecov.io](http://codecov.io/github/JuliaAI/MLJModelInterface.jl/coverage.svg?branch=master)](http://codecov.io/github/JuliaAI/MLJModelInterface.jl?branch=master) |
 
+[![Stable](https://img.shields.io/badge/docs-stable-blue.svg)](https://juliaai.github.io/MLJModelInterface.jl/stable/)
 
-[MLJ](https://github.com/alan-turing-institute/MLJ.jl) is a framework
-for evaluating, combining and optimizing machine learning models in
-Julia. A third party package wanting to integrate their supervised or
-unsupervised machine learning models must import the module
-`MLJModelInterface` defined in this package. 
 
-### Instructions
-
-- [Quick-start guide](https://alan-turing-institute.github.io/MLJ.jl/dev/quick_start_guide_to_adding_models/) to adding models to MLJ
-
-- [Detailed API
-  specification](https://alan-turing-institute.github.io/MLJ.jl/dev/adding_models_for_general_use/)
+[MLJ](https://alan-turing-institute.github.io/MLJ.jl/dev/) is a framework for evaluating,
+combining and optimizing machine learning models in Julia. A third party package wanting
+to integrate their machine learning models into MLJ must import the module
+`MLJModelInterface` defined in this package, as described in the
+[documentation]((https://juliaai.github.io/MLJModelInterface.jl/stable/).
diff --git a/docs/Project.toml b/docs/Project.toml
@@ -0,0 +1,3 @@
+[deps]
+Documenter = "e30172f5-a6a5-5a46-863b-614d45cd2de4"
+MLJModelInterface = "e80e1ace-859a-464e-9ed9-23947d8ae3ea"
diff --git a/docs/make.jl b/docs/make.jl
@@ -0,0 +1,46 @@
+using Documenter
+using MLJModelInterface
+import MLJModelInterface as MMI
+
+makedocs(;
+         modules=[MLJModelInterface, ],
+         format=Documenter.HTML(),
+         pages=[
+             "Home" => "index.md",
+             "Quick-start guide" => "quick_start_guide.md",
+             "The model type hierarchy" => "the_model_type_hierarchy.md",
+             "New model type declarations" => "type_declarations.md",
+             "Supervised models" => "supervised_models.md",
+             "Summary of methods" => "summary_of_methods.md",
+             "The form of data for fitting and predicting" => "form_of_data.md",
+             "The fit method" => "the_fit_method.md",
+             "The fitted_params method" => "the_fitted_params_method.md",
+             "The predict method" => "the_predict_method.md",
+             "The predict_joint method" => "the_predict_joint_method.md",
+             "Training losses" => "training_losses.md",
+             "Feature importances" =>  "feature_importances.md",
+             "Trait declarations" => "trait_declarations.md",
+             "Iterative models and the update! method" => "iterative_models.md",
+             "Implementing a data front end" => "implementing_a_data_front_end.md",
+             "Supervised models with a transform method" =>
+                 "supervised_models_with_transform.md",
+             "Models that learn a probability distribution" => "fitting_distributions.md",
+             "Serialization" => "serialization.md",
+             "Document strings" => "document_strings.md",
+             "Unsupervised models" => "unsupervised_models.md",
+             "Static models" => "static_models.md",
+             "Outlier detection models" => "outlier_detection_models.md",
+             "Convenience methods" => "convenience_methods.md",
+             "Where to place code implementing new models" => "where_to_put_code.md",
+             "How to add models to the MLJ Model Registry" => "how_to_register.md",
+             "Reference" => "reference.md",
+         ],
+         sitename="MLJModelInterface",
+         warnonly = [:cross_references, :missing_docs],
+)
+
+deploydocs(
+    repo = "github.com/JuliaAI/MLJModelInterface.jl",
+    devbranch="dev",
+    push_preview=false,
+)
diff --git a/docs/src/convenience_methods.md b/docs/src/convenience_methods.md
@@ -0,0 +1,16 @@
+# Convenience methods
+
+```@docs; canonical=false
+MMI.table
+MMI.matrix
+MMI.int
+MMI.UnivariateFinite
+MMI.classes
+MMI.decoder
+MMI.select
+MMI.selectrows
+MMI.selectcols
+MMI.UnivariateFinite
+```
+
+
diff --git a/docs/src/document_strings.md b/docs/src/document_strings.md
@@ -0,0 +1,52 @@
+# Document strings
+
+To be registered, MLJ models must include a detailed document string
+for the model type, and this must conform to the standard outlined
+below. We recommend you simply adapt an existing compliant document
+string and read the requirements below if you're not sure, or to use
+as a checklist. Here are examples of compliant doc-strings (go to the
+end of the linked files):
+
+- Regular supervised models (classifiers and regressors): [MLJDecisionTreeInterface.jl](https://github.com/JuliaAI/MLJDecisionTreeInterface.jl/blob/master/src/MLJDecisionTreeInterface.jl) (see the end of the file)
+
+- Tranformers: [MLJModels.jl](https://github.com/JuliaAI/MLJModels.jl/blob/dev/src/builtins/Transformers.jl)
+
+A utility function is available for generating a standardized header
+for your doc-strings (but you provide most detail by hand):
+
+```@docs
+MLJModelInterface.doc_header
+```
+
+## The document string standard
+
+Your document string must include the following components, in order:
+
+- A *header*, closely matching the example given above.
+
+- A *reference describing the algorithm* or an actual description of
+  the algorithm, if necessary. Detail any non-standard aspects of the
+  implementation. Generally, defer details on the role of
+  hyperparameters to the "Hyperparameters" section (see below).
+
+- Instructions on *how to import the model type* from MLJ (because a user can already inspect the doc-string in the Model Registry, without having loaded the code-providing package).
+
+- Instructions on *how to instantiate* with default hyperparameters or with keywords.
+
+- A *Training data* section: explains how to bind a model to data in a machine with all possible signatures (eg, `machine(model, X, y)` but also `machine(model, X, y, w)` if, say, weights are supported);  the role and scitype requirements for each data argument should be itemized.
+
+- Instructions on *how to fit* the machine (in the same section).
+
+- A *Hyperparameters* section (unless there aren't any): an itemized list of the parameters, with defaults given.
+
+- An *Operations* section: each implemented operation (`predict`, `predict_mode`, `transform`, `inverse_transform`, etc ) is itemized and explained. This should include operations with no data arguments, such as `training_losses` and `feature_importances`.
+
+- A *Fitted parameters* section: To explain what is returned by `fitted_params(mach)` (the same as `MLJModelInterface.fitted_params(model, fitresult)` -  see later) with the fields of that named tuple itemized.
+
+- A *Report* section (if `report` is non-empty): To explain what, if anything, is included in the `report(mach)`  (the same as the `report` return value of `MLJModelInterface.fit`) with the fields itemized.
+
+- An optional but highly recommended *Examples* section, which includes MLJ examples, but which could also include others if the model type also implements a second "local" interface, i.e., defined in the same module. (Note that each module referring to a type can declare separate doc-strings which appear concatenated in doc-string queries.)
+
+- A closing *"See also"* sentence which includes a `@ref` link to the raw model type (if you are wrapping one).
+
+
diff --git a/docs/src/feature_importances.md b/docs/src/feature_importances.md
@@ -0,0 +1,7 @@
+# Feature importances
+
+```@docs; canonical=false
+MLJModelInterface.feature_importances
+```
+
+Trait values can also be set using the `metadata_model` method, see below.
diff --git a/docs/src/fitting_distributions.md b/docs/src/fitting_distributions.md
@@ -0,0 +1,20 @@
+# Models that learn a probability distribution
+
+
+!!! warning "Experimental"
+
+	The following API is experimental. It is subject to breaking changes during minor or major releases without warning. Models implementing this interface will not work with MLJBase versions earlier than 0.17.5.
+
+Models that fit a probability distribution to some `data` should be
+regarded as `Probabilistic <: Supervised` models with target `y = data`
+and `X = nothing`.
+
+The `predict` method should return a single distribution.
+
+A working implementation of a model that fits a `UnivariateFinite`
+distribution to some categorical data using [Laplace
+smoothing](https://en.wikipedia.org/wiki/Additive_smoothing)
+controlled by a hyperparameter `alpha` is given
+[here](https://github.com/JuliaAI/MLJBase.jl/blob/d377bee1198ec179a4ade191c11fef583854af4a/test/interface/model_api.jl#L36).
+
+
diff --git a/docs/src/form_of_data.md b/docs/src/form_of_data.md
@@ -0,0 +1,47 @@
+# The form of data for fitting and predicting
+
+The model implementer does not have absolute control over the types of
+data `X`, `y` and `Xnew` appearing in the `fit` and `predict` methods
+they must implement. Rather, they can specify the *scientific type* of
+this data by making appropriate declarations of the traits
+`input_scitype` and `target_scitype` discussed later under [Trait
+declarations](@ref).
+
+*Important Note.* Unless it genuinely makes little sense to do so, the
+MLJ recommendation is to specify a `Table` scientific type for `X`
+(and hence `Xnew`) and an `AbstractVector` scientific type (e.g.,
+`AbstractVector{Continuous}`) for targets `y`. Algorithms requiring
+matrix input can coerce their inputs appropriately; see below.
+
+
+## Additional type coercions
+
+If the core algorithm being wrapped requires data in a different or
+more specific form, then `fit` will need to coerce the table into the
+form desired (and the same coercions applied to `X` will have to be
+repeated for `Xnew` in `predict`). To assist with common cases, MLJ
+provides the convenience method
+[`MMI.matrix`](@ref). `MMI.matrix(Xtable)` has type `Matrix{T}` where
+`T` is the tightest common type of elements of `Xtable`, and `Xtable`
+is any table. (If `Xtable` is itself just a wrapped matrix,
+`Xtable=Tables.table(A)`, then `A=MMI.table(Xtable)` will be returned
+without any copying.)
+
+Alternatively, a more performant option is to implement a data
+front-end for your model; see [Implementing a data front-end](@ref).
+
+Other auxiliary methods provided by MLJModelInterface for handling tabular data
+are: `selectrows`, `selectcols`, `select` and `schema` (for extracting
+the size, names and eltypes of a table's columns). See [Convenience
+methods](@ref) below for details.
+
+
+## Important convention
+
+It is to be understood that the columns of table `X` correspond to
+features and the rows to observations. So, for example, the predict
+method for a linear regression model might look like `predict(model,
+w, Xnew) = MMI.matrix(Xnew)*w`, where `w` is the vector of learned
+coefficients.
+
+
diff --git a/docs/src/how_to_register.md b/docs/src/how_to_register.md
@@ -0,0 +1,15 @@
+# How to add models to the MLJ model registry
+
+The MLJ model registry is located in the [MLJModels.jl
+repository](https://github.com/JuliaAI/MLJModels.jl). To
+add a model, you need to follow these steps
+
+- Ensure your model conforms to the interface defined above
+
+- Raise an issue at
+  [MLJModels.jl](https://github.com/JuliaAI/MLJModels.jl/issues)
+  and point out where the MLJ-interface implementation is, e.g. by
+  providing a link to the code.
+
+- An administrator will then review your implementation and work with
+  you to add the model to the registry
diff --git a/docs/src/implementing_a_data_front_end.md b/docs/src/implementing_a_data_front_end.md
@@ -0,0 +1,112 @@
+# Implementing a data front-end
+
+!!! note
+
+	It is suggested that packages implementing MLJ's model API, that later implement a data front-end, should tag their changes in a breaking release. (The changes will not break the use of models for the ordinary MLJ user, who interacts with models exclusively through the machine interface. However, it will break usage for some external packages that have chosen to depend directly on the model API.)
+
+```julia
+MLJModelInterface.reformat(model, args...) -> data
+MLJModelInterface.selectrows(::Model, I, data...) -> sampled_data
+```
+
+Models optionally overload `reformat` to define transformations of
+user-supplied data into some model-specific representation (e.g., from
+a table to a matrix). Computational overheads associated with multiple
+`fit!`/`predict`/`transform` calls (on MLJ machines) are then avoided
+when memory resources allow. The fallback returns `args` (no
+transformation).
+
+The `selectrows(::Model, I, data...)` method is overloaded to specify
+how the model-specific data is to be subsampled, for some observation
+indices `I` (a colon, `:`, or instance of
+`AbstractVector{<:Integer}`). In this way, implementing a data
+front-end also allows more efficient resampling of data (in user calls
+to `evaluate!`).
+
+After detailing formal requirements for implementing a data front-end,
+we give a [Sample implementation](@ref). A simple [implementation](https://github.com/Evovest/EvoTrees.jl/blob/94b58faf3042009bd609c9a5155a2e95486c2f0e/src/MLJ.jl#L23)
+also appears in the EvoTrees.jl package.
+
+Here "user-supplied data" is what the MLJ user supplies when
+constructing a machine, as in `machine(models, args...)`, which
+coincides with the arguments expected by `fit(model, verbosity,
+args...)` when `reformat` is not overloaded.
+
+Overloading `reformat` is permitted for any `Model`
+subtype, except for subtypes of `Static`. Here is a complete list of
+responsibilities for such an implementation, for some
+`model::SomeModelType` (a sample implementation follows after):
+
+- A `reformat(model::SomeModelType, args...) -> data` method must be
+  implemented for each form of `args...` appearing in a valid machine
+  construction `machine(model, args...)` (there will be one for each
+  possible signature of `fit(::SomeModelType, ...)`).
+
+- Additionally, if not included above, there must be a single argument
+  form of reformat, `reformat(model::SomeModelType, arg) -> (data,)`,
+  serving as a data front-end for operations like `predict`. It must
+  always hold that `reformat(model, args...)[1] = reformat(model,
+  args[1])`.
+
+The fallback is `reformat(model, args...) = args` (i.e., slurps provided data).
+
+*Important.* `reformat(model::SomeModelType, args...)` must always return a tuple, even if
+  this has length one. The length of the tuple need not match `length(args)`.
+- `fit(model::SomeModelType, verbosity, data...)` should be
+  implemented as if `data` is the output of `reformat(model,
+  args...)`, where `args` is the data an MLJ user has bound to `model`
+  in some machine. The same applies to any overloading of `update`.
+
+- Each implemented operation, such as `predict` and `transform` - but
+  excluding `inverse_transform` - must be defined as if its data
+  arguments are `reformat`ed versions of user-supplied data. For
+  example, in the supervised case, `data_new` in
+  `predict(model::SomeModelType, fitresult, data_new)` is
+  `reformat(model, Xnew)`, where `Xnew` is the data provided by the MLJ
+  user in a call `predict(mach, Xnew)` (`mach.model == model`).
+
+- To specify how the model-specific representation of data is to be
+  resampled, implement `selectrows(model::SomeModelType, I, data...)
+  -> resampled_data` for each overloading of `reformat(model::SomeModel,
+  args...) -> data` above. Here `I` is an arbitrary abstract integer
+  vector or `:` (type `Colon`).
+
+*Important.* `selectrows(model::SomeModelType, I, args...)` must always
+return a tuple of the same length as `args`, even if this is one.
+
+The fallback for `selectrows` is described at [`selectrows`](@ref).
+
+
+## Sample implementation
+
+Suppose a supervised model type `SomeSupervised` supports sample
+weights, leading to two different `fit` signatures, and that it has a
+single operation `predict`:
+
+	fit(model::SomeSupervised, verbosity, X, y)
+	fit(model::SomeSupervised, verbosity, X, y, w)
+
+	predict(model::SomeSupervised, fitresult, Xnew)
+
+Without a data front-end implemented, suppose `X` is expected to be a
+table and `y` a vector, but suppose the core algorithm always converts
+`X` to a matrix with features as rows (each record corresponds to
+a column in the table).  Then a new data-front end might look like
+this:
+
+	constant MMI = MLJModelInterface
+
+	# for fit:
+	MMI.reformat(::SomeSupervised, X, y) = (MMI.matrix(X)', y)
+	MMI.reformat(::SomeSupervised, X, y, w) = (MMI.matrix(X)', y, w)
+	MMI.selectrows(::SomeSupervised, I, Xmatrix, y) =
+		(view(Xmatrix, :, I), view(y, I))
+	MMI.selectrows(::SomeSupervised, I, Xmatrix, y, w) =
+		(view(Xmatrix, :, I), view(y, I), view(w, I))
+
+	# for predict:
+	MMI.reformat(::SomeSupervised, X) = (MMI.matrix(X)',)
+	MMI.selectrows(::SomeSupervised, I, Xmatrix) = (view(Xmatrix, :, I),)
+
+With these additions, `fit` and `predict` are refactored, so that `X`
+and `Xnew` represent matrices with features as rows.