Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DataFrames dependency #347

Closed
andyferris opened this issue Nov 8, 2018 · 4 comments
Closed

DataFrames dependency #347

andyferris opened this issue Nov 8, 2018 · 4 comments

Comments

@andyferris
Copy link
Member

andyferris commented Nov 8, 2018

I was wondering if the direct dependency on DataFrames was strictly necessary, or if the Tables.jl interface could make it completely unnecessary at some point?

I'm not against sensible defaults (and this choice may be the most convenient for the vast majority of users) but this is more of a query about composable design as well as letting users create (and package) software with reduced dependencies.

I'm not sure how to make this work. Plots.jl has an interface for specifying the current default backend. Requires.jl lets you specify stuff when e.g. DataFrames is also loaded. We'd need to aim for peoples REPL sessions to be intuitive and user-friendly, and yet allow our "applications" be fully specified in their behavior (in this case that might be explicitly defining the sink type on all reads or directly using CSV.FIle).

@Nosferican
Copy link
Contributor

I am in favor of dropping the dependency, the default could just be a Vector{NamedTuple} if need be to be "collected" from a suitable stream to a table and is not sent to a sink.

@quinnj
Copy link
Member

quinnj commented Aug 22, 2019

Closing for now as we'll keep the DataFrames dependency for the foreseeable future. I think a long-term master plan would be to replace the DelimitedFiles stdlib w/ the core machinery in this package, and that would involve decoupling from DataFrames.

@quinnj quinnj closed this as completed Aug 22, 2019
@quinnj
Copy link
Member

quinnj commented Aug 22, 2019

Oh, the other thing I remember was a great conversation I had with @KristofferC and @StefanKarpinski at JuliaCon about conditional/optional dependency support in Pkg.jl. If/when we get that, it would also be pretty easy to decouple CSV.jl and DataFrames.jl.

@kdheepak
Copy link

Would you be open to considering a PR that adds a dependency on Requires and makes the dependency on DataFrames lazy? using DataFrames takes a while [1] and it would be great to make load times faster here if all the rich features of DataFrames aren't being used here.

[1] - 2.3 GHz Intel Core i7 MacOSX

(v1.3) pkg> add DataFrames
  Updating registry at `~/.julia/registries/General`
  Updating git-repo `https://github.com/JuliaRegistries/General.git`
 Resolving package versions...
  Updating `~/.julia/environments/v1.3/Project.toml`
  [a93c6f00] + DataFrames v0.19.4
  Updating `~/.julia/environments/v1.3/Manifest.toml`
  [324d7699] + CategoricalArrays v0.7.1
  [a93c6f00] + DataFrames v0.19.4
  [41ab1584] + InvertedIndices v1.0.0
  [e1d29d7a] + Missings v0.4.3
  [2dfb63ee] + PooledArrays v0.5.2
  [a2af1166] + SortingAlgorithms v0.3.1
  [9fa8497b] + Future

julia> @time using DataFrames
[ Info: Precompiling DataFrames [a93c6f00-e57d-5684-b7b6-d8193f3e46c0]
 20.760481 seconds (1.97 M allocations: 113.364 MiB, 0.06% gc time)

julia> @time using DataFrames
  1.675493 seconds (3.06 M allocations: 179.430 MiB, 2.88% gc time)

julia> exit()

julia> @time using DataFrames
  0.000357 seconds (289 allocations: 15.797 KiB)

julia
               _
   _       _ _(_)_     |  Documentation: https://docs.julialang.org
  (_)     | (_) (_)    |
   _ _   _| |_  __ _   |  Type "?" for help, "]?" for Pkg help.
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 1.3.0-rc4.1 (2019-10-15)
 _/ |\__'_|_|_|\__'_|  |  Official https://julialang.org/ release
|__/                   |

julia> @time using DataFrames
  0.973533 seconds (1.24 M allocations: 79.475 MiB)

julia> @time using DataFrames
  1.610228 seconds (3.04 M allocations: 178.733 MiB, 2.96% gc time)

julia> @time using DataFrames
  0.000289 seconds (289 allocations: 15.797 KiB)

julia>

Alternatively, would you consider making separate package (e.g. CSVBase.jl) that had the core parsing functionalilty that is currently here, with CSV.jl being the wrapper package that provides the interface and that depends on DataFrames.jl and other packages?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants