Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Split up the package? #310

Open
ararslan opened this issue Jul 22, 2017 · 24 comments
Open

Split up the package? #310

ararslan opened this issue Jul 22, 2017 · 24 comments

Comments

@ararslan
Copy link
Member

This package has become quite large and is something of a heavyweight dependency for packages that just need a single type. An example of that is StatsBase, which only uses DataStructures for the arrays as heaps stuff. Indeed, just going through and trying to fix the deprecation warnings in this package for more recent Julia versions has proved quite cumbersome. It seems to me that for the sake of maintainability, the easily separable pieces of this package could be split off into separate packages, say like Heaps.jl, Deques.jl, etc. That way packages can just pull in the functionality they need. Thoughts?

@yuyichao
Copy link
Contributor

just going through and trying to fix the deprecation warnings in this package for more recent Julia versions has proved quite cumbersome

What about it is cumbersome? Splitting part that doesn't depend on anything else might be ok though I don't think depwarn fix is a good reason for it and I don't see how splitting will help there. The depwarn fix now needs to be down on multiple packages rather than one.

@ararslan
Copy link
Member Author

That wasn't really an argument for splitting up so much as just a comment on the fact that it's big enough that fixing depwarns is annoying to do all at once.

@TotalVerb
Copy link

I also think this package is unnecessarily large. It can reasonably be changed to simply re-export the exports of smaller packages which become dependencies, to maintain the current convenience while allowing more modular development

@ararslan
Copy link
Member Author

Exactly what I was thinking, @TotalVerb.

@oxinabox
Copy link
Member

oxinabox commented Oct 20, 2017

Yes, please lets split it up.
It is so large.
I keep wanting to fix things, or refactor them, etc.
But the weight of all the files gets me down.
There is a lot of cruft in the code.
It has good test coverage, so refactoring and breaking down should be safe.
(Can keep a copy of all the tests in DataStructures.jl and then run them using the reexported ones until happy)

Breaking it into separate repos, will make it easier to decruft it.
One thing that can be noted is that it really does break up quiet distinctly
Some parts were clearly written by different people,
who (reasonably enough) did not touch the other parts.

Like All the Sorted stuff (see breakdown in next post) was written by @StephenVavasis .
All the Heap stuff was extracted from Base.

There are basically 5 packages living in here, all with there own core author list.
(and then the good work of later people maintaining them all)

I spent an hour or so checking and breaking down a potential split of the package.
I'ma put the list in a separate post.

@oxinabox
Copy link
Member

oxinabox commented Oct 20, 2017

This is just the files in /src/

Heaps.jl

Anything to do with heaps.
Lightweight package for people who just want the old priorityqueue from base,
or other heap related functions.

  • heaps
  • heaps.jl
  • priorityqueue.jl -- it depends on heap functions, so either it goes here or in LinearDataStructures.jl, which then gains a Heaps.jl Dependancy

SortedDataStructures.jl

Its basically data structures that have optimised performance because they
built on top of search-trees.
They should be in there own package as they take their own set of expertise to maintain.
Even though they overlap with the other packages I propose that are interface driven.

  • balanced_tree.jl
  • sorted_dict.jl
  • sorted_multi_dict.jl
  • sorted_set.jl
  • tokens.jl
  • tokens2.jl
  • container_loops.jl

SetDataStructures.jl

They all act like sets. (Fast element of-checking)
Many of the packages in this say that they will be better once AbstractSet is in base.
JuliaLang/julia#5533
So they all probably need a fair bit of cleanup now

  • ordered_set.jl
  • disjoint_set.jl
  • int_set.jl

AssociativeDataStructures.jl

  • default_dict.jl
  • accumulator.jl: On one interpretation a DefaultDict{K,<:Integer}, on the other a Bag/Multiset
  • multi_dict.jl: DefaultDict{K, Vector{T}}
  • classified_collections.jl -- this is a weird one. It is kinda the generalisation of Multi-dicts. Does anyone even love it?
    - dict_sorting.jl: helpers for ordered_dict
  • ordered_dict.jl
  • trie.jl -- tries are associative data-structures. Though our current one is a bit limited in its functionality. If someone loved it, it could actually be its own package with all kinds of different tries. But no one really loves tries these days.
  • delegate.jl  is used by ordered_set, multi_dict, and default_dict. The use in ordered_set is just for two simple single argument functions, that can be replaced with hand coding them.

LinearDataStructures.jl

  • list.jl : a linked lists not really like the others here.
  • deque.jl
  • queue.jl : Dequeue wrapper
  • stack.jl : Dequeue wrapper
  • circ_deque.jl : implements same interface as dequeue
  • circular_buffer.jl : Like circ_deque but instread of getting full it just overwrites old elements. Also doesn't implement same interface as deque.

These 5 packages are basically not interdependent.

@ararslan
Copy link
Member Author

Good ideas. I'm not sold on the specific names but otherwise +1.

@kmsquire
Copy link
Member

kmsquire commented Oct 21, 2017

These 5 packages are basically not interdependent.

Mostly true, but note that an OrderedSet (listed in SetDataStructures.jl) uses the same underlying implementation as an OrderedDict (listed in AssociativeDataStructures.jl)

@oxinabox
Copy link
Member

Mostly true, but note that an OrderedSet (listed in SetDataStructures.jl) uses the same underlying implementation as an OrderedDict (listed in AssociativeDataStructures.jl)

Ah yes, I missed that one.
Maybe merge AssociativeDataStructures.jl and SetDataStructures.jl,
and call it "MembershipDataStructures.jl"

@ararslan and I had a discussion on specific names the other day.
and concluder better names were *Collections.jl

@PallHaraldsson
Copy link
Contributor

PallHaraldsson commented Nov 9, 2017

Pro for splitting may be maintainablility, but since not in Base, maybe for discorverability, having all the main data-structures in one place is good.

Just one question on it. For using (e.g. precompiling) is the size slowing down? I mean does Julia need to compile more than you actually use of a package? If it does now, can it be deferred in a future version of Julia?

@oxinabox
Copy link
Member

oxinabox commented Nov 17, 2017

Just one question on it. For using (e.g. precompiling) is the size slowing down? I mean does Julia need to compile more than you actually use of a package? If it dos now, can it be deferred in a future version of Julia?

Yes, and probably?
I mean turing complete languages can do anything.
But likely/easy? Less clear.

@oxinabox
Copy link
Member

Ok,
here is my splitting plan:

Split according to the splits covered by #310 (comment)
except merge Associative and Set into a single Membership package.
using the names *Collections.jl

DataStructures.jl will remain as a MetaPackage that does all of them.

So the actual concrete plan:
Make 5 branches on this Remote, under split/subname (and split/combined for the new metapackage).
Use git history rewriting to delete all the files not (and prune the empty commits) related to the the subpackage in question.
That will speed up installs, even using Pkg2 of the new subpackages. Not doing that will lead the the suffering as per SciML/DifferentialEquations.jl#107.
It can't be done easily for the combined package under that current name (see aforementioned suffering thread)
The subpackage branchs are easy enough, they get there tests poked at and made sure they are passing CI here.

The branch with the combined meta package is a bit harder.
while setting stuff up (i.e. when ever-thing is just branches) it will need a custom build script to run the existing tests and by doing a Pkg.clone.("DataStructures", ["split/...", "split/...",...]).
Still shouldn't be too hard.

Then once that is all set up, everything can cruise along for a little while.
PRs can be merged both into current (real) master,
and also merged off into the branchs.

Once we are happy all is well,
we make one last tag of DataStructures.jl 0.x.y,

Then we push all the split branches of to their own remotes, and register them.
Once they are registered, we push the split/combined to DataStructures.jl master,
and we tag DataStructures.jl 1.0.0

We close all issues and PRs here.
and open links for the ones we care about in the new repos. ( @ararslan do you have a script for that? Something similar was done for SpecialFunctions.jl)

We then delete all the tests in DataStructures.jl, and maybe tag 1.0.1.
Which might make any shallow cloning Pkg3 based method able to install DataStructures.jl lightning fast, sine it will then be a <1kb repo.

Also somewhere in this plan needs to be sorting out docs.
I guess they should be split and DataStructures.jl's docs should basically (or even literally) just link to the subpackages docs.

So that is the plan.
Thoughts?

(@vanbujm this was the complex git game I was talking to you about a few weeks back)

@oxinabox
Copy link
Member

oxinabox commented Dec 6, 2017

This is waiting on #348 and #347
as those are the big PRs that are open at the moment that affect files that would be in more than one package after the split.

@scls19fr
Copy link
Member

scls19fr commented Jan 5, 2018

I wonder if splitting package shouldn't be done after ensuring that doctests are really run and are passing. See #345 and #361

@oxinabox
Copy link
Member

oxinabox commented Jan 5, 2018

No. I disagree.
There will always be "one last thing".
Right now there are no active PRs that are being worked on that hit files in multiple splits.

If it comes to it we can always move commits around later via the magic of git.
It is particular easy to do that if all the PRs are such that they actuallychan just be replayed onto one of the other split repos.

@oxinabox
Copy link
Member

oxinabox commented Jan 5, 2018

The first split package Heaps.jl is up at https://github.com/JuliaCollections/Heaps.jl
(locally for me it is a branch of DataStructures.jl)
I've not made a branch of DataStructures that actually depends on it anywhere yet,
but as a stand-alone package it is functional.

I am not porting anything deprecated across.

It is now clear to me we will either need a DataStructuresBase.jl
or to do manual method merging shenanigans in DataStructures.jl.
However either of those can happen later.

For my own reference the commands I need to remove things out of git history and compact it are:

git filter-branch --prune-empty --force --index-filter 'git rm --cached --ignore-unmatch  LISTFILESHERE' 

git filter-branch --prune-empty --parent-filter 'sed "s/-p //g" | xargs -r git show-branch --independent | sed "s/\</-p /g"'

@rofinn
Copy link

rofinn commented Feb 27, 2018

FWIW, I'd like to use a Trie for Memento.jl, but I don't really want to depend on DataStructures.jl. Would someone be willing to split that out into a separate package for me to work on (I'm guessing folks would like to keep it in the JuliaCollections org)?

@iamed2
Copy link
Contributor

iamed2 commented Feb 27, 2018

@rofinn I can create that and add you to it

@rofinn
Copy link

rofinn commented Feb 27, 2018

Thanks

@oxinabox
Copy link
Member

@rofinn sorry that I didn't get on with and finish the split.
Got back to uni and have been in solid deadlines ever since.

Do you think Tries are good separate from other MembershipCollections?
They might be, they do have other close relatives like suffex trees, and finite state acceptors.

@rofinn
Copy link

rofinn commented Feb 28, 2018

Would those implementations likely need to share code or define some common abstract parent type? IMHO, folks usually just want a specific data structure for their use case, so I'd only be in favour of that if it promoted some common API or avoided code duplication.

@milesfrain
Copy link
Contributor

Another motivating reason for splitting packages is to save time on benchmarking.
For example, SparseIntSet #533 should not need to re-run Heap benchmarks. These wasted cycles will compound as more packages adopt benchmarking.
@oxinabox I'm happy to help with this reorganization effort. Is a good next step to create the necessary PRs to bring Heaps.jl up to date?

@oxinabox
Copy link
Member

No Heaps.jl should just be recreated.
It is easier to just redo the history slicing.
I will delete that repo.

However, I have been thinking about this a bit more and I am no longer sure it is worth splitting up.
Pkg3 makes this much faster than insrallin used to be, which was part of the original goal.
Another part of the original goal was to have smaller chunks to try and document/update old cruft, but that is mostly done now.

@milesfrain
Copy link
Contributor

That sounds reasonable. Shall we close this issue?

Regarding the benchmarking concern, we can do the following:

  • Split up benchmarks and allow them to be run individually (similar to how existing tests are organized) during testing. I'll tackle this next.
  • Encourage folks to keep the CI benchmarking suites short. Thinking 30 seconds maximum per datastructure to keep Travis time under ~20 minutes. There can still be longer suites available for more extensive analysis when running locally.

It would be really cool if PkgBenchmark could select the benchmarks to run based on code changes, but I don't believe that capability is on the roadmap.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

10 participants