Split up the package? #310

ararslan · 2017-07-22T07:02:07Z

This package has become quite large and is something of a heavyweight dependency for packages that just need a single type. An example of that is StatsBase, which only uses DataStructures for the arrays as heaps stuff. Indeed, just going through and trying to fix the deprecation warnings in this package for more recent Julia versions has proved quite cumbersome. It seems to me that for the sake of maintainability, the easily separable pieces of this package could be split off into separate packages, say like Heaps.jl, Deques.jl, etc. That way packages can just pull in the functionality they need. Thoughts?

yuyichao · 2017-07-22T13:17:16Z

just going through and trying to fix the deprecation warnings in this package for more recent Julia versions has proved quite cumbersome

What about it is cumbersome? Splitting part that doesn't depend on anything else might be ok though I don't think depwarn fix is a good reason for it and I don't see how splitting will help there. The depwarn fix now needs to be down on multiple packages rather than one.

ararslan · 2017-07-22T18:34:16Z

That wasn't really an argument for splitting up so much as just a comment on the fact that it's big enough that fixing depwarns is annoying to do all at once.

TotalVerb · 2017-09-12T01:30:33Z

I also think this package is unnecessarily large. It can reasonably be changed to simply re-export the exports of smaller packages which become dependencies, to maintain the current convenience while allowing more modular development

ararslan · 2017-09-12T01:40:08Z

Exactly what I was thinking, @TotalVerb.

oxinabox · 2017-10-20T07:30:26Z

Yes, please lets split it up.
It is so large.
I keep wanting to fix things, or refactor them, etc.
But the weight of all the files gets me down.
There is a lot of cruft in the code.
It has good test coverage, so refactoring and breaking down should be safe.
(Can keep a copy of all the tests in DataStructures.jl and then run them using the reexported ones until happy)

Breaking it into separate repos, will make it easier to decruft it.
One thing that can be noted is that it really does break up quiet distinctly
Some parts were clearly written by different people,
who (reasonably enough) did not touch the other parts.

Like All the Sorted stuff (see breakdown in next post) was written by @StephenVavasis .
All the Heap stuff was extracted from Base.

There are basically 5 packages living in here, all with there own core author list.
(and then the good work of later people maintaining them all)

I spent an hour or so checking and breaking down a potential split of the package.
I'ma put the list in a separate post.

oxinabox · 2017-10-20T07:44:30Z

This is just the files in /src/

Heaps.jl

Anything to do with heaps.
Lightweight package for people who just want the old priorityqueue from base,
or other heap related functions.

heaps
heaps.jl
priorityqueue.jl -- it depends on heap functions, so either it goes here or in LinearDataStructures.jl, which then gains a Heaps.jl Dependancy

SortedDataStructures.jl

Its basically data structures that have optimised performance because they
built on top of search-trees.
They should be in there own package as they take their own set of expertise to maintain.
Even though they overlap with the other packages I propose that are interface driven.

balanced_tree.jl
sorted_dict.jl
sorted_multi_dict.jl
sorted_set.jl
tokens.jl
tokens2.jl
container_loops.jl

SetDataStructures.jl

They all act like sets. (Fast element of-checking)
Many of the packages in this say that they will be better once AbstractSet is in base.
JuliaLang/julia#5533
So they all probably need a fair bit of cleanup now

ordered_set.jl
disjoint_set.jl
int_set.jl

AssociativeDataStructures.jl

default_dict.jl
accumulator.jl: On one interpretation a DefaultDict{K,<:Integer}, on the other a Bag/Multiset
multi_dict.jl: DefaultDict{K, Vector{T}}
classified_collections.jl -- this is a weird one. It is kinda the generalisation of Multi-dicts. Does anyone even love it?
- dict_sorting.jl: helpers for ordered_dict
ordered_dict.jl
trie.jl -- tries are associative data-structures. Though our current one is a bit limited in its functionality. If someone loved it, it could actually be its own package with all kinds of different tries. But no one really loves tries these days.
delegate.jl is used by ordered_set, multi_dict, and default_dict. The use in ordered_set is just for two simple single argument functions, that can be replaced with hand coding them.

LinearDataStructures.jl

list.jl : a linked lists not really like the others here.
deque.jl
queue.jl : Dequeue wrapper
stack.jl : Dequeue wrapper
circ_deque.jl : implements same interface as dequeue
circular_buffer.jl : Like circ_deque but instread of getting full it just overwrites old elements. Also doesn't implement same interface as deque.

These 5 packages are basically not interdependent.

ararslan · 2017-10-20T19:00:51Z

Good ideas. I'm not sold on the specific names but otherwise +1.

kmsquire · 2017-10-21T07:45:37Z

These 5 packages are basically not interdependent.

Mostly true, but note that an OrderedSet (listed in SetDataStructures.jl) uses the same underlying implementation as an OrderedDict (listed in AssociativeDataStructures.jl)

oxinabox · 2017-10-23T01:57:51Z

Mostly true, but note that an OrderedSet (listed in SetDataStructures.jl) uses the same underlying implementation as an OrderedDict (listed in AssociativeDataStructures.jl)

Ah yes, I missed that one.
Maybe merge AssociativeDataStructures.jl and SetDataStructures.jl,
and call it "MembershipDataStructures.jl"

@ararslan and I had a discussion on specific names the other day.
and concluder better names were *Collections.jl

PallHaraldsson · 2017-11-09T14:29:40Z

Pro for splitting may be maintainablility, but since not in Base, maybe for discorverability, having all the main data-structures in one place is good.

Just one question on it. For using (e.g. precompiling) is the size slowing down? I mean does Julia need to compile more than you actually use of a package? If it does now, can it be deferred in a future version of Julia?

oxinabox · 2017-11-17T04:47:36Z

Just one question on it. For using (e.g. precompiling) is the size slowing down? I mean does Julia need to compile more than you actually use of a package? If it dos now, can it be deferred in a future version of Julia?

Yes, and probably?
I mean turing complete languages can do anything.
But likely/easy? Less clear.

oxinabox · 2017-11-17T05:20:26Z

Ok,
here is my splitting plan:

Split according to the splits covered by #310 (comment)
except merge Associative and Set into a single Membership package.
using the names *Collections.jl

DataStructures.jl will remain as a MetaPackage that does all of them.

So the actual concrete plan:
Make 5 branches on this Remote, under split/subname (and split/combined for the new metapackage).
Use git history rewriting to delete all the files not (and prune the empty commits) related to the the subpackage in question.
That will speed up installs, even using Pkg2 of the new subpackages. Not doing that will lead the the suffering as per SciML/DifferentialEquations.jl#107.
It can't be done easily for the combined package under that current name (see aforementioned suffering thread)
The subpackage branchs are easy enough, they get there tests poked at and made sure they are passing CI here.

The branch with the combined meta package is a bit harder.
while setting stuff up (i.e. when ever-thing is just branches) it will need a custom build script to run the existing tests and by doing a Pkg.clone.("DataStructures", ["split/...", "split/...",...]).
Still shouldn't be too hard.

Then once that is all set up, everything can cruise along for a little while.
PRs can be merged both into current (real) master,
and also merged off into the branchs.

Once we are happy all is well,
we make one last tag of DataStructures.jl 0.x.y,

Then we push all the split branches of to their own remotes, and register them.
Once they are registered, we push the split/combined to DataStructures.jl master,
and we tag DataStructures.jl 1.0.0

We close all issues and PRs here.
and open links for the ones we care about in the new repos. ( @ararslan do you have a script for that? Something similar was done for SpecialFunctions.jl)

We then delete all the tests in DataStructures.jl, and maybe tag 1.0.1.
Which might make any shallow cloning Pkg3 based method able to install DataStructures.jl lightning fast, sine it will then be a <1kb repo.

Also somewhere in this plan needs to be sorting out docs.
I guess they should be split and DataStructures.jl's docs should basically (or even literally) just link to the subpackages docs.

So that is the plan.
Thoughts?

(@vanbujm this was the complex git game I was talking to you about a few weeks back)

oxinabox · 2017-12-06T09:51:17Z

This is waiting on #348 and #347
as those are the big PRs that are open at the moment that affect files that would be in more than one package after the split.

scls19fr · 2018-01-05T10:36:41Z

I wonder if splitting package shouldn't be done after ensuring that doctests are really run and are passing. See #345 and #361

oxinabox · 2018-01-05T10:50:47Z

No. I disagree.
There will always be "one last thing".
Right now there are no active PRs that are being worked on that hit files in multiple splits.

If it comes to it we can always move commits around later via the magic of git.
It is particular easy to do that if all the PRs are such that they actuallychan just be replayed onto one of the other split repos.

oxinabox · 2018-01-05T13:40:00Z

The first split package Heaps.jl is up at https://github.com/JuliaCollections/Heaps.jl
(locally for me it is a branch of DataStructures.jl)
I've not made a branch of DataStructures that actually depends on it anywhere yet,
but as a stand-alone package it is functional.

I am not porting anything deprecated across.

It is now clear to me we will either need a DataStructuresBase.jl
or to do manual method merging shenanigans in DataStructures.jl.
However either of those can happen later.

For my own reference the commands I need to remove things out of git history and compact it are:

git filter-branch --prune-empty --force --index-filter 'git rm --cached --ignore-unmatch  LISTFILESHERE' 

git filter-branch --prune-empty --parent-filter 'sed "s/-p //g" | xargs -r git show-branch --independent | sed "s/\</-p /g"'

rofinn · 2018-02-27T18:41:24Z

FWIW, I'd like to use a Trie for Memento.jl, but I don't really want to depend on DataStructures.jl. Would someone be willing to split that out into a separate package for me to work on (I'm guessing folks would like to keep it in the JuliaCollections org)?

iamed2 · 2018-02-27T19:40:04Z

@rofinn I can create that and add you to it

rofinn · 2018-02-27T19:42:29Z

Thanks

oxinabox · 2018-02-28T01:00:33Z

@rofinn sorry that I didn't get on with and finish the split.
Got back to uni and have been in solid deadlines ever since.

Do you think Tries are good separate from other MembershipCollections?
They might be, they do have other close relatives like suffex trees, and finite state acceptors.

rofinn · 2018-02-28T06:59:28Z

Would those implementations likely need to share code or define some common abstract parent type? IMHO, folks usually just want a specific data structure for their use case, so I'd only be in favour of that if it promoted some common API or avoided code duplication.

milesfrain · 2019-09-29T20:22:07Z

Another motivating reason for splitting packages is to save time on benchmarking.
For example, SparseIntSet #533 should not need to re-run Heap benchmarks. These wasted cycles will compound as more packages adopt benchmarking.
@oxinabox I'm happy to help with this reorganization effort. Is a good next step to create the necessary PRs to bring Heaps.jl up to date?

oxinabox · 2019-09-29T21:02:55Z

No Heaps.jl should just be recreated.
It is easier to just redo the history slicing.
I will delete that repo.

However, I have been thinking about this a bit more and I am no longer sure it is worth splitting up.
Pkg3 makes this much faster than insrallin used to be, which was part of the original goal.
Another part of the original goal was to have smaller chunks to try and document/update old cruft, but that is mostly done now.

milesfrain · 2019-09-29T22:26:13Z

That sounds reasonable. Shall we close this issue?

Regarding the benchmarking concern, we can do the following:

Split up benchmarks and allow them to be run individually (similar to how existing tests are organized) during testing. I'll tackle this next.
Encourage folks to keep the CI benchmarking suites short. Thinking 30 seconds maximum per datastructure to keep Travis time under ~20 minutes. There can still be longer suites available for more extensive analysis when running locally.

It would be really cool if PkgBenchmark could select the benchmarks to run based on code changes, but I don't believe that capability is on the roadmap.

oxinabox mentioned this issue Jan 5, 2018

Improve docstrings and doc #360

Open

IlyaOrson mentioned this issue Jun 12, 2018

Merge with DataStructures.jl ChrisRackauckas/LinkedLists.jl#2

Open

oxinabox mentioned this issue Jun 26, 2018

Splitting out OrderedCollections #392

Closed

oxinabox mentioned this issue Sep 24, 2018

Add MutableLinkedList #450

Merged

oxinabox mentioned this issue Jan 9, 2019

Maturing out the package (Path to DataStructures.jl v1.0) #479

Open

3 tasks

milesfrain mentioned this issue Sep 25, 2019

Use PkgBenchmark to detect performance regressions for heaps #531

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Split up the package? #310

Split up the package? #310

ararslan commented Jul 22, 2017

yuyichao commented Jul 22, 2017

ararslan commented Jul 22, 2017

TotalVerb commented Sep 12, 2017

ararslan commented Sep 12, 2017

oxinabox commented Oct 20, 2017 •

edited

Loading

oxinabox commented Oct 20, 2017 •

edited

Loading

ararslan commented Oct 20, 2017

kmsquire commented Oct 21, 2017 •

edited

Loading

oxinabox commented Oct 23, 2017

PallHaraldsson commented Nov 9, 2017 •

edited

Loading

oxinabox commented Nov 17, 2017 •

edited

Loading

oxinabox commented Nov 17, 2017

oxinabox commented Dec 6, 2017

scls19fr commented Jan 5, 2018 •

edited

Loading

oxinabox commented Jan 5, 2018 •

edited

Loading

oxinabox commented Jan 5, 2018

rofinn commented Feb 27, 2018

iamed2 commented Feb 27, 2018

rofinn commented Feb 27, 2018

oxinabox commented Feb 28, 2018

rofinn commented Feb 28, 2018 •

edited

Loading

milesfrain commented Sep 29, 2019

oxinabox commented Sep 29, 2019

milesfrain commented Sep 29, 2019

Split up the package? #310

Split up the package? #310

Comments

ararslan commented Jul 22, 2017

yuyichao commented Jul 22, 2017

ararslan commented Jul 22, 2017

TotalVerb commented Sep 12, 2017

ararslan commented Sep 12, 2017

oxinabox commented Oct 20, 2017 • edited Loading

oxinabox commented Oct 20, 2017 • edited Loading

Heaps.jl

SortedDataStructures.jl

SetDataStructures.jl

AssociativeDataStructures.jl

LinearDataStructures.jl

ararslan commented Oct 20, 2017

kmsquire commented Oct 21, 2017 • edited Loading

oxinabox commented Oct 23, 2017

PallHaraldsson commented Nov 9, 2017 • edited Loading

oxinabox commented Nov 17, 2017 • edited Loading

oxinabox commented Nov 17, 2017

oxinabox commented Dec 6, 2017

scls19fr commented Jan 5, 2018 • edited Loading

oxinabox commented Jan 5, 2018 • edited Loading

oxinabox commented Jan 5, 2018

rofinn commented Feb 27, 2018

iamed2 commented Feb 27, 2018

rofinn commented Feb 27, 2018

oxinabox commented Feb 28, 2018

rofinn commented Feb 28, 2018 • edited Loading

milesfrain commented Sep 29, 2019

oxinabox commented Sep 29, 2019

milesfrain commented Sep 29, 2019

oxinabox commented Oct 20, 2017 •

edited

Loading

oxinabox commented Oct 20, 2017 •

edited

Loading

kmsquire commented Oct 21, 2017 •

edited

Loading

PallHaraldsson commented Nov 9, 2017 •

edited

Loading

oxinabox commented Nov 17, 2017 •

edited

Loading

scls19fr commented Jan 5, 2018 •

edited

Loading

oxinabox commented Jan 5, 2018 •

edited

Loading

rofinn commented Feb 28, 2018 •

edited

Loading