[RFC] v1 Roadmap #51

emjun · 2020-04-28T00:46:45Z

Here is a proposal for the next batch of changes I'd like to tackle and merge/replace what we currently have. Some of these are logical extensions of changes we've been making to the code base already on the Master branch and other branches. The extent of these changes warrants a new version.

Key changes:
System code

Create a more object-oriented API (see example here)
- Started to sketch out on this branch: (See tests for more example Tea programs with this new API, but it looks like this example gist).
- Main changes include collapsing interval and ratio variable types into a unified "numeric" type; two ways to instantiate a Tea object (either through an object constructor or a series of function calls, similar to how it is now)
- This is consistent with recent changes/PRs for VarData, StudyDeterminer
Modularize hypothesis grammar so that it's easy to extend and add new operators
- Depending on hypothesis, should "navigate" to appropriate analysis/test selection module (Null Hypothesis Significance Testing, Linear modeling, etc.)
Modularize constraint specification to (eventually) allow for easier customization of constraints
Refactor testing infrastructure (@MaLiN2223 already started this!)
Add mode tests to improve test coverage (@MaLiN2223 already started this!)
Make syntactic checking for Tea programs more systematic
Minimize/make consistent the usage of global values
Redefine and change strict/relaxed modes into (1) "strict" (if user assumptions are not met, halt program); (2) "infer" (let Tea determine properties of data and override user's assumptions), (3) "relaxed" (assume user statements are true)
- May want to rename these modes

Documentation

Better/more complete documentation for how to contribute
All new features will need updated documentation

Proposed timeline: Finish by end of July

Points of feedback:
Code

Thoughts on the above?
Anything else to add? remove?
Should we create a new type of clause for data transformations, rather than allowing them to remain in assumptions? (Might want to hold off on this change until we have some more end-user feedback/research)

Social/Organization

Should we move Tea-lang to be an org on Github? Pros: Give contributors more ownership and agency. More organization to new related projects. Cons: Might not be necessary at this stage?

@MaLiN2223 @meli365 @rjust @jheer @emeryberger

meli365 · 2020-04-29T01:25:28Z

Regarding a more object-oriented API: could this be implemented under the hood? This would reduce cumbersome syntax for users who would only instantiate one Tea object per program anyway. We could potentially still support an object-oriented API for users who are more comfortable with this programming framework or who want to use multiple Tea objects in a single program.

MaLiN2223 · 2020-04-29T17:39:26Z

I have only a few points of feedback. I think your design sounds very good, thank you!
In case I did not mention something below, it means that I agree with your proposal about it. I am going to number all my comments so it is easier to reference them if needed.

I would completely omit global variables/functions (except one mentioned below) in the API. Setting global variables can only cause problems: makes testing harder, unable to run two tests in order without resetting, unable to run pararell executions etc. I mention this as a first point because I feel very strongly about it.
I think I tend to agree with @meli365 here but I think we can have both at the same time.

This is how I imagine it would work:

form tea import initialize_study
tea_obj = initialize_study(var, design, data_path) # this function would be only one global function in our API and would return already initialized Tea object
hypo = tea_obj.hypothesize("x~y")
print(hypo) #hypo is also an object which supports cast to string for easier printing

This allows not advanced users to quickly and easily start their experiments. In my opinion, the smaller code needed to get simple thing done the better. Also, I wouldn't like to wrap hypothesize in the same function just to make sure that we separate concerns at least a little.

Then, advanced users could simply do

from tea import Tea
tea_obj = Tea()
tea_obj.define_variables(var)
tea_obj.define_design(study)
tea_obj.load_data(file_path)
...

I like the builder pattern that you proposed.

That said I would vote for starting with object oriented design, we can later introduce utility functions that can encapsulate some repeated operations (like initialization of the Tea object). In my experience it is easier to go from OOP api to functional one than the other way.

From the provided examples, I got a feeling that all variables (vars) have the same structure. This shows that we can encapsulate them to objects or tuples. Using dicts which do not have support for typing might lead to bugs.
Similar as above but for design variables.
'two ways to instantiate a Tea object' this is a very good idea.
Could you elaborate more on Make syntactic checking for Tea programs more systematic?
"Should we create a new type of clause for data transformations, rather than allowing them to remain in assumptions?" I would vote for waiting until we have the whole API done because only then we can see good/bad usages.
Should we move Tea-lang to be an org on Github? I think it might be a good idea. One additional benefit is that it would allow other contributors to assign themselves to an issue to indicate that they are workin on it.
Proposed timeline: Finish by end of July is it a hard or soft deadline? If it is soft and we might have some more time maybe we should plan for doing MVP first?

General points:

Do we want to build this on a separate branch or on master? If on master, I think it might be a good idea to build this along side of the old API, otherwise colaboration might be almost impossible as we might be stepping on each other toes.
For all the missing pieces we should have some kind of board (we can use Projects tab on github) so we wouldn't duplicate the work. I would also propose to split this to some well defined features so we can separate the work easier.
Enforcing code style checkers on PRs would go a long way in terms of quality of the code in the repo. I mean it in both using PEP as a basis for code style and using good programming practices. The earlier we would enforce this, the easier it will be.
(Something for a late stage of the project) Performance tests will be needed. We might want to make sure that if user would came with a large dataset (size > 10s of GBs) we will be able to process it in a reasonable time.
(Something for a late stage of the project) Utilize github releases with relase notes and assets. In my opinion Tensorflow project does it very nicely.
We might want to utilize labels for issues, it might make our life easier when we have more of them in the future.

emjun · 2020-05-05T00:40:35Z

Thanks @meli365 and @MaLiN2223 for the great feedback! In general, I think we're on the same page.

Responding to @MaLiN2223 's points of feedback (using the same numbers) in greater detail:

Original points:

Globals: I also don't love the number of globals we have currently. Many of these can be replaced with enums that are type-checked.
API: Let's plan to provide both APIs (OOP and more functional), but start with the OOP API. We can come up with more functional wrappers after we have the OOP API down.
3&4. Vars and Design: Yep, let's encapsulate in typed tuples.
:)
More systematic syntactic checking for Tea programs: Typing for Vars and Design will help with this. Another check would be to make sure that all variables declared in a hypothesis are defined/specified in Vars first. We do this, but we don't handle these errors gracefully. I'd like to improve checking and error messaging for these kinds of errors.
Data transformations: Sounds like a good plan to hold off on this for now.
Github org: Did it! Everyone tagged in the original thread should have an invite. The org will be good for keeping related projects grouped together and giving contributors greater agency. I don't see any downsides. I haven't moved the main tea repo to the org yet. I will once we have a plan for how to develop v1 alongside Master (see below).
Deadline: End of July is not a hard deadline. Focusing on an MVP first and then improving it makes sense to me.

General points:

Do we want to build this on a separate branch or on master?...
Response: My plan was to build this new API alongside what we have on Master with the plans of eventually replacing what we currently have.
I'd like to reduce stepping on each others' toes and duplicating efforts as much as possible, so here are a couple options I think would be viable:

Work on the branch I started after I've cleaned it up and updated it to include the latest features on Master
Fork Master and then start re-organizing code, specifying new features, etc.

For all the missing pieces we should have some kind of board (we can use Projects tab on github) so we wouldn't duplicate the work. I would also propose to split this to some well defined features so we can separate the work easier.
Response: Agreed! (A) I have used Trello in the past, but let's use the Projects tab to keep everything centralized. (B) Splitting into well-defined features and making feature requests is a good way to move forward.
Enforcing code style checkers on PRs would go a long way in terms of quality of the code in the repo. I mean it in both using PEP as a basis for code style and using good programming practices. The earlier we would enforce this, the easier it will be.
Response: I'd like to include Pep8 as a default check when building and testing code. We could also augment with manual checks. Do you have any other packages you like to use?
(Something for a late stage of the project) Performance tests will be needed. We might want to make sure that if user would came with a large dataset (size > 10s of GBs) we will be able to process it in a reasonable time.
Response: Absolutely! We might need to come up with some clever workarounds for large data. I agree though that we should tackle this after the other changes/tests are addressed.
(Something for a late stage of the project) Utilize github releases with relase notes and assets. In my opinion Tensorflow project does it very nicely.
Response: Yep, agreed! Let's keep this in mind but not prioritize it for now.
We might want to utilize labels for issues, it might make our life easier when we have more of them in the future.
Response: Yes, please. I tried for a little while (and then quickly lost the habit), but let's make sure to label issues more regularly.

MaLiN2223 · 2020-05-09T12:27:28Z

Sounds good to me, what is our plan now then?
Do you mind creating some tasks for others to pick up so we know where to start? Also, a detailed checklist would also be helpful.

As for style checkers on build, I will try to handle that soon. It might take some time because some parts of our code base are not compliant but I think I should be able to get it sorted before end of the next week. Created #52 to track this, you can assign me to this issue.

I would also suggest to merge #47 and #48 so our build stops to fail and so we can have more tests.

emjun · 2020-05-12T00:20:07Z

Yes! My plan is basically to get everything ready for the major re-factor: Re-org code I already have for api_v1, create checklist, and assign tasks.

Update, here are my todos in priority order:

1. Triage outstanding bugs/issues.
2. Address those bugs/issues -- as many as I can without a major refactor that we are already planning. (in progress)
3. Fast forward the api_v1 branch.
4. Create checklist for V1.
5. Assign tasks in checklist.

Planning pessimistically, I'd say these things will take me two weeks.

emjun · 2020-05-18T03:06:47Z

To add to V1 Roadmap:

Require explicit hypothesis in order to perform a statistical test. Might need to identify most natural ways of expressing hypotheses (from end-users) and expand hypothesis grammar to support these. See commit message about this.
Catch any errors/warnings raised by dependencies (e.g., scipy, statsmodels) and better report these. (Inspired from this issue.)
- If the above errors/warnings are about validity of tests, these concerns should be added as constraints to Tea.
Add tests for one-sided and two-sided tests and make sure that the computed p-values complement one another: https://github.com/emjun/tea-lang/issues/2

MaLiN2223 · 2020-06-25T18:59:31Z

@emjun Do we have any updates on this?

emjun · 2020-07-13T02:35:11Z

Just set up a Project for V1 and assigned a few initial tasks.

The "Goals for V1" column summarizes the above discussion into specific goals. The "To do" column adds more specific/smaller grained tasks to achieve those goals. Feel free to add anything to either column, assign yourself new issues, etc.

V1 is currently under development in the api_v1 branch. New features should be implemented as branches from the api_v1 branch. Then, once we have everything merged into api_v1, we can merge and replace the master branch, upload to pip, and have an official V1 release!

Timeline: MVP as soon as we can. :) Taking a look through the todos and being a terrible project estimator, maybe by beg of October? ;)

MaLiN2223 mentioned this issue May 9, 2020

Enforce style checks #52

Closed

emjun added planning 📝 Planning new features, versions, etc. feedback wanted 💬 Feedback from the community is wanted labels May 12, 2020

emjun mentioned this issue May 21, 2020

test_paired_t_test throws exception #6

Closed

domoritz mentioned this issue Oct 12, 2020

numeric type does not work in released version #69

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] v1 Roadmap #51

[RFC] v1 Roadmap #51

emjun commented Apr 28, 2020 •

edited

Loading

meli365 commented Apr 29, 2020

MaLiN2223 commented Apr 29, 2020 •

edited

Loading

emjun commented May 5, 2020

MaLiN2223 commented May 9, 2020 •

edited

Loading

emjun commented May 12, 2020 •

edited

Loading

emjun commented May 18, 2020 •

edited

Loading

MaLiN2223 commented Jun 25, 2020

emjun commented Jul 13, 2020

[RFC] v1 Roadmap #51

[RFC] v1 Roadmap #51

Comments

emjun commented Apr 28, 2020 • edited Loading

meli365 commented Apr 29, 2020

MaLiN2223 commented Apr 29, 2020 • edited Loading

emjun commented May 5, 2020

MaLiN2223 commented May 9, 2020 • edited Loading

emjun commented May 12, 2020 • edited Loading

emjun commented May 18, 2020 • edited Loading

MaLiN2223 commented Jun 25, 2020

emjun commented Jul 13, 2020

emjun commented Apr 28, 2020 •

edited

Loading

MaLiN2223 commented Apr 29, 2020 •

edited

Loading

MaLiN2223 commented May 9, 2020 •

edited

Loading

emjun commented May 12, 2020 •

edited

Loading

emjun commented May 18, 2020 •

edited

Loading