-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Generate sample dataset for testing with 2017 data schema #142
Comments
@psybers |
Afaik, only this repo: https://sourceforge.net/p/ptolemyj/code/HEAD/tree/ |
I don't want to mix Git and SVN. |
Why not mix?
That gives the users the ability to test on both kinds of SCM data.
|
Actually, SVN support was not brought from Bitbucket to GitHub. I will sync that. |
We are using projects' ids as the keys of the project sequence file which might not be unique cross forges. |
Uniqueness isn't necessary in that file, only in the AST file. Keys can duplicate here. |
The new sample dataset now gives many errors when running the ast count query:
|
The sample dataset is way out of date. |
I think a good course of action is to first identify which project(s) we want in the sample dataset. Then write a shell script that will git clone those project(s) (perhaps based on a specific tag/commit) and then build the dataset. This would allow easily rebuilding the datasets. |
This is actually the way it works right now. Running BoaGenerator.java will take care all the way to the final dataset. |
So I guess that file needs updated with the project(s) we want in the dataset (Boa, Candoia, etc). And what I meant was we add a .sh script somewhere, so that the user won't have to worry about figuring out the arguments to the generator:
|
And actually what I was envisioning was even more - they wouldn't need to have the JSON file. The script could connect to GitHub and download the project metadata when it clones. So what I was thinking was just a simple shell script that did that for a small set of projects, then ran the generator and built the dataset. |
FYI the readme for the project states: "The sample dataset contain only three projects to keep the download size small: Boa, PaniniJ, and Panini." |
Should we update the readme or update the sample dataset? I prefer the former since I would want to see also projects from outside of the lab. |
The design choice of picking those 3 specific projects was that we, as owners of those projects, can give consent to them being redistributed in any form. If we pick any other project, we have to worry about various issues like licensing, GitHub TOS, etc. |
We are not owning any Web applications for testing JS, PHP, HTML, CSS. |
Then, it might make sense to contact maintainers of a couple of such projects to explicitly ask permission to make their project available as part of Boa sample dataset. |
Use data generation code on master to generate sample dataset from a few projects
boa https://github.com/boalang/compiler
ptolemy ???
candoia https://github.com/candoia/candoia
@psybers please help with the list of projects
The text was updated successfully, but these errors were encountered: