Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generate sample dataset for testing with 2017 data schema #142

Open
nguyenhoan opened this issue Jun 28, 2017 · 19 comments
Open

Generate sample dataset for testing with 2017 data schema #142

nguyenhoan opened this issue Jun 28, 2017 · 19 comments

Comments

@nguyenhoan
Copy link
Contributor

nguyenhoan commented Jun 28, 2017

Use data generation code on master to generate sample dataset from a few projects
boa https://github.com/boalang/compiler
ptolemy ???
candoia https://github.com/candoia/candoia

@psybers please help with the list of projects

@nguyenhoan
Copy link
Contributor Author

@psybers
Is there a GitHub repo for ptolemy?

@psybers
Copy link
Member

psybers commented Jul 6, 2017

Afaik, only this repo: https://sourceforge.net/p/ptolemyj/code/HEAD/tree/

@nguyenhoan
Copy link
Contributor Author

I don't want to mix Git and SVN.
So let's start with Boa and Candoia.

@psybers
Copy link
Member

psybers commented Jul 6, 2017 via email

@nguyenhoan
Copy link
Contributor Author

Actually, SVN support was not brought from Bitbucket to GitHub.

I will sync that.

@nguyenhoan
Copy link
Contributor Author

We are using projects' ids as the keys of the project sequence file which might not be unique cross forges.

@psybers
Copy link
Member

psybers commented Jul 6, 2017

Uniqueness isn't necessary in that file, only in the AST file. Keys can duplicate here.

@nguyenhoan
Copy link
Contributor Author

5df9b51

@psybers psybers reopened this Aug 11, 2017
@psybers
Copy link
Member

psybers commented Aug 11, 2017

The new sample dataset now gives many errors when running the ast count query:

error with ast: -1!!src/test/java/org/junit/tests/assertion/AssertionTest.java
java.lang.ClassCastException: org.apache.hadoop.io.Text cannot be cast to org.apache.hadoop.io.LongWritable
	at org.apache.hadoop.io.LongWritable.compareTo(LongWritable.java:60)
	at org.apache.hadoop.io.WritableComparator.compare(WritableComparator.java:115)
	at org.apache.hadoop.io.MapFile$Reader.binarySearch(MapFile.java:500)
	at org.apache.hadoop.io.MapFile$Reader.seekInternal(MapFile.java:444)
	at org.apache.hadoop.io.MapFile$Reader.seekInternal(MapFile.java:417)
	at org.apache.hadoop.io.MapFile$Reader.seek(MapFile.java:404)
	at org.apache.hadoop.io.MapFile$Reader.get(MapFile.java:523)
	at boa.functions.BoaAstIntrinsics.getast(BoaAstIntrinsics.java:101)
	at boa.runtime.BoaAbstractVisitor.visit(BoaAbstractVisitor.java:207)
	at boa.runtime.BoaAbstractVisitor.visit(BoaAbstractVisitor.java:194)
	at boa.runtime.BoaAbstractVisitor.visit(BoaAbstractVisitor.java:184)
	at boa.runtime.BoaAbstractVisitor.visit(BoaAbstractVisitor.java:164)
	at boa.AstCount$AstCountBoaMapper$Job0.map(AstCount.java:173)
	at boa.AstCount$AstCountBoaMapper.runJob(AstCount.java:198)
	at boa.AstCount$AstCountBoaMapper.map(AstCount.java:189)
	at boa.AstCount$AstCountBoaMapper.map(AstCount.java:126)
	at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
	at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
	at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212)

@nguyenhoan
Copy link
Contributor Author

The sample dataset is way out of date.
Because it was getting bigger, we tend to delay updating it until a major "milestone"

@psybers
Copy link
Member

psybers commented Aug 11, 2017

I think a good course of action is to first identify which project(s) we want in the sample dataset. Then write a shell script that will git clone those project(s) (perhaps based on a specific tag/commit) and then build the dataset. This would allow easily rebuilding the datasets.

@nguyenhoan
Copy link
Contributor Author

This is actually the way it works right now.
One just needs to update the JSON file containing the metadata of the chosen projects
https://github.com/boalang/compiler/blob/datagen2017/dataset/repos/repos.json

Running BoaGenerator.java will take care all the way to the final dataset.

@psybers
Copy link
Member

psybers commented Aug 11, 2017

So I guess that file needs updated with the project(s) we want in the dataset (Boa, Candoia, etc).

And what I meant was we add a .sh script somewhere, so that the user won't have to worry about figuring out the arguments to the generator:

A140801:compiler-datagen rdyer$ ./boa.sh -g
User must specify the path of the repository. Please see --remote and --local options
usage: boa
The most commonly used Boa options are:
 -cache,--json              enable if you want to delete the cloned code
                            for user.
 -debug,--json              enable for debug mode.
 -debugparse,--json         enable for debug mode when parsing source
                            files.
 -help,--help <arg>         help
 -inputJson,--json <arg>    .json files for metadata
 -inputRepo,--json <arg>    cloned repo path
 -output,--json <arg>       directory where output is desired
 -password,--json <arg>     github password to authenticate.
 -targetRepo,--json <arg>   name of the target repository
 -targetUser,--json <arg>   username of target repository
 -user,--json <arg>         github username to authenticate

Please report issues at http://www.github.com/boalang/
Exception in thread "main" java.lang.NullPointerException
	at boa.datagen.CacheGithubJSON.main(CacheGithubJSON.java:22)
	at boa.datagen.BoaGenerator.main(BoaGenerator.java:59)
	at boa.BoaMain.main(BoaMain.java:60)

@psybers
Copy link
Member

psybers commented Aug 11, 2017

And actually what I was envisioning was even more - they wouldn't need to have the JSON file. The script could connect to GitHub and download the project metadata when it clones. So what I was thinking was just a simple shell script that did that for a small set of projects, then ran the generator and built the dataset.

@psybers
Copy link
Member

psybers commented Aug 23, 2017

FYI the readme for the project states: "The sample dataset contain only three projects to keep the download size small: Boa, PaniniJ, and Panini."

@nguyenhoan
Copy link
Contributor Author

Should we update the readme or update the sample dataset?

I prefer the former since I would want to see also projects from outside of the lab.

@psybers
Copy link
Member

psybers commented Aug 23, 2017

The design choice of picking those 3 specific projects was that we, as owners of those projects, can give consent to them being redistributed in any form. If we pick any other project, we have to worry about various issues like licensing, GitHub TOS, etc.

@nguyenhoan
Copy link
Contributor Author

We are not owning any Web applications for testing JS, PHP, HTML, CSS.
I would say we are not owning any non-Java projects.

@hridesh
Copy link
Member

hridesh commented Sep 1, 2017

Then, it might make sense to contact maintainers of a couple of such projects to explicitly ask permission to make their project available as part of Boa sample dataset.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants