Generate sample dataset for testing with 2017 data schema #142

nguyenhoan · 2017-06-28T17:49:10Z

Use data generation code on master to generate sample dataset from a few projects
boa https://github.com/boalang/compiler
ptolemy ???
candoia https://github.com/candoia/candoia

@psybers please help with the list of projects

nguyenhoan · 2017-07-06T19:17:42Z

@psybers
Is there a GitHub repo for ptolemy?

psybers · 2017-07-06T19:31:56Z

Afaik, only this repo: https://sourceforge.net/p/ptolemyj/code/HEAD/tree/

nguyenhoan · 2017-07-06T19:34:31Z

I don't want to mix Git and SVN.
So let's start with Boa and Candoia.

psybers · 2017-07-06T19:35:46Z

Why not mix? That gives the users the ability to test on both kinds of SCM data.

nguyenhoan · 2017-07-06T19:47:20Z

Actually, SVN support was not brought from Bitbucket to GitHub.

I will sync that.

nguyenhoan · 2017-07-06T21:29:11Z

We are using projects' ids as the keys of the project sequence file which might not be unique cross forges.

psybers · 2017-07-06T22:03:46Z

Uniqueness isn't necessary in that file, only in the AST file. Keys can duplicate here.

nguyenhoan · 2017-07-10T22:27:39Z

5df9b51

psybers · 2017-08-11T15:57:58Z

The new sample dataset now gives many errors when running the ast count query:

error with ast: -1!!src/test/java/org/junit/tests/assertion/AssertionTest.java
java.lang.ClassCastException: org.apache.hadoop.io.Text cannot be cast to org.apache.hadoop.io.LongWritable
	at org.apache.hadoop.io.LongWritable.compareTo(LongWritable.java:60)
	at org.apache.hadoop.io.WritableComparator.compare(WritableComparator.java:115)
	at org.apache.hadoop.io.MapFile$Reader.binarySearch(MapFile.java:500)
	at org.apache.hadoop.io.MapFile$Reader.seekInternal(MapFile.java:444)
	at org.apache.hadoop.io.MapFile$Reader.seekInternal(MapFile.java:417)
	at org.apache.hadoop.io.MapFile$Reader.seek(MapFile.java:404)
	at org.apache.hadoop.io.MapFile$Reader.get(MapFile.java:523)
	at boa.functions.BoaAstIntrinsics.getast(BoaAstIntrinsics.java:101)
	at boa.runtime.BoaAbstractVisitor.visit(BoaAbstractVisitor.java:207)
	at boa.runtime.BoaAbstractVisitor.visit(BoaAbstractVisitor.java:194)
	at boa.runtime.BoaAbstractVisitor.visit(BoaAbstractVisitor.java:184)
	at boa.runtime.BoaAbstractVisitor.visit(BoaAbstractVisitor.java:164)
	at boa.AstCount$AstCountBoaMapper$Job0.map(AstCount.java:173)
	at boa.AstCount$AstCountBoaMapper.runJob(AstCount.java:198)
	at boa.AstCount$AstCountBoaMapper.map(AstCount.java:189)
	at boa.AstCount$AstCountBoaMapper.map(AstCount.java:126)
	at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
	at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
	at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212)

nguyenhoan · 2017-08-11T16:15:00Z

The sample dataset is way out of date.
Because it was getting bigger, we tend to delay updating it until a major "milestone"

psybers · 2017-08-11T16:25:12Z

I think a good course of action is to first identify which project(s) we want in the sample dataset. Then write a shell script that will git clone those project(s) (perhaps based on a specific tag/commit) and then build the dataset. This would allow easily rebuilding the datasets.

nguyenhoan · 2017-08-11T16:40:47Z

This is actually the way it works right now.
One just needs to update the JSON file containing the metadata of the chosen projects
https://github.com/boalang/compiler/blob/datagen2017/dataset/repos/repos.json

Running BoaGenerator.java will take care all the way to the final dataset.

psybers · 2017-08-11T17:00:20Z

So I guess that file needs updated with the project(s) we want in the dataset (Boa, Candoia, etc).

And what I meant was we add a .sh script somewhere, so that the user won't have to worry about figuring out the arguments to the generator:

A140801:compiler-datagen rdyer$ ./boa.sh -g
User must specify the path of the repository. Please see --remote and --local options
usage: boa
The most commonly used Boa options are:
 -cache,--json              enable if you want to delete the cloned code
                            for user.
 -debug,--json              enable for debug mode.
 -debugparse,--json         enable for debug mode when parsing source
                            files.
 -help,--help <arg>         help
 -inputJson,--json <arg>    .json files for metadata
 -inputRepo,--json <arg>    cloned repo path
 -output,--json <arg>       directory where output is desired
 -password,--json <arg>     github password to authenticate.
 -targetRepo,--json <arg>   name of the target repository
 -targetUser,--json <arg>   username of target repository
 -user,--json <arg>         github username to authenticate

Please report issues at http://www.github.com/boalang/
Exception in thread "main" java.lang.NullPointerException
	at boa.datagen.CacheGithubJSON.main(CacheGithubJSON.java:22)
	at boa.datagen.BoaGenerator.main(BoaGenerator.java:59)
	at boa.BoaMain.main(BoaMain.java:60)

psybers · 2017-08-11T17:02:20Z

And actually what I was envisioning was even more - they wouldn't need to have the JSON file. The script could connect to GitHub and download the project metadata when it clones. So what I was thinking was just a simple shell script that did that for a small set of projects, then ran the generator and built the dataset.

psybers · 2017-08-23T13:56:54Z

FYI the readme for the project states: "The sample dataset contain only three projects to keep the download size small: Boa, PaniniJ, and Panini."

nguyenhoan · 2017-08-23T14:15:14Z

Should we update the readme or update the sample dataset?

I prefer the former since I would want to see also projects from outside of the lab.

psybers · 2017-08-23T14:28:22Z

The design choice of picking those 3 specific projects was that we, as owners of those projects, can give consent to them being redistributed in any form. If we pick any other project, we have to worry about various issues like licensing, GitHub TOS, etc.

nguyenhoan · 2017-09-01T14:46:24Z

We are not owning any Web applications for testing JS, PHP, HTML, CSS.
I would say we are not owning any non-Java projects.

hridesh · 2017-09-01T14:48:12Z

Then, it might make sense to contact maintainers of a couple of such projects to explicitly ask permission to make their project available as part of Boa sample dataset.

nguyenhoan added the data generation label Jun 28, 2017

nguyenhoan assigned RobertHSchmidt Jun 28, 2017

nguyenhoan closed this as completed Jul 17, 2017

psybers reopened this Aug 11, 2017

psybers added the enhancement label Jul 23, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Generate sample dataset for testing with 2017 data schema #142

Generate sample dataset for testing with 2017 data schema #142

nguyenhoan commented Jun 28, 2017 •

edited

Loading

nguyenhoan commented Jul 6, 2017

psybers commented Jul 6, 2017

nguyenhoan commented Jul 6, 2017

psybers commented Jul 6, 2017 via email •

edited

Loading

nguyenhoan commented Jul 6, 2017

nguyenhoan commented Jul 6, 2017

psybers commented Jul 6, 2017

nguyenhoan commented Jul 10, 2017

psybers commented Aug 11, 2017

nguyenhoan commented Aug 11, 2017

psybers commented Aug 11, 2017

nguyenhoan commented Aug 11, 2017

psybers commented Aug 11, 2017

psybers commented Aug 11, 2017

psybers commented Aug 23, 2017

nguyenhoan commented Aug 23, 2017

psybers commented Aug 23, 2017

nguyenhoan commented Sep 1, 2017

hridesh commented Sep 1, 2017

Generate sample dataset for testing with 2017 data schema #142

Generate sample dataset for testing with 2017 data schema #142

Comments

nguyenhoan commented Jun 28, 2017 • edited Loading

nguyenhoan commented Jul 6, 2017

psybers commented Jul 6, 2017

nguyenhoan commented Jul 6, 2017

psybers commented Jul 6, 2017 via email • edited Loading

nguyenhoan commented Jul 6, 2017

nguyenhoan commented Jul 6, 2017

psybers commented Jul 6, 2017

nguyenhoan commented Jul 10, 2017

psybers commented Aug 11, 2017

nguyenhoan commented Aug 11, 2017

psybers commented Aug 11, 2017

nguyenhoan commented Aug 11, 2017

psybers commented Aug 11, 2017

psybers commented Aug 11, 2017

psybers commented Aug 23, 2017

nguyenhoan commented Aug 23, 2017

psybers commented Aug 23, 2017

nguyenhoan commented Sep 1, 2017

hridesh commented Sep 1, 2017

nguyenhoan commented Jun 28, 2017 •

edited

Loading

psybers commented Jul 6, 2017 via email •

edited

Loading