-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adjust the Popularity measure in the risk index #21
Comments
I like this idea of emphasizing the top 5% of ALL Debian packages. Is 5% the right value - should it be 1% or 2%? Perhaps we could give it 2 points if it's in the top 1%, and 1 point if it's in the 2-5% popularity of all packages; that would provide a little gradation. |
Sam and I looked at the Debian popularity values in more detail. We think that giving additional scores at the 5% and 1% level would be justifiable (2 points if within the top 1% of popularity, 1 point if within the top 5% of popularity but not the top 1%). Here's why. Looking at the popularity graph, the "knee" in the curve - which we'll define as the place where the absolute value of the slope of the curve is one - is at about package 5000 (out of 146754 packages). That means that the curve switches to a slope of less than one at about 3.4% into the set. Since this only a sample set, it makes sense to use a slightly broader definition, so I suggest that we cut off popularity at about 5% (since that would clearly include the 3.4% transition location), which would cut it off at package number 7338. We then re-examined these top 5% values, and there's another transition within that set at about 1% of the total number of package. IE, the top 1% of all packages are ESPECIALLY popular. Obviously the number of packages and their popularity changes over time; we want to use reasonable cutoffs that are a little less sensitive to exact values. Cutoffs of 5% and 1% are fairly common, and seem justified by the data set. |
BTW, there seems to be no universal definition of a "knee" in a graph. More complex systems for defining and finding knees in curves (compared to what we used) can be found here:
These involve finding the maximum of the curvature, which for a continuous function is: I don't think we need to dig into these more complex systems for our purposes. |
Popularity chart by installations for all Debian packages: Popularity chart of the top 5% of Debian Packages. Data was obtained from: http://popcon.debian.org/by_inst |
Currently a package receives a point if it is in the top 90% of packages analyzed. Making this a relative measure. Consider making it absolute, adjusting this measure to the top 5% of ALL debian packages based on [1]. With more than 140K packages being tracked by the popularity contest, it is more sensible to reduce this measure to a much smaller percentage. Even 1% (~1400 packages) can be a reasonable threshold. Thanks.
[1] http://popcon.debian.org/
The text was updated successfully, but these errors were encountered: