Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve the dataset #2

Open
aviks opened this issue Jun 10, 2018 · 2 comments
Open

Improve the dataset #2

aviks opened this issue Jun 10, 2018 · 2 comments

Comments

@aviks
Copy link
Member

aviks commented Jun 10, 2018

This is more of feedback than an issue, not sure its actionable.

I tried this on a real world list of almost 18K names, and got a hit rate of around 34%.

@oxinabox
Copy link
Member

oxinabox commented Jun 11, 2018

With some effort a new database could be constructed.
Goverments tend to release statistics on how popular each names is by year and sex.

This dataset does the USA: https://catalog.data.gov/dataset/baby-names-from-social-security-card-applications-national-level-data

This dataset does Australia
https://data.gov.au/dataset/popular-baby-names

England and Wales
https://data.gov.uk/dataset/afe1871f-dede-41bf-a6ba-0a1d32217cdb/baby-names-england-and-wales

Northern Ireland
https://data.gov.uk/dataset/9ebaf276-f4d5-41e9-bf22-b7ccab8cf85e/full-list-of-first-forenames-given-to-babies-registered-in-northern-ireland

I wouldn't be surprised if name usage was Zipfian.
So truely vast numbers of very rare names

@oxinabox
Copy link
Member

oxinabox commented Jun 12, 2018

Julio Raffo, 2016. "Worldwide Gender-Name Dictionary," WIPO Economics & Statistics Related Resources 10, World Intellectual Property Organization - Economics and Statistics Division.

created a dataset from several sources included various government statistics, facebook and wikipedia.
https://ideas.repec.org/c/wip/eccode/10.html

6.2 million names for 182 different countries
It only works to a resolution of Male, Female or Androgynous, and gives no count information.
and it is case-insensitive only
But that is all fine.

Making that work would mean added DataDeps.jl as dependency because it is nontrivial in size,
and writing an alternate loading function, using CSVFiles.jl
And also adding the definition of what country codes are accepted into the Detector type.
Since they vary between the datasets

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants