Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Interests] Include other categorization sources besides Alexa #170

Open
marcosmenendez opened this issue Jun 7, 2015 · 2 comments
Open

Comments

@marcosmenendez
Copy link

Alexa offers a limited coverage of websites, so we need to complement it with other sources that will help us improve and extend the coverage.

For that, we should use a combination of other 4 sources as explained in this paper: http://arxiv.org/pdf/1411.5281v1.pdf

These are: Cyren, Google Ad Words, McAfee and WebPulse.

Pointers for those sources are mentioned in the paper. If we have any doubts we may write or call paper authors. I know them

@marcosmenendez
Copy link
Author

Besides considering Cyren or Webpulse, we could also test the outcome of running the following API for the unclassified domains:
http://www.alchemyapi.com/products/alchemylanguage/concept-tagging
http://www.alchemyapi.com/products/alchemylanguage/keyword-extraction

Login data can be found at lastpass shared folder

@JorgelieHD
Copy link
Contributor

In order to add new types of categorizations two site were investigated: Cyren (http://www.cyren.com/url-category-check.html) and Webpulse (https://sitereview.bluecoat.com/sitereview.jsp) known as Blue Coat.

These sites, as we could see, have more categorized domains , making it ideal to complement the categorization of domains. They use other categories, different from that used by Alexa (DMOZ) to classify domains.

Therefore it is necessary to find a relationship between the categories of Alexa and the other sites, check this paper as reference (http://arxiv.org/pdf/1411.5281v1.pdf) and more specifically Leacock-Chodorow similarity that works with a similarity coefficient between two or more words.

Here you can see an application that throws this coefficient (and other data)
http://ws4jdemo.appspot.com/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants