You'll never find what you're not looking for, so these wordlists are:
-
Comprehensive. These wordlists are constructed by analyzing terabytes of data from the biggest data sources around.
-
Based on RECENT data. Tech evolves fast, why shouldn't wordlists? The wordlists are based exclusively on data from up to a year ago, and will keep changing as tech changes.
-
Created with SCIENCE. By using some data science to remove outliers, generally crappy results and much else we remove much of the human element and biases to give you much more relevant and language-agnostic results.
-
Magical. The construction of these wordlists is automagic, meaning in a year from now this github repo will still have up-to-date and high quality wordlists.
-
Sorted. By rows being sorted from most likely to occur to least likely, your chances of finding juicy stuff as fast as possible is much better, making the wordlists uniquely suitable when speed AND comprehensiveness are required.
-
Explainable. Many items in popular wordlists have no basis in real life except for what the author thinks will work. Every row in these wordlists is derived from real data.
As you can see, the top 3000 rows in the parameters wordlist map quite beautifully to their likelihood of being found in websites.
Use the mixed case wordlist unless you are sure your target is case insensitive. Or don't. I'm a README, not a cop.
Wordlist name | Size(s) | Description |
---|---|---|
sam-cc-parameters-(mixedcase|lowercase)-all.txt | ~50,000 | HTTP parameter names. Use this to find hidden functionality! Basically what would go in {here} for the URL http://example.com?{here}=value . |
sam-gh-directories-(mixedcase|lowercase)-top(size).txt | 1,000 10,000 100,000 |
Directory names as found in all open-source GitHub repos. Useful for brute-forcing host directories. |
sam-gh-files-(mixedcase|lowercase)-top(size).txt | 1,000 10,000 100,000 1,000,000 |
File names as found in all open-source GitHub repos. Useful for brute-forcing files, especially blind. |
The wordlists are created by trawling through huge public datasets. The methods employed are a bit different based on the noisiness of the data source, but in general:
- Deleting duplicate items from the same source (e.g. repo or domain) to allow the final frequency to represent their global frequency as opposed to letting small but repetitive sources dominate.
- Pruning items that are too rare to be of general interest based on their rate of occurrence (generally at least 10-100 occurrences)
- Using shannon entropy to remove random values, tokens and UUIDs.
- Removing items that are broken due to incorrect encoding and/or decoding.
The data source is given in the name of the file, to make them easy to tell apart.
cc
= CommonCrawl