The Wall Street Journal has done an excellent service by downloading the individual email files from the State Department's listing and making them accessible via their own search app. Even better, they've packaged the emails in convenient zip files.
Note: The WSJ has also published the code the use to do the scrape at their own Github repo:
https://github.com/wsjdata/clinton-email-cruncher
Why do you need your own copies of the emails when the WSJ's application works so well? It's a great tool, but it's still not as flexible as using regular expressions, which allow us to search by pattern, including:
- Look for all instances in which a
$
is followed by numbers (to quickly locate places where money is mentioned) - Look for all instances of...anything...in which the email does not include an article from the New York Times/Washington Post/WSJ/etc.
- Extract the "From:" field of each email to quickly find out who the most frequent senders were.
The algorithms in this repo are pretty general and can be replicated with any number of tools or languages. But to get my example code going, you should have:
- Python 3.5 (I use Anaconda)
- Requests for downloading the files (comes with Anaconda)
- lxml for simple HTML parsing
- The PDF Poppler library so that we can use pdftotext to extract the raw text from each PDF.
To search by regular expression, you can obviously use Python's library. But I like doing things via the command-line if possible, and ag (the Silver Searcher) is my favorite version of grep-like tools.
To get the data yourself, run the getdata.py via the command-line interpreter:
$ python getdata.py
Here's what happens:
-
The landing page for WSJ's document search is downloaded, via the Requests library.
-
The HTML is parsed via the lxml library, and the footnote links are extracted.
-
The URLs from each link is extracted, e.g.
http://graphics.wsj.com/hillary-clinton-email-documents/zips/Clinton_Email_August_Release.zip
-
A local filename is derived from the URL, e.g.
./data/docs/zips/Clinton_Email_August_Release.zip
-
If the zip file doesn't exist locally, it is downloaded.
-
Every zip file is unpacked into its own separate directory of pdfs.
-
Each PDF is processed via pdftotext (from the Poppler library), and the raw text (which is embedded in each PDF by the State Department's optical-character-recognition software) is extracted into its own text file.
I recommend using grep from the command-line to swiftly look for patterns (though I prefer ag, the silver searcher, for more regex power). Then, when you've gotten the lay of the data (including how messy the optical-character-recognition translation was), you can write Python code to do more specific things.
$ ag -i 'benghazi' data/docs/text
Sample result:
data/docs/text/HRCEmail_SeptemberWeb/C05785381.txt
24:BENGHAZI (AP) - At first, the responses to the questionnaire about the trauma of the war in Libya were
49: The women said they had been raped by Gadhafi's militias in numerous cities and towns: Benghazi, Tobruk,
70: Doctors at hospitals in Benghazi, the rebel bastion, said they had heard of women being raped but had not
We want emails that mention "benghazi" because the sender/recipient is actually discussing it, not just because they're sending each other news clippings.
Here's how to get a list of such filenames
$ ag 'Associated Press|New York Times|Washington Post|Reuters|\( *AP *\)' \
-L data/docs/text
data/docs/text/Clinton_Email_August_Release/C05765907.txt
data/docs/text/Clinton_Email_August_Release/C05765911.txt
data/docs/text/Clinton_Email_August_Release/C05765917.txt
data/docs/text/Clinton_Email_August_Release/C05765915.txt
data/docs/text/Clinton_Email_August_Release/C05765918.txt
data/docs/text/Clinton_Email_August_Release/C05765922.txt
data/docs/text/Clinton_Email_August_Release/C05765931.txt
data/docs/text/Clinton_Email_August_Release/C05765923.txt
data/docs/text/Clinton_Email_August_Release/C05765928.txt
data/docs/text/Clinton_Email_August_Release/C05765933.txt
Pipe those filenames back into ag
:
$ ag 'Associated Press|New York Times|Washington Post|Reuters|\( *AP *\)' \
-L data/docs/text \
| xargs ag -i 'benghazi'
Sample output:
data/docs/text/HRCEmail_SeptemberWeb/C05784017.txt
13:Subject: Fw: Benghazi Sitrep - August 14, 2011
14:Attachments: Benghazi Sitrep - August 14 2011.docx; Benghazi Sitrep - August 14 2011.docx
26:Subject: Fw: Benghazi Sitrep - August 14, 2011
40:(SBU) Jibril to Return to Benghazi to Discuss Forming a New Government: A senior foreign ministry official told us that
41:PM Mahmoud Jibril is planning to return to Benghazi on/about August 17 to resume discussions on reconstituting the
57:Subject: Benghazi Sitrep - August 14, 2011
70:Attached for your information is the latest Benghazi sitrep from the Office of the U.S. Envoy to the Libyan Transitional
A little better, but there's still a bit too much noise that comes from forwarded emails...some of these forwards will contain useful content, but let's see what we get when we cut them out:
$ ag 'Associated Press|New York Times|Washington Post|Reuters|\( *AP *\)' \
-L data/docs/text \
| xargs ag -iL 'fwd?:' \
| xargs ag -i 'benghazi'
Here's some sample output that I noticed immediately:
data/docs/text/HRCEmail_SeptemberWeb/C05779634.txt
19:For what it is worth. I know Self al-Islam — once flew w/ him on his jet from Benghazi to London. I have obviously been
data/docs/text/HRCEmail_SeptemberWeb/C05780083.txt
37:Benghazi to coordinate opposition military activities, made contact with the newly formed
38:National Libyan Council (NLC) stating that the Benghazi military council would join the NLC
122:• Benghazi, Opposition forces are currently in possession of 14 fighter aircraft at the
123: Benghazi Airport, but they have no pilots or maintenance crews to support them.
Let's just use good ol head to see what the tops of these emails are all about -- note that I pipe into grep to remove all blank lines for easier reading:
$ head -n 20 data/docs/text/HRCEmail_SeptemberWeb/C05779634.txt \
| grep -v '^$'
Output (junk characters removed):
UNCLASSIFIED U.S. Department of State Case No. F-2014-20439 Doc No. C05779634 Date: 09/30/2015
From: Anne-Marie Slaughter
Sent: Sunday, April 3, 2011 11:31 AM
To:
Cc: Mills, Cheryl D; Abedin, Huma; Sullivan, Jacob J
Subject: just a thought
For what it is worth. I know Self al-Islam — once flew w/ him on his jet from Benghazi to London. I have obviously been
very vocal in support of intervention and against Gaddafi himself, so have some credibility w/ both sides. lilt would be
Seems benign, but at least it's not just a news clipping.
A more generally useful pattern is looking for the dollar sign, plus any number of digits that immediately follow:
ag '\$\d+' data/docs/text
data/docs/text/HRC_Email_296/C05739846.txt
43: Department is asking permission from Congress to transfer $1.3 billion from funds that had been allocated for
44: spending in Iraq. This includes $553 million for additional Marine security guards; $130 million for diplomatic
45: security personnel; and $691 million for improving security at installations abroad.
data/docs/text/HRC_Email_296/C05739864.txt
107: families by providing payments of 2,000 Dinars (approximately $1,500) per month to
243: militiamen and their families by providing payments of 2,000 Dinars (approximately $1,500) per
This would obviously benefit from filtering the emails that include news clippings, but you get the idea.
A common angle in searching an email corpus is to find most-common senders or recipients. We don't need any fancy scripting, just a recognition of what makes an email address an email address.
Here's a typical header:
From: Mills, Cheryl D <[email protected]>
So basically, a line that begins with From:
...has a bunch of white space, and has the email inside angle brackets: <[email protected]>
Here's a quick tryout:
$ ag -o --noheading --nofilename '^From: {3,}.+' data/docs/text \
| head -n 10
Output:
From: Sullivan, Jacob J <[email protected]>
From: Mills, Cheryl D <[email protected]>
From: Abedin, Huma <[email protected]>
From: Abedin, Huma <[email protected]>
From: sbwhoeop
Pretty good. But some emails have the address redacted, i.e. sbwhoeop
. For now, let's just try to capture the actual email addresses:
$ ag -o --noheading --nofilename '^From: {3,}.+' data/docs/text \
| ag -o '(?<=<).+?(?=>)' \
| tr [:upper:] [:lower:] \
| tr -d ' ' \
| LC_ALL=C sort \
| LC_ALL=C uniq -c \
| LC_ALL=C sort -rn \
| head -n 10
Result:
7270 [email protected]
3923 [email protected]
3017 [email protected]
2782 [email protected]
626 [email protected]
498 [email protected]
396 [email protected]
291 [email protected]
257 [email protected]
242 millscd?state.gov
What's with the LC_ALL=C
in the previous command? Apparently the text encoding isn't all quite right. Without the flag, you will get this error:
sort: string comparison failed: Illegal byte sequence
sort: Set LC_ALL='C' to work around the problem.
sort: The strings compared were `abedinh\342\251state.gov' and `[email protected]'.
Of course, it's not the most frequent senders that are the most interesting...but also, the ones who only show up a few times...
$ ag -o --noheading --nofilename '^From: {3,}.+' data/docs/text \
| ag -o '(?<=<).+?(?=>)' \
| tr [:upper:] [:lower:] \
| tr -d ' ' \
| LC_ALL=C sort \
| LC_ALL=C uniq -c \
| LC_ALL=C sort -rn \
| tail -n 20
As you can see, the OCR quality wasn't perfect...but there are ways -- with a little more regex skill -- that we can filter out the junk:
1 [email protected]
1 [email protected]
1 [email protected]
1 [email protected]
1 [email protected]
1 [email protected]
1 [email protected]
1 [email protected]
1 [email protected]
1 [email protected]
1 [email protected]
1 abedinh?state,gov
1 abedinhostate.gov
1 abedinhdstate.gov
1 [email protected]
1 [email protected]
1 abedinh@state,gov
1 [email protected]
1 1i
1 1
You know how when shooting a quick email you're sometimes too lazy to use proper capitalization or punctuation? Sometimes that happens even to the Secretary of State and her circle:
$ ag '^Subject: [a-z0-9 ]+.?$' data/docs/text/
Of course, there's a ton of false negatives, but it's just a quickie search:
data/docs/text/HRCEmail_NovWeb/C05794317.txt
25:Subject: salsa excursion
data/docs/text/HRCEmail_NovWeb/C05794358.txt
9:Subject: help
data/docs/text/HRCEmail_NovWeb/C05794365.txt
24:Subject: hey
data/docs/text/HRCEmail_NovWeb/C05794372.txt
442:Subject: help
data/docs/text/HRCEmail_NovWeb/C05794378.txt
39:Subject: more on libya
data/docs/text/HRCEmail_NovWeb/C05794388.txt
24:Subject: are you in dc or ny?
This is getting pretty ugly without programming, but still, it kind of works:
ag '^Sent: (.+?)(((0| )8|11):\d\d *PM|(0| )[0-6]:\d\d *AM)' data/docs/text/
Sometimes people like to use multiple consecutive exclamation or question marks, or both, to indicate strong emotions. Also, such punctuation is not usually part of a formal article, so finding these occurrence may help filter for "real" messages:
$ ag -i '[a-z]+[?|]{2,}' data/docs/text
Some of the sample matches:
data/docs/text/HRCEmail_NovWeb/C05794489.txt
38:Friday's schedule and try to remind me???
data/docs/text/HRCEmail_NovWeb/C05794646.txt
17:Wonder why they didn't hand me their letter??
data/docs/text/HRCEmail_NovWeb/C05794734.txt
17:What were you doing at the Kennedy Center w Ambos, Philippe and Rosemarie??? I'm very confused!
data/docs/text/HRCEmail_NovWeb/C05795016.txt
49:Is nothing sacred??
data/docs/text/HRCEmail_NovWeb/C05795120.txt
16:WHAT??? Or, more to the point, WTF??
data/docs/text/HRCEmail_NovWeb/C05795739.txt
17:Van I see the pashminas??
A little regular expression knowledge can help refine the most vaguest of searches. For example, people these days like to use abbreviations for normal words, such as r u
instead of are you
. Or sometimes, u?
and/or y?
. Looking for such informal phrases of communication can be a great filter when cutting through mostly boring emails.
With normal text search, it's very difficult to disambiguate "u?" -- i.e.
- Find all occurrences of the letter "u" followed by a question mark
-- from "Do you prefer Ubuntu?"
The following grep looks for the solitary letter "r" followed by one-or-more white spaces, and then the solitary letter "u", case insensitive:
$ ag -i '\br +u\b' data/docs/text
Just a few messages, but they seem fun:
data/docs/text/Clinton_Email_August_Release/C05767696.txt
35:about that. But regardless, means ur email must be back! R u getting other messages?
data/docs/text/Clinton_Email_August_Release/C05770301.txt
1726: R Under Secretary of State for Public Diplomacy and Public Affairs
data/docs/text/Clinton_Email_August_Release/C05773787.txt
11:Subject: R u up? He's done and wondering u r up.
data/docs/text/Clinton_Email_August_Release/C05774251.txt
27:about that. But regardless, means ur email must be back! R u getting other messages?
data/docs/text/HRCEmail_Feb13thWeb/C05791277.txt
21:Cc: Pittman, H Dean; Holt, Victoria K (USUN); Fine Tressa R USUN); Ried, Curtis R (USUN); 'Hajjar
data/docs/text/HRCEmail_NovWeb/C05798177.txt
16:What a terrific time. Thanks so much for including me. Loved the. Toast- the downton abbey schtick( am a fan. R U
data/docs/text/HRCEmail_OctWeb/C05791376.txt
26:that R understands that some wounded and children are getting out. I told him bluntly that's not what we understood.
The following regex looks for the letter "u" followed by a literal question mark which is not followed by another alphabetical character -- a rare real-life chance to practice the negative-lookahead syntax, e.g. (?!\w)
:
$ ag -i '\bu\?(?!\w)' data/docs/text/
These all look fun:
data/docs/text/HRCEmail_DecWeb/C05785917.txt
75:I haven't heard anything more from dod today...have u?
data/docs/text/HRCEmail_DecWeb/C05785919.txt
92:I haven't heard anything more from dod today...have u?
data/docs/text/HRCEmail_Jan29thWeb/C05781152.txt
39:U never sleep do u?
data/docs/text/HRCEmail_JulyWeb/C05764209.txt
13:Subject: Can I call u?
data/docs/text/HRCEmail_JuneWeb/C05760041.txt
12:Subject: Re: Can I call u?
22:Subject: Can I call u?
What happens when you leave out the negative lookahead?
$ ag -i '\bu\?' data/docs/text/
A minor inconvenience in the form of a few extra non-useful matches, as several URL query strings are captured without a negative lookahead:
data/docs/text/HRCEmail_NovWeb/C05774838.txt
430:maeci.gc.ca/u?id=1001455.35479660e507911ad0ede9335212fa35&n=T&1=001 foreign affairs eng&o=32902
data/docs/text/HRCEmail_SeptemberWeb/C05778574.txt
112: <http://whatcounts.com/u?id=COD97B57DCF68E4CFC5E233FC1C79406A60E1DDBDO1ECB9D> .
data/docs/text/HRCEmail_SeptemberWeb/C05782700.txt
87: http://email.foxnews.com/u?id=63BA7452E6F05FA312628179BBC2EB01
91: http://email.foxnews.com/u?id=63BA7452E6F05FA312628179BBC2EBO1&global=1
Maybe you're interested in the types of websites Secretary Clinton and her email friends forward each other? Here's a probably not totally accurate regex pattern for that:
$ ag -io 'https?://.+?\s' data/docs/text/
Or maybe you're curious about the domains...because you want to find out which websites, in general, are the most frequented by Secretary Clinton's network. Again, a bit sloppy, but one that we can refine if needed:
$ ag -io 'https?://.+?(?=/)' data/docs/text/
And here's how to get a tally of top 20 most mentioned web domains:
$ ag --noheading --nofilename \
-io 'https?://.+?(?=/)' data/docs/text/ \
| sort | uniq -c | sort -rn | head -n 20
Here's the output:
371 http://www.guardian.co.uk
142 http://www.amazon.com
125 http://www.nytimes.com
121 http://en.wikipedia.org
99 http://www.washingtonpost.com
86 http://topics.nytimes.com
82 http://www.facebook.com
79 http://www.huffingtonpost.com
70 http://www.messagelabs.com
66 http://www.ft.com
65 http://www.state.gov
54 http://www.thedailybeast.com
53 http://twitter.com
45 http://www.youtube.com
41 http://www.newyorkercom
34 http://www.newyorker.com
34 http://www.evite.com
34 http://maxblumenthal.com
34 http://coloradoindependent.com
Note that some of these are repeats within the same email, or forward-chains, such as coloradoindependent.com
And of course, nothing wrong with plucking out a few results from a previous query to check out some tangents, such as, what exactly were they using Wikipedia or New York Times' topics pages as references?
$ ag --noheading --nofilename \
-i 'wikipedia.org|topics.nytimes.com' data/docs/text/