Text corpus of articles published by Apple Daily between 2002/01/01 to 2021/06/20.
Articles published on the same day are organized into the same csv file. The csv file name represents the published date in yyyymmdd
format. Each article is stored as a row in the csv.
Each csv has the following columns:
An unique key to identify the row/article.
The published date in yyyymmdd
format.
The id of the article on each day. For example, article with id 0 is the headline of that day.
The title of the article. The title is also included at the beginning of the text
column. Users using the text
column does not need to scan this column.
The text of the article. Line endings are not preserved.
Makefile is included to build the corpus from the orginial backup apple-articles-plaintext-20020101-20210620.zip
.
To build the corpus:
- Download
apple-articles-plaintext-20020101-20210620.zip
from the internet. Unzip it to the root of the repository, keep thedata
folder structure. - Run
make all
to build the corpus. It requiresxargs
,python3
andBeautifulSoup
. - csv files are generated under the
corpus
folder.
2 Spark notebooks are included in the sample
folder.
List all frequently appeared word combos in the corpus.
Parse and scan for all the sentense in the corpus.
Some of the articles are missing in the corpus. They are listed in error.log
.
The license status of Apple Daily article is unknown.
Derived works under sample
are released under the CC BY 4.0 license.