Apple Daily Corpus - 蘋果日報語料庫

Text corpus of articles published by Apple Daily between 2002/01/01 to 2021/06/20.

Format

Articles published on the same day are organized into the same csv file. The csv file name represents the published date in yyyymmdd format. Each article is stored as a row in the csv.

Each csv has the following columns:

key

An unique key to identify the row/article.

date

The published date in yyyymmdd format.

article_daily_id

The id of the article on each day. For example, article with id 0 is the headline of that day.

title

The title of the article. The title is also included at the beginning of the text column. Users using the text column does not need to scan this column.

text

The text of the article. Line endings are not preserved.

Build

Makefile is included to build the corpus from the orginial backup apple-articles-plaintext-20020101-20210620.zip.

To build the corpus:

Download apple-articles-plaintext-20020101-20210620.zip from the internet. Unzip it to the root of the repository, keep the data folder structure.
Run make all to build the corpus. It requires xargs, python3 and BeautifulSoup.
csv files are generated under the corpus folder.

Sample using the corpus

2 Spark notebooks are included in the sample folder.

ngram.ipynb

List all frequently appeared word combos in the corpus.

sentences.ipynb

Parse and scan for all the sentense in the corpus.

Missing articles

Some of the articles are missing in the corpus. They are listed in error.log.

License

The license status of Apple Daily article is unknown.

Derived works under sample are released under the CC BY 4.0 license.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
corpus		corpus
examples		examples
.gitignore		.gitignore
Makefile		Makefile
README.md		README.md
apple2csv.py		apple2csv.py
corpus_list.txt		corpus_list.txt
error.log		error.log

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Apple Daily Corpus - 蘋果日報語料庫

Format

key

date

article_daily_id

title

text

Build

Sample using the corpus

ngram.ipynb

sentences.ipynb

Missing articles

License

About

Languages

alex-the-man/apple-daily-corpus

Folders and files

Latest commit

History

Repository files navigation

Apple Daily Corpus - 蘋果日報語料庫

Format

key

date

article_daily_id

title

text

Build

Sample using the corpus

ngram.ipynb

sentences.ipynb

Missing articles

License

About

Resources

Stars

Watchers

Forks

Languages