Skip to content

Commit

Permalink
Rename to AcrosticSleuth
Browse files Browse the repository at this point in the history
  • Loading branch information
Dargones committed Jul 29, 2024
1 parent b89b837 commit 7b04914
Show file tree
Hide file tree
Showing 2 changed files with 24 additions and 24 deletions.
46 changes: 23 additions & 23 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,47 +1,47 @@
# AcrosticScout
# AcrosticSleuth

AcrosticScout is a program for identifying and ranking acrostics.
AcrosticSleuth is a program for identifying and ranking acrostics.
At a high level, the tool works by comparing the probability of random occurrence with the probability that a sequence of characters forms a meaningful word or phrase in the target language.
AcrosticScout is optimized to quickly process gigabytes of text.
With the help of AcrosticScout, we have been able to discover multiple previously unknown acrostics, including the English philosopher's Thomas Hobbes signature in *The Elements of Law* (THOMAS[OF]HOBBES).
AcrosticSleuth is optimized to quickly process gigabytes of text.
With the help of AcrosticSleuth, we have been able to discover multiple previously unknown acrostics, including the English philosopher's Thomas Hobbes signature in *The Elements of Law* (THOMAS[OF]HOBBES).
You can read more about the methodology in our upcoming paper ([preprint]()).

### Table of contents
- [What languages does AcrosticScout support?](#what-languages-does-acrosticscout-support)
- [How to install and use AcrosticScout?](#how-to-install-and-use-acrosticscout)
- [What languages does AcrosticSleuth support?](#what-languages-does-acrosticsleuth-support)
- [How to install and use AcrosticSleuth?](#how-to-install-and-use-acrosticsleuth)
- [Hello World example](#hello-world-example)
- [How was AcrosticScout evaluated?](#how-was-acrosticscout-evaluated)
- [How was AcrosticSleuth evaluated?](#how-was-acrosticsleuth-evaluated)
- [How to reproduce our results?](#how-to-reproduce-our-results)
- [How to cite this?](#how-to-cite-this)

## What languages does AcrosticScout support?
AcrosticScout currently support **English, French, Russian, and Latin**.
The only language-specific component of AcrosticScout is the unigram language model produced by [sentencepiece](https://github.com/google/sentencepiece).
Support for new languages can, therefore, be easily added -- please [make an issue](https://github.com/acrostics/acrostic-scout/issues/new) here on GitHub if you wish to use AcrosticScout with another language.
## What languages does AcrosticSleuth support?
AcrosticSleuth currently support **English, French, Russian, and Latin**.
The only language-specific component of AcrosticSleuth is the unigram language model produced by [sentencepiece](https://github.com/google/sentencepiece).
Support for new languages can, therefore, be easily added -- please [make an issue](https://github.com/acrostics/acrostic-sleuth/issues/new) here on GitHub if you wish to use AcrosticSleuth with another language.

## How to install and use AcrosticScout?
## How to install and use AcrosticSleuth?

To run AcrosticScout, you need Java SDK installed on your machine.
We have tested AcrosticScout on Mac OS and Linux.
To run AcrosticSleuth, you need Java SDK installed on your machine.
We have tested AcrosticSleuth on Mac, Mac-Arm, Ubuntu, and Windows [as part of our CI](.github/workflows/main.yml).

First, compile the code from the base directory using:

```bash
javac -cp src -encoding UTF-8 src/acrostics/*.java
```

Then run AcrosticScout using the command below, replacing `INPUT` and `LANG` with the name of the directory that contains the dataset you wish AcrosticScout to analyze and the language of that dataset, respectively:
Then run AcrosticSleuth using the command below, replacing `INPUT` and `LANG` with the name of the directory that contains the dataset you wish AcrosticSleuth to analyze and the language of that dataset, respectively:

```bash
java -cp src acrostics.Main -input INPUT -language LANG
```

AcrosticScout accepts multiple optional command line arguments (thank you, [picocli](https://github.com/remkop/picocli/tree/v4.7.6)) -- run the tool with the `--help` flag to get the up-to-date list of all available options.
AcrosticSleuth accepts multiple optional command line arguments (thank you, [picocli](https://github.com/remkop/picocli/tree/v4.7.6)) -- run the tool with the `--help` flag to get the up-to-date list of all available options.

## Hello World example

This repository includes an example dataset comprising a subset of pages with acrostics from the English subdomain of WikiSource database (see [How was AcrosticScout evaluated?](#how-was-acrosticscout-evaluated)).
You can test AcrosticScout on this small dataset using:
This repository includes an example dataset comprising a subset of pages with acrostics from the English subdomain of WikiSource database (see [How was AcrosticSleuth evaluated?](#how-was-acrosticsleuth-evaluated)).
You can test AcrosticSleuth on this small dataset using:

```bash
java -cp src acrostics.Main -input data/example -language EN -mode LINE -charset utf-8 -outputSize 4000 --concise
Expand All @@ -52,7 +52,7 @@ Here is the meaning behind each of the options used:
- `-language EN`: use the default English language model
- `-mode LINE`: search for line acrostics (where an acrostic is formed by the initial letters of each line)
- `-charset utf-8`: use the utf-8 encoding when opening the files
- `-outputSize 4000`: return top 4000 instances (AcrosticScout clusters collocated instances, so the actual number of results it returns is much smaller -- 46)
- `-outputSize 4000`: return top 4000 instances (AcrosticSleuth clusters collocated instances, so the actual number of results it returns is much smaller -- 46)
- `--concise`: only report key information (file,acrostic,rank).

Specifically, you should be getting the following output (highest ranked acrostics appear at the bottom of the list):
Expand Down Expand Up @@ -108,10 +108,10 @@ data/example/The PearlVolume 18Acrostic.txt cunt_is_sweet_when_young_and_ten
data/example/The Confessions of William-Henry Ireland.txt warwick_at_dudley_at_southampton_at_rivers_at_shakspeare 7.6181055E+27
```

## How was AcrosticScout evaluated?
## How was AcrosticSleuth evaluated?

We have created the [Acrostic Identification Task Dataset](https://github.com/acrostics/acrostic-identification-task-dataset) by manually identifying all poems explicitly referred to or formatted as acrostics on English, Russian, and French subdomains of [WikiSource](https://en.wikisource.org/wiki/Main_Page), an online library of source texts in the public domain.
AcrosticScout reaches recall of over 50% within the first 100 results it returns for English and Russian, and recall rises to up to 80% when considering more results.
AcrosticSleuth reaches recall of over 50% within the first 100 results it returns for English and Russian, and recall rises to up to 80% when considering more results.
Read more in our [paper]():

![](RecallFigure.svg)
Expand All @@ -131,9 +131,9 @@ First, clone this directory with the `--recursive` flag, so that it also include
Next, follow the directions for [downloading and setting up the Acrostic Identification Task Dataset](https://github.com/acrostics/acrostic-identification-task-dataset/blob/main/README.md), which is cloned as a submodule for this repository in the `data` directory.
Make sure to run the [get_data.sh](https://github.com/acrostics/acrostic-identification-task-dataset/blob/main/get_data.sh) script as discussed in the README linked above.

Finally, to run AcrosticScout on the dataset and measure its recall, run [data/evaluate_on_acrostics-identification-task-dataset.sh](data/evaluate_on_acrostics-identification-task-dataset.sh).
Finally, to run AcrosticSleuth on the dataset and measure its recall, run [data/evaluate_on_acrostics-identification-task-dataset.sh](data/evaluate_on_acrostics-identification-task-dataset.sh).
The script will save the output files in the `output` directory and produce `recall.png` figure that plots the recall graph you see above and in the paper.

## How to cite this?

Fedchin, A., Cooperman, I., Chaudhuri, P., Dexter, J.P. 2024 "AcrosticScout: Differentiating True Acrostics from Random Noise in Multilingual Corpora Using Probabilistic Ranking". Forthcoming
Fedchin, A., Cooperman, I., Chaudhuri, P., Dexter, J.P. 2024 "AcrosticSleuth: Differentiating True Acrostics from Random Noise in Multilingual Corpora Using Probabilistic Ranking". Forthcoming
2 changes: 1 addition & 1 deletion data/acrostic-identification-task-dataset

0 comments on commit 7b04914

Please sign in to comment.