Software designed for comparing Wikipedia articles in different languages in order to determine what information is missing from one article, but present in another. The goal is for everyone to have the same access to information no matter what language they speak. This is just one small step in eliminating the digital divide.
The intended use of this software is for comparing Wikipedia articles, however in it's current state, users must copy/paste text into text-boxes. Because of this, it can be used to compare any 2 sets of text to view similarities.
For more information, visit: https://www.grey-box.ca
Python 3+:
https://www.python.org/downloads/
Git:
https://git-scm.com/book/en/v2/Getting-Started-Installing-Git
# go to your operating system's command line interface
# clone repo
git clone https://github.com/aidhayes/project-symmetry
# change directory to project-symmetry
cd project-symmetry
# install required libraries
python3 -m pip install -r requirements.txt
# run program
python3 main.py
Inside these text-boxes is where you will paste the articles you wish to compare. Please note, at this time text must be in the same language, so use a translation tool such as DeepL to make text the same language.
We currently offer support for 45 different languages. Select one from the drop down then hit "Select" to change the on screen display.
First select a comparison tool (currently, only 2 are supported):
Then select a similarity percentage:
The program will search for sentences that have a similarity score >= to this number. (Note: The program is unlikely to return results if you select a high percentage due to the nature of comparison tools. A percentage of ~10% for BLEU Score and ~30% for Sentence Bert has returned best results, though feel free to test different values. Click "Select" to change the Comparison Tool and Similarity Percentage.
Finally, click "Compare", and the program should highlight sentences in both articles than are similar to each other. Matching colors denote maching sentences.
Ex. English v. French Article on Barack Obama:
Note: Some sections highlighted may not be very similar at all, please reference the disclaimer down below. We will try to get better results so less human review is needed.
Testing for comparison speed will be done as follows:
- Clean up formatting of articles, if this step is not done the comparison will take much longer than needed (i.e. 1.5 minutes vs 10 minutes)
- Paste without formatting into word document
- Remove infobox
- It's likely these contain roughly the same information, and since text is unformatted, it will be hard to tell where this information is coming from
- Mostly done to cut down on comparison time
- Preferably remove image captions
- Remove references
- Minimal apps open so more RAM is being allocated to comparison (i.e. Tests were run with only VSCode, the comparison app, and Excel open).
- Number iterations based off # sentences in each article (O(m * n))
- NLTK splits articles into sentences using
sent_tokenizer
, simply got length by usinglen()
- NLTK splits articles into sentences using
- Estimates made based off previous results
- Comparison speed calculated using time library
- Time per comparison = (total time) / (# iterations)
- Initial estimate of .0005 seconds per iteration
- Articles of varying length used
- Barack Obama (random selection)
- Elvis Presley (One of the longest articles according to this site
- Boris Johnson (One of largest articles according to Wikipedia)
If you wish to not wait roughly 1-2 minute to compare entire articles, then it is recommended you only compare single sections at a time. Doing so will provide much faster results, though more testing will be needed if doing this section by section will save much time.
This project utilizes several NLP libraries to compare text. It is important to note that the results may not always be accurate. Most of these libraries do not take into consideration sentence structure and grammar, so it is advised that the user double checks to make sure highlighted sections are close enough to each other. The best translations and comparisons will always be made by a real person, however having someone manually do this would be extremely time consuming, which is one of the problems this project aims to solve.