pelitk
is a python package that contains implementations of various lexical analysis tools that are useful for Second Language Acquisition (SLA) work. These modules can be imported and used in Python. At present, there are two modules available:
conc.py
- functions for creating concordances to show selected key words in contextlex.py
- functions measuring lexical sophistication and diversity using a range of indices
File | File type | Description |
---|---|---|
docs |
folder | contains CONC.MD and LEX.MD |
CONC.MD |
markdown | describes the conc.py module |
LEX.MD |
markdown | describes the lex.py module |
LICENSE.txt |
text | General Public License allowing reproduction but not editing of pelitk |
pelitk |
folder | contains the data/wordlists folder and the Python modules conc.py and lex.py |
data/wordlists |
folder | contains the wordlists required for lex.py |
conc.py |
Python script | Python module for installing concordancing functions |
lex.py |
Python script | Python module for installing lexical measurement functions |
README.md |
markdown | describes pelitk |
requirements.txt |
text | list of the Python modules that need to be installed for pelitk to function |
setup.py |
Python script | contains pelitk information and code required for installation |
To install pelitk
, enter the following into command line:
pip install git+https://github.com/ELI-Data-Mining-Group/pelitk.git@master
In addition, the following are Python modules required for lex.py
:
Essentially, a concordance is a list of words or phrases from a text, presented with their immediate contexts. Concordancing has long been an integral part of corpus investigations; as John Sinclair describes,
"The normal starting point for a corpus investigation is the concordance, which from early days in computing has used the [Key Word In Context (KWIC)] format, where instances of a chosen word or phrase (the NODE) are presented in a layout that aligns occurrences of the node vertically, but otherwise keeps them in the order in which they appear in the corpus."
Sinclair (2003, xiii)
conc.py
creates a concordance list based on key words in a text, and it has options to allow for greater user flexibility. In the example usage below, there is a short text of two sentences which has been tokenized (split into a list of strings) to analyze the key word platypus. The output (presented in two formats) demonstrates how concordance lines provide a useful format for quickly seeing how a word (or phrase) is used in different contexts.
>>> from pelitk import conc
>>> tok_text = ['The', 'key', 'word', 'in', 'this', 'text', 'is', 'the', 'noun', 'platypus', '.',
'I', 'want', 'to', 'see', 'the', 'cotext', 'every', 'time', 'the', 'word', 'platypus', 'occurs', '.']
>>> print(conc.concordance(tok_text,'platypus',5))
[('this text is the noun', 'platypus', '. I want to see'),
('cotext every time the word', 'platypus', 'occurs . ')]
>>> print(conc.concordance(tok_text,'platypus',5,pretty=True))
[' this text is the noun platypus . I want to see ',
' cotext every time the word platypus occurs . ']
Looking at the function more closely, we see that there are three required arguments and two optional arguments:
Argument | Description |
---|---|
tok_text | a list of tokenized text or list of tuples, e.g. ['the','word'] or [('the', 'DT'), ('word', 'NN')] |
node | the node word or tuple that will be the the focus of concordance lines |
num | the size of the collocation span, i.e. how many words on either side of the node |
pos | optional True/False argument (default is 'False'). Set to 'True' if the tok_text is a list of tuples with POS tags (see example above) |
pretty | optional True/False argument (default is 'False'). If True, the output will be formatted so that all the node words are aligned in each row and joined in a single string. |
Returning to the example, we see that we have selected a span of 5 words on either side of the key word (or node), which is a common span size, but which could be increased to allow for greater context. The second output shows the difference when the pretty
argument is set to 'True'. In the 'pretty' format, it is easier to scan visually, but it is more difficult to further process the data.
It is also possible to use conc.py
with a list of key words, rather than a single key word. For a demonstration of how to do so, see the PELIC_concordancing_tutorial
which compiles a concordance list with a list of nine different verbs.
For more example code and a full description of the functions (including their arguments and sub-functions), see CONC.md
and conc.py
.
There are a number of quantitative measures used for understanding and describing lexical proficiency and development. In particular, many researchers have focused on lexical sophistication (the variation in ‘basic’ and ‘advanced’ words used in a text) and lexical diversity (the percentage of unique words in a text). For a complete discussion of lexical proficiency, see Leńko-Szymańska (2019). lex.py
provides functions to calculate a number of the more commonly used metrics of sophistication and diversity, summarized briefly below.
For example code and a full description of the functions (including their arguments and sub-functions), see LEX.md
and lex.py
.
adv_guiraud
Calculates Advanced Guiraud (AG):
- measure of lexical sophistication
- formula = advanced types / sqrt(number of tokens).
- By default, the function uses NGSL top 2k words as frequency list of common types to ignore. Optionally, other lists can be used instead.
vocd
Calculates vocD:
- measure of lexical diversity
- formula = calculating TTR from a number of random samples then fitting a curve and reporting the parameter value
- the default requires a minimum text length of 35 words (the default number of sub-samples), though this can be optionally adjusted
ttr
Calculates Type-Token_Ratio (TTR):
- simple measure of lexical diversity
- formula = number of types / number of tokens in a text
- practical to calculate but sensitive to text length (shorter texts have higher TTR)
mtld
Calculates Measure of Textual Lexical Diversity (MTLD):
- measure of lexical diversity
- formula = complex sequential analysis of samples, generating a score based on TTR scores in the samples.
maas
Calculates Maas (log 2):
- measure of lexical diversity
- formula = TTR with log correction