Skip to content

interscript/geonames-transliteration-data

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

38 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

GeoNames data for transliteration testing

Build Status

Purpose

Extract transliteration pairs (entries that are coded with transliteration systems) for testing of those transliteration systems.

LAST UPDATED: See releases page.

Usage

Basic

make all

Achieves all the steps below.

Check latest release date of GNDB

Due to recent instabilities of pages that serve GNDB releases, the make checkdate target is created in order to fetch the latest release date (once or multiple times) to ensure that a consistent version is served across all GNDB load balancers.

# Check once
make checkdate

# Check 10 times
CHECKDATE_TIMES=10 make checkdate

All dates displayed should be identical. If interleaving entries are shown, there is an issue with endpoint consistency.

Create the GeoNames database for filtering

Create a SQLite3 database using the “all countries” GeoNames data set.

make geonames.db

Internally it will run this:

$ curl -sL http://geonames.nga.mil/gns/html/cntyfile/geonames_20200106.zip -o geonames.zip
$ unzip geonames.zip -d data
$ sqlite
sqlite> .mode tabs
sqlite> .import geonames_20200106/Countries.txt countries
sqlite> .backup geonames.db
sqlite> exit

Extract all transliteration pairs first

make geonames_pairs.db

You will get geonames_pairs.csv which contains all transliteration pairs.

TRANSL_CD|SRC_UNI|SRC_NT|SRC_FULL_NAME_RO|SRC_FULL_NAME_RG|DEST_UNI|DEST_NT|DEST_FULL_NAME_RO|DEST_FULL_NAME_RG|LC
zho_Hani2Latn_WDG_1979|475320|D|Qiantangjiang Zhan|Qiantangjiang Zhan|12277331|DS|钱塘江站|钱塘江站|zho
zho_Hani2Latn_WDG_1979|14673196|V|Tung-men Chiao|Tung-men Chiao|14633249|VS|东门礁|东门礁|zho
zho_Hani2Latn_WDG_1979|14673197|V|Hsi-men Chiao|Hsi-men Chiao|14633249|VS|东门礁|东门礁|zho
...

Extracting for every transliteration system

make all

Your output files will be stored in pairs/${translit_system}.csv and look like this:

TRANSL_CD,SRC_UNI,SRC_NT,SRC_FULL_NAME_RO,SRC_FULL_NAME_RG,DEST_UNI,DEST_NT,DEST_FULL_NAME_RO,DEST_FULL_NAME_RG,LC
ara_Arab2Latn_BUC_2002,17550392,N,"Wādī al Khuḑrah","Wādī al Khuḑrah",17550393,NS,"وادي الخضرة","وادي الخضرة",ara
ara_Arab2Latn_BUC_2002,-3489887,D,"Wādī ad Dafīn","Wādī ad Dafīn",14461283,DS,"وادي الدفين","وادي الدفين",ara
ara_Arab2Latn_BUC_2002,18101655,N,"Shi‘b Ḩāzirin","Ḩāzirin, Shi‘b",18101656,NS,"شعب حازرن","شعب حازرن",ara
ara_Arab2Latn_BUC_2002,14363956,N,"Būghāz Raqm Ithnān","Raqm Ithnān, Būghāz",14305791,NS,"بوغاز رقم ٢","بوغاز رقم ٢",ara
...

Column description

TRANSL_CD

The transliteration system used to generate the DEST_* name.

LC

Language code.

{SRC, DEST}_UNI

ID of the source/destination name.

{SRC, DEST}_NT

Type of source/destination name.

SRC_FULL_NAME_{RO,RG}

Source name. RO ⇒ full name. RG ⇒ name with grouped feature, e.g. "Lake Nemo" in RO will be formatted as "Nemo, Lake" in RG.

NT column values

In the *_NT columns, rows with values DS, NS, VS are always the name source, the rest are generated.

NT_CD DESCRIPTION DEFINITION

C

Conventional

A commonly used English-language name approved by the U.S. Board on Geographic Names (BGN) for use in addition to, or in lieu of, a BGN-approved local official name or names, e.g., Rome, Alps, Danube River.

D

Unverified

A name from a source whose official status can not be verified by the BGN.

DS

Unverified Non-Roman Script

The non-Roman script form of a name from a source whose official status can not be verified by the BGN.

N

Approved

The BGN-approved local official name for a geographic feature. Except for countries with more than one official language; there is normally only one such name for a feature wholly within a country.

NS

Non-Roman Script

The non-Roman script form of the BGN-approved local official name for a geographic feature. Except for countries with more than one official language; there is normally only one such name for a feature wholly within a country.

P

Provisional

A geographic name of an area for which the territorial status is not finally determined or not recognized by the United States.

V

Variant

A former name, name in local usage, or other spelling found on various sources.

VA

Anglicized Variant

An English-language name that is derived by modifying the local official name to render it more accessible or meaningful to an English-language user.

VS

Variant Non-Roman Script

The non-Roman script form of a former name, name in local usage, or other spelling found on various sources.

Credits

Copyright Ribose.