Skip to content

Commit

Permalink
Merge branch 'main' into master
Browse files Browse the repository at this point in the history
  • Loading branch information
oskar700 authored Jan 8, 2024
2 parents e24f8a0 + 36933e2 commit aa993a9
Show file tree
Hide file tree
Showing 12 changed files with 53 additions and 57 deletions.
3 changes: 3 additions & 0 deletions .github/CODEOWNERS
Validating CODEOWNERS rules …
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
# Default code owner

* @Senzing/senzing-gdev
6 changes: 6 additions & 0 deletions .github/pull.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
version: "1"
rules: # Array of rules
- base: main # Required. Target branch
upstream: openvenues:master # Required. Must be in the same fork network.
mergeMethod: squash # Optional, one of [none, merge, squash, rebase, hardreset], Default: none.
label: ":arrow_heading_down: pull" # Optional
24 changes: 24 additions & 0 deletions .github/workflows/issue-automation.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
---

name: 'issue automation'

on:
issues:
types:
- reopened
- opened

jobs:
add-issue-to-community:
uses: Senzing/build-resources/.github/workflows/add-to-project.yaml@main
with:
project-number: "9"
classic: true
secrets:
SENZING_GITHUB_ACCESS_TOKEN: ${{ secrets.SENZING_GITHUB_ACCESS_TOKEN }}

add-issue-labels:
uses: Senzing/build-resources/.github/workflows/add-labels-to-issue.yaml@main
secrets:
ORG_MEMBERSHIP_TOKEN: ${{ secrets.ORG_MEMBERSHIP_TOKEN }}
SENZING_MEMBERS: ${{ secrets.SENZING_MEMBERS }}
6 changes: 3 additions & 3 deletions .github/workflows/test.yml
Original file line number Diff line number Diff line change
Expand Up @@ -2,9 +2,9 @@ name: Test

on:
push:
branches: [master]
branches: [main]
pull_request:
branches: [master]
branches: [main]
workflow_dispatch:

jobs:
Expand All @@ -14,7 +14,7 @@ jobs:
os: [ubuntu-latest, macos-latest]
runs-on: ${{ matrix.os }}
steps:
- uses: actions/checkout@v3
- uses: actions/checkout@v4
- name: Install Dependencies Linux
if: matrix.os == 'ubuntu-latest'
run: |
Expand Down
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -82,3 +82,5 @@ docs/_build/

# PyBuilder
target/

.history
13 changes: 13 additions & 0 deletions PULL_REQUEST_TEMPLATE.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
# Pull request questions

## Which issue does this address

Issue number: #nnn

## Why was change needed

???

## What does change improve

???
18 changes: 0 additions & 18 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -177,24 +177,6 @@ If you require a .lib import library to link this to your application. You can g
lib.exe /def:libpostal.def /out:libpostal.lib /machine:x64
```

Installation with an alternative data model
-------------------------------------------

An alternative data model is available for libpostal. It is created by Senzing Inc. for improved parsing on US, UK and Singapore addresses and improved US rural route address handling.
To enable this add `MODEL=senzing` to the conigure line during installation:
```
./configure --datadir=[...some dir with a few GB of space...] MODEL=senzing
```

The data for this model is gotten from [OpenAddress](https://openaddresses.io/), [OpenStreetMap](https://www.openstreetmap.org/) and data generated by Senzing based on customer feedback (a few hundred records), a total of about 1.2 billion records of data from over 230 countries, in 100+ languages. The data from OpenStreetMap and OpenAddress is good but not perfect so the data set was modified by filtering out badly formed addresses, correcting misclassified address tokens and removing tokens that didn't belong in the addresses, whenever these conditions were encountered.

Senzing created a data set of 12950 addresses from 89 countries that it uses to test and verify the quality of its models. The data set was generated using random addresses from OSM, minimally 50 per country. Hard-to-parse addresses were gotten from Senzing support team and customers and from the libpostal github page and added to this set. The Senzing model got 4.3% better parsing results than the default model, using this test set.

The size of this model is about 2.2GB compared to 1.8GB for the default model so keep that in mind if storages space is important.

Further information about this data model can be found at: https://github.com/Senzing/libpostal-data
If you run into any issues with this model, whether they have to do with parses, installation or any other problems, then please report them at https://github.com/Senzing/libpostal-data

Examples of parsing
-------------------

Expand Down
15 changes: 0 additions & 15 deletions configure.ac
Original file line number Diff line number Diff line change
Expand Up @@ -60,17 +60,6 @@ AC_SUBST([LIBPOSTAL_DATA_FILE_LATEST_VERSION], [$DATA_FILE_LATEST_VERSION])
AC_SUBST([LIBPOSTAL_PARSER_MODEL_LATEST_VERSION], [$PARSER_MODEL_LATEST_VERSION])
AC_SUBST([LIBPOSTAL_LANG_CLASS_MODEL_LATEST_VERSION], [$LANG_CLASS_MODEL_LATEST_VERSION])

# Senzing data
AC_SUBST([LIBPOSTAL_SENZING_DATA_DIR_VERSION_STRING], [v1])

SENZING_DATA_FILE_LATEST_VERSION=$(cat $srcdir/versions/senzing/base_data)
SENZING_PARSER_MODEL_LATEST_VERSION=$(cat $srcdir/versions/senzing/parser)
SENZING_LANG_CLASS_MODEL_LATEST_VERSION=$(cat $srcdir/versions/senzing/language_classifier)

AC_SUBST([LIBPOSTAL_SENZING_DATA_FILE_LATEST_VERSION], [$SENZING_DATA_FILE_LATEST_VERSION])
AC_SUBST([LIBPOSTAL_SENZING_PARSER_MODEL_LATEST_VERSION], [$SENZING_PARSER_MODEL_LATEST_VERSION])
AC_SUBST([LIBPOSTAL_SENZING_LANG_CLASS_MODEL_LATEST_VERSION], [$SENZING_LANG_CLASS_MODEL_LATEST_VERSION])

AC_CONFIG_FILES([Makefile
libpostal.pc
src/Makefile
Expand Down Expand Up @@ -108,10 +97,6 @@ AC_ARG_ENABLE([data-download],
*) AC_MSG_ERROR([bad value ${enableval} for --disable-data-download]) ;;
esac], [DOWNLOAD_DATA=true])

AC_ARG_VAR(MODEL, [Option to use alternative data models. Currently available is "senzing" (MODEL=senzing). If this option is not set the default libpostal data model is used.])
AS_VAR_IF([MODEL], [], [],
[AS_VAR_IF([MODEL], [senzing], [], [AC_MSG_FAILURE([Invalid MODEL value set])])])

AM_CONDITIONAL([DOWNLOAD_DATA], [test "x$DOWNLOAD_DATA" = "xtrue"])

AC_ARG_WITH(cflags-scanner-extra, [AS_HELP_STRING([--with-cflags-scanner-extra@<:@=VALUE@:>@], [Extra compilation options for scanner.c])],
Expand Down
20 changes: 2 additions & 18 deletions src/libpostal_data.in
Original file line number Diff line number Diff line change
Expand Up @@ -14,12 +14,10 @@ LIBPOSTAL_DATA_DIR=$3
MB=$((1024*1024))
CHUNK_SIZE=$((64*$MB))

DATAMODEL="@MODEL@"

# Not loving this approach but there appears to be no way to query the size
# of a release asset without using the Github API
LIBPOSTAL_DATA_FILE_CHUNKS=1
LIBPOSTAL_PARSER_MODEL_CHUNKS=12
LIBPOSTAL_PARSER_MODEL_CHUNKS=1
LIBPOSTAL_LANG_CLASS_MODEL_CHUNKS=1

LIBPOSTAL_DATA_DIR_VERSION_STRING="@LIBPOSTAL_DATA_DIR_VERSION_STRING@"
Expand All @@ -34,21 +32,7 @@ LIBPOSTAL_DATA_FILE="libpostal_data.tar.gz"
LIBPOSTAL_PARSER_FILE="parser.tar.gz"
LIBPOSTAL_LANG_CLASS_FILE="language_classifier.tar.gz"

LIBPOSTAL_BASE_URL="https://github.com/$LIBPOSTAL_REPO_NAME/releases/download"

if [ "$DATAMODEL" = "senzing" ]; then
LIBPOSTAL_DATA_FILE_CHUNKS=1
LIBPOSTAL_PARSER_MODEL_CHUNKS=1
LIBPOSTAL_LANG_CLASS_MODEL_CHUNKS=1

LIBPOSTAL_DATA_DIR_VERSION_STRING="@LIBPOSTAL_SENZING_DATA_DIR_VERSION_STRING@"

LIBPOSTAL_DATA_FILE_LATEST_VERSION="@LIBPOSTAL_SENZING_DATA_FILE_LATEST_VERSION@"
LIBPOSTAL_PARSER_MODEL_LATEST_VERSION="@LIBPOSTAL_SENZING_PARSER_MODEL_LATEST_VERSION@"
LIBPOSTAL_LANG_CLASS_MODEL_LATEST_VERSION="@LIBPOSTAL_SENZING_LANG_CLASS_MODEL_LATEST_VERSION@"

LIBPOSTAL_BASE_URL="https://public-read-libpostal-data.s3.amazonaws.com"
fi
LIBPOSTAL_BASE_URL="https://public-read-libpostal-data.s3.amazonaws.com"

LIBPOSTAL_DATA_VERSION_FILE=$LIBPOSTAL_DATA_DIR/data_version
LIBPOSTAL_DATA_DIR_VERSION=
Expand Down
1 change: 0 additions & 1 deletion versions/senzing/base_data

This file was deleted.

1 change: 0 additions & 1 deletion versions/senzing/language_classifier

This file was deleted.

1 change: 0 additions & 1 deletion versions/senzing/parser

This file was deleted.

0 comments on commit aa993a9

Please sign in to comment.