Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added support for importing parent property. #97

Merged

Conversation

mansoor-sajjad
Copy link
Contributor

@mansoor-sajjad mansoor-sajjad commented Jun 21, 2022

👋

Here's the reason for this change 🚀

We have data from multiple sources with the custom data formats like NeTEx, SOSI, GeoJson, CSV etc.
In order to import that data to Pelias elasticsearch, we have decided to convert the data to single format and store it in our cloud storage which we will be importing to the Pelias elasticsearch later on.
Instead of writing aur own importer / importers, we have decided to convert the data from all the sources to CSV and then use the pelias-csv-importer to import it in pelias elasticsearch.
The pelias-csv-importer is somewhat limited in supporting the fields that can be imported through it, although the pelias-model support alot of fields.
We want to contribute to the pelias-csv-importer to support more fields.


Here's what actually got changed 👏

In this PR we are adding the support to import the parent field. We will be adding support for more fields later on.
Added unit test for testing the import of new parent field.
We have named the new field parent_json, the structure of json object will be

{
	"country": {
		"id": "NOR",
		"name": "NO".
		"abbr": "nor",
		"source": "myDataSource"
	},
	"locality": {
		"id": "KVE:TopographicPlace:0301",
		"name": "Oslo"
	},
        .....
}

In the CSV file it will look like the other json fields like addendum_json.

"{""country"":{""id"":""NOR"",""name"":""NO""},""locality"":{""id"":""0301"",""name"":""Oslo""},""county"":{""id"":"":03"",""name"":""Oslo""}}"

@mansoor-sajjad
Copy link
Contributor Author

Hi @missinglink,

Do you have the opportunity to review this pull-request?
It been there for a while.

@mansoor-sajjad
Copy link
Contributor Author

Hi @orangejulius,

Do you have the opportunity to review this pull-request?
It been there for a while.

@@ -135,6 +135,15 @@ function processRecord(record, next_uid, stats) {

const pelias_document = new peliasModel.Document( source, layer, model_id );

const parent = getCaseInsensitiveAsJSON( `parent_json`, record );
if (parent) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can make the code a bit more robust and readable with lodash, such as this:

if (_.isPlainObject(parent)) {
  _.each(parent, (value, field) => {
    pelias_document.addParent(field, value.name, value.id, value.abbr, value.source)
  })
}

It has a couple advantages:

  • explicitly check the type of parent is an object which prevents a class of errors when the parent_json object is incorrectly specified.
  • use the _.each function to simplify iteration over the properties and make it a bit more succinct and readable

Copy link
Member

@missinglink missinglink Aug 3, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the case where parent_json is incorrectly formatted there will be no error returned to the user, this may cause frustration if the import is long-running as it will take time to detect the error and require re-running the import.

Should we detect a situation where the parent_json column is provided but contains either malformed or invalid values? In this situation should we panic or emit an error/warning message?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding the validation:

Right now we are only showing the warnings in the logs and ignore the records which are malformed, this is the similar behaviour as with the other fields.
Following warning is shown for the malformed records:

Record import failed on line 26 due to invalid document type, expecting: string got: undefined.

In addition to that we have implemented the additional validation in the model to check that the parent fields are valid, the PR is released in [email protected].
I have upgraded the version of pelias-model in csv-importer as part of this PR, so that if we use the non-existing parent name the record will be ignored with following warning.

Record import failed on line 36 due to invalid property: land, should be one of: continent,country,dependency,macroregion,region,macrocounty,county,borough,locality,localadmin,neighbourhood,postalcode,ocean,marinearea,empire.

The way the csv-importer works is, it try to import all the records, give the warnings for the records it fails to import and give the following report at the end of import.

2022-08-04T12:38:39.408Z - info: [csv-importer] Finished import from /Users/me/1659614476347.csv, goodRecordCount=60371 badRecordCount=326

The report show that the warn and not fail was the intentional implementation. Don't know why it was done like that. 🤔
This is not specific to parent_json field, i think changing this behaviour may affect the other users of csv-importer.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes sense 👍 this sounds sufficient

@missinglink
Copy link
Member

I've not run the code, so just wondering what the expected behaviour is here.

Is this intended to be run with adminLookup.enabled = false so that the PIP service doesn't run, or is it intended to be used in conjunction with the PIP service, and in this case, are the values merged or replaced?

@missinglink
Copy link
Member

missinglink commented Aug 3, 2022

The pelias/model method addParent permits multiple values, this is particularly useful for solving issues such as the 'postal cities' problem, where the postal locality differs from the official cadastral boundary.

In the case where multiple parent values are provided for the same field name, we store all copies in the elastic index, making them all searchable, but only the first entry is used for displaying the label.

Your proposed JSON data model only permits a singular entry, "country": {}, I could see someone questioning this design choice at a later date when they see the discrepancy, fixing it would either require some messy code or breaking your implementation, neither of which is very nice.

Can we please wrap the values in an array (such as "country": [{}]) to make it match the database data model?

@mansoor-sajjad
Copy link
Contributor Author

mansoor-sajjad commented Aug 4, 2022

In our case we don't do any adminLookup, hance we don't use the PIP service. We have own admin units which we set in the parent_json field.
Prerequisite in this PR for having the parent_json field is setting adminLookup.enabled = false. In the pelias.json

"imports": {
    "adminLookup": {
      "enabled": false
    },
  }

If we at later stage need both the parent field and PIP service then we will do it in separate PR.

I can add a little check in the code which will stop the import process in the start, if we have both the parent_json field provided and the adminLookup enabled.
What do you think? 🤔

@mansoor-sajjad
Copy link
Contributor Author

mansoor-sajjad commented Aug 4, 2022

Updating the parent_json structure to match the database model ✅

"{""county"":[{""id"":""34"",""name"":""Innlandet""}],""country"":[{""id"":""NOR"",""name"":""NO""}],""locality"":[{""id"":""3403"",""name"":""Hamar""}]}"

Updated the code to accept the new model ✅
Updated the unit test as per the new model ✅

@missinglink
Copy link
Member

I can add a little check in the code which will stop the import process in the start, if we have both the parent_json field provided and the adminLookup enabled.

Actually, I think it's fine that it can be used with or without adminLookup. I assume in this case that the values in the CSV will be added first, and therefore be the 'main' value used in the response, while the values from admin lookup will be secondary, and only used for search.

@missinglink
Copy link
Member

missinglink commented Aug 4, 2022

Great, looks good thanks, could you please add some documentation info in the README and it's good to merge 👍

@mansoor-sajjad
Copy link
Contributor Author

@missinglink Added the documentation.
Can you please have a quick look at it.

@missinglink missinglink merged commit d062fa5 into pelias:master Aug 9, 2022
@missinglink
Copy link
Member

Thanks! this was released on npm as [email protected]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants