The NER Manager includes by default a builtin entity extraction with different bundles available for different languages. The entity extraction is done even if the utterance is not matched to an intent.
Builtin | English | French | Spanish | Portuguese | Chinese | Japanese | Other |
---|---|---|---|---|---|---|---|
X | X | X | X | X | X | X | |
Ip | X | X | X | X | X | X | X |
Hashtag | X | X | X | X | X | X | X |
Phone Number | X | X | X | X | X | X | X |
URL | X | X | X | X | X | X | X |
Number | X | X | X | X | X | X | see 1 |
Ordinal | X | X | X | X | X | X | |
Percentage | X | X | X | X | X | X | see 2 |
Dimension | X | X | X | X | X | X | see 3 |
Age | X | X | X | X | X | X | |
Currency | X | X | X | X | X | X | |
Date | X | X | X | X | see 4 | see 4 | see 4 |
Duration | X |
- 1: Only for non text numbers
- 2: Only for % symbol non text numbers
- 3: Only for dimension acronyms (km, s, km/h...) non text numbers
- 4: Only dd/MM/yyyy formats or similars, non text
- Email Extraction
- IP Extraction
- Hashtag Extraction
- Phone Number Extraction
- URL Extraction
- Number Extraction
- Ordinal Extraction
- Percentage Extraction
- Age Extraction
- Currency Extraction
- Date Extraction
- Duration Extraction
It can identify and extract valid emails accounts, this works for any language.
"utterance": "My email is [email protected] please write me",
"entities": [
{
"start": 12,
"end": 33,
"len": 22,
"accuracy": 0.95,
"sourceText": "[email protected]",
"utteranceText": "[email protected]",
"entity": "email",
"resolution": {
"value": "[email protected]"
}
}
]
It can identify and extract valid IPv4 and IPv6 addresses, this works for any language.
"utterance": "My ip is 8.8.8.8",
"entities": [
{
"start": 9,
"end": 15,
"len": 7,
"accuracy": 0.95,
"sourceText": "8.8.8.8",
"utteranceText": "8.8.8.8",
"entity": "ip",
"resolution": {
"value": "8.8.8.8",
"type": "ipv4"
}
}
]
"utterance": "My ip is ABEF:452::FE10",
"entities": [
{
"start": 9,
"end": 22,
"len": 14,
"accuracy": 0.95,
"sourceText": "ABEF:452::FE10",
"utteranceText": "ABEF:452::FE10",
"entity": "ip",
"resolution": {
"value": "ABEF:452::FE10",
"type": "ipv6"
}
}
]
It can identify and extract hashtags from the utterances, this works for any language.
"utterance": "Open source is great! #proudtobeaxa",
"entities": [
{
"start": 22,
"end": 34,
"len": 13,
"accuracy": 0.95,
"sourceText": "#proudtobeaxa",
"utteranceText": "#proudtobeaxa",
"entity": "hashtag",
"resolution": {
"value": "#proudtobeaxa"
}
}
]
It can identify and extract phone numbers from the utterances, this works for any language.
"utterance": "So here is my number +1 541-754-3010 callme maybe",
"entities": [
{
"start": 21,
"end": 35,
"len": 15,
"accuracy": 0.95,
"sourceText": "+1 541-754-3010",
"utteranceText": "+1 541-754-3010",
"entity": "phonenumber",
"resolution": {
"value": "+1 541-754-3010"
}
}
]
It can identify and extract phone URLs from the utterances, this works for any language.
"utterance": "The url is https://something.com",
"entities": [
{
"start": 11,
"end": 31,
"len": 21,
"accuracy": 0.95,
"sourceText": "https://something.com",
"utteranceText": "https://something.com",
"entity": "url",
"resolution": {
"value": "https://something.com"
}
}
]
It can identify and extract numbers. This works for any language, and the numbers can be integer or floats.
"utterance": "This is 12",
"entities": [
{
"start": 8,
"end": 9,
"len": 2,
"accuracy": 0.95,
"sourceText": "12",
"utteranceText": "12",
"entity": "number",
"resolution": {
"strValue": "12",
"value": 12,
"subtype": "integer"
}
}
]
The numbers can be also be text written, but this only works for: English, French, Spanish and Portuguese.
"utterance": "This is twelve",
"entities": [
{
"start": 8,
"end": 13,
"len": 6,
"accuracy": 0.95,
"sourceText": "twelve",
"utteranceText": "twelve",
"entity": "number",
"resolution": {
"strValue": "12",
"value": 12,
"subtype": "integer"
}
}
]
The text feature also works for fractions.
"utterance": "one over 3",
"entities": [
{
"start": 0,
"end": 9,
"len": 10,
"accuracy": 0.95,
"sourceText": "one over 3",
"utteranceText": "one over 3",
"entity": "number",
"resolution": {
"strValue": "0.333333333333333",
"value": 0.333333333333333,
"subtype": "float"
}
}
]
It can identify and extract numbers. This works only for English, Spanish, French and Portuguese.
"utterance": "He was 2nd",
"entities": [
{
"start": 7,
"end": 9,
"len": 3,
"accuracy": 0.95,
"sourceText": "2nd",
"utteranceText": "2nd",
"entity": "ordinal",
"resolution": {
"strValue": "2",
"value": 2,
"subtype": "integer"
}
}
]
The numbers can be written by text.
"utterance": "one hundred twenty fifth",
"entities": [
{
"start": 0,
"end": 23,
"len": 24,
"accuracy": 0.95,
"sourceText": "one hundred twenty fifth",
"utteranceText": "one hundred twenty fifth",
"entity": "ordinal",
"resolution": {
"strValue": "125",
"value": 125,
"subtype": "integer"
}
}
]
It can identify and extract percentages. If the percentage is indicated with the symbol % it works for any language.
"utterance": "68.2%",
"entities": [
{
"start": 0,
"end": 4,
"len": 5,
"accuracy": 0.95,
"sourceText": "68.2%",
"utteranceText": "68.2%",
"entity": "percentage",
"resolution": {
"strValue": "68.2%",
"value": 68.2,
"subtype": "float"
}
}
]
The percentage can be indicated by text, but it only works for English, French, Spanish and Portuguese.
"utterance": "68.2 percent",
"entities": [
{
"start": 0,
"end": 11,
"len": 12,
"accuracy": 0.95,
"sourceText": "68.2 percent",
"utteranceText": "68.2 percent",
"entity": "percentage",
"resolution": {
"strValue": "68.2%",
"value": 68.2,
"subtype": "float"
}
}
]
It can understand text numbers but only works for English, French, Spanish and Portuguese.
"utterance": "thirty five percentage",
"entities": [
{
"start": 0,
"end": 21,
"len": 22,
"accuracy": 0.95,
"sourceText": "thirty five percentage",
"utteranceText": "thirty five percentage",
"entity": "percentage",
"resolution": {
"strValue": "35%",
"value": 35,
"subtype": "integer"
}
}
]
It can identify and extract different dimensions, like length, distance, speed, volume, area,... If the international acronym of the dimension is used then it works in any language.
"utterance": "120km",
"entities": [
{
"start": 0,
"end": 4,
"len": 5,
"accuracy": 0.95,
"sourceText": "120km",
"utteranceText": "120km",
"entity": "dimension",
"resolution": {
"strValue": "120",
"value": 120,
"unit": "Kilometer",
"localeUnit": "Kilometer"
}
}
]
In instead of the acronym, the text of the dimension is used in a language, then it works in English, French, Spanish and Portuguese.
"utterance": "Está a 325 kilómetros de Bucarest",
"entities": [
{
"start": 7,
"end": 20,
"len": 14,
"accuracy": 0.95,
"sourceText": "325 kilómetros",
"utteranceText": "325 kilómetros",
"entity": "dimension",
"resolution": {
"strValue": "325",
"value": 325,
"unit": "Kilometer",
"localeUnit": "Kilómetro"
}
}
]
It can identify and extract ages. It works in English, French, Spanish and Portuguese. Take into account that several ways to say an age can be also confused with a duraction ("It will be 10 years" can be an age or a duration), so two overlaped entities, one age and one duration, can be returned.
"utterance": "This saga is ten years old",
"entities": [
{
"start": 13,
"end": 25,
"len": 13,
"accuracy": 0.95,
"sourceText": "ten years old",
"utteranceText": "ten years old",
"entity": "age",
"resolution": {
"strValue": "10",
"value": 10,
"unit": "Year",
"localeUnit": "Year"
}
},
{
"start": 13,
"end": 21,
"len": 9,
"accuracy": 0.95,
"sourceText": "ten years",
"utteranceText": "ten years",
"entity": "duration",
"resolution": {
"values": [
{
"timex": "P10Y",
"type": "duration",
"value": "315360000"
}
]
}
}
]
It can identify and extract currency values. It works in English, French, Spanish and Portuguese.
"utterance": "420 million finnish markka",
"entities": [
{
"start": 0,
"end": 25,
"len": 26,
"accuracy": 0.95,
"sourceText": "420 million finnish markka",
"utteranceText": "420 million finnish markka",
"entity": "currency",
"resolution": {
"strValue": "420000000",
"value": 420000000,
"unit": "Finnish markka",
"localeUnit": "Finnish markka"
}
}
]
It the used language is not english, the localeUnit contains the locale name of the currency.
"utterance": "420 millones de marcos finlandeses",
"entities": [
{
"start": 0,
"end": 33,
"len": 34,
"accuracy": 0.95,
"sourceText": "420 millones de marcos finlandeses",
"utteranceText": "420 millones de marcos finlandeses",
"entity": "currency",
"resolution": {
"strValue": "420000000",
"value": 420000000,
"unit": "Finnish markka",
"localeUnit": "Marco finlandés"
}
}
]
It can identify and extract dates, if provided in numeric format can work in any language, but take into account that the localization also affect to the date format.
"utterance": "28/10/2018",
"entities": [
{
"start": 0,
"end": 9,
"len": 10,
"accuracy": 0.95,
"sourceText": "28/10/2018",
"utteranceText": "28/10/2018",
"entity": "date",
"resolution": {
"type": "date",
"timex": "2018-10-28",
"strValue": "2018-10-28",
"date": "2018-10-28T00:00:00.000Z"
}
}
]
It can understand written date formats in English, French, Spanish and Portuguese.
"utterance": "Volveré el 12 de enero del 2019",
"entities": [
{
"start": 11,
"end": 30,
"len": 20,
"accuracy": 0.95,
"sourceText": "12 de enero del 2019",
"utteranceText": "12 de enero del 2019",
"entity": "date",
"resolution": {
"type": "date",
"timex": "2019-01-12",
"strValue": "2019-01-12",
"date": "2019-01-12T00:00:00.000Z"
}
}
]
It can understand partial dates. Then the timex contains the resolution, example, if I provide the day but not the month neither the year, then both year and month will be filled with X. Also, in this case, two possible dates will be returned: the past and the future. Also take into account that in cases like that, the resolution can also include a number, like in this example:
"utterance": "I'll go back on 15",
"entities": [
{
"start": 16,
"end": 17,
"len": 2,
"accuracy": 0.95,
"sourceText": "15",
"utteranceText": "15",
"entity": "number",
"resolution": {
"strValue": "15",
"value": 15,
"subtype": "integer"
}
},
{
"start": 16,
"end": 17,
"len": 2,
"accuracy": 0.95,
"sourceText": "15",
"utteranceText": "15",
"entity": "date",
"resolution": {
"type": "interval",
"timex": "XXXX-XX-15",
"strPastValue": "2018-08-15",
"pastDate": "2018-08-15T00:00:00.000Z",
"strFutureValue": "2018-09-15",
"futureDate": "2018-09-15T00:00:00.000Z"
}
}
]
When the grain resolution is not a day, it can be resolved not only with a past and future date, but also each date is an interval. Example: if we are resolving a date that is a month, like January, it will return the past and future januaries, but also each january is an interval from the day 1 of January until the day 1 of February, like in this example:
"utterance": "I'll be out in Jan",
"entities": [
{
"start": 15,
"end": 17,
"len": 3,
"accuracy": 0.95,
"sourceText": "Jan",
"utteranceText": "Jan",
"entity": "daterange",
"resolution": {
"type": "interval",
"timex": "XXXX-01",
"strPastStartValue": "2018-01-01",
"pastStartDate": "2018-01-01T00:00:00.000Z",
"strPastEndValue": "2018-02-01",
"pastEndDate": "2018-02-01T00:00:00.000Z",
"strFutureStartValue": "2019-01-01",
"futureStartDate": "2019-01-01T00:00:00.000Z",
"strFutureEndValue": "2019-02-01",
"futureEndDate": "2019-02-01T00:00:00.000Z"
}
}
]
It also identifies expecial dates, like Christmas:
"utterance": "I will return in Christmas",
"entities": [
{
"start": 17,
"end": 25,
"len": 9,
"accuracy": 0.95,
"sourceText": "Christmas",
"utteranceText": "Christmas",
"entity": "date",
"resolution": {
"type": "interval",
"timex": "XXXX-12-25",
"strPastValue": "2017-12-25",
"pastDate": "2017-12-25T00:00:00.000Z",
"strFutureValue": "2018-12-25",
"futureDate": "2018-12-25T00:00:00.000Z"
}
}
]
It can identify and extract duration intervals. It works currently in English only. The resolution is done in seconds, with a timex indicator. Example: "It will take me 5 minutes" the timex is "PT5M" meaning "Present Time 5 Minutes".
"utterance": "It will take me 5 minutes",
"entities": [
{
"start": 13,
"end": 21,
"len": 9,
"accuracy": 0.95,
"sourceText": "5 minutes",
"utteranceText": "5 minutes",
"entity": "duration",
"resolution": {
"values": [
{
"timex": "PT5M",
"type": "duration",
"value": "300"
}
]
}
}
]