[API] Add CSV bulk indexing support to Kibana API #6844

Bargs · 2016-04-09T22:56:43Z

Creates the API necessary for #6541
Required by: #6845

This PR creates a streaming API endpoint that parses CSV files and indexes each row in the file as a document in Elasticsearch.

Endpoint

/api/kibana/{index}/_data

The request payload can be either raw CSV data, or multipart/form-data with the CSV file attached under a csv key.

Query String Parameters

delimiter - (optional) String - a custom delimiter. Defaults to ,
pipeline - (optional) Boolean - If true, documents are sent through the index's corresponding pipeline (based on the Kibana ingest config convetion of appending kibana- to the index name), if one exists

Response

In order to support large files without having to implement a polling mechanism or websockets, the API streams a chunked response back to the browser. This makes realtime progress updates a possibility and helps prevent timeouts between the backend and the browser. The response payload is a JSON array of "result objects". Each result object represents the results of a portion (or for a small file, potentially the entirety) of the parsing and indexing the API has completed so far. The result object schema looks like the following:

{
  created: Number - the number of successfully indexed documents for this chunk 
  errors: Object (optional) - Contains arrays of errors, keyed by error type
}

The errors object may have the following keys:
index - Array of objects - contains any indexing errors returned by elasticsearch, per document
other - Array of strings - contains csv parse errors, non-indexing ES errors, and any other misc. errors

To make this more concrete, here are a couple sample responses:

With an indexing error:

[
  {
    "created": 3,
    "errors": {
      "index": [
        {
          "_id": "L3 - smalltest.csv",
          "error": {
            "type": "mapper_parsing_exception",
            "reason": "failed to parse [CZ_FIPS]",
            "caused_by": {
              "type": "number_format_exception",
              "reason": "For input string: \"foo\""
            }
          }
        }
      ]
    }
  }
]

With a parsing error:

[
  {
    "created": 0,
    "errors": {
      "other": [
        "Invalid opening quote at line 2"
      ]
    }
  }
]

Bargs · 2016-04-09T22:58:07Z

@epixa I thought you might be interested in reviewing this since it seems like the inverse of the CSV export API you're building. But let me know if you already have too much on your plate and I can find someone else to look at it.

Bargs · 2016-04-12T14:36:20Z

jenkins, test it

epixa · 2016-04-12T15:04:03Z

src/plugins/kibana/server/routes/api/ingest/register_bulk.js

+      responseStream.write('[');
+      csv.pipe(parser);
+
+      hi(parser)


As far as I can tell, none of this is really specific the csv import feature, so I think almost all of this logic should be pulled out of this handler and into a standalone library. The library should basically have one purpose: given a stream of docs, send those docs off the bulk api and return a stream response. The route handler would then simply be responsible for parsing the csv payload, sending it through the import function, and piping the response back to the client.

This would also have the added benefit of allowing more thorough unit testing of the individual parts of the stream handling process, which will help clarity and reliability of the streaming process.

I had the exact same thought actually. I'd also like the _bulk input to support JSON files (or even JSON request bodies) in addition to CSVs, which the library approach would make easier.

However, my tendency is to way over engineer things (see: the first 3 implementations of the ingest API), so this is my attempt to keep to a simple MVP that's good enough to power the CSV upload UI.

I feel you on the whole over-engineering thing, but I don't think this falls into that category. Over-engineering is making things more complicated than they need to be, but I actually think having all of this code strung across the request handler makes this code more complicated than if it's pulled out into a library with clear boundaries. It also makes it impossible to unit test, which I think is a bug even in an MVP.

To elaborate on that last point: MVP should describe the minimum capabilities of the product rather than the quality of code that powers it, and it's generally hard to make the case that code is quality if it can't be unit tested.

👍 to not trying to build a third party api in this initial implementation, but that shouldn't mean we don't write tests to cover the code paths we create. Again, I go back to my product vs code thing from before, the vast majority of our code isn't publicly consumable (or at least we don't want people to use it directly), but that doesn't mean we don't benefit from testing it. Those ~24 code paths aren't accidental or anything, we're not talking about accounting for every possible input type and stuff, we're just talking about testing the code paths that are written.

Things like batching size, parallelization, and back pressure are handled by highland because the way you've coded this, but any developer at any time could make even trivial changes to this file that could break any or all of those things in subtle ways that could have effects we couldn't possibly anticipate through automated testing. That's why we should assert that these things are being setup/configured the way they need to be rather than hoping our tests will catch the unexpected effects of breaking them.

That doesn't mean we need to re-test all of the capabilities of highland, though. If we are comfortable assuming that highland will handle those things for us so long as we set them or configure them properly (which I am, and it sounds like you are), then we would only need to assert that we're configuring them properly. Those assertions also help codify important decisions that are made at implementation time, like the choice to do a lower batch size.

As for the rest of those things, I didn't mean to imply that you weren't testing for them, just that my 24 code paths number wasn't even considering those things. Many of those things are definitely best verified through functional tests just as many others are best verified through unit tests. You probably are verifying a bunch of them already.

To be honest, I think a boatload of context is being lost in this comment chain. Want to zoom tomorrow? Maybe we could bring in some other people to get some fresh perspectives as well?

Would be happy to Zoom. However, I would like you to provide some concrete examples of the types of tests you'd like to see, as there's a lot of hand waving going on here.

any developer at any time could make even trivial changes to this file that could break any or all of those things in subtle ways that could have effects we couldn't possibly anticipate through automated testing. That's why we should assert that these things are being setup/configured the way they need to be rather than hoping our tests will catch the unexpected effects of breaking them.

And I'd also like concrete examples of ways in which a developer could make trivial changes that break everything, where a unit test would save them, and they wouldn't just change the unit test in addition to the code if the subtle breakages are in fact impossible to anticipate and the developer is incapable of detecting them when testing their changes.

Sure, I can come up with some examples using this code before we chat about it.

I'm not sure what meant by the follow up comment, though. Developers will always be able to modify tests, and doing so will always mean that the code isn't working the way it was before.

It sounds like you want tests to verify the value of certain configuration options, like batch size, to prevent bugs that "we couldn't possibly anticipate" caused by a developer changing that value. What I'm saying is that if the bug is in fact impossible to anticipate, a unit test isn't going to prevent the developer from changing that value.

But I could be totally wrong about what you mean, so let's table that discussion until we have some concrete examples of unit tests that we can dissect.

rashidkpc · 2016-04-12T16:11:06Z

test/unit/data/fake_names_index_template.json

+          "type": "double",
+          "doc_values": true
+        },
+        "CountryFull": {


You can drop this and all of the other "type": "text" mappings since the _default_ dynamic template will catch them

Ah that's right, ES creates text and keywords out of strings by default now doesn't it?

Can still drop all of these "text" fields. Even outside of Elasticsearch's defaults you have a _default_ mapping that will catch all of the text fields anyway

See my other comment about this.

epixa · 2016-04-13T19:29:19Z

I'm not sure why github collapsed the one longer thread we have going, so I'm linking it here for other reviewers to add their thoughts: #6844 (diff)

If we add anything more to it, we should probably do so on this main PR comment thread so they don't continue to disappear into the void.

rashidkpc · 2016-04-14T00:20:09Z

Its because it was a comment on code that has changed, so I guess github decides its resolved, but yeah, in general anything large scope should go in the main thread or it gets lost.

rashidkpc · 2016-04-26T15:08:30Z

jenkins, test it

rashidkpc · 2016-04-26T16:27:20Z

@Bargs can you merge master on this? That should fix the tests

Bargs · 2016-04-26T22:03:36Z

Hmmm I rebased but it's still busted, I'll have to look into it when I get some time.

Bargs · 2016-04-28T00:05:04Z

Tests are passing again. @epixa and I are going to discuss the unit testing more on Friday. @rashidkpc in the meantime could you give this a review as well?

rashidkpc · 2016-04-28T16:20:49Z

test/unit/data/fake_names_index_template.json

+    "_default_": {
+      "dynamic_templates": [
+        {
+          "string_fields": {


This will catch all of your text/string fields, no reason to define them explicitly.

I need the explicit mappings otherwise I'll get mapping exceptions when indexing the data. For instance, some of the ZipCode values are actually strings, but Elasticsearch will map is as a long if the first value it sees looks like a number. So ZipCode needs to be explicitly mapped as a string.

Some of the fields might not strictly require explicit mappings, but to make the tests as robust and repeatable as possible, I went ahead and mapped everything.

LeeDr · 2016-05-10T21:47:27Z

LGTM

Lee's discovery test (with @Bargs help);

In this case I started elasticsearch and kibana with npm run test:ui:server so my kibana port is 5620.
Kibana version header is required.
I loaded one of the same test data csv files from the API tests.

$ curl -s -XPOST 'localhost:5620/api/kibana/names/_bulk' -H "kbn-version: 5.0.0-snapshot" --data-binary @test/unit/data/fake_names.csv
[
{"created":97,"errors":{"index":[{"_id":"AVScm05upO2BM3CqF7Ec","error":{"type":"illegal_argument_exception","reason":"mapper [ZipCode] of different type, current_type [text], merged_type [long]"}},{"_id":"AVScm05upO2BM3CqF7Ed","error":{"type":"illegal_argument_exception","reason":"mapper [Pounds] cannot be changed from type [float] to [long]"}},{"_id":"AVScm05upO2BM3CqF7Ek","error":{"type":"illegal_argument_exception","reason":"mapper [ZipCode] of different type, current_type [text], merged_type [long]"}}]}}
]

ycombinator · 2016-05-11T12:29:26Z

test/unit/api/ingest/_data.js

            });
        });

      });

      bdd.describe('optional parameters', function () {
        bdd.it('should accept a custom delimiter query string param for parsing the CSV', function () {
-          return request.post('/kibana/names/_bulk?delimiter=|')
+          return request.post('/kibana/names/_data?delimiter=|')


Just noticed this. The | could be passed in as %7C because of URL encoding. Mind adding a test to make sure that works?

ycombinator · 2016-05-11T13:15:42Z

I made the following request:

$ curl -kv -H'kbn-version: 5.0.0-snapshot' -XPOST https://localhost:5601/dun/api/kibana/foo/_data -d '
foo,bar
1,2'

And got the following response:

HTTP/1.1 200 OK
[
{"created":0,"errors":{"other":["Number of columns on line 2 does not match header"]}}
]

I have a few questions/comments:

Why is the kbn-version header required? Is this so we can effectively version the API, should its interface change in the future?
The error above is being caused because I have an extra newline at the very start of my input. Either we should strip whitespace at the start and end of the input (as a whole, not from every individual row, to be clear) OR we should detect that there is actually no header specified at all (as the first line of input is empty) and make that the error message.
I strongly recommend not responding with a JSON array. This makes it impossible to extend in the future. Instead I suggest responding with a JSON object with a property for the array. This way, if you need to add properties to the response in the future, you have that extensibility available to you without needing to break backwards compatibility.
I'm guessing the reason you are returning the error as an element of the array is because you might run into errors on individual lines of the input CSV (but please correct me if there's another reason). However, there is a class of errors that apply to the entire input payload (such as the error above) or the entire request in general (such as a the authentication error example shown below). For this class of "global-level" errors, I suggest that this API not return a 200 OK but the appropriate status code > 399. Also, if you take the suggestion in point 3, you can then attach these "global-level" error messages as properties to the root JSON object in the response.

Authentication error example, request + response:

$ curl -k -H'kbn-version: 5.0.0-snapshot' -XPOST https://localhost:5601/dun/api/kibana/foo/_data -d 'foo,bar
1,2'
[
{"created":0,"errors":{"other":["Unauthorized"]}}
]```

ycombinator · 2016-05-11T13:22:59Z

Currently the API (really Elasticsearch) auto-assigns IDs for each row in the CSV. Would it be useful for the API (and by extension, the UI) to expose a parameter that will allow the user to specify an _id column in the CSV? The API would then use the value of this column in each row as the _id of the corresponding document in Elasticsearch. This is something that can be added later too, if you don't know of a use case for this at the moment.

ycombinator · 2016-05-11T13:29:05Z

General note about query string parameters on this API: This API is named _data because, in the future, it could take data formats other than CSV as input. However, some of the parameters on this API (such as delimiter) might only make sense for certain formats (such as CSV). Also the same parameter might end up meaning different things, depending on the format of the data.

A couple of thoughts:

Perhaps its not such a bad idea to make this API specific to CSV input, by naming it _csv, or
Perhaps the parameters that are specific to certain formats should be namespaced like csv_delimiter

epixa · 2016-05-11T14:28:02Z

kbn-version is required for all requests to the Kibana server for CSRF protection.

Bargs · 2016-05-11T14:56:04Z

Why is the kbn-version header required? Is this so we can effectively version the API, should its interface change in the future?

kbn-version is actually unrelated to the API, it's a header we added to protect against XSRF.

The error above is being caused because I have an extra newline at the very start of my input. Either we should strip whitespace at the start and end of the input (as a whole, not from every individual row, to be clear) OR we should detect that there is actually no header specified at all (as the first line of input is empty) and make that the error message.

Setting the skip_empty_lines config value to true might solve this, let me give that a try.

I strongly recommend not responding with a JSON array. This makes it impossible to extend in the future. Instead I suggest responding with a JSON object with a property for the array. This way, if you need to add properties to the response in the future, you have that extensibility available to you without needing to break backwards compatibility.

I think that's a good point but it's a little tricky since the response is streamed to the client. Making the response an object will make it a bit harder to parse in a streaming manner. The way I think about it, the array is a container for 1 or more response objects. It could just as easily be 1 or more json objects separated by newlines, but I wanted to make it valid json so clients could consume it easily if they don't care about reading the streaming results as they arrive. So if we were to add a property in the future, it would be to the response objects, not the array.

I'm guessing the reason you are returning the error as an element of the array is because you might run into errors on individual lines of the input CSV (but please correct me if there's another reason).

Exactly.

However, there is a class of errors that apply to the entire input payload (such as the error above) or the entire request in general (such as a the authentication error example shown below).

So a couple of things here

Unfortunately the CSV parsing lib does nothing to distinguish between line level errors and global errors. So we'd have to rely on matching the error text itself which would be a bad idea.
Is that unauthorized error coming from elasticsearch? Are you getting it because you're unauthorized to add data to a specified index or because you're unauthorized to use the bulk API? What if Shield's field level security is being used, and some indexing requests are unauthorized but some are ok?

For the sake of streaming, I really like the simplicity of treating each response object in the array as its own independent chunk and I think most errors that can occur really are line specific. But it does seem to make sense to return a 403 if, for instance, the user has no write permissions on the index they're trying to add data to. But that would require pre-checking the users permissions which we don't really do anywhere else, and it seems like more of a cross cutting concern that should be handled in the security plugin rather than adding security specific code to the API route handler itself. I dunno, I'm pretty torn on this one.

Bargs · 2016-05-11T15:01:03Z

Currently the API (really Elasticsearch) auto-assigns IDs for each row in the CSV.

Actually the API creates an ID from the filename and line number: https://github.com/elastic/kibana/pull/6844/files#diff-acac6333b1d8bc4acac4ec269419e41aR51

I do think the ability to specify an ID column would be useful, but that's something else I think I'd like to take a wait and see approach on. Having the line number is really useful in the CSV upload UI so that I can point users to the exact line that caused an error.

Bargs · 2016-05-11T15:09:34Z

Perhaps the parameters that are specific to certain formats should be namespaced like csv_delimiter

I like that idea, since there are other params like pipeline that could apply to any input type.

ycombinator · 2016-05-11T17:30:05Z

@Bargs and I chatted out-of-issue about the structure of the API response. I'll try to summarize:

Given that this is an internal-only API at the moment, we go with whatever is simplest to generate (from the server side) and consume (from the client side). So we keep the response structure as-is: a top-level array containing response objects. If/when we release this API publicly, we'll need to think harder about the implications of that structure per points 3 and 4 in my comment above.

ycombinator · 2016-05-11T19:19:12Z

Actually the API creates an ID from the filename and line number: https://github.com/elastic/kibana/pull/6844/files#diff-acac6333b1d8bc4acac4ec269419e41aR51

Ah, I missed that. Thanks for clarifying.

I do think the ability to specify an ID column would be useful, but that's something else I think I'd like to take a wait and see approach on.

The wait-and-see approach about letting users specify an ID column sounds good!

…when other data types are supported in the future

Bargs · 2016-05-11T19:27:13Z

@ycombinator just pushed some updates.

Data dir is now fixtures
Empty lines at the beginning of a file will no longer cause errors
Changed delimiter to csv_delimiter

I think that covers everything we agreed to update. Let me know how we're looking.

Bargs · 2016-05-11T20:35:08Z

jenkins, test it

ycombinator · 2016-05-11T22:56:01Z

LGTM.

`[email protected]` ⏩ `83.0.0` ⚠️ The biggest change in this PR by far is the `EuiButtonEmpty` Emotion conversion, which changes the DOM structure of the button slightly as well as several CSS classes around it. EUI has attempted to convert any custom EuiButtonEmpty CSS overrides where possible, but would super appreciate it if CODEOWNERS checked their touched files. If anything other than a snapshot or test was touched, please double check the display of your button(s) and confirm everything still looks shipshape. Feel free to ping us for advice if not. --- ## [`83.0.0`](https://github.com/elastic/eui/tree/v83.0.0) **Bug fixes** - Fixed `EuiPaginationButton` styling affected by `EuiButtonEmpty`'s Emotion conversion ([#6893](elastic/eui#6893)) **Breaking changes** - Removed `isPlaceholder` prop from `EuiPaginationButton` ([#6893](elastic/eui#6893)) ## [`82.2.1`](https://github.com/elastic/eui/tree/v82.2.1) - Updated supported Node engine versions to allow Node 16, 18 and >=20 ([#6884](elastic/eui#6884)) ## [`82.2.0`](https://github.com/elastic/eui/tree/v82.2.0) - Updated EUI's SVG icons library to use latest SVGO v3 optimization ([#6843](elastic/eui#6843)) - Added success color `EuiNotificationBadge` ([#6864](elastic/eui#6864)) - Added `badgeColor` prop to `EuiFilterButton` ([#6864](elastic/eui#6864)) - Updated `EuiBadge` to use CSS-in-JS for named colors instead of inline styles. Custom colors will still use inline styles. ([#6864](elastic/eui#6864)) **CSS-in-JS conversions** - Converted `EuiButtonGroup` and `EuiButtonGroupButton` to Emotion ([#6841](elastic/eui#6841)) - Converted `EuiButtonIcon` to Emotion ([#6844](elastic/eui#6844)) - Converted `EuiButtonEmpty` to Emotion ([#6863](elastic/eui#6863)) - Converted `EuiCollapsibleNav` and `EuiCollapsibleNavGroup` to Emotion ([#6865](elastic/eui#6865)) - Removed Sass variables `$euiCollapsibleNavGroupLightBackgroundColor`, `$euiCollapsibleNavGroupDarkBackgroundColor`, and `$euiCollapsibleNavGroupDarkHighContrastColor` ([#6865](elastic/eui#6865)) --------- Co-authored-by: Cee Chen <[email protected]> Co-authored-by: Jeramy Soucy <[email protected]> Co-authored-by: Kibana Machine <[email protected]>

Bargs assigned epixa Apr 9, 2016

Bargs added review Feature:Add Data Add Data and sample data feature on Home labels Apr 9, 2016

Bargs mentioned this pull request Apr 9, 2016

Add Data - CSV Upload UI #6845

Merged

Bargs force-pushed the ingest/bulkAPI branch from ef2217b to fd7d967 Compare April 11, 2016 22:53

epixa reviewed Apr 12, 2016
View reviewed changes

epixa assigned rashidkpc and unassigned epixa Apr 12, 2016

Bargs mentioned this pull request Apr 12, 2016

Proposal: Streaming results elastic/elasticsearch#12188

Closed

rashidkpc reviewed Apr 12, 2016
View reviewed changes

Bargs mentioned this pull request Apr 12, 2016

Increase default max payload limit of base path proxy #6868

Closed

rashidkpc assigned Bargs and unassigned rashidkpc Apr 26, 2016

Bargs force-pushed the ingest/bulkAPI branch from 47b45ff to ba6b0c1 Compare April 26, 2016 21:47

Bargs assigned rashidkpc and unassigned Bargs Apr 28, 2016

rashidkpc reviewed Apr 28, 2016
View reviewed changes

Bargs added 3 commits May 10, 2016 17:02

Change _bulk to _data

bc65d95

only need the parse portion of the csv package

417d8a3

Better doc IDs and a descriptive variable name

b85952f

Merge branch 'feature/ingest' into ingest/bulkAPI

05b198a

Bargs assigned ycombinator and unassigned Bargs May 10, 2016

ycombinator reviewed May 11, 2016
View reviewed changes

Bargs added 2 commits May 11, 2016 15:05

Rename data dir to fixtures

08427ee

Empty lines at the beginning of a file will no longer cause an error

b34460a

Change delimiter query param to csv_delimiter so better namespace it …

475913d

…when other data types are supported in the future

Bargs assigned Bargs and unassigned ycombinator May 11, 2016

Bargs merged commit 5918737 into elastic:feature/ingest May 11, 2016

Bargs mentioned this pull request Jul 8, 2016

Support for statistic events handler for Import CSV job state and report #7680

Closed

[API] Add CSV bulk indexing support to Kibana API #6844

[API] Add CSV bulk indexing support to Kibana API #6844

Conversation

Bargs commented Apr 9, 2016 • edited Loading

Endpoint

Query String Parameters

Response

Bargs commented Apr 9, 2016

Bargs commented Apr 12, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

epixa commented Apr 13, 2016

rashidkpc commented Apr 14, 2016

rashidkpc commented Apr 26, 2016

rashidkpc commented Apr 26, 2016

Bargs commented Apr 26, 2016

Bargs commented Apr 28, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

LeeDr commented May 10, 2016

Choose a reason for hiding this comment

ycombinator commented May 11, 2016 • edited Loading

ycombinator commented May 11, 2016 • edited Loading

ycombinator commented May 11, 2016

epixa commented May 11, 2016

Bargs commented May 11, 2016

Bargs commented May 11, 2016

Bargs commented May 11, 2016

ycombinator commented May 11, 2016

ycombinator commented May 11, 2016

Bargs commented May 11, 2016

Bargs commented May 11, 2016

ycombinator commented May 11, 2016

Bargs commented Apr 9, 2016 •

edited

Loading

ycombinator commented May 11, 2016 •

edited

Loading

ycombinator commented May 11, 2016 •

edited

Loading