Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[API] Add CSV bulk indexing support to Kibana API #6844

Merged
merged 18 commits into from
May 11, 2016

Conversation

Bargs
Copy link
Contributor

@Bargs Bargs commented Apr 9, 2016

Creates the API necessary for #6541
Required by: #6845

This PR creates a streaming API endpoint that parses CSV files and indexes each row in the file as a document in Elasticsearch.

Endpoint

/api/kibana/{index}/_data

The request payload can be either raw CSV data, or multipart/form-data with the CSV file attached under a csv key.

Query String Parameters

  • delimiter - (optional) String - a custom delimiter. Defaults to ,
  • pipeline - (optional) Boolean - If true, documents are sent through the index's corresponding pipeline (based on the Kibana ingest config convetion of appending kibana- to the index name), if one exists

Response

In order to support large files without having to implement a polling mechanism or websockets, the API streams a chunked response back to the browser. This makes realtime progress updates a possibility and helps prevent timeouts between the backend and the browser. The response payload is a JSON array of "result objects". Each result object represents the results of a portion (or for a small file, potentially the entirety) of the parsing and indexing the API has completed so far. The result object schema looks like the following:

{
  created: Number - the number of successfully indexed documents for this chunk 
  errors: Object (optional) - Contains arrays of errors, keyed by error type
}

The errors object may have the following keys:
index - Array of objects - contains any indexing errors returned by elasticsearch, per document
other - Array of strings - contains csv parse errors, non-indexing ES errors, and any other misc. errors

To make this more concrete, here are a couple sample responses:

With an indexing error:

[
  {
    "created": 3,
    "errors": {
      "index": [
        {
          "_id": "L3 - smalltest.csv",
          "error": {
            "type": "mapper_parsing_exception",
            "reason": "failed to parse [CZ_FIPS]",
            "caused_by": {
              "type": "number_format_exception",
              "reason": "For input string: \"foo\""
            }
          }
        }
      ]
    }
  }
]

With a parsing error:

[
  {
    "created": 0,
    "errors": {
      "other": [
        "Invalid opening quote at line 2"
      ]
    }
  }
]

@Bargs Bargs added review Feature:Add Data Add Data and sample data feature on Home labels Apr 9, 2016
@Bargs
Copy link
Contributor Author

Bargs commented Apr 9, 2016

@epixa I thought you might be interested in reviewing this since it seems like the inverse of the CSV export API you're building. But let me know if you already have too much on your plate and I can find someone else to look at it.

@Bargs
Copy link
Contributor Author

Bargs commented Apr 12, 2016

jenkins, test it

responseStream.write('[');
csv.pipe(parser);

hi(parser)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I can tell, none of this is really specific the csv import feature, so I think almost all of this logic should be pulled out of this handler and into a standalone library. The library should basically have one purpose: given a stream of docs, send those docs off the bulk api and return a stream response. The route handler would then simply be responsible for parsing the csv payload, sending it through the import function, and piping the response back to the client.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would also have the added benefit of allowing more thorough unit testing of the individual parts of the stream handling process, which will help clarity and reliability of the streaming process.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had the exact same thought actually. I'd also like the _bulk input to support JSON files (or even JSON request bodies) in addition to CSVs, which the library approach would make easier.

However, my tendency is to way over engineer things (see: the first 3 implementations of the ingest API), so this is my attempt to keep to a simple MVP that's good enough to power the CSV upload UI.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel you on the whole over-engineering thing, but I don't think this falls into that category. Over-engineering is making things more complicated than they need to be, but I actually think having all of this code strung across the request handler makes this code more complicated than if it's pulled out into a library with clear boundaries. It also makes it impossible to unit test, which I think is a bug even in an MVP.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To elaborate on that last point: MVP should describe the minimum capabilities of the product rather than the quality of code that powers it, and it's generally hard to make the case that code is quality if it can't be unit tested.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 to not trying to build a third party api in this initial implementation, but that shouldn't mean we don't write tests to cover the code paths we create. Again, I go back to my product vs code thing from before, the vast majority of our code isn't publicly consumable (or at least we don't want people to use it directly), but that doesn't mean we don't benefit from testing it. Those ~24 code paths aren't accidental or anything, we're not talking about accounting for every possible input type and stuff, we're just talking about testing the code paths that are written.

Things like batching size, parallelization, and back pressure are handled by highland because the way you've coded this, but any developer at any time could make even trivial changes to this file that could break any or all of those things in subtle ways that could have effects we couldn't possibly anticipate through automated testing. That's why we should assert that these things are being setup/configured the way they need to be rather than hoping our tests will catch the unexpected effects of breaking them.

That doesn't mean we need to re-test all of the capabilities of highland, though. If we are comfortable assuming that highland will handle those things for us so long as we set them or configure them properly (which I am, and it sounds like you are), then we would only need to assert that we're configuring them properly. Those assertions also help codify important decisions that are made at implementation time, like the choice to do a lower batch size.

As for the rest of those things, I didn't mean to imply that you weren't testing for them, just that my 24 code paths number wasn't even considering those things. Many of those things are definitely best verified through functional tests just as many others are best verified through unit tests. You probably are verifying a bunch of them already.

To be honest, I think a boatload of context is being lost in this comment chain. Want to zoom tomorrow? Maybe we could bring in some other people to get some fresh perspectives as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be happy to Zoom. However, I would like you to provide some concrete examples of the types of tests you'd like to see, as there's a lot of hand waving going on here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any developer at any time could make even trivial changes to this file that could break any or all of those things in subtle ways that could have effects we couldn't possibly anticipate through automated testing. That's why we should assert that these things are being setup/configured the way they need to be rather than hoping our tests will catch the unexpected effects of breaking them.

And I'd also like concrete examples of ways in which a developer could make trivial changes that break everything, where a unit test would save them, and they wouldn't just change the unit test in addition to the code if the subtle breakages are in fact impossible to anticipate and the developer is incapable of detecting them when testing their changes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I can come up with some examples using this code before we chat about it.

I'm not sure what meant by the follow up comment, though. Developers will always be able to modify tests, and doing so will always mean that the code isn't working the way it was before.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It sounds like you want tests to verify the value of certain configuration options, like batch size, to prevent bugs that "we couldn't possibly anticipate" caused by a developer changing that value. What I'm saying is that if the bug is in fact impossible to anticipate, a unit test isn't going to prevent the developer from changing that value.

But I could be totally wrong about what you mean, so let's table that discussion until we have some concrete examples of unit tests that we can dissect.

"type": "double",
"doc_values": true
},
"CountryFull": {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can drop this and all of the other "type": "text" mappings since the _default_ dynamic template will catch them

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah that's right, ES creates text and keywords out of strings by default now doesn't it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can still drop all of these "text" fields. Even outside of Elasticsearch's defaults you have a _default_ mapping that will catch all of the text fields anyway

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my other comment about this.

@epixa
Copy link
Contributor

epixa commented Apr 13, 2016

I'm not sure why github collapsed the one longer thread we have going, so I'm linking it here for other reviewers to add their thoughts: #6844 (diff)

If we add anything more to it, we should probably do so on this main PR comment thread so they don't continue to disappear into the void.

@rashidkpc
Copy link
Contributor

Its because it was a comment on code that has changed, so I guess github decides its resolved, but yeah, in general anything large scope should go in the main thread or it gets lost.

@rashidkpc
Copy link
Contributor

jenkins, test it

@rashidkpc
Copy link
Contributor

@Bargs can you merge master on this? That should fix the tests

@Bargs
Copy link
Contributor Author

Bargs commented Apr 26, 2016

Hmmm I rebased but it's still busted, I'll have to look into it when I get some time.

@Bargs
Copy link
Contributor Author

Bargs commented Apr 28, 2016

Tests are passing again. @epixa and I are going to discuss the unit testing more on Friday. @rashidkpc in the meantime could you give this a review as well?

@Bargs Bargs assigned rashidkpc and unassigned Bargs Apr 28, 2016
"_default_": {
"dynamic_templates": [
{
"string_fields": {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will catch all of your text/string fields, no reason to define them explicitly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need the explicit mappings otherwise I'll get mapping exceptions when indexing the data. For instance, some of the ZipCode values are actually strings, but Elasticsearch will map is as a long if the first value it sees looks like a number. So ZipCode needs to be explicitly mapped as a string.

Some of the fields might not strictly require explicit mappings, but to make the tests as robust and repeatable as possible, I went ahead and mapped everything.

@LeeDr
Copy link
Contributor

LeeDr commented May 10, 2016

LGTM

Lee's discovery test (with @Bargs help);

In this case I started elasticsearch and kibana with npm run test:ui:server so my kibana port is 5620.
Kibana version header is required.
I loaded one of the same test data csv files from the API tests.

$ curl -s -XPOST 'localhost:5620/api/kibana/names/_bulk' -H "kbn-version: 5.0.0-snapshot" --data-binary @test/unit/data/fake_names.csv
[
{"created":97,"errors":{"index":[{"_id":"AVScm05upO2BM3CqF7Ec","error":{"type":"illegal_argument_exception","reason":"mapper [ZipCode] of different type, current_type [text], merged_type [long]"}},{"_id":"AVScm05upO2BM3CqF7Ed","error":{"type":"illegal_argument_exception","reason":"mapper [Pounds] cannot be changed from type [float] to [long]"}},{"_id":"AVScm05upO2BM3CqF7Ek","error":{"type":"illegal_argument_exception","reason":"mapper [ZipCode] of different type, current_type [text], merged_type [long]"}}]}}
]


@Bargs Bargs assigned ycombinator and unassigned Bargs May 10, 2016
});
});

});

bdd.describe('optional parameters', function () {
bdd.it('should accept a custom delimiter query string param for parsing the CSV', function () {
return request.post('/kibana/names/_bulk?delimiter=|')
return request.post('/kibana/names/_data?delimiter=|')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just noticed this. The | could be passed in as %7C because of URL encoding. Mind adding a test to make sure that works?

@ycombinator
Copy link
Contributor

ycombinator commented May 11, 2016

I made the following request:

$ curl -kv -H'kbn-version: 5.0.0-snapshot' -XPOST https://localhost:5601/dun/api/kibana/foo/_data -d '
foo,bar
1,2'

And got the following response:

HTTP/1.1 200 OK
[
{"created":0,"errors":{"other":["Number of columns on line 2 does not match header"]}}
]

I have a few questions/comments:

  1. Why is the kbn-version header required? Is this so we can effectively version the API, should its interface change in the future?
  2. The error above is being caused because I have an extra newline at the very start of my input. Either we should strip whitespace at the start and end of the input (as a whole, not from every individual row, to be clear) OR we should detect that there is actually no header specified at all (as the first line of input is empty) and make that the error message.
  3. I strongly recommend not responding with a JSON array. This makes it impossible to extend in the future. Instead I suggest responding with a JSON object with a property for the array. This way, if you need to add properties to the response in the future, you have that extensibility available to you without needing to break backwards compatibility.
  4. I'm guessing the reason you are returning the error as an element of the array is because you might run into errors on individual lines of the input CSV (but please correct me if there's another reason). However, there is a class of errors that apply to the entire input payload (such as the error above) or the entire request in general (such as a the authentication error example shown below). For this class of "global-level" errors, I suggest that this API not return a 200 OK but the appropriate status code > 399. Also, if you take the suggestion in point 3, you can then attach these "global-level" error messages as properties to the root JSON object in the response.

Authentication error example, request + response:

$ curl -k -H'kbn-version: 5.0.0-snapshot' -XPOST https://localhost:5601/dun/api/kibana/foo/_data -d 'foo,bar
1,2'
[
{"created":0,"errors":{"other":["Unauthorized"]}}
]```

@ycombinator
Copy link
Contributor

ycombinator commented May 11, 2016

Currently the API (really Elasticsearch) auto-assigns IDs for each row in the CSV. Would it be useful for the API (and by extension, the UI) to expose a parameter that will allow the user to specify an _id column in the CSV? The API would then use the value of this column in each row as the _id of the corresponding document in Elasticsearch. This is something that can be added later too, if you don't know of a use case for this at the moment.

@ycombinator
Copy link
Contributor

General note about query string parameters on this API: This API is named _data because, in the future, it could take data formats other than CSV as input. However, some of the parameters on this API (such as delimiter) might only make sense for certain formats (such as CSV). Also the same parameter might end up meaning different things, depending on the format of the data.

A couple of thoughts:

  • Perhaps its not such a bad idea to make this API specific to CSV input, by naming it _csv, or
  • Perhaps the parameters that are specific to certain formats should be namespaced like csv_delimiter

@epixa
Copy link
Contributor

epixa commented May 11, 2016

kbn-version is required for all requests to the Kibana server for CSRF protection.

@Bargs
Copy link
Contributor Author

Bargs commented May 11, 2016

Why is the kbn-version header required? Is this so we can effectively version the API, should its interface change in the future?

kbn-version is actually unrelated to the API, it's a header we added to protect against XSRF.

The error above is being caused because I have an extra newline at the very start of my input. Either we should strip whitespace at the start and end of the input (as a whole, not from every individual row, to be clear) OR we should detect that there is actually no header specified at all (as the first line of input is empty) and make that the error message.

Setting the skip_empty_lines config value to true might solve this, let me give that a try.

I strongly recommend not responding with a JSON array. This makes it impossible to extend in the future. Instead I suggest responding with a JSON object with a property for the array. This way, if you need to add properties to the response in the future, you have that extensibility available to you without needing to break backwards compatibility.

I think that's a good point but it's a little tricky since the response is streamed to the client. Making the response an object will make it a bit harder to parse in a streaming manner. The way I think about it, the array is a container for 1 or more response objects. It could just as easily be 1 or more json objects separated by newlines, but I wanted to make it valid json so clients could consume it easily if they don't care about reading the streaming results as they arrive. So if we were to add a property in the future, it would be to the response objects, not the array.

I'm guessing the reason you are returning the error as an element of the array is because you might run into errors on individual lines of the input CSV (but please correct me if there's another reason).

Exactly.

However, there is a class of errors that apply to the entire input payload (such as the error above) or the entire request in general (such as a the authentication error example shown below).

So a couple of things here

  • Unfortunately the CSV parsing lib does nothing to distinguish between line level errors and global errors. So we'd have to rely on matching the error text itself which would be a bad idea.
  • Is that unauthorized error coming from elasticsearch? Are you getting it because you're unauthorized to add data to a specified index or because you're unauthorized to use the bulk API? What if Shield's field level security is being used, and some indexing requests are unauthorized but some are ok?

For the sake of streaming, I really like the simplicity of treating each response object in the array as its own independent chunk and I think most errors that can occur really are line specific. But it does seem to make sense to return a 403 if, for instance, the user has no write permissions on the index they're trying to add data to. But that would require pre-checking the users permissions which we don't really do anywhere else, and it seems like more of a cross cutting concern that should be handled in the security plugin rather than adding security specific code to the API route handler itself. I dunno, I'm pretty torn on this one.

@Bargs
Copy link
Contributor Author

Bargs commented May 11, 2016

Currently the API (really Elasticsearch) auto-assigns IDs for each row in the CSV.

Actually the API creates an ID from the filename and line number: https://github.com/elastic/kibana/pull/6844/files#diff-acac6333b1d8bc4acac4ec269419e41aR51

I do think the ability to specify an ID column would be useful, but that's something else I think I'd like to take a wait and see approach on. Having the line number is really useful in the CSV upload UI so that I can point users to the exact line that caused an error.

@Bargs
Copy link
Contributor Author

Bargs commented May 11, 2016

Perhaps the parameters that are specific to certain formats should be namespaced like csv_delimiter

I like that idea, since there are other params like pipeline that could apply to any input type.

@ycombinator
Copy link
Contributor

@Bargs and I chatted out-of-issue about the structure of the API response. I'll try to summarize:

Given that this is an internal-only API at the moment, we go with whatever is simplest to generate (from the server side) and consume (from the client side). So we keep the response structure as-is: a top-level array containing response objects. If/when we release this API publicly, we'll need to think harder about the implications of that structure per points 3 and 4 in my comment above.

@ycombinator
Copy link
Contributor

Actually the API creates an ID from the filename and line number: https://github.com/elastic/kibana/pull/6844/files#diff-acac6333b1d8bc4acac4ec269419e41aR51

Ah, I missed that. Thanks for clarifying.

I do think the ability to specify an ID column would be useful, but that's something else I think I'd like to take a wait and see approach on.

The wait-and-see approach about letting users specify an ID column sounds good!

…when other data types are supported in the future
@Bargs
Copy link
Contributor Author

Bargs commented May 11, 2016

@ycombinator just pushed some updates.

  • Data dir is now fixtures
  • Empty lines at the beginning of a file will no longer cause errors
  • Changed delimiter to csv_delimiter

I think that covers everything we agreed to update. Let me know how we're looking.

@Bargs
Copy link
Contributor Author

Bargs commented May 11, 2016

jenkins, test it

@ycombinator
Copy link
Contributor

LGTM.

@Bargs Bargs assigned Bargs and unassigned ycombinator May 11, 2016
@Bargs Bargs merged commit 5918737 into elastic:feature/ingest May 11, 2016
Ikuni17 pushed a commit that referenced this pull request Jul 6, 2023
`[email protected]` ⏩ `83.0.0`

⚠️ The biggest change in this PR by far is the `EuiButtonEmpty` Emotion
conversion, which changes the DOM structure of the button slightly as
well as several CSS classes around it.

EUI has attempted to convert any custom EuiButtonEmpty CSS overrides
where possible, but would super appreciate it if CODEOWNERS checked
their touched files. If anything other than a snapshot or test was
touched, please double check the display of your button(s) and confirm
everything still looks shipshape. Feel free to ping us for advice if
not.

---

## [`83.0.0`](https://github.com/elastic/eui/tree/v83.0.0)

**Bug fixes**

- Fixed `EuiPaginationButton` styling affected by `EuiButtonEmpty`'s
Emotion conversion ([#6893](elastic/eui#6893))

**Breaking changes**

- Removed `isPlaceholder` prop from `EuiPaginationButton`
([#6893](elastic/eui#6893))

## [`82.2.1`](https://github.com/elastic/eui/tree/v82.2.1)

- Updated supported Node engine versions to allow Node 16, 18 and >=20
([#6884](elastic/eui#6884))

## [`82.2.0`](https://github.com/elastic/eui/tree/v82.2.0)

- Updated EUI's SVG icons library to use latest SVGO v3 optimization
([#6843](elastic/eui#6843))
- Added success color `EuiNotificationBadge`
([#6864](elastic/eui#6864))
- Added `badgeColor` prop to `EuiFilterButton`
([#6864](elastic/eui#6864))
- Updated `EuiBadge` to use CSS-in-JS for named colors instead of inline
styles. Custom colors will still use inline styles.
([#6864](elastic/eui#6864))

**CSS-in-JS conversions**

- Converted `EuiButtonGroup` and `EuiButtonGroupButton` to Emotion
([#6841](elastic/eui#6841))
- Converted `EuiButtonIcon` to Emotion
([#6844](elastic/eui#6844))
- Converted `EuiButtonEmpty` to Emotion
([#6863](elastic/eui#6863))
- Converted `EuiCollapsibleNav` and `EuiCollapsibleNavGroup` to Emotion
([#6865](elastic/eui#6865))
- Removed Sass variables `$euiCollapsibleNavGroupLightBackgroundColor`,
`$euiCollapsibleNavGroupDarkBackgroundColor`, and
`$euiCollapsibleNavGroupDarkHighContrastColor`
([#6865](elastic/eui#6865))

---------

Co-authored-by: Cee Chen <[email protected]>
Co-authored-by: Jeramy Soucy <[email protected]>
Co-authored-by: Kibana Machine <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature:Add Data Add Data and sample data feature on Home review
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants