Add a track to test nested / parent child performance #8

danielmitterdorfer · 2016-08-17T05:57:25Z

Index + search
We should also force a high update rate (to see the cost of updating nested docs)

danielmitterdorfer · 2016-09-22T08:17:28Z

We should implement elastic/rally#155 first before adding such a track.

jpountz · 2017-02-14T08:45:49Z

We discussed this in the search meeting as we had a significant regression with nested queries in 2.0 which is only being addressed now. We would like to catch such regressions earlier in the future and were thinking about writing a track that would use the StackOverflow dataset and index comments and answers as nested documents of the questions, which would be the top-level documents.

I agree the update rate would be an interesting thing to benchmark but for now I think pure indexing speed + simple queries (both nested and non-nested, since the use of nested mappings forces ES to apply filters internally to exclude nested documents from eg. match_all queries) would be a great start?

cc @markharwood

markharwood · 2017-02-14T10:11:29Z

The question I have is do we want to test an artificial scenario where all Q&As are pre-fused as nested docs to be queried or a more real-world scenario where new answers continually revise existing question docs while searches are also being serviced. I'm not sure if rally would support these search-while-reindexing scenarios?

jpountz · 2017-02-14T10:16:01Z

I think we should start with the pre-fused scenario for now, which should be simpler to implement and would have caught the 2.0 regression. I'm all for making benchmarking as realistic as possible but let's get there step by step?

danielmitterdorfer · 2017-02-14T10:19:12Z

do we want to test an artificial scenario where all Q&As are pre-fused as nested docs to be queried

We usually implement our tracks this way to see effects in isolation.

The latter case could be implemented in a second step as a separate challenge. Just for reference, here's an implementation hint for this. You can index an search concurrently by defining the schedule as follows:

"schedule": [
  {
    "parallel": {
      "tasks": [
        {
          "operation": "bulk",
          "warmup-time-period": 240,
          "clients": 8,
          "target-throughput": 50
        },
        {
          "operation": "some-simple-query",
          "clients": 2,
          "warmup-iterations": 500,
          "iterations": 1000,
          "target-throughput": 50
        },
        {
          "operation": "some-complex-query",
          "clients": 2,
          "warmup-iterations": 500,
          "iterations": 1000,
          "target-throughput": 2
        }
      ]
    }
  }
]

markharwood · 2017-02-14T10:27:25Z

Cool.
I can pre-fuse some data on my laptop or we might want to benchmark that one-off fusion process.
There's typically 2 ways that process can be done using elasticsearch:

Bulk load using scripted updates to append Answers to Query docs.
Index questions, index answers, use scroll API on the 2 indices sorted on a common key and Python client assembles new docs to output to bulk index API.

Do you want to benchmark either of these?

danielmitterdorfer · 2017-02-14T10:33:35Z

I tend to do option 1. If you want me to run this benchmark for our comparison charts with older releases (2.x, 1.7(?)), then we just need to make sure it's implemented in a way that it's an apples-to-apples comparison in older releases (i.e. I guess it's Groovy before 5.0 and Painless afterwards but I think that's fine).

markharwood · 2017-02-15T11:28:10Z

@danielmitterdorfer @jpountz Can you review the data/mapping below before I kick off an upload of the json data.

I'm proposing we have this basic data for each StackOverflow question:

// Example doc
	{
           "title": "Display Progress Bar at the Time of Processing",
           "qid": "1000000",
           "answers": [
              {
                 "date": "2009-06-16T09:55:57.320",
                 "user": "Michał Niklas (22595)"
              },
              {
                 "date": "2009-06-17T12:34:22.643",
                 "user": "Jack Njiri (77153)"
              }
           ],
           "tag": [
              "vb6",
              "progress-bar"
           ],
           "user": "Jash",
           "creationDate": "2009-06-16T07:28:42.770"
        }
     }

That gives us a little free-text and structured data in the root doc and just who/when data in the nested answer objects. I have a full StackOverflow dump as of Jun 2016 and converted to the above format json is 3.64GB unzipped and 700Mb zipped. The mapping I suggest is pretty basic:

   {
     "question": {
        "properties": {
           "answers": {
              "type": "nested",
              "properties": {
                 "date": {
                    "type": "date"
                 },
                 "user": {
                    "type": "keyword"
                 }
              }
           },
           "creationDate": {
              "type": "date"
           },
           "date": {
              "type": "date"
           },
           "qid": {
              "type": "keyword"
           },
           "tag": {
              "type": "keyword"
           },
           "title": {
              "type": "text"
           },
           "user": {
              "type": "keyword"
           }
        }
     }
  }

If it look OK with you I'll kickoff an upload to the S3 benchmarks corpora store

jpountz · 2017-02-15T13:11:50Z

I think it is a good idea to have some minimal metadata, otherwise the indexing time that is specific to nested docs might be drowned into full text analysis + indexing. Maybe one minor suggestion would be to make the user field consistent between the question and answer objects (both in terms of mapping and format). Otherwise +1!

danielmitterdorfer · 2017-02-15T13:42:23Z

Looks great! I guess you're showing the master version of the track. For 5.x, you should also turn off _all. Thanks for tackling this!

markharwood · 2017-02-15T13:45:15Z

Thanks, both.

Maybe one minor suggestion would be to make the user field consistent between the question and answer objects

Good spot - that's a quirk of that particular example doc. In some questions the ownerID is missing and we only have display name instead.

Uses StackOverflow questions+answers nested docs with just title text, tags, authors and dates for fields. Closes elastic#8

markharwood · 2017-02-21T10:31:35Z

@danielmitterdorfer I was getting into tangles with git/rally because I'd initially followed my usual practice of creating a local dev branch ("fix/8") to create my PR but then realised rally manages branch switching so to test master I have to do dev on a local master branch.

But.... I just tried moving my changes to a local master branch and it tested OK so pushed to master on my public repo here but I cannot create a PR from master.

What's the best way forward here?

danielmitterdorfer · 2017-02-21T10:39:07Z

I suggest you do:

git checkout master
git checkout -b fix/8
git push name_of_your_clones_remote_here fix/8

This should work?

* Nested docs querying benchmark. Uses StackOverflow questions+answers nested docs with just title text, tags, authors and dates for fields. Closes #8

markharwood mentioned this issue Feb 20, 2017

Nested docs querying benchmark. #14

Merged

markharwood added a commit to markharwood/rally-tracks that referenced this issue Feb 21, 2017

Nested docs querying benchmark.

410ea05

Uses StackOverflow questions+answers nested docs with just title text, tags, authors and dates for fields. Closes elastic#8

markharwood closed this as completed in #14 Feb 21, 2017

markharwood added a commit that referenced this issue Feb 21, 2017

Nested docs querying benchmark. (#14)

8acdc3a

* Nested docs querying benchmark. Uses StackOverflow questions+answers nested docs with just title text, tags, authors and dates for fields. Closes #8

markharwood added a commit that referenced this issue Feb 21, 2017

Nested docs querying benchmark. (#14)

dc78eeb

* Nested docs querying benchmark. Uses StackOverflow questions+answers nested docs with just title text, tags, authors and dates for fields. Closes #8

markharwood added a commit that referenced this issue Feb 21, 2017

Nested docs querying benchmark. (#14)

240afa9

* Nested docs querying benchmark. Uses StackOverflow questions+answers nested docs with just title text, tags, authors and dates for fields. Closes #8

markharwood added a commit that referenced this issue Feb 21, 2017

Nested docs querying benchmark. (#14)

6793b33

* Nested docs querying benchmark. Uses StackOverflow questions+answers nested docs with just title text, tags, authors and dates for fields. Closes #8

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a track to test nested / parent child performance #8

Add a track to test nested / parent child performance #8

danielmitterdorfer commented Aug 17, 2016

danielmitterdorfer commented Sep 22, 2016

jpountz commented Feb 14, 2017

markharwood commented Feb 14, 2017

jpountz commented Feb 14, 2017

danielmitterdorfer commented Feb 14, 2017

markharwood commented Feb 14, 2017

danielmitterdorfer commented Feb 14, 2017 •

edited

Loading

markharwood commented Feb 15, 2017

jpountz commented Feb 15, 2017 •

edited

Loading

danielmitterdorfer commented Feb 15, 2017

markharwood commented Feb 15, 2017 •

edited

Loading

markharwood commented Feb 21, 2017

danielmitterdorfer commented Feb 21, 2017

Add a track to test nested / parent child performance #8

Add a track to test nested / parent child performance #8

Comments

danielmitterdorfer commented Aug 17, 2016

danielmitterdorfer commented Sep 22, 2016

jpountz commented Feb 14, 2017

markharwood commented Feb 14, 2017

jpountz commented Feb 14, 2017

danielmitterdorfer commented Feb 14, 2017

markharwood commented Feb 14, 2017

danielmitterdorfer commented Feb 14, 2017 • edited Loading

markharwood commented Feb 15, 2017

jpountz commented Feb 15, 2017 • edited Loading

danielmitterdorfer commented Feb 15, 2017

markharwood commented Feb 15, 2017 • edited Loading

markharwood commented Feb 21, 2017

danielmitterdorfer commented Feb 21, 2017

danielmitterdorfer commented Feb 14, 2017 •

edited

Loading

jpountz commented Feb 15, 2017 •

edited

Loading

markharwood commented Feb 15, 2017 •

edited

Loading