Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for Select, Scan and Search queries. #83

Closed
anskarl opened this issue Dec 31, 2019 · 1 comment
Closed

Support for Select, Scan and Search queries. #83

anskarl opened this issue Dec 31, 2019 · 1 comment

Comments

@anskarl
Copy link
Contributor

anskarl commented Dec 31, 2019

Scruid at the moment supports aggregation queries (timeseries, group-by and top-n). It would be also useful to extend the functionality of the library to support Select, Scan and Search queries.

While it is straightforward to implement such queries in Scruid, the format of the resulting data is different and cannot be handled by the current implementation.

Specifically, the format of the resulting data for timeseries and group-by queries is like below:

[
  {
    "timestamp": "2012-01-01T00:00:00.000Z",
    "result": { ... }
  },
  {
    "timestamp": "2012-01-02T00:00:00.000Z",
    "result": { ... }
  }
]

It is an array of JSON structures, each one is composed of a timestamp and a result which is a JSON structure.

The format of top-n queries is slight different, each time-stamped row contains a result which is an array of JSON structures:

[
  {
    "timestamp": "2012-01-01T00:00:00.000Z",
    "result": [{ ... }, { ... } ... ]
  },
  {
    "timestamp": "2012-01-02T00:00:00.000Z",
    "result": [{ ... }, { ... } ... ]
  }
]

The resulting data (array of JSON structures) of any aggregation query (timeseries, group-by and top-n) is handled by the class ing.wbaa.druid.DruidResponse and each result (array or not) is represented by the class ing.wbaa.druid.DruidResult.

Select queries return raw Druid rows and support pagination. The format of the resulting data is close to the aggregation queries, an array of JSON objects with timestamp and a result which is a JSON structure:

[{
  "timestamp" : "2013-01-01T00:00:00.000Z",
  "result" : {
    "pagingIdentifiers" : {
      "wikipedia_2012-12-29T00:00:00.000Z_2013-01-10T08:00:00.000Z_2013-01-10T08:13:47.830Z_v9" : 4
    },
    "events" : [ {
      "segmentId" : "wikipedia_editstream_2012-12-29T00:00:00.000Z_2013-01-10T08:00:00.000Z_2013-01-10T08:13:47.830Z_v9",
      "offset" : 0,
      "event" : { ... }
    }, ...
		
		]
	}, ...
]

The only difference is that the result structure contains an array of events, therefore it requires a different implementation of ing.wbaa.druid.DruidResponse.

Scan queries do not support pagination like Select queries, but are more efficient and return rows in streaming mode. Regarding the format of the result, compared to aggregation queries, it does not contain a timestamp but the segmentId. The timestamp, however, can be retrieved by the inner event structures. Below is an example fragment of the resulting data of a scan query:

[ {
    "segmentId" : "wikipedia_editstream_2012-12-29T00:00:00.000Z_2013-01-10T08:00:00.000Z_2013-01-10T08:13:47.830Z_v9",
    "columns" : [ "timestamp", "dim1", "dim2", ... ],
    "events" : [ { "timestamp" : "2013-01-01T00:00:00.000Z", "dim1": "some_value", "dim2": "some_other_value", ... }, { ... }, ... ]
	}, ...
]

Furthermore, scan queries can return data in different format (compacted list) and also have a legacy mode for the timestamp dimension, in which timestamp is being replaced by the __time dimension --- for details see official documentation.

Search queries return dimension values that match the search specification. The format is close to top-n queries, timestamp field and result is an array of JSON structures.

[
  {
    "timestamp": "2012-01-01T00:00:00.000Z",
    "result": [
      {
        "dimension": "dim1",
        "value": "some_value",
        "count": 3
      },
      {
        "dimension": "dim2",
        "value": "some_value",
        "count": 1
      }, ...
    ]
  }, ...
]

The main difference here is that the format of the JSON structures in result is always composed of the same fields, that is dimension, its value and the corresponding count. So the issue here is that list[T] and series[T] functions of ing.wbaa.druid.DruidResponse can only be applied to any class having those three particular fields. I think, however, that for practical reasons it is better to have list and series functions without type parameters and return some predefined class with those fields.

With respect to the aforementioned issues, in order to support Select, Scan and Search queries, ing.wbaa.druid.DruidResponse and ing.wbaa.druid.DruidResult have to be adapted, as well as apply minor changes to ing.wbaa.druid.client.DruidClient and ing.wbaa.druid.client.DruidResponseHandler.

anskarl added a commit to anskarl/scruid that referenced this issue Dec 31, 2019
  - Changes interval query and response API in order to support the three new queries
  - Legacy mode for Scan queries is configurable from application.conf
  - Updates documentation and examples for new queries
  - Updates unit tests to examine new queries
  - All Druid Queries have toDebugString utility function, in order to get the corresponding native JSON representation of the query as a string (useful for debugging purposes)

ing-bankGH-83
@anskarl
Copy link
Contributor Author

anskarl commented Dec 31, 2019

Commit 7c0f737 implements the aforementioned changes. An outline of the changes is given below:

  • DruidResponse is now a sealed trait and has two implementations DruidResponseTimeseriesImpl (for timeseries, group-by, top-n and select queries) and DruidResponseScanImpl (only for scan queries)
    • DruidResponseTimeseriesImpl contains a list of DruidResult which can be mapped to user-defined case classes
    • DruidResponseScanImpl contains a list of DruidScanResults, each one having a list of individual DruidScanResult which can be mapped to user-defined case classes
  • In contrast to other queries, search queries have a separate DruidResponseSearch which does not extent DruidResponse, but provides similar API. The reason behind this is that the results of search queries have a specific format that does not depends on the schema of the datasource and therefore there is no reason to be mapped to user-defined case classes. Furthermore, for that reason the query functions of DruidQuery trait have been moved to a separate sealed trait DruidQueryFunctions and it is not extended by SearchQuery.

anskarl added a commit to anskarl/scruid that referenced this issue Jan 12, 2020
anskarl added a commit to anskarl/scruid that referenced this issue Jan 17, 2020
  - Simplify decoding functions by using `.toTry.get` instead of pattern matching.
  - Rename DruidResponseScanImpl to DruidScanResponse

ing-bankGH-83
@anskarl anskarl mentioned this issue May 8, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants