Support for Select, Scan and Search queries. #83

anskarl · 2019-12-31T17:13:02Z

Scruid at the moment supports aggregation queries (timeseries, group-by and top-n). It would be also useful to extend the functionality of the library to support Select, Scan and Search queries.

While it is straightforward to implement such queries in Scruid, the format of the resulting data is different and cannot be handled by the current implementation.

Specifically, the format of the resulting data for timeseries and group-by queries is like below:

[
  {
    "timestamp": "2012-01-01T00:00:00.000Z",
    "result": { ... }
  },
  {
    "timestamp": "2012-01-02T00:00:00.000Z",
    "result": { ... }
  }
]

It is an array of JSON structures, each one is composed of a timestamp and a result which is a JSON structure.

The format of top-n queries is slight different, each time-stamped row contains a result which is an array of JSON structures:

[
  {
    "timestamp": "2012-01-01T00:00:00.000Z",
    "result": [{ ... }, { ... } ... ]
  },
  {
    "timestamp": "2012-01-02T00:00:00.000Z",
    "result": [{ ... }, { ... } ... ]
  }
]

The resulting data (array of JSON structures) of any aggregation query (timeseries, group-by and top-n) is handled by the class ing.wbaa.druid.DruidResponse and each result (array or not) is represented by the class ing.wbaa.druid.DruidResult.

Select queries return raw Druid rows and support pagination. The format of the resulting data is close to the aggregation queries, an array of JSON objects with timestamp and a result which is a JSON structure:

[{
  "timestamp" : "2013-01-01T00:00:00.000Z",
  "result" : {
    "pagingIdentifiers" : {
      "wikipedia_2012-12-29T00:00:00.000Z_2013-01-10T08:00:00.000Z_2013-01-10T08:13:47.830Z_v9" : 4
    },
    "events" : [ {
      "segmentId" : "wikipedia_editstream_2012-12-29T00:00:00.000Z_2013-01-10T08:00:00.000Z_2013-01-10T08:13:47.830Z_v9",
      "offset" : 0,
      "event" : { ... }
    }, ...
		
		]
	}, ...
]

The only difference is that the result structure contains an array of events, therefore it requires a different implementation of ing.wbaa.druid.DruidResponse.

Scan queries do not support pagination like Select queries, but are more efficient and return rows in streaming mode. Regarding the format of the result, compared to aggregation queries, it does not contain a timestamp but the segmentId. The timestamp, however, can be retrieved by the inner event structures. Below is an example fragment of the resulting data of a scan query:

[ {
    "segmentId" : "wikipedia_editstream_2012-12-29T00:00:00.000Z_2013-01-10T08:00:00.000Z_2013-01-10T08:13:47.830Z_v9",
    "columns" : [ "timestamp", "dim1", "dim2", ... ],
    "events" : [ { "timestamp" : "2013-01-01T00:00:00.000Z", "dim1": "some_value", "dim2": "some_other_value", ... }, { ... }, ... ]
	}, ...
]

Furthermore, scan queries can return data in different format (compacted list) and also have a legacy mode for the timestamp dimension, in which timestamp is being replaced by the __time dimension --- for details see official documentation.

Search queries return dimension values that match the search specification. The format is close to top-n queries, timestamp field and result is an array of JSON structures.

[
  {
    "timestamp": "2012-01-01T00:00:00.000Z",
    "result": [
      {
        "dimension": "dim1",
        "value": "some_value",
        "count": 3
      },
      {
        "dimension": "dim2",
        "value": "some_value",
        "count": 1
      }, ...
    ]
  }, ...
]

The main difference here is that the format of the JSON structures in result is always composed of the same fields, that is dimension, its value and the corresponding count. So the issue here is that list[T] and series[T] functions of ing.wbaa.druid.DruidResponse can only be applied to any class having those three particular fields. I think, however, that for practical reasons it is better to have list and series functions without type parameters and return some predefined class with those fields.

With respect to the aforementioned issues, in order to support Select, Scan and Search queries, ing.wbaa.druid.DruidResponse and ing.wbaa.druid.DruidResult have to be adapted, as well as apply minor changes to ing.wbaa.druid.client.DruidClient and ing.wbaa.druid.client.DruidResponseHandler.

The text was updated successfully, but these errors were encountered:

- Changes interval query and response API in order to support the three new queries - Legacy mode for Scan queries is configurable from application.conf - Updates documentation and examples for new queries - Updates unit tests to examine new queries - All Druid Queries have toDebugString utility function, in order to get the corresponding native JSON representation of the query as a string (useful for debugging purposes) ing-bankGH-83

anskarl · 2019-12-31T17:16:36Z

Commit 7c0f737 implements the aforementioned changes. An outline of the changes is given below:

DruidResponse is now a sealed trait and has two implementations DruidResponseTimeseriesImpl (for timeseries, group-by, top-n and select queries) and DruidResponseScanImpl (only for scan queries)
- DruidResponseTimeseriesImpl contains a list of DruidResult which can be mapped to user-defined case classes
- DruidResponseScanImpl contains a list of DruidScanResults, each one having a list of individual DruidScanResult which can be mapped to user-defined case classes
In contrast to other queries, search queries have a separate DruidResponseSearch which does not extent DruidResponse, but provides similar API. The reason behind this is that the results of search queries have a specific format that does not depends on the schema of the datasource and therefore there is no reason to be mapped to user-defined case classes. Furthermore, for that reason the query functions of DruidQuery trait have been moved to a separate sealed trait DruidQueryFunctions and it is not extended by SearchQuery.

ing-bankGH-83

- Simplify decoding functions by using `.toTry.get` instead of pattern matching. - Rename DruidResponseScanImpl to DruidScanResponse ing-bankGH-83

anskarl mentioned this issue Jan 1, 2020

Adds support for Select, Scan and Search queries #85

Merged

anskarl added a commit to anskarl/scruid that referenced this issue Jan 12, 2020

changes version to 2.3.1-SNAPSHOT

d9d307a

ing-bankGH-83

krisgeus closed this as completed Jan 18, 2020

anskarl mentioned this issue May 8, 2020

Release v2.4.0 #97

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for Select, Scan and Search queries. #83

Support for Select, Scan and Search queries. #83

anskarl commented Dec 31, 2019

anskarl commented Dec 31, 2019

Support for Select, Scan and Search queries. #83

Support for Select, Scan and Search queries. #83

Comments

anskarl commented Dec 31, 2019

anskarl commented Dec 31, 2019