-
Notifications
You must be signed in to change notification settings - Fork 295
FAQ
Q: Where data sources are supported?
A: Please refer to: Available Connectors or look at the module list in the root of our github repository.
Q: Where data types are supported?
A: Please refer to: Supported Data Types or consult the README.md for the connector you are interested in for a more precise list.
Q: How do I view available databases, tables, columns, etc...?
A: You can run Hive compliant DDL statements against federated sources. We've also extended the Hive DDL grammar so you can run DDLs without registering your datasource. Some examples are below.
show databases in `lambda:<function_name>`;
show tables in `lambda:<function_name>`.<database>;
describe `lambda:<function_name>`.<database>.<table>;
Q: What limitations come with using Lambda for BigData?
A: The most relevant restrictions are the 15 minute max runtime and 3GB max memory available to any given Lambda Invocation. In practice we've not found these to be a limiting factor for most data sources. If the source system that you are federating to supports partitioning or parallel scans, Athena will use multiple Lambda invocations to extend both the max runtime and total working memory available to the connector. Since Athena's execution engine is only delegating Table Scan operations to your connector, your query can actually run much longer than 15 minutes and use considerably more memory than 3GB. These restrictions only apply to the fragments of the SQL Table Scan operations that Athena delegates to Lambda. A less obvious limit imposed by Lambda relates to how your Lambda function sends data to Athena. Lambda does not support streaming responses and also limits response (and request) sizes to ~6MB. The Athena Query Federation SDK will automatically encrypt and spill large response to S3 in batches that allow Athena's engine to pipeline reads and improve performance. You own and configure the S3 bucket that is used for spill as well as the encryption key source (KMS or /dev/random) that should be used for any spilled data.
Q: What is a 'Connector'?
A: A 'Connector' is a piece of code that can translate between your target data source and Athena. Today this code is expected to run in an AWS Lambda function but in the future we hope to offer more options. You can think of a connector as an extension of Athena's query engine. Athena will delegate portions of the federated query plan to your connector.
For more a more detailed explanation of the functionality that comprises a connector see: MetadataHandler, RecordHandler, UserDefinedFunctionHandler
Q: How does Athena Query Federation make use of Apache Arrow?
A: Any time Athena and your connector/UDF need to exchange Schema or typed data (e.g. row data), Apache Arrow is used. More precisely, Apache Arrow is used in the following situations:
- Table Schema uses Apache Arrow Schema
- Query Projection uses Apache Arrow Schema
- Predicate Constraints uses Apache Arrow Record Batches to hold the typed values of the constraints.
- Partition List uses Apache Arrow Record Batches since Partitions almost always correspond to specific values of certain columns (e.g. partition columns).
- Row Data is communicated using Apache Arrow Record Batches for both UDFs and Connectors.
An important thing to understand is that the Athena Query Federation SDK offers a set of utilities (e.g. BlockUtils, S3BlockSpiller, Projectors) that make working with Apache Arrow easier for beginners but add overhead from type boxing and coercion. For non-columnar sources the overhead is often negligible but for source that can take better advantage of Apache Arrow's native interfaces we plan to offer a more appropriate abstraction. The SDK also allows you to bypass these abstracts and make direct use of Apache Arrow, we try not to be prescriptive.
Q: What data gets spilled to S3?
A: If the response from your Connector's Lambda function exceeds the Lambda response size limit of 6MB, the Amazon Athena Query SDK will automatically encrypt, batch, and spill the response to an S3 bucket that you configure. The response to Athena will include the location(s) in S3 where the query data has been spilled and also the query specific encryption key to use when reading the data. Athena never gets direct access to your source system and the entity running the Athena query must have access to the spill location in order for Athena to read the spilled data. The use of a unique (per-query per-split) encryption key along with a bucket you control helps ensure federation does not become an avenue for unintended data exfiltration or privilege escalation. We also recommend that customers set an S3 lifecycle policy to delete objects from the spill location since the data is not needed once the query completes.
It is also worth noting that, we are also working on a mechanism that will eliminate the need for any spill in the future.
Q: What kinds of predicates does Athena support when applying Partition Pruning or Predicate Pushdown?
Only associative predicates are supported. Where relevant, Athena will supply you with the associative portion of the query predicate so that you can perform filtering or push the predicate into your source system for even better performance. It is important to note that the predicate is not always the query's full predicate. For example, if the query's predicate was "where (col0 < 1 or col1 < 10) and col2 + 10 < 100 and function(col3) > 19" only the "col0 < 1 or col1 < 10" will be supplied to you at this time. We are still considering the best form for supplying connectors with a more complete view of the query and its predicate. We expect a future release to provide full predicates to connectors and lets the connector decide which parts of the predicate it is capable of applying.
Q: Does Athena send query limits (e.g. select * from table limit 100) to connectors?
No but this is something we are considering. It is not always possible to extract a semantically correct limit to send to your connector. For example, if the limit needs to be applied after a join or a function/aggregation which is not available in your connector then Athena will not know in advance how many rows your connector should read. Additionally, if you cancel your query or the query fails you will want to stop your connector's scan. Today, Lambda does not offer the ability to cancel an invocation. To reduce overscan we've instead built into the Athena Query Federation SDK a way for you to know when your connector can stop scanning. Athena will already avoid calling your Lambda for unnecessary splits once the query has scanned enough data but you can use QueryStatusChecker to exit your Lambda early. You can see examples of how to use this facility in any of the connectors in this repository.
Q: I'd like to connector to XYZ, but XYZ is not in the list of available connectors, what should I do?
A: The Amazon Athena Query Federation SDK allows you to customize Amazon Athena with your own code. This enables you to integrate with new data sources, proprietary data formats, or build in new user defined functions. If you are comfortable developing your own connector, we recommend going through our Getting Started Guide.
Alternatively, you can raise an issue and let us know what data source you feel would be valuable for us to support.
Q: What connectivity / networking requirements are there for Athena Federation?
A: From the outset we expected that customers wanting to run federated queries would have a diverse technology landscape, potentially compromised of many micro-services and application specific VPCs which created islands of data. As such, Athena has no specific networking requirement in order to federate queries across your sources. Instead, Athena gives you the freedom to deploy connectors as Lambda functions in the appropriate VPC(s) that will enable the connector to communicate with your source. For example, you can deploy 3 different copies of the JDBC connector in order to have Athena join across MySQL in VPC1 and Redshift in VPC2 and lastly Postgres in VPC3. At no point would Athena be able to speak directly to these VPCs nor would these VPCs require any peering or connectivity to each-other.