This document details the support of reading and writing parquet format from cascading.
- Read and Write ==============
In parquet-cascading sub-module, we provide support for reading/writing records of various data structures including Thrift(TBase), Scrooge and Tuples. Please refer to following sections for each data structures.
ParquetTbaseScheme is the interface for reading thrift records in Parquet format. Providing a ParquetTbaseScheme as a parameter to the constructor of a source enables the program to read Thrift object(TBase), eg.
Scheme sourceScheme = new ParquetTBaseScheme(Name.class) Tap source = new Hfs(sourceScheme, parquetInputPath);
In the above example Name is a thrift class that extends TBase. Under the hood parquet will generate a schema from the thrift class to decode the data.
The thrift class is actually optional to initialize a ParquetTBaseScheme when the data is written as Thrift records in Parquet. When writing thrift records to parquet format, the Thrift class of the records is stored as meta-data in the footer of the parquet file. Therefore when reading the file, if a thrift class is not explicitly provided, Parquet will use the class name stored in the footer as the thrift class.
ParquetTbaseScheme can also be used by a sink. When used as a sink, the Thrift class of the records being written must be explicitly provided.
Scheme sinkScheme = new ParquetTBaseScheme(Name.class); Tap sink = new Hfs(sinkScheme, parquetOutputPath);
For more concrete examples please refer to TestParquetTBaseScheme
Scrooge support is defined in a separate module called parquet-scrooge. With ParquetScroogeScheme, data can be read in the form of Scrooge objects which are more scala friendly.
Scheme sinkScheme = new ParquetScroogeScheme(Name.class); Tap sink = new Hfs(sinkScheme, parquetOutputPath);
Currently, the support for reading tuples is mainly(but not limited) for data written from pig scripts as pig tuples. More comprehensive support will be added, but in the mean time, there are some limitations to notice: Nested structures are not supported. If the data is written as thrift objects which have nested structure, it can not be read at current time. Data to read must be in flat structure. To read data as tuples, simply use ParquetTupleScheme:
Scheme sourceScheme = new ParquetTupleScheme(new Fields("last_name")); Tap source = new Hfs(sourceScheme, parquetInputPath);
For more examples please refer to TestParquetTupleScheme
- Projection Pushdown ====================== One of the big benefit of using columnar format is to be able to read only a subset of columns when the full schema is huge. It saves times by not reading unused columns.
Parquet support projection pushdown for Thrift records and tuples.
To read only a subset of attributes in a thrift class, the columns of interest should be specified using glob syntax. For example, for a thrift class as follows:
struct Address{
1: string street
2: string zip
}
struct Person{
1: string name
2: int16 age
3: Address addr
}
In the above example, when reading records of type Person, we can use following glob expression to specify the attributes we want to read:
-
Exact match: "
name
" will only fetch the name attribute. -
Alternative match: "
address/{street,zip}
" will fetch both street and zip in the Address -
Wildcard match: "
*
" will fetch name and age, but not address, since address is a nested structure -
Recursive match: "
**
" will recursively match all attributes defined in Person. -
Joined match: Multiple glob expression can be joined together separated by ";". eg. "name;address/street" will match only name and street in Address.
To specify the glob filter, simply set the conf with "parquet.thrift.column.filter" set to the glob expression string.
Map<Object, Object> props=new HashMap<Object, Object>();
props.put("parquet.thrift.column.filter","name;address/street");
HadoopFlowConnector hadoopFlowConnector = new HadoopFlowConnector(props);
When using ParquetTupleScheme, specifying projection pushdown is as simple as specifying fields as the parameter of the constructor of ParquetTupleScheme:
Scheme sourceScheme = new ParquetTupleScheme(new Fields("age"));