Deriving Spark DataFrame schemas from case classes.
struct-type-encoder is available on maven central with the following coordinates:
"com.github.benfradet" %% "struct-type-encoder" % "0.1.0"
When reading a DataFrame/Dataset from a data source the schema of the data has to be inferred. In practice, this translates into looking at every record of all the files and coming up with a schema that can satisfy every one of these records, as shown here for JSON.
As anyone can guess, this can be a very time-consuming task, especially if you know in advance the schema of your data. A common pattern is to do the following:
case class MyCaseClass(a: Int, b: String, c: Double)
val inferred = spark
.read
.json("/some/dir/*.json")
.as[MyCaseClass]
In this case, there is no need to spend time inferring the schema as the DataFrame is directly
converted to a Dataset of MyCaseClass
. However, it can be a lot of boilerplate to bypass the
inference by specifying your own schema.
import org.apache.spark.sql.types._
val schema = SructType(
StructField("a", IntegerType) ::
StructField("b", StringType) ::
StructField("c", DoubleType) :: Nil
)
val specified = spark
.read
.schema(schema)
.json("/some/dir/*.json")
.as[MyCaseClass]
struct-type-encoder derives instances of StructType
(how
Spark represents a schema) from your case class automatically:
import ste.StructTypeEncoder
import ste.StructTypeEncoder._
val derived = spark
.read
.schema(StructTypeEncoder[MyCaseClass].encode)
.json("/some/dir/*.json")
.as[MyCaseClass]
No inference, no boilerplate!
This project includes JMH benchmarks to prove that inferring schemas and coming up with the schema satisfying all records is expensive. The benchmarks compare the average time spent parsing a thousand files each containing a hundred rows when the schema is inferred (by Spark, not user-specified) and derived (thanks to struct-type-encoder).
derived | inferred | |
---|---|---|
CSV | 5.936 ± 0.035 s | 6.494 ± 0.209 s |
JSON | 5.092 ± 0.048 s | 6.019 ± 0.049 s |
We see that when deriving the schemas we spend 16.7% less time reading JSON data and a 8.98% for CSV.