Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to read excel files in a streaming fashion in spark 2.x #572

Open
tmljob opened this issue Apr 1, 2022 · 3 comments
Open

How to read excel files in a streaming fashion in spark 2.x #572

tmljob opened this issue Apr 1, 2022 · 3 comments

Comments

@tmljob
Copy link

tmljob commented Apr 1, 2022

I have a set of Excel format files which needs to be read from Spark(2.4.0) as and when an Excel file is loaded into a local directory. Scala version used here is 2.11.8.

I've tried using readstream method of SparkSession, but I'm not able to read in a streaming way. the code as:

   val spark = SparkSession.builder().master("local[*]").appName("Spark SQL Example").getOrCreate()
    spark.sqlContext.setConf("spark.sql.streaming.schemaInference","true")
    import spark.implicits._
    val df = spark.readStream.format("com.crealytics.spark.excel").option("header", true).load("file:///filepath/*.xlsx")
    df.writeStream.format("memory").queryName("tab").start().awaitTermination()
    val res = spark.sql("select * from tab")
    res.show()

the error log as:

22/04/01 16:14:16 INFO StateStoreCoordinatorRef: Registered StateStoreCoordinator endpoint
Exception in thread "main" java.lang.UnsupportedOperationException: Data source com.crealytics.spark.excel does not support streamed reading
	at org.apache.spark.sql.execution.datasources.DataSource.sourceSchema(DataSource.scala:246)
	at org.apache.spark.sql.execution.datasources.DataSource.sourceInfo$lzycompute(DataSource.scala:95)
	at org.apache.spark.sql.execution.datasources.DataSource.sourceInfo(DataSource.scala:95)
	at org.apache.spark.sql.execution.streaming.StreamingRelation$.apply(StreamingRelation.scala:33)
	at org.apache.spark.sql.streaming.DataStreamReader.load(DataStreamReader.scala:215)
	at org.apache.spark.sql.streaming.DataStreamReader.load(DataStreamReader.scala:225)
	at com.chinafusiongroup.dcp.ExcelStreamApp$.main(ExcelStreamApp.scala:12)
	at com.chinafusiongroup.dcp.ExcelStreamApp.main(ExcelStreamApp.scala)
22/04/01 16:14:16 INFO SparkContext: Invoking stop() from shutdown hook

Any answers would be helpful.

@nightscape
Copy link
Owner

I fear you are out of luck here...
Streaming read probably works in v2 (haven't tested it myself yet), but we have stopped supporting Scala 2.11 quite a while ago...
If you're proficient with Scala, you could try building it yourself for Scala 2.11, but I fear many dependencies have also stopped publishing packages for 2.11...

@tmljob
Copy link
Author

tmljob commented Apr 7, 2022

I fear you are out of luck here... Streaming read probably works in v2 (haven't tested it myself yet), but we have stopped supporting Scala 2.11 quite a while ago... If you're proficient with Scala, you could try building it yourself for Scala 2.11, but I fear many dependencies have also stopped publishing packages for 2.11...

Thanks for your answer, I hope to support streaming as soon as possible in the new version. Through research and verification, we currently use hadoopoffice (spark-hadoopoffice-ds) to deal with the current needs.

@tonicava
Copy link

Transform your xls into csv first, then you can read the data in streaming.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants