Dude, you really need to chill. Have you ever worked with a large open source project before? It seems not. Even so, insinuating there are tons of bugs that were left uncovered until you came along (despite the fact that the project is used by millions across many different organizations) is ludicrous. Learn a little bit of humility
If you're new to something, assume you have made a mistake rather than that there is a bug. Lurk a bit more, or even do a simple Google search, and you will realize Sean is a very senior committer (i.e. expert) in Spark, and has been for many years. He, and everyone else participating in these lists, is doing it voluntarily on their own time. They're not being paid to handhold you and quickly answer to your every whim.
This is probably more of a question for the user support list, but I believe I understand the issue.
Schema inside of spark refers to the structure of the output rows, for example the schema for a particular dataframe could be (User: Int, Password: String) - Two Columns the first is User of type int and the second is Password of Type String.
When you pass the schema from one reader to another, you are only copyting this structure, not all of the other options associated with the dataframe. This is usually useful when you are reading from sources with different options but data that needs to be read into the same structure.
The other properties such as "format" and "options" exist independently of Schema. This is helpful if I was reading from both MySQL and a comma separated file for example. While the Schema is the same, the options like ("inferSchema") do not apply to both MySql and CSV and format actually picks whether to us "JDBC" or "CSV" so copying that wouldn't be helpful either.
val streamingDataFrame = spark.readStream .schema(staticSchema) .format("csv") .option("maxFilesPerTrigger", 1) .option("header","true") .load("/data/retail-data/by-day/*.csv")
// lazy operation so we will need to call a streaming action to start the action val purchaseByCustomerPerHour = streamingDataFrame .selectExpr( "CustomerId", "(UnitPrice * Quantity) as total_cost", "InvoiceDate") .groupBy( col("CustomerId"), window(col("InvoiceDate"), "1 day")) .sum("total_cost")
// stream action to write to console purchaseByCustomerPerHour.writeStream .format("console") .queryName("customer_purchases") .outputMode("complete") .start()