Folks, I've been building out a large machine learning repository using spark as the compute platform running on yarn and hadoop, I was wondering
if folks have some best practice oriented thoughts around unit testing/integration testing this application, I am using
spark-submit and a configuration file to enable a dynamic workflow such that we can build different ML repos for each of our models. The
ML repos consist of parquet files and eventually hive tables.I want
to be able to unit test this application using scalatest or some other recommended utility, I also want to integration test
the application in our int environment, specifically we have a dev/int and eventually prod and a prod environment consisting of spark running on hadoop usign yarn.
The ideal workflow in my mind would be:</div> 1) unit tests run upon every checkin in our dev enviroment</div> 2) application gets propagated to our int environment</div> 3) integration tests run successfully in our int environment</div> 4) application gets propagated to our prod environment</div> 5) hive table/parquet file gets generated and consumed by scala notebooks running on top of spark cluster</div>
**Caveat I wasnt sure if this was more appropriate for dev or user mailing list but given that I only am following dev I sent this here.