[Events] Events not fired for SaveAsTextFile (?)

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

[Events] Events not fired for SaveAsTextFile (?)

Bolke de Bruin
Hi,

Apologies upfront if this should have gone to user@ but it seems a developer question so here goes.

We are trying to improve a listener to track lineage across our platform. This requires tracking where data comes from and where it goes to. E.g.

sc.setLogLevel("INFO");
val data = sc.textFile("hdfs://migration/staffingsec/Mydata.gz")
data.saveAsTextFile ("hdfs://datalab/user/xxx”);

In this case we would like to know that Spark picked up “Mydata.gz” and wrote it to “xxx”. Of course more complex examples are possible.

In the particular case of the above Spark (2.3.2) does not seem trigger any events, or at least not that we know of that give us the relevant information.

Is that a correct assessment? What can we do to get that information without knowing the code upfront? Should we provide a patch?

Thanks
Bolke

Verstuurd vanaf mijn iPad
---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: [Events] Events not fired for SaveAsTextFile (?)

Driesprong, Fokko
Hi Bolke,

I would argue that Spark is not the right level of abstraction of doing this. I would create a wrapper around the particular filesystem: http://hadoop.apache.org/docs/r2.8.0/api/org/apache/hadoop/fs/FileSystem.html
Therefore you can write a wrapper around the LocalFileSystem if data will be written to local disk, DistributedFileSystem when written to HDFS, and also many object stores implements this interface. My 2¢

Cheers, Fokko

Op ma 15 okt. 2018 om 18:58 schreef Bolke de Bruin <[hidden email]>:
Hi,

Apologies upfront if this should have gone to user@ but it seems a developer question so here goes.

We are trying to improve a listener to track lineage across our platform. This requires tracking where data comes from and where it goes to. E.g.

sc.setLogLevel("INFO");
val data = sc.textFile("hdfs://migration/staffingsec/Mydata.gz")
data.saveAsTextFile ("hdfs://datalab/user/xxx”);

In this case we would like to know that Spark picked up “Mydata.gz” and wrote it to “xxx”. Of course more complex examples are possible.

In the particular case of the above Spark (2.3.2) does not seem trigger any events, or at least not that we know of that give us the relevant information.

Is that a correct assessment? What can we do to get that information without knowing the code upfront? Should we provide a patch?

Thanks
Bolke

Verstuurd vanaf mijn iPad
---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: [Events] Events not fired for SaveAsTextFile (?)

Bolke de Bruin
Hi Fokko

Spark fires it off for many other things. It does so for ML pipelines and it does make information available for data frames. 

We use S3 in this case I just simplified the example. It is important to know what process took what action. Only spark knows this and it does supply this information at other occasions. 

So I don't think your comment makes sense?

Cheers
Bolke

Op ma 15 okt. 2018 19:05 schreef Driesprong, Fokko <[hidden email]>:
Hi Bolke,

I would argue that Spark is not the right level of abstraction of doing this. I would create a wrapper around the particular filesystem: http://hadoop.apache.org/docs/r2.8.0/api/org/apache/hadoop/fs/FileSystem.html
Therefore you can write a wrapper around the LocalFileSystem if data will be written to local disk, DistributedFileSystem when written to HDFS, and also many object stores implements this interface. My 2¢

Cheers, Fokko

Op ma 15 okt. 2018 om 18:58 schreef Bolke de Bruin <[hidden email]>:
Hi,

Apologies upfront if this should have gone to user@ but it seems a developer question so here goes.

We are trying to improve a listener to track lineage across our platform. This requires tracking where data comes from and where it goes to. E.g.

sc.setLogLevel("INFO");
val data = sc.textFile("hdfs://migration/staffingsec/Mydata.gz")
data.saveAsTextFile ("hdfs://datalab/user/xxx”);

In this case we would like to know that Spark picked up “Mydata.gz” and wrote it to “xxx”. Of course more complex examples are possible.

In the particular case of the above Spark (2.3.2) does not seem trigger any events, or at least not that we know of that give us the relevant information.

Is that a correct assessment? What can we do to get that information without knowing the code upfront? Should we provide a patch?

Thanks
Bolke

Verstuurd vanaf mijn iPad
---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]