Supporting Row and DataFrame level metadata?

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

Supporting Row and DataFrame level metadata?

Jeff Evans
Hi,

I'm wondering if there has been any past discussion on the subject of supporting metadata attributes as a first class concept, both at the row level, as well as the DataFrame level?  I did a Jira search, but most of the items I found were unrelated to this concept, or pertained to column level metadata, which is of course already supported.

Row-level metadata, would be useful in scenarios like the following:
  • Lineage and provenance attributes, which need to eventually be propagated to some other system, but which shouldn't be written out with the "regular" DataFrame.
  • Other custom attributes, such as the input_file_name for data read from HDFS, message keys from Kafka
So why not just store regular the attributes as regular columns (possibly with some special prefix to help us filter them out if needed)?
  • When passing the DataFrame to another piece of library code, we might need to remove those columns, depending on what it does (ex: if it operates on every column).  Or we might need to perform an extra join in order to "retain" the attributes from the rows processed by the library function.
  • If we need to union an existing DataFrame (with metadata) and another one that we read from another source (which has different, or no metadata).  If metadata attributes are represented as normal columns, we have to do some finagling to get the union to work properly.
  • If we want to simply write the DataFrame somewhere, we probably don't want to mix metadata attributes with the actual data.
For DataFrame-level metadata:
  • Attributes such as the table/schema/DB name, or primary key information, for DataFrames read from JDBC (ex: downstream processing might want to always partitionBy these key columns, whatever they happen to be)
  • Adding tracking information about what app-specific processing steps have been applied so far, their timings, etc.
  • For SQL sources, capturing the full query that produced the DataFrame
Some of these scenarios can be made easier by custom code with implicit conversions, as outlined here.  But that has its own drawbacks and shortcomings (as outlined in the comments).

How are people currently managing this?  Does it make sense, conceptually, as something that Spark should directly support?