Re: Access to live data of cached dataFrame

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view

Re: Access to live data of cached dataFrame

When you cache a dataframe, you actually cache a logical plan. That's why re-creating the dataframe doesn't work: Spark finds out the logical plan is cached and picks the cached data.

You need to uncache the dataframe, or go back to the SQL way:
spark.table("abc").cache() // returns latest data.
spark.table("abc").show // returns cached data.

On Mon, May 20, 2019 at 3:33 AM Tomas Bartalos <[hidden email]> wrote:
I'm trying to re-read however I'm getting cached data (which is a bit confusing). For re-read I'm issuing:"delta").load("/data").groupBy(col("event_hour")).count

The cache seems to be global influencing also new dataframes. 

So the question is how should I re-read without loosing the cached data (without using unpersist) ? 

As I mentioned with sql its possible - I can create a cached view, so wen I access the original table I get live data, when I access the view I get cached data. 


On Fri, 17 May 2019, 8:57 pm Sean Owen, <[hidden email]> wrote:
A cached DataFrame isn't supposed to change, by definition.
You can re-read each time or consider setting up a streaming source on
the table which provides a result that updates as new data comes in.

On Fri, May 17, 2019 at 1:44 PM Tomas Bartalos <[hidden email]> wrote:
> Hello,
> I have a cached dataframe:
> I would like to access the "live" data for this data frame without deleting the cache (using unpersist()). Whatever I do I always get the cached data on subsequent queries. Even adding new column to the query doesn't help:
>"delta").load("/data").groupBy(col("event_hour")).count.withColumn("dummy", lit("dummy"))
> I'm able to workaround this using cached sql view, but I couldn't find a pure dataFrame solution.
> Thank you,
> Tomas