My I report a special comparaison of executions leading on issues on Spark JIRA ?

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

My I report a special comparaison of executions leading on issues on Spark JIRA ?

Marc Le Bihan
Hello,

I currently run a Spark project based on cities, local authorities,
enterprises, local communities, etc.
Ten Datasets written in Java are doing operations going from simple join to
elaborate ones.
Language used is Java. 20 integrations tests with the whole data (20 GB)
takes seven hour.

*All work perfectly under Spark 2.4.6 - Scala 2.12 - Java 11 or 8*.
I remember it was working well on Spark 2.4.5 too,
but had many troubles in the past with Spark 2.4.3 (if I remember well from
L4Z algorithms often).

I attempted to run my integration tests on Spark 3.0.1. Many of them has
failed, with strange messages.
Something about lambda or about Map that where no more taken into account
when in a Java Dataset, object or schema ?

I then gone back, but to Spark 2.4.7. To make a try. And Spark 2.4.7. also
encounters troubles that 2.4.6. didn't have.

My question :


May I create an issue on JIRA based on the comparison of the executions of
my project with different versions of Spark, reporting error messages
received, call stacks and showing the lines around the one that encountered
a problem if available,
even if I can't provide you test cases for each trouble ?
Would this be able to give you hints about things that are going wrong ?

I could then have a try with some development version if needed (when asked
for) to see if my project returns to stability.



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: My I report a special comparaison of executions leading on issues on Spark JIRA ?

RussS
You are always welcome to create a jira or jiras, but you may find you get a faster response by asking about your issues on the mailing list first. 

That may help in identifying whether your issues are already logged or not, or whether there is a solution that can be applied right away.


On Thu, Oct 1, 2020, 3:27 AM Marc Le Bihan <[hidden email]> wrote:
Hello,

I currently run a Spark project based on cities, local authorities,
enterprises, local communities, etc.
Ten Datasets written in Java are doing operations going from simple join to
elaborate ones.
Language used is Java. 20 integrations tests with the whole data (20 GB)
takes seven hour.

*All work perfectly under Spark 2.4.6 - Scala 2.12 - Java 11 or 8*.
I remember it was working well on Spark 2.4.5 too,
but had many troubles in the past with Spark 2.4.3 (if I remember well from
L4Z algorithms often).

I attempted to run my integration tests on Spark 3.0.1. Many of them has
failed, with strange messages.
Something about lambda or about Map that where no more taken into account
when in a Java Dataset, object or schema ?

I then gone back, but to Spark 2.4.7. To make a try. And Spark 2.4.7. also
encounters troubles that 2.4.6. didn't have.

My question :


May I create an issue on JIRA based on the comparison of the executions of
my project with different versions of Spark, reporting error messages
received, call stacks and showing the lines around the one that encountered
a problem if available,
even if I can't provide you test cases for each trouble ?
Would this be able to give you hints about things that are going wrong ?

I could then have a try with some development version if needed (when asked
for) to see if my project returns to stability.



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: My I report a special comparaison of executions leading on issues on Spark JIRA ?

Sean Owen-2
Yes indeed, in fact you seem to be describing Spark 2->3 changes that are already documented in the spark 3 migration guide.

On Thu, Oct 1, 2020 at 7:08 AM Russell Spitzer <[hidden email]> wrote:
You are always welcome to create a jira or jiras, but you may find you get a faster response by asking about your issues on the mailing list first. 

That may help in identifying whether your issues are already logged or not, or whether there is a solution that can be applied right away.


On Thu, Oct 1, 2020, 3:27 AM Marc Le Bihan <[hidden email]> wrote:
Hello,

I currently run a Spark project based on cities, local authorities,
enterprises, local communities, etc.
Ten Datasets written in Java are doing operations going from simple join to
elaborate ones.
Language used is Java. 20 integrations tests with the whole data (20 GB)
takes seven hour.

*All work perfectly under Spark 2.4.6 - Scala 2.12 - Java 11 or 8*.
I remember it was working well on Spark 2.4.5 too,
but had many troubles in the past with Spark 2.4.3 (if I remember well from
L4Z algorithms often).

I attempted to run my integration tests on Spark 3.0.1. Many of them has
failed, with strange messages.
Something about lambda or about Map that where no more taken into account
when in a Java Dataset, object or schema ?

I then gone back, but to Spark 2.4.7. To make a try. And Spark 2.4.7. also
encounters troubles that 2.4.6. didn't have.

My question :


May I create an issue on JIRA based on the comparison of the executions of
my project with different versions of Spark, reporting error messages
received, call stacks and showing the lines around the one that encountered
a problem if available,
even if I can't provide you test cases for each trouble ?
Would this be able to give you hints about things that are going wrong ?

I could then have a try with some development version if needed (when asked
for) to see if my project returns to stability.



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: My I report a special comparaison of executions leading on issues on Spark JIRA ?

Marc Le Bihan
Few tests (that are working on 2.4.6 and 2.4.7) are failling in 3.0.1

Some with this message : *java.lang.ClassNotFoundException:
com/fasterxml/jackson/module/scala/ScalaObjectMapper*

Coming from :
        at
org.apache.spark.sql.catalyst.util.RebaseDateTime.lastSwitchJulianDay(RebaseDateTime.scala)
        at
org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.rebaseDays(VectorizedColumnReader.java:182)
        at
org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.decodeDictionaryIds(VectorizedColumnReader.java:336)
        at
org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBatch(VectorizedColumnReader.java:239)
        at
org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:273)

or
        at
org.apache.spark.sql.catalyst.util.DateTimeUtils$.toJavaDate(DateTimeUtils.scala:130)
        at
org.apache.spark.sql.catalyst.util.DateTimeUtils.toJavaDate(DateTimeUtils.scala)
        at
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.processNext(Unknown
Source)


The oher ones with this one :
org.apache.spark.sql.AnalysisException: *Can't extract value from
lambdavariable(MapObject, StringType, true, 376)*: need struct type but got
string;

These one might be hurting to a dataset having this schema ?

/**
 * Renvoyer le schéma du Dataset.
 * @return Schema.
 */
public StructType schemaEntreprise() {
   StructType schema = new StructType()
      .add("siren", StringType, false)
      .add("statutDiffusionUniteLegale", StringType, true)
      .add("unitePurgeeUniteLegale", StringType, true )
      .add("dateCreationEntreprise", StringType, true)
      .add("sigle", StringType, true)
     
      .add("sexe", StringType, true)
      .add("prenom1", StringType, true)
      .add("prenom2", StringType, true)
      .add("prenom3", StringType, true)
      .add("prenom4", StringType, true)
     
      .add("prenomUsuel", StringType, true)
      .add("pseudonyme", StringType, true)
      .add("rna", StringType, true)
      .add("trancheEffectifsUniteLegale", StringType, true)
      .add("anneeEffectifsUniteLegale", StringType, true)
     
      .add("dateDernierTraitement", StringType, true)
      .add("nombrePeriodesUniteLegale", StringType, true)
      .add("categorieEntreprise", StringType, true)
      .add("anneeCategorieEntreprise", StringType, true)
      .add("dateDebutHistorisation", StringType, true)

      .add("etatAdministratifUniteLegale", StringType, true)
      .add("nomNaissance", StringType, true)
      .add("nomUsage", StringType, true)
      .add("denominationEntreprise", StringType, true)
      .add("denominationUsuelle1", StringType, true)

      .add("denominationUsuelle2", StringType, true)
      .add("denominationUsuelle3", StringType, true)
      .add("categorieJuridique", StringType, true)
      .add("activitePrincipale", StringType, true)
      .add("nomenclatureActivitePrincipale", StringType, true)

      .add("nicSiege", StringType, true)
      .add("economieSocialeSolidaireUniteLegale", StringType, true)
      .add("caractereEmployeurUniteLegale", StringType, true)
     
         // Champs créés par withColumn
         .add("purgee", BooleanType, true)
         .add("anneeValiditeEffectifSalarie", IntegerType, true)
         .add("active", BooleanType, true)
         .add("nombrePeriodes", IntegerType, true)
         .add("anneeCategorie", IntegerType, true)

         .add("economieSocialeSolidaire", BooleanType, true)
         .add("caractereEmployeur", BooleanType, true);
   
   // Ajouter au Dataset des entreprises la liaison avec les établissements.
   MapType mapEtablissements = new MapType(StringType,
this.datasetEtablissement.schemaEtablissement(), true);
   StructField etablissements = new StructField("etablissements",
mapEtablissements, true, Metadata.empty());
   schema.add(etablissements);
   schema.add("libelleCategorieJuridique", StringType, true);
   schema.add("partition", StringType, true);
   
   return schema;
}

Are they worth to mention in an issue (or to complete the description of an
existing issue) ?
Do you need me to pursue some analysis, and if so, how ?



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: My I report a special comparaison of executions leading on issues on Spark JIRA ?

Sean Owen-2
I am not sure what tests you are referring to. Your own? They may indeed have to be changed to work with Spark 3. All Spark tests pass in Spark 3 though. 

No, until you can clarify I do not see something to report in JIRA.

On Fri, Oct 2, 2020, 3:07 PM Marc Le Bihan <[hidden email]> wrote:
Few tests (that are working on 2.4.6 and 2.4.7) are failling in 3.0.1

Some with this message : *java.lang.ClassNotFoundException:
com/fasterxml/jackson/module/scala/ScalaObjectMapper*

Coming from :
        at
org.apache.spark.sql.catalyst.util.RebaseDateTime.lastSwitchJulianDay(RebaseDateTime.scala)
        at
org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.rebaseDays(VectorizedColumnReader.java:182)
        at
org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.decodeDictionaryIds(VectorizedColumnReader.java:336)
        at
org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBatch(VectorizedColumnReader.java:239)
        at
org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:273)

or
        at
org.apache.spark.sql.catalyst.util.DateTimeUtils$.toJavaDate(DateTimeUtils.scala:130)
        at
org.apache.spark.sql.catalyst.util.DateTimeUtils.toJavaDate(DateTimeUtils.scala)
        at
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.processNext(Unknown
Source)


The oher ones with this one :
org.apache.spark.sql.AnalysisException: *Can't extract value from
lambdavariable(MapObject, StringType, true, 376)*: need struct type but got
string;

These one might be hurting to a dataset having this schema ?

/**
 * Renvoyer le schéma du Dataset.
 * @return Schema.
 */
public StructType schemaEntreprise() {
   StructType schema = new StructType()
      .add("siren", StringType, false)
      .add("statutDiffusionUniteLegale", StringType, true)
      .add("unitePurgeeUniteLegale", StringType, true )
      .add("dateCreationEntreprise", StringType, true)
      .add("sigle", StringType, true)

      .add("sexe", StringType, true)
      .add("prenom1", StringType, true)
      .add("prenom2", StringType, true)
      .add("prenom3", StringType, true)
      .add("prenom4", StringType, true)

      .add("prenomUsuel", StringType, true)
      .add("pseudonyme", StringType, true)
      .add("rna", StringType, true)
      .add("trancheEffectifsUniteLegale", StringType, true)
      .add("anneeEffectifsUniteLegale", StringType, true)

      .add("dateDernierTraitement", StringType, true)
      .add("nombrePeriodesUniteLegale", StringType, true)
      .add("categorieEntreprise", StringType, true)
      .add("anneeCategorieEntreprise", StringType, true)
      .add("dateDebutHistorisation", StringType, true)

      .add("etatAdministratifUniteLegale", StringType, true)
      .add("nomNaissance", StringType, true)
      .add("nomUsage", StringType, true)
      .add("denominationEntreprise", StringType, true)
      .add("denominationUsuelle1", StringType, true)

      .add("denominationUsuelle2", StringType, true)
      .add("denominationUsuelle3", StringType, true)
      .add("categorieJuridique", StringType, true)
      .add("activitePrincipale", StringType, true)
      .add("nomenclatureActivitePrincipale", StringType, true)

      .add("nicSiege", StringType, true)
      .add("economieSocialeSolidaireUniteLegale", StringType, true)
      .add("caractereEmployeurUniteLegale", StringType, true)

         // Champs créés par withColumn
         .add("purgee", BooleanType, true)
         .add("anneeValiditeEffectifSalarie", IntegerType, true)
         .add("active", BooleanType, true)
         .add("nombrePeriodes", IntegerType, true)
         .add("anneeCategorie", IntegerType, true)

         .add("economieSocialeSolidaire", BooleanType, true)
         .add("caractereEmployeur", BooleanType, true);

   // Ajouter au Dataset des entreprises la liaison avec les établissements.
   MapType mapEtablissements = new MapType(StringType,
this.datasetEtablissement.schemaEtablissement(), true);
   StructField etablissements = new StructField("etablissements",
mapEtablissements, true, Metadata.empty());
   schema.add(etablissements);
   schema.add("libelleCategorieJuridique", StringType, true);
   schema.add("partition", StringType, true);

   return schema;
}

Are they worth to mention in an issue (or to complete the description of an
existing issue) ?
Do you need me to pursue some analysis, and if so, how ?



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: My I report a special comparaison of executions leading on issues on Spark JIRA ?

Marc Le Bihan
Yes. As I explained at the beginning of the message.

For com/fasterxml/jackson/module/scala/ScalaObjectMapper missing
I will check myself with spark-core and spark-sql become unable to load this
dependency

But I see nothing in Spark Migration Guide 2.4.6 to 3.0 explaining the
apparition of this message :
org.apache.spark.sql.AnalysisException: *Can't extract value from
lambdavariable(MapObject, StringType, true, 376)*: need struct type but got
string;

Can you hint me ?



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: My I report a special comparaison of executions leading on issues on Spark JIRA ?

cloud0fan
It will speed up the process a lot if a simple code snippet to reproduce the error is provided.

On Sat, Oct 3, 2020 at 4:40 AM Marc Le Bihan <[hidden email]> wrote:
Yes. As I explained at the beginning of the message.

For com/fasterxml/jackson/module/scala/ScalaObjectMapper missing
I will check myself with spark-core and spark-sql become unable to load this
dependency

But I see nothing in Spark Migration Guide 2.4.6 to 3.0 explaining the
apparition of this message :
org.apache.spark.sql.AnalysisException: *Can't extract value from
lambdavariable(MapObject, StringType, true, 376)*: need struct type but got
string;

Can you hint me ?



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]