Programmatic: parquet file corruption error

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Programmatic: parquet file corruption error

Zahid Rahman
Hi,

When I run the code for a user defined data type dataset using case class in scala  and run the code in the interactive spark-shell against parquet file. The results are as expected.
However I then the same code programmatically in IntelliJ IDE then spark is give a file corruption error.

Steps I have taken to determine the source of error are :
I have tested for file permission and made sure to chmod 777 , just in case.
I tried a fresh copy of same parquet file.
I ran both programme before and after the fresh copy.
I also rebooted then ran programmatically against a fresh parquet file.
The corruption error was consistent in all cases.
I have copy and pasted the spark-shell , the error message and the code in the IDE and the pom.xml, IntelliJ java  classpath command line.

Perhaps the code in the libraries are different than the ones  used by spark-shell from that when run programmatically.
I don't believe it is an error on my part.
<------------------------------------------------------------------------------------------------------

07:28:45 WARN  CorruptStatistics:117 - Ignoring statistics because created_by could not be parsed (see PARQUET-251): parquet-mr (build 32c46643845ea8a705c35d4ec8fc654cc8ff816d)
org.apache.parquet.VersionParser$VersionParseException: Could not parse created_by: parquet-mr (build 32c46643845ea8a705c35d4ec8fc654cc8ff816d) using format: (.*?)\s+version\s*(?:([^(]*?)\s*(?:\(\s*build\s*([^)]*?)\s*\))?)?
at org.apache.parquet.VersionParser.parse(VersionParser.java:112)
at org.apache.parquet.CorruptStatistics.shouldIgnoreStatistics(CorruptStatistics.java:72)
at org.apache.parquet.format.converter.ParquetMetadataConverter.fromParquetStatisticsInternal(ParquetMetadataConverter.java:435)
at org.apache.parquet.format.converter.ParquetMetadataConverter.fromParquetStatistics(ParquetMetadataConverter.java:454)
at org.apache.parquet.format.converter.ParquetMetadataConverter.fromParquetMetadata(ParquetMetadataConverter.java:914)
at org.apache.parquet.format.converter.ParquetMetadataConverter.readParquetMetadata(ParquetMetadataConverter.java:885)
at org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:532)
at org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:505)
at org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:499)
at org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:448)
at org.apache.spark.sql.execution.datasources.parquet.SpecificParquetRecordReaderBase.initialize(SpecificParquetRecordReaderBase.java:105)
at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.initialize(VectorizedParquetRecordReader.java:131)
at org.apache.spark.sql.execution.datasources.v2.parquet.ParquetPartitionReaderFactory.buildReaderBase(ParquetPartitionReaderFactory.scala:174)
at org.apache.spark.sql.execution.datasources.v2.parquet.ParquetPartitionReaderFactory.createVectorizedReader(ParquetPartitionReaderFactory.scala:205)
at org.apache.spark.sql.execution.datasources.v2.parquet.ParquetPartitionReaderFactory.buildColumnarReader(ParquetPartitionReaderFactory.scala:103)
at org.apache.spark.sql.execution.datasources.v2.FilePartitionReaderFactory.$anonfun$createColumnarReader$1(FilePartitionReaderFactory.scala:38)
at org.apache.spark.sql.execution.datasources.v2.FilePartitionReaderFactory$$Lambda$2018/0000000000000000.apply(Unknown Source)
at scala.collection.Iterator$$anon$10.next(Iterator.scala:459)
at org.apache.spark.sql.execution.datasources.v2.FilePartitionReader.getNextReader(FilePartitionReader.scala:109)
at org.apache.spark.sql.execution.datasources.v2.FilePartitionReader.next(FilePartitionReader.scala:42)
at org.apache.spark.sql.execution.datasources.v2.DataSourceRDD$$anon$1.hasNext(DataSourceRDD.scala:62)
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown Source)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:726)
at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:321)
at org.apache.spark.sql.execution.SparkPlan$$Lambda$1879/0000000000000000.apply(Unknown Source)
at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:872)
at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:872)
at org.apache.spark.rdd.RDD$$Lambda$1875/0000000000000000.apply(Unknown Source)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:313)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:127)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:441)
at org.apache.spark.executor.Executor$TaskRunner$$Lambda$855/0000000000000000.apply(Unknown Source)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:444)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:821)


<--------------------------------------------------------------------------------------------------------------
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 3.0.0-preview2
      /_/
         
Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_242)
Type in expressions to have them evaluated.
Type :help for more information.

scala>       val flightDf = spark.read.parquet("/data/flight-data/parquet/2010-summary.parquet/")
flightDf: org.apache.spark.sql.DataFrame = [DEST_COUNTRY_NAME: string, ORIGIN_COUNTRY_NAME: string ... 1 more field]

scala> case class flight (DEST_COUNTRY_NAME: String,
     |                      ORIGIN_COUNTRY_NAME:String,
     |                      count: BigInt)
defined class flight

scala>      val flightDf = spark.read.parquet("/data/flight-data/parquet/2010-summary.parquet/")
flightDf: org.apache.spark.sql.DataFrame = [DEST_COUNTRY_NAME: string, ORIGIN_COUNTRY_NAME: string ... 1 more field]

scala>     val flights = flightDf.as[flight]
flights: org.apache.spark.sql.Dataset[flight] = [DEST_COUNTRY_NAME: string, ORIGIN_COUNTRY_NAME: string ... 1 more field]

scala>   flights.filter(flight_row => flight_row.ORIGIN_COUNTRY_NAME != "Canada").map(flight_row => flight_row).take(3)
res0: Array[flight] = Array(flight(United States,Romania,1), flight(United States,Ireland,264), flight(United States,India,69))

< ---------------------------------------------------------------------------------------------------------------------
import org.apache.spark.sql.SparkSession

object chapter2 {

// define specific data type class then manipulate it using the filter and map functions
case class flight (DEST_COUNTRY_NAME: String,
ORIGIN_COUNTRY_NAME:String,
count: BigInt)

def main(args: Array[String]): Unit = {

// using an inter active shell, spark session needed here to avoid Intellij errors
val spark = SparkSession.builder.master("local[*]").appName(" chapter2").getOrCreate

import spark.implicits._
val flightDf = spark.read.parquet("/data/flight-data/parquet/2010-summary.parquet/")
val flights = flightDf.as[flight]
flights.filter(flight_row => flight_row.ORIGIN_COUNTRY_NAME != "Canada").map(flight_row => flight_row).take(3)
spark.stop()
}
}
<---------------------------------------------------------------------------------------------------------------------------------
<!-- https://mvnrepository.com/artifact/org.apache.spark/spark-core -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.12</artifactId>
<version>3.0.0-preview2</version>
</dependency>

<!-- https://mvnrepository.com/artifact/org.apache.spark/spark-sql -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.12</artifactId>
<version>3.0.0-preview2</version>
</dependency>
<!---------------------Intellij command line ------------------------------------->
/home/kub19/jdk8u242-b08/bin/java -javaagent:/snap/intellij-idea-community/216/lib/idea_rt.jar=42207:/snap/intellij-idea-community/216/bin -Dfile.encoding=UTF-8 -classpath /home/kub19/jdk8u242-b08/jre/lib/charsets.jar:/home/kub19/jdk8u242-b08/jre/lib/ext/cldrdata.jar:/home/kub19/jdk8u242-b08/jre/lib/ext/dnsns.jar:/home/kub19/jdk8u242-b08/jre/lib/ext/dtfj.jar:/home/kub19/jdk8u242-b08/jre/lib/ext/dtfjview.jar:/home/kub19/jdk8u242-b08/jre/lib/ext/jaccess.jar:/home/kub19/jdk8u242-b08/jre/lib/ext/localedata.jar:/home/kub19/jdk8u242-b08/jre/lib/ext/nashorn.jar:/home/kub19/jdk8u242-b08/jre/lib/ext/sunec.jar:/home/kub19/jdk8u242-b08/jre/lib/ext/sunjce_provider.jar:/home/kub19/jdk8u242-b08/jre/lib/ext/sunpkcs11.jar:/home/kub19/jdk8u242-b08/jre/lib/ext/traceformat.jar:/home/kub19/jdk8u242-b08/jre/lib/ext/zipfs.jar:/home/kub19/jdk8u242-b08/jre/lib/jce.jar:/home/kub19/jdk8u242-b08/jre/lib/jsse.jar:/home/kub19/jdk8u242-b08/jre/lib/management-agent.jar:/home/kub19/jdk8u242-b08/jre/lib/resources.jar:/home/kub19/jdk8u242-b08/jre/lib/rt.jar:/home/kub19/spark-2.4.5-bin-hadoop2.7/projects/SparkDefinitiveGuide/target/classes:/home/kub19/.m2/repository/org/apache/spark/spark-core_2.12/3.0.0-preview2/spark-core_2.12-3.0.0-preview2.jar:/home/kub19/.m2/repository/com/thoughtworks/paranamer/paranamer/2.8/paranamer-2.8.jar:/home/kub19/.m2/repository/org/apache/avro/avro/1.8.2/avro-1.8.2.jar:/home/kub19/.m2/repository/org/codehaus/jackson/jackson-core-asl/1.9.13/jackson-core-asl-1.9.13.jar:/home/kub19/.m2/repository/org/codehaus/jackson/jackson-mapper-asl/1.9.13/jackson-mapper-asl-1.9.13.jar:/home/kub19/.m2/repository/org/apache/commons/commons-compress/1.8.1/commons-compress-1.8.1.jar:/home/kub19/.m2/repository/org/tukaani/xz/1.5/xz-1.5.jar:/home/kub19/.m2/repository/org/apache/avro/avro-mapred/1.8.2/avro-mapred-1.8.2-hadoop2.jar:/home/kub19/.m2/repository/org/apache/avro/avro-ipc/1.8.2/avro-ipc-1.8.2.jar:/home/kub19/.m2/repository/commons-codec/commons-codec/1.9/commons-codec-1.9.jar:/home/kub19/.m2/repository/com/twitter/chill_2.12/0.9.3/chill_2.12-0.9.3.jar:/home/kub19/.m2/repository/com/esotericsoftware/kryo-shaded/4.0.2/kryo-shaded-4.0.2.jar:/home/kub19/.m2/repository/com/esotericsoftware/minlog/1.3.0/minlog-1.3.0.jar:/home/kub19/.m2/repository/org/objenesis/objenesis/2.5.1/objenesis-2.5.1.jar:/home/kub19/.m2/repository/com/twitter/chill-java/0.9.3/chill-java-0.9.3.jar:/home/kub19/.m2/repository/org/apache/xbean/xbean-asm7-shaded/4.15/xbean-asm7-shaded-4.15.jar:/home/kub19/.m2/repository/org/apache/hadoop/hadoop-client/2.7.4/hadoop-client-2.7.4.jar:/home/kub19/.m2/repository/org/apache/hadoop/hadoop-common/2.7.4/hadoop-common-2.7.4.jar:/home/kub19/.m2/repository/commons-cli/commons-cli/1.2/commons-cli-1.2.jar:/home/kub19/.m2/repository/xmlenc/xmlenc/0.52/xmlenc-0.52.jar:/home/kub19/.m2/repository/commons-httpclient/commons-httpclient/3.1/commons-httpclient-3.1.jar:/home/kub19/.m2/repository/commons-io/commons-io/2.4/commons-io-2.4.jar:/home/kub19/.m2/repository/commons-collections/commons-collections/3.2.2/commons-collections-3.2.2.jar:/home/kub19/.m2/repository/org/mortbay/jetty/jetty-sslengine/6.1.26/jetty-sslengine-6.1.26.jar:/home/kub19/.m2/repository/javax/servlet/jsp/jsp-api/2.1/jsp-api-2.1.jar:/home/kub19/.m2/repository/commons-configuration/commons-configuration/1.6/commons-configuration-1.6.jar:/home/kub19/.m2/repository/commons-digester/commons-digester/1.8/commons-digester-1.8.jar:/home/kub19/.m2/repository/commons-beanutils/commons-beanutils/1.7.0/commons-beanutils-1.7.0.jar:/home/kub19/.m2/repository/com/google/code/gson/gson/2.2.4/gson-2.2.4.jar:/home/kub19/.m2/repository/org/apache/hadoop/hadoop-auth/2.7.4/hadoop-auth-2.7.4.jar:/home/kub19/.m2/repository/org/apache/httpcomponents/httpclient/4.2.5/httpclient-4.2.5.jar:/home/kub19/.m2/repository/org/apache/httpcomponents/httpcore/4.2.4/httpcore-4.2.4.jar:/home/kub19/.m2/repository/org/apache/directory/server/apacheds-kerberos-codec/2.0.0-M15/apacheds-kerberos-codec-2.0.0-M15.jar:/home/kub19/.m2/repository/org/apache/directory/server/apacheds-i18n/2.0.0-M15/apacheds-i18n-2.0.0-M15.jar:/home/kub19/.m2/repository/org/apache/directory/api/api-asn1-api/1.0.0-M20/api-asn1-api-1.0.0-M20.jar:/home/kub19/.m2/repository/org/apache/directory/api/api-util/1.0.0-M20/api-util-1.0.0-M20.jar:/home/kub19/.m2/repository/org/apache/curator/curator-client/2.7.1/curator-client-2.7.1.jar:/home/kub19/.m2/repository/org/apache/htrace/htrace-core/3.1.0-incubating/htrace-core-3.1.0-incubating.jar:/home/kub19/.m2/repository/org/apache/hadoop/hadoop-hdfs/2.7.4/hadoop-hdfs-2.7.4.jar:/home/kub19/.m2/repository/org/mortbay/jetty/jetty-util/6.1.26/jetty-util-6.1.26.jar:/home/kub19/.m2/repository/xerces/xercesImpl/2.9.1/xercesImpl-2.9.1.jar:/home/kub19/.m2/repository/xml-apis/xml-apis/1.3.04/xml-apis-1.3.04.jar:/home/kub19/.m2/repository/org/apache/hadoop/hadoop-mapreduce-client-app/2.7.4/hadoop-mapreduce-client-app-2.7.4.jar:/home/kub19/.m2/repository/org/apache/hadoop/hadoop-mapreduce-client-common/2.7.4/hadoop-mapreduce-client-common-2.7.4.jar:/home/kub19/.m2/repository/org/apache/hadoop/hadoop-yarn-client/2.7.4/hadoop-yarn-client-2.7.4.jar:/home/kub19/.m2/repository/org/apache/hadoop/hadoop-yarn-server-common/2.7.4/hadoop-yarn-server-common-2.7.4.jar:/home/kub19/.m2/repository/org/apache/hadoop/hadoop-mapreduce-client-shuffle/2.7.4/hadoop-mapreduce-client-shuffle-2.7.4.jar:/home/kub19/.m2/repository/org/apache/hadoop/hadoop-yarn-api/2.7.4/hadoop-yarn-api-2.7.4.jar:/home/kub19/.m2/repository/org/apache/hadoop/hadoop-mapreduce-client-core/2.7.4/hadoop-mapreduce-client-core-2.7.4.jar:/home/kub19/.m2/repository/org/apache/hadoop/hadoop-yarn-common/2.7.4/hadoop-yarn-common-2.7.4.jar:/home/kub19/.m2/repository/javax/xml/bind/jaxb-api/2.2.2/jaxb-api-2.2.2.jar:/home/kub19/.m2/repository/javax/xml/stream/stax-api/1.0-2/stax-api-1.0-2.jar:/home/kub19/.m2/repository/org/codehaus/jackson/jackson-jaxrs/1.9.13/jackson-jaxrs-1.9.13.jar:/home/kub19/.m2/repository/org/codehaus/jackson/jackson-xc/1.9.13/jackson-xc-1.9.13.jar:/home/kub19/.m2/repository/org/apache/hadoop/hadoop-mapreduce-client-jobclient/2.7.4/hadoop-mapreduce-client-jobclient-2.7.4.jar:/home/kub19/.m2/repository/org/apache/hadoop/hadoop-annotations/2.7.4/hadoop-annotations-2.7.4.jar:/home/kub19/.m2/repository/org/apache/spark/spark-launcher_2.12/3.0.0-preview2/spark-launcher_2.12-3.0.0-preview2.jar:/home/kub19/.m2/repository/org/apache/spark/spark-kvstore_2.12/3.0.0-preview2/spark-kvstore_2.12-3.0.0-preview2.jar:/home/kub19/.m2/repository/org/fusesource/leveldbjni/leveldbjni-all/1.8/leveldbjni-all-1.8.jar:/home/kub19/.m2/repository/com/fasterxml/jackson/core/jackson-core/2.10.0/jackson-core-2.10.0.jar:/home/kub19/.m2/repository/com/fasterxml/jackson/core/jackson-annotations/2.10.0/jackson-annotations-2.10.0.jar:/home/kub19/.m2/repository/org/apache/spark/spark-network-common_2.12/3.0.0-preview2/spark-network-common_2.12-3.0.0-preview2.jar:/home/kub19/.m2/repository/org/apache/spark/spark-network-shuffle_2.12/3.0.0-preview2/spark-network-shuffle_2.12-3.0.0-preview2.jar:/home/kub19/.m2/repository/org/apache/spark/spark-unsafe_2.12/3.0.0-preview2/spark-unsafe_2.12-3.0.0-preview2.jar:/home/kub19/.m2/repository/javax/activation/activation/1.1.1/activation-1.1.1.jar:/home/kub19/.m2/repository/org/apache/curator/curator-recipes/2.7.1/curator-recipes-2.7.1.jar:/home/kub19/.m2/repository/org/apache/curator/curator-framework/2.7.1/curator-framework-2.7.1.jar:/home/kub19/.m2/repository/com/google/guava/guava/16.0.1/guava-16.0.1.jar:/home/kub19/.m2/repository/org/apache/zookeeper/zookeeper/3.4.14/zookeeper-3.4.14.jar:/home/kub19/.m2/repository/org/apache/yetus/audience-annotations/0.5.0/audience-annotations-0.5.0.jar:/home/kub19/.m2/repository/javax/servlet/javax.servlet-api/3.1.0/javax.servlet-api-3.1.0.jar:/home/kub19/.m2/repository/org/apache/commons/commons-lang3/3.9/commons-lang3-3.9.jar:/home/kub19/.m2/repository/org/apache/commons/commons-math3/3.4.1/commons-math3-3.4.1.jar:/home/kub19/.m2/repository/org/apache/commons/commons-text/1.6/commons-text-1.6.jar:/home/kub19/.m2/repository/com/google/code/findbugs/jsr305/3.0.0/jsr305-3.0.0.jar:/home/kub19/.m2/repository/org/slf4j/slf4j-api/1.7.16/slf4j-api-1.7.16.jar:/home/kub19/.m2/repository/org/slf4j/jul-to-slf4j/1.7.16/jul-to-slf4j-1.7.16.jar:/home/kub19/.m2/repository/org/slf4j/jcl-over-slf4j/1.7.16/jcl-over-slf4j-1.7.16.jar:/home/kub19/.m2/repository/log4j/log4j/1.2.17/log4j-1.2.17.jar:/home/kub19/.m2/repository/org/slf4j/slf4j-log4j12/1.7.16/slf4j-log4j12-1.7.16.jar:/home/kub19/.m2/repository/com/ning/compress-lzf/1.0.3/compress-lzf-1.0.3.jar:/home/kub19/.m2/repository/org/xerial/snappy/snappy-java/1.1.7.3/snappy-java-1.1.7.3.jar:/home/kub19/.m2/repository/org/lz4/lz4-java/1.7.0/lz4-java-1.7.0.jar:/home/kub19/.m2/repository/com/github/luben/zstd-jni/1.4.4-3/zstd-jni-1.4.4-3.jar:/home/kub19/.m2/repository/org/roaringbitmap/RoaringBitmap/0.7.45/RoaringBitmap-0.7.45.jar:/home/kub19/.m2/repository/org/roaringbitmap/shims/0.7.45/shims-0.7.45.jar:/home/kub19/.m2/repository/commons-net/commons-net/3.1/commons-net-3.1.jar:/home/kub19/.m2/repository/org/scala-lang/modules/scala-xml_2.12/1.2.0/scala-xml_2.12-1.2.0.jar:/home/kub19/.m2/repository/org/scala-lang/scala-library/2.12.10/scala-library-2.12.10.jar:/home/kub19/.m2/repository/org/scala-lang/scala-reflect/2.12.10/scala-reflect-2.12.10.jar:/home/kub19/.m2/repository/org/json4s/json4s-jackson_2.12/3.6.6/json4s-jackson_2.12-3.6.6.jar:/home/kub19/.m2/repository/org/json4s/json4s-core_2.12/3.6.6/json4s-core_2.12-3.6.6.jar:/home/kub19/.m2/repository/org/json4s/json4s-ast_2.12/3.6.6/json4s-ast_2.12-3.6.6.jar:/home/kub19/.m2/repository/org/json4s/json4s-scalap_2.12/3.6.6/json4s-scalap_2.12-3.6.6.jar:/home/kub19/.m2/repository/org/glassfish/jersey/core/jersey-client/2.29.1/jersey-client-2.29.1.jar:/home/kub19/.m2/repository/jakarta/ws/rs/jakarta.ws.rs-api/2.1.6/jakarta.ws.rs-api-2.1.6.jar:/home/kub19/.m2/repository/org/glassfish/hk2/external/jakarta.inject/2.6.1/jakarta.inject-2.6.1.jar:/home/kub19/.m2/repository/org/glassfish/jersey/core/jersey-common/2.29.1/jersey-common-2.29.1.jar:/home/kub19/.m2/repository/jakarta/annotation/jakarta.annotation-api/1.3.5/jakarta.annotation-api-1.3.5.jar:/home/kub19/.m2/repository/org/glassfish/hk2/osgi-resource-locator/1.0.3/osgi-resource-locator-1.0.3.jar:/home/kub19/.m2/repository/org/glassfish/jersey/core/jersey-server/2.29.1/jersey-server-2.29.1.jar:/home/kub19/.m2/repository/org/glassfish/jersey/media/jersey-media-jaxb/2.29.1/jersey-media-jaxb-2.29.1.jar:/home/kub19/.m2/repository/jakarta/validation/jakarta.validation-api/2.0.2/jakarta.validation-api-2.0.2.jar:/home/kub19/.m2/repository/org/glassfish/jersey/containers/jersey-container-servlet/2.29.1/jersey-container-servlet-2.29.1.jar:/home/kub19/.m2/repository/org/glassfish/jersey/containers/jersey-container-servlet-core/2.29.1/jersey-container-servlet-core-2.29.1.jar:/home/kub19/.m2/repository/org/glassfish/jersey/inject/jersey-hk2/2.29.1/jersey-hk2-2.29.1.jar:/home/kub19/.m2/repository/org/glassfish/hk2/hk2-locator/2.6.1/hk2-locator-2.6.1.jar:/home/kub19/.m2/repository/org/glassfish/hk2/external/aopalliance-repackaged/2.6.1/aopalliance-repackaged-2.6.1.jar:/home/kub19/.m2/repository/org/glassfish/hk2/hk2-api/2.6.1/hk2-api-2.6.1.jar:/home/kub19/.m2/repository/org/glassfish/hk2/hk2-utils/2.6.1/hk2-utils-2.6.1.jar:/home/kub19/.m2/repository/org/javassist/javassist/3.22.0-CR2/javassist-3.22.0-CR2.jar:/home/kub19/.m2/repository/io/netty/netty-all/4.1.42.Final/netty-all-4.1.42.Final.jar:/home/kub19/.m2/repository/com/clearspring/analytics/stream/2.9.6/stream-2.9.6.jar:/home/kub19/.m2/repository/io/dropwizard/metrics/metrics-core/4.1.1/metrics-core-4.1.1.jar:/home/kub19/.m2/repository/io/dropwizard/metrics/metrics-jvm/4.1.1/metrics-jvm-4.1.1.jar:/home/kub19/.m2/repository/io/dropwizard/metrics/metrics-json/4.1.1/metrics-json-4.1.1.jar:/home/kub19/.m2/repository/io/dropwizard/metrics/metrics-graphite/4.1.1/metrics-graphite-4.1.1.jar:/home/kub19/.m2/repository/io/dropwizard/metrics/metrics-jmx/4.1.1/metrics-jmx-4.1.1.jar:/home/kub19/.m2/repository/com/fasterxml/jackson/core/jackson-databind/2.10.0/jackson-databind-2.10.0.jar:/home/kub19/.m2/repository/com/fasterxml/jackson/module/jackson-module-scala_2.12/2.10.0/jackson-module-scala_2.12-2.10.0.jar:/home/kub19/.m2/repository/com/fasterxml/jackson/module/jackson-module-paranamer/2.10.0/jackson-module-paranamer-2.10.0.jar:/home/kub19/.m2/repository/org/apache/ivy/ivy/2.4.0/ivy-2.4.0.jar:/home/kub19/.m2/repository/oro/oro/2.0.8/oro-2.0.8.jar:/home/kub19/.m2/repository/net/razorvine/pyrolite/4.30/pyrolite-4.30.jar:/home/kub19/.m2/repository/net/sf/py4j/py4j/0.10.8.1/py4j-0.10.8.1.jar:/home/kub19/.m2/repository/org/apache/spark/spark-tags_2.12/3.0.0-preview2/spark-tags_2.12-3.0.0-preview2.jar:/home/kub19/.m2/repository/org/apache/commons/commons-crypto/1.0.0/commons-crypto-1.0.0.jar:/home/kub19/.m2/repository/org/spark-project/spark/unused/1.0.0/unused-1.0.0.jar:/home/kub19/.m2/repository/org/apache/spark/spark-sql_2.12/3.0.0-preview2/spark-sql_2.12-3.0.0-preview2.jar:/home/kub19/.m2/repository/com/univocity/univocity-parsers/2.8.3/univocity-parsers-2.8.3.jar:/home/kub19/.m2/repository/org/apache/spark/spark-sketch_2.12/3.0.0-preview2/spark-sketch_2.12-3.0.0-preview2.jar:/home/kub19/.m2/repository/org/apache/spark/spark-catalyst_2.12/3.0.0-preview2/spark-catalyst_2.12-3.0.0-preview2.jar:/home/kub19/.m2/repository/org/scala-lang/modules/scala-parser-combinators_2.12/1.1.2/scala-parser-combinators_2.12-1.1.2.jar:/home/kub19/.m2/repository/org/codehaus/janino/janino/3.0.15/janino-3.0.15.jar:/home/kub19/.m2/repository/org/codehaus/janino/commons-compiler/3.0.15/commons-compiler-3.0.15.jar:/home/kub19/.m2/repository/org/antlr/antlr4-runtime/4.7.1/antlr4-runtime-4.7.1.jar:/home/kub19/.m2/repository/org/apache/arrow/arrow-vector/0.15.1/arrow-vector-0.15.1.jar:/home/kub19/.m2/repository/org/apache/arrow/arrow-format/0.15.1/arrow-format-0.15.1.jar:/home/kub19/.m2/repository/org/apache/arrow/arrow-memory/0.15.1/arrow-memory-0.15.1.jar:/home/kub19/.m2/repository/com/google/flatbuffers/flatbuffers-java/1.9.0/flatbuffers-java-1.9.0.jar:/home/kub19/.m2/repository/org/apache/orc/orc-core/1.5.8/orc-core-1.5.8.jar:/home/kub19/.m2/repository/org/apache/orc/orc-shims/1.5.8/orc-shims-1.5.8.jar:/home/kub19/.m2/repository/com/google/protobuf/protobuf-java/2.5.0/protobuf-java-2.5.0.jar:/home/kub19/.m2/repository/commons-lang/commons-lang/2.6/commons-lang-2.6.jar:/home/kub19/.m2/repository/io/airlift/aircompressor/0.10/aircompressor-0.10.jar:/home/kub19/.m2/repository/org/apache/orc/orc-mapreduce/1.5.8/orc-mapreduce-1.5.8.jar:/home/kub19/.m2/repository/org/apache/hive/hive-storage-api/2.6.0/hive-storage-api-2.6.0.jar:/home/kub19/.m2/repository/org/apache/parquet/parquet-column/1.10.1/parquet-column-1.10.1.jar:/home/kub19/.m2/repository/org/apache/parquet/parquet-common/1.10.1/parquet-common-1.10.1.jar:/home/kub19/.m2/repository/org/apache/parquet/parquet-encoding/1.10.1/parquet-encoding-1.10.1.jar:/home/kub19/.m2/repository/org/apache/parquet/parquet-hadoop/1.10.1/parquet-hadoop-1.10.1.jar:/home/kub19/.m2/repository/org/apache/parquet/parquet-format/2.4.0/parquet-format-2.4.0.jar:/home/kub19/.m2/repository/org/apache/parquet/parquet-jackson/1.10.1/parquet-jackson-1.10.1.jar:/home/kub19/.m2/repository/org/scala-lang/scala-library/2.12.8/scala-library-2.12.8.jar:/home/kub19/.m2/repository/org/scala-lang/scala-reflect/2.12.8/scala-reflect-2.12.8.jar chapter2
 








¯\_(ツ)_/¯
♡۶Java♡۶RMI ♡۶
Make Use Method {MUM}
Reply | Threaded
Open this post in threaded view
|

Re: Programmatic: parquet file corruption error

cloud0fan
Running Spark application with an IDE is not officially supported. It may work under some cases but there is no guarantee at all. The official way is to run interactive queries with spark-shell or package your application to a jar and use spark-submit.

On Thu, Mar 26, 2020 at 4:12 PM Zahid Rahman <[hidden email]> wrote:
Hi,

When I run the code for a user defined data type dataset using case class in scala  and run the code in the interactive spark-shell against parquet file. The results are as expected.
However I then the same code programmatically in IntelliJ IDE then spark is give a file corruption error.

Steps I have taken to determine the source of error are :
I have tested for file permission and made sure to chmod 777 , just in case.
I tried a fresh copy of same parquet file.
I ran both programme before and after the fresh copy.
I also rebooted then ran programmatically against a fresh parquet file.
The corruption error was consistent in all cases.
I have copy and pasted the spark-shell , the error message and the code in the IDE and the pom.xml, IntelliJ java  classpath command line.

Perhaps the code in the libraries are different than the ones  used by spark-shell from that when run programmatically.
I don't believe it is an error on my part.
<------------------------------------------------------------------------------------------------------

07:28:45 WARN  CorruptStatistics:117 - Ignoring statistics because created_by could not be parsed (see PARQUET-251): parquet-mr (build 32c46643845ea8a705c35d4ec8fc654cc8ff816d)
org.apache.parquet.VersionParser$VersionParseException: Could not parse created_by: parquet-mr (build 32c46643845ea8a705c35d4ec8fc654cc8ff816d) using format: (.*?)\s+version\s*(?:([^(]*?)\s*(?:\(\s*build\s*([^)]*?)\s*\))?)?
at org.apache.parquet.VersionParser.parse(VersionParser.java:112)
at org.apache.parquet.CorruptStatistics.shouldIgnoreStatistics(CorruptStatistics.java:72)
at org.apache.parquet.format.converter.ParquetMetadataConverter.fromParquetStatisticsInternal(ParquetMetadataConverter.java:435)
at org.apache.parquet.format.converter.ParquetMetadataConverter.fromParquetStatistics(ParquetMetadataConverter.java:454)
at org.apache.parquet.format.converter.ParquetMetadataConverter.fromParquetMetadata(ParquetMetadataConverter.java:914)
at org.apache.parquet.format.converter.ParquetMetadataConverter.readParquetMetadata(ParquetMetadataConverter.java:885)
at org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:532)
at org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:505)
at org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:499)
at org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:448)
at org.apache.spark.sql.execution.datasources.parquet.SpecificParquetRecordReaderBase.initialize(SpecificParquetRecordReaderBase.java:105)
at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.initialize(VectorizedParquetRecordReader.java:131)
at org.apache.spark.sql.execution.datasources.v2.parquet.ParquetPartitionReaderFactory.buildReaderBase(ParquetPartitionReaderFactory.scala:174)
at org.apache.spark.sql.execution.datasources.v2.parquet.ParquetPartitionReaderFactory.createVectorizedReader(ParquetPartitionReaderFactory.scala:205)
at org.apache.spark.sql.execution.datasources.v2.parquet.ParquetPartitionReaderFactory.buildColumnarReader(ParquetPartitionReaderFactory.scala:103)
at org.apache.spark.sql.execution.datasources.v2.FilePartitionReaderFactory.$anonfun$createColumnarReader$1(FilePartitionReaderFactory.scala:38)
at org.apache.spark.sql.execution.datasources.v2.FilePartitionReaderFactory$$Lambda$2018/0000000000000000.apply(Unknown Source)
at scala.collection.Iterator$$anon$10.next(Iterator.scala:459)
at org.apache.spark.sql.execution.datasources.v2.FilePartitionReader.getNextReader(FilePartitionReader.scala:109)
at org.apache.spark.sql.execution.datasources.v2.FilePartitionReader.next(FilePartitionReader.scala:42)
at org.apache.spark.sql.execution.datasources.v2.DataSourceRDD$$anon$1.hasNext(DataSourceRDD.scala:62)
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown Source)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:726)
at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:321)
at org.apache.spark.sql.execution.SparkPlan$$Lambda$1879/0000000000000000.apply(Unknown Source)
at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:872)
at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:872)
at org.apache.spark.rdd.RDD$$Lambda$1875/0000000000000000.apply(Unknown Source)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:313)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:127)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:441)
at org.apache.spark.executor.Executor$TaskRunner$$Lambda$855/0000000000000000.apply(Unknown Source)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:444)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:821)


<--------------------------------------------------------------------------------------------------------------
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 3.0.0-preview2
      /_/
         
Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_242)
Type in expressions to have them evaluated.
Type :help for more information.

scala>       val flightDf = spark.read.parquet("/data/flight-data/parquet/2010-summary.parquet/")
flightDf: org.apache.spark.sql.DataFrame = [DEST_COUNTRY_NAME: string, ORIGIN_COUNTRY_NAME: string ... 1 more field]

scala> case class flight (DEST_COUNTRY_NAME: String,
     |                      ORIGIN_COUNTRY_NAME:String,
     |                      count: BigInt)
defined class flight

scala>      val flightDf = spark.read.parquet("/data/flight-data/parquet/2010-summary.parquet/")
flightDf: org.apache.spark.sql.DataFrame = [DEST_COUNTRY_NAME: string, ORIGIN_COUNTRY_NAME: string ... 1 more field]

scala>     val flights = flightDf.as[flight]
flights: org.apache.spark.sql.Dataset[flight] = [DEST_COUNTRY_NAME: string, ORIGIN_COUNTRY_NAME: string ... 1 more field]

scala>   flights.filter(flight_row => flight_row.ORIGIN_COUNTRY_NAME != "Canada").map(flight_row => flight_row).take(3)
res0: Array[flight] = Array(flight(United States,Romania,1), flight(United States,Ireland,264), flight(United States,India,69))

< ---------------------------------------------------------------------------------------------------------------------
import org.apache.spark.sql.SparkSession

object chapter2 {

// define specific data type class then manipulate it using the filter and map functions
case class flight (DEST_COUNTRY_NAME: String,
ORIGIN_COUNTRY_NAME:String,
count: BigInt)

def main(args: Array[String]): Unit = {

// using an inter active shell, spark session needed here to avoid Intellij errors
val spark = SparkSession.builder.master("local[*]").appName(" chapter2").getOrCreate

import spark.implicits._
val flightDf = spark.read.parquet("/data/flight-data/parquet/2010-summary.parquet/")
val flights = flightDf.as[flight]
flights.filter(flight_row => flight_row.ORIGIN_COUNTRY_NAME != "Canada").map(flight_row => flight_row).take(3)
spark.stop()
}
}
<---------------------------------------------------------------------------------------------------------------------------------
<!-- https://mvnrepository.com/artifact/org.apache.spark/spark-core -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.12</artifactId>
<version>3.0.0-preview2</version>
</dependency>

<!-- https://mvnrepository.com/artifact/org.apache.spark/spark-sql -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.12</artifactId>
<version>3.0.0-preview2</version>
</dependency>
<!---------------------Intellij command line ------------------------------------->
/home/kub19/jdk8u242-b08/bin/java -javaagent:/snap/intellij-idea-community/216/lib/idea_rt.jar=42207:/snap/intellij-idea-community/216/bin -Dfile.encoding=UTF-8 -classpath /home/kub19/jdk8u242-b08/jre/lib/charsets.jar:/home/kub19/jdk8u242-b08/jre/lib/ext/cldrdata.jar:/home/kub19/jdk8u242-b08/jre/lib/ext/dnsns.jar:/home/kub19/jdk8u242-b08/jre/lib/ext/dtfj.jar:/home/kub19/jdk8u242-b08/jre/lib/ext/dtfjview.jar:/home/kub19/jdk8u242-b08/jre/lib/ext/jaccess.jar:/home/kub19/jdk8u242-b08/jre/lib/ext/localedata.jar:/home/kub19/jdk8u242-b08/jre/lib/ext/nashorn.jar:/home/kub19/jdk8u242-b08/jre/lib/ext/sunec.jar:/home/kub19/jdk8u242-b08/jre/lib/ext/sunjce_provider.jar:/home/kub19/jdk8u242-b08/jre/lib/ext/sunpkcs11.jar:/home/kub19/jdk8u242-b08/jre/lib/ext/traceformat.jar:/home/kub19/jdk8u242-b08/jre/lib/ext/zipfs.jar:/home/kub19/jdk8u242-b08/jre/lib/jce.jar:/home/kub19/jdk8u242-b08/jre/lib/jsse.jar:/home/kub19/jdk8u242-b08/jre/lib/management-agent.jar:/home/kub19/jdk8u242-b08/jre/lib/resources.jar:/home/kub19/jdk8u242-b08/jre/lib/rt.jar:/home/kub19/spark-2.4.5-bin-hadoop2.7/projects/SparkDefinitiveGuide/target/classes:/home/kub19/.m2/repository/org/apache/spark/spark-core_2.12/3.0.0-preview2/spark-core_2.12-3.0.0-preview2.jar:/home/kub19/.m2/repository/com/thoughtworks/paranamer/paranamer/2.8/paranamer-2.8.jar:/home/kub19/.m2/repository/org/apache/avro/avro/1.8.2/avro-1.8.2.jar:/home/kub19/.m2/repository/org/codehaus/jackson/jackson-core-asl/1.9.13/jackson-core-asl-1.9.13.jar:/home/kub19/.m2/repository/org/codehaus/jackson/jackson-mapper-asl/1.9.13/jackson-mapper-asl-1.9.13.jar:/home/kub19/.m2/repository/org/apache/commons/commons-compress/1.8.1/commons-compress-1.8.1.jar:/home/kub19/.m2/repository/org/tukaani/xz/1.5/xz-1.5.jar:/home/kub19/.m2/repository/org/apache/avro/avro-mapred/1.8.2/avro-mapred-1.8.2-hadoop2.jar:/home/kub19/.m2/repository/org/apache/avro/avro-ipc/1.8.2/avro-ipc-1.8.2.jar:/home/kub19/.m2/repository/commons-codec/commons-codec/1.9/commons-codec-1.9.jar:/home/kub19/.m2/repository/com/twitter/chill_2.12/0.9.3/chill_2.12-0.9.3.jar:/home/kub19/.m2/repository/com/esotericsoftware/kryo-shaded/4.0.2/kryo-shaded-4.0.2.jar:/home/kub19/.m2/repository/com/esotericsoftware/minlog/1.3.0/minlog-1.3.0.jar:/home/kub19/.m2/repository/org/objenesis/objenesis/2.5.1/objenesis-2.5.1.jar:/home/kub19/.m2/repository/com/twitter/chill-java/0.9.3/chill-java-0.9.3.jar:/home/kub19/.m2/repository/org/apache/xbean/xbean-asm7-shaded/4.15/xbean-asm7-shaded-4.15.jar:/home/kub19/.m2/repository/org/apache/hadoop/hadoop-client/2.7.4/hadoop-client-2.7.4.jar:/home/kub19/.m2/repository/org/apache/hadoop/hadoop-common/2.7.4/hadoop-common-2.7.4.jar:/home/kub19/.m2/repository/commons-cli/commons-cli/1.2/commons-cli-1.2.jar:/home/kub19/.m2/repository/xmlenc/xmlenc/0.52/xmlenc-0.52.jar:/home/kub19/.m2/repository/commons-httpclient/commons-httpclient/3.1/commons-httpclient-3.1.jar:/home/kub19/.m2/repository/commons-io/commons-io/2.4/commons-io-2.4.jar:/home/kub19/.m2/repository/commons-collections/commons-collections/3.2.2/commons-collections-3.2.2.jar:/home/kub19/.m2/repository/org/mortbay/jetty/jetty-sslengine/6.1.26/jetty-sslengine-6.1.26.jar:/home/kub19/.m2/repository/javax/servlet/jsp/jsp-api/2.1/jsp-api-2.1.jar:/home/kub19/.m2/repository/commons-configuration/commons-configuration/1.6/commons-configuration-1.6.jar:/home/kub19/.m2/repository/commons-digester/commons-digester/1.8/commons-digester-1.8.jar:/home/kub19/.m2/repository/commons-beanutils/commons-beanutils/1.7.0/commons-beanutils-1.7.0.jar:/home/kub19/.m2/repository/com/google/code/gson/gson/2.2.4/gson-2.2.4.jar:/home/kub19/.m2/repository/org/apache/hadoop/hadoop-auth/2.7.4/hadoop-auth-2.7.4.jar:/home/kub19/.m2/repository/org/apache/httpcomponents/httpclient/4.2.5/httpclient-4.2.5.jar:/home/kub19/.m2/repository/org/apache/httpcomponents/httpcore/4.2.4/httpcore-4.2.4.jar:/home/kub19/.m2/repository/org/apache/directory/server/apacheds-kerberos-codec/2.0.0-M15/apacheds-kerberos-codec-2.0.0-M15.jar:/home/kub19/.m2/repository/org/apache/directory/server/apacheds-i18n/2.0.0-M15/apacheds-i18n-2.0.0-M15.jar:/home/kub19/.m2/repository/org/apache/directory/api/api-asn1-api/1.0.0-M20/api-asn1-api-1.0.0-M20.jar:/home/kub19/.m2/repository/org/apache/directory/api/api-util/1.0.0-M20/api-util-1.0.0-M20.jar:/home/kub19/.m2/repository/org/apache/curator/curator-client/2.7.1/curator-client-2.7.1.jar:/home/kub19/.m2/repository/org/apache/htrace/htrace-core/3.1.0-incubating/htrace-core-3.1.0-incubating.jar:/home/kub19/.m2/repository/org/apache/hadoop/hadoop-hdfs/2.7.4/hadoop-hdfs-2.7.4.jar:/home/kub19/.m2/repository/org/mortbay/jetty/jetty-util/6.1.26/jetty-util-6.1.26.jar:/home/kub19/.m2/repository/xerces/xercesImpl/2.9.1/xercesImpl-2.9.1.jar:/home/kub19/.m2/repository/xml-apis/xml-apis/1.3.04/xml-apis-1.3.04.jar:/home/kub19/.m2/repository/org/apache/hadoop/hadoop-mapreduce-client-app/2.7.4/hadoop-mapreduce-client-app-2.7.4.jar:/home/kub19/.m2/repository/org/apache/hadoop/hadoop-mapreduce-client-common/2.7.4/hadoop-mapreduce-client-common-2.7.4.jar:/home/kub19/.m2/repository/org/apache/hadoop/hadoop-yarn-client/2.7.4/hadoop-yarn-client-2.7.4.jar:/home/kub19/.m2/repository/org/apache/hadoop/hadoop-yarn-server-common/2.7.4/hadoop-yarn-server-common-2.7.4.jar:/home/kub19/.m2/repository/org/apache/hadoop/hadoop-mapreduce-client-shuffle/2.7.4/hadoop-mapreduce-client-shuffle-2.7.4.jar:/home/kub19/.m2/repository/org/apache/hadoop/hadoop-yarn-api/2.7.4/hadoop-yarn-api-2.7.4.jar:/home/kub19/.m2/repository/org/apache/hadoop/hadoop-mapreduce-client-core/2.7.4/hadoop-mapreduce-client-core-2.7.4.jar:/home/kub19/.m2/repository/org/apache/hadoop/hadoop-yarn-common/2.7.4/hadoop-yarn-common-2.7.4.jar:/home/kub19/.m2/repository/javax/xml/bind/jaxb-api/2.2.2/jaxb-api-2.2.2.jar:/home/kub19/.m2/repository/javax/xml/stream/stax-api/1.0-2/stax-api-1.0-2.jar:/home/kub19/.m2/repository/org/codehaus/jackson/jackson-jaxrs/1.9.13/jackson-jaxrs-1.9.13.jar:/home/kub19/.m2/repository/org/codehaus/jackson/jackson-xc/1.9.13/jackson-xc-1.9.13.jar:/home/kub19/.m2/repository/org/apache/hadoop/hadoop-mapreduce-client-jobclient/2.7.4/hadoop-mapreduce-client-jobclient-2.7.4.jar:/home/kub19/.m2/repository/org/apache/hadoop/hadoop-annotations/2.7.4/hadoop-annotations-2.7.4.jar:/home/kub19/.m2/repository/org/apache/spark/spark-launcher_2.12/3.0.0-preview2/spark-launcher_2.12-3.0.0-preview2.jar:/home/kub19/.m2/repository/org/apache/spark/spark-kvstore_2.12/3.0.0-preview2/spark-kvstore_2.12-3.0.0-preview2.jar:/home/kub19/.m2/repository/org/fusesource/leveldbjni/leveldbjni-all/1.8/leveldbjni-all-1.8.jar:/home/kub19/.m2/repository/com/fasterxml/jackson/core/jackson-core/2.10.0/jackson-core-2.10.0.jar:/home/kub19/.m2/repository/com/fasterxml/jackson/core/jackson-annotations/2.10.0/jackson-annotations-2.10.0.jar:/home/kub19/.m2/repository/org/apache/spark/spark-network-common_2.12/3.0.0-preview2/spark-network-common_2.12-3.0.0-preview2.jar:/home/kub19/.m2/repository/org/apache/spark/spark-network-shuffle_2.12/3.0.0-preview2/spark-network-shuffle_2.12-3.0.0-preview2.jar:/home/kub19/.m2/repository/org/apache/spark/spark-unsafe_2.12/3.0.0-preview2/spark-unsafe_2.12-3.0.0-preview2.jar:/home/kub19/.m2/repository/javax/activation/activation/1.1.1/activation-1.1.1.jar:/home/kub19/.m2/repository/org/apache/curator/curator-recipes/2.7.1/curator-recipes-2.7.1.jar:/home/kub19/.m2/repository/org/apache/curator/curator-framework/2.7.1/curator-framework-2.7.1.jar:/home/kub19/.m2/repository/com/google/guava/guava/16.0.1/guava-16.0.1.jar:/home/kub19/.m2/repository/org/apache/zookeeper/zookeeper/3.4.14/zookeeper-3.4.14.jar:/home/kub19/.m2/repository/org/apache/yetus/audience-annotations/0.5.0/audience-annotations-0.5.0.jar:/home/kub19/.m2/repository/javax/servlet/javax.servlet-api/3.1.0/javax.servlet-api-3.1.0.jar:/home/kub19/.m2/repository/org/apache/commons/commons-lang3/3.9/commons-lang3-3.9.jar:/home/kub19/.m2/repository/org/apache/commons/commons-math3/3.4.1/commons-math3-3.4.1.jar:/home/kub19/.m2/repository/org/apache/commons/commons-text/1.6/commons-text-1.6.jar:/home/kub19/.m2/repository/com/google/code/findbugs/jsr305/3.0.0/jsr305-3.0.0.jar:/home/kub19/.m2/repository/org/slf4j/slf4j-api/1.7.16/slf4j-api-1.7.16.jar:/home/kub19/.m2/repository/org/slf4j/jul-to-slf4j/1.7.16/jul-to-slf4j-1.7.16.jar:/home/kub19/.m2/repository/org/slf4j/jcl-over-slf4j/1.7.16/jcl-over-slf4j-1.7.16.jar:/home/kub19/.m2/repository/log4j/log4j/1.2.17/log4j-1.2.17.jar:/home/kub19/.m2/repository/org/slf4j/slf4j-log4j12/1.7.16/slf4j-log4j12-1.7.16.jar:/home/kub19/.m2/repository/com/ning/compress-lzf/1.0.3/compress-lzf-1.0.3.jar:/home/kub19/.m2/repository/org/xerial/snappy/snappy-java/1.1.7.3/snappy-java-1.1.7.3.jar:/home/kub19/.m2/repository/org/lz4/lz4-java/1.7.0/lz4-java-1.7.0.jar:/home/kub19/.m2/repository/com/github/luben/zstd-jni/1.4.4-3/zstd-jni-1.4.4-3.jar:/home/kub19/.m2/repository/org/roaringbitmap/RoaringBitmap/0.7.45/RoaringBitmap-0.7.45.jar:/home/kub19/.m2/repository/org/roaringbitmap/shims/0.7.45/shims-0.7.45.jar:/home/kub19/.m2/repository/commons-net/commons-net/3.1/commons-net-3.1.jar:/home/kub19/.m2/repository/org/scala-lang/modules/scala-xml_2.12/1.2.0/scala-xml_2.12-1.2.0.jar:/home/kub19/.m2/repository/org/scala-lang/scala-library/2.12.10/scala-library-2.12.10.jar:/home/kub19/.m2/repository/org/scala-lang/scala-reflect/2.12.10/scala-reflect-2.12.10.jar:/home/kub19/.m2/repository/org/json4s/json4s-jackson_2.12/3.6.6/json4s-jackson_2.12-3.6.6.jar:/home/kub19/.m2/repository/org/json4s/json4s-core_2.12/3.6.6/json4s-core_2.12-3.6.6.jar:/home/kub19/.m2/repository/org/json4s/json4s-ast_2.12/3.6.6/json4s-ast_2.12-3.6.6.jar:/home/kub19/.m2/repository/org/json4s/json4s-scalap_2.12/3.6.6/json4s-scalap_2.12-3.6.6.jar:/home/kub19/.m2/repository/org/glassfish/jersey/core/jersey-client/2.29.1/jersey-client-2.29.1.jar:/home/kub19/.m2/repository/jakarta/ws/rs/jakarta.ws.rs-api/2.1.6/jakarta.ws.rs-api-2.1.6.jar:/home/kub19/.m2/repository/org/glassfish/hk2/external/jakarta.inject/2.6.1/jakarta.inject-2.6.1.jar:/home/kub19/.m2/repository/org/glassfish/jersey/core/jersey-common/2.29.1/jersey-common-2.29.1.jar:/home/kub19/.m2/repository/jakarta/annotation/jakarta.annotation-api/1.3.5/jakarta.annotation-api-1.3.5.jar:/home/kub19/.m2/repository/org/glassfish/hk2/osgi-resource-locator/1.0.3/osgi-resource-locator-1.0.3.jar:/home/kub19/.m2/repository/org/glassfish/jersey/core/jersey-server/2.29.1/jersey-server-2.29.1.jar:/home/kub19/.m2/repository/org/glassfish/jersey/media/jersey-media-jaxb/2.29.1/jersey-media-jaxb-2.29.1.jar:/home/kub19/.m2/repository/jakarta/validation/jakarta.validation-api/2.0.2/jakarta.validation-api-2.0.2.jar:/home/kub19/.m2/repository/org/glassfish/jersey/containers/jersey-container-servlet/2.29.1/jersey-container-servlet-2.29.1.jar:/home/kub19/.m2/repository/org/glassfish/jersey/containers/jersey-container-servlet-core/2.29.1/jersey-container-servlet-core-2.29.1.jar:/home/kub19/.m2/repository/org/glassfish/jersey/inject/jersey-hk2/2.29.1/jersey-hk2-2.29.1.jar:/home/kub19/.m2/repository/org/glassfish/hk2/hk2-locator/2.6.1/hk2-locator-2.6.1.jar:/home/kub19/.m2/repository/org/glassfish/hk2/external/aopalliance-repackaged/2.6.1/aopalliance-repackaged-2.6.1.jar:/home/kub19/.m2/repository/org/glassfish/hk2/hk2-api/2.6.1/hk2-api-2.6.1.jar:/home/kub19/.m2/repository/org/glassfish/hk2/hk2-utils/2.6.1/hk2-utils-2.6.1.jar:/home/kub19/.m2/repository/org/javassist/javassist/3.22.0-CR2/javassist-3.22.0-CR2.jar:/home/kub19/.m2/repository/io/netty/netty-all/4.1.42.Final/netty-all-4.1.42.Final.jar:/home/kub19/.m2/repository/com/clearspring/analytics/stream/2.9.6/stream-2.9.6.jar:/home/kub19/.m2/repository/io/dropwizard/metrics/metrics-core/4.1.1/metrics-core-4.1.1.jar:/home/kub19/.m2/repository/io/dropwizard/metrics/metrics-jvm/4.1.1/metrics-jvm-4.1.1.jar:/home/kub19/.m2/repository/io/dropwizard/metrics/metrics-json/4.1.1/metrics-json-4.1.1.jar:/home/kub19/.m2/repository/io/dropwizard/metrics/metrics-graphite/4.1.1/metrics-graphite-4.1.1.jar:/home/kub19/.m2/repository/io/dropwizard/metrics/metrics-jmx/4.1.1/metrics-jmx-4.1.1.jar:/home/kub19/.m2/repository/com/fasterxml/jackson/core/jackson-databind/2.10.0/jackson-databind-2.10.0.jar:/home/kub19/.m2/repository/com/fasterxml/jackson/module/jackson-module-scala_2.12/2.10.0/jackson-module-scala_2.12-2.10.0.jar:/home/kub19/.m2/repository/com/fasterxml/jackson/module/jackson-module-paranamer/2.10.0/jackson-module-paranamer-2.10.0.jar:/home/kub19/.m2/repository/org/apache/ivy/ivy/2.4.0/ivy-2.4.0.jar:/home/kub19/.m2/repository/oro/oro/2.0.8/oro-2.0.8.jar:/home/kub19/.m2/repository/net/razorvine/pyrolite/4.30/pyrolite-4.30.jar:/home/kub19/.m2/repository/net/sf/py4j/py4j/0.10.8.1/py4j-0.10.8.1.jar:/home/kub19/.m2/repository/org/apache/spark/spark-tags_2.12/3.0.0-preview2/spark-tags_2.12-3.0.0-preview2.jar:/home/kub19/.m2/repository/org/apache/commons/commons-crypto/1.0.0/commons-crypto-1.0.0.jar:/home/kub19/.m2/repository/org/spark-project/spark/unused/1.0.0/unused-1.0.0.jar:/home/kub19/.m2/repository/org/apache/spark/spark-sql_2.12/3.0.0-preview2/spark-sql_2.12-3.0.0-preview2.jar:/home/kub19/.m2/repository/com/univocity/univocity-parsers/2.8.3/univocity-parsers-2.8.3.jar:/home/kub19/.m2/repository/org/apache/spark/spark-sketch_2.12/3.0.0-preview2/spark-sketch_2.12-3.0.0-preview2.jar:/home/kub19/.m2/repository/org/apache/spark/spark-catalyst_2.12/3.0.0-preview2/spark-catalyst_2.12-3.0.0-preview2.jar:/home/kub19/.m2/repository/org/scala-lang/modules/scala-parser-combinators_2.12/1.1.2/scala-parser-combinators_2.12-1.1.2.jar:/home/kub19/.m2/repository/org/codehaus/janino/janino/3.0.15/janino-3.0.15.jar:/home/kub19/.m2/repository/org/codehaus/janino/commons-compiler/3.0.15/commons-compiler-3.0.15.jar:/home/kub19/.m2/repository/org/antlr/antlr4-runtime/4.7.1/antlr4-runtime-4.7.1.jar:/home/kub19/.m2/repository/org/apache/arrow/arrow-vector/0.15.1/arrow-vector-0.15.1.jar:/home/kub19/.m2/repository/org/apache/arrow/arrow-format/0.15.1/arrow-format-0.15.1.jar:/home/kub19/.m2/repository/org/apache/arrow/arrow-memory/0.15.1/arrow-memory-0.15.1.jar:/home/kub19/.m2/repository/com/google/flatbuffers/flatbuffers-java/1.9.0/flatbuffers-java-1.9.0.jar:/home/kub19/.m2/repository/org/apache/orc/orc-core/1.5.8/orc-core-1.5.8.jar:/home/kub19/.m2/repository/org/apache/orc/orc-shims/1.5.8/orc-shims-1.5.8.jar:/home/kub19/.m2/repository/com/google/protobuf/protobuf-java/2.5.0/protobuf-java-2.5.0.jar:/home/kub19/.m2/repository/commons-lang/commons-lang/2.6/commons-lang-2.6.jar:/home/kub19/.m2/repository/io/airlift/aircompressor/0.10/aircompressor-0.10.jar:/home/kub19/.m2/repository/org/apache/orc/orc-mapreduce/1.5.8/orc-mapreduce-1.5.8.jar:/home/kub19/.m2/repository/org/apache/hive/hive-storage-api/2.6.0/hive-storage-api-2.6.0.jar:/home/kub19/.m2/repository/org/apache/parquet/parquet-column/1.10.1/parquet-column-1.10.1.jar:/home/kub19/.m2/repository/org/apache/parquet/parquet-common/1.10.1/parquet-common-1.10.1.jar:/home/kub19/.m2/repository/org/apache/parquet/parquet-encoding/1.10.1/parquet-encoding-1.10.1.jar:/home/kub19/.m2/repository/org/apache/parquet/parquet-hadoop/1.10.1/parquet-hadoop-1.10.1.jar:/home/kub19/.m2/repository/org/apache/parquet/parquet-format/2.4.0/parquet-format-2.4.0.jar:/home/kub19/.m2/repository/org/apache/parquet/parquet-jackson/1.10.1/parquet-jackson-1.10.1.jar:/home/kub19/.m2/repository/org/scala-lang/scala-library/2.12.8/scala-library-2.12.8.jar:/home/kub19/.m2/repository/org/scala-lang/scala-reflect/2.12.8/scala-reflect-2.12.8.jar chapter2
 








¯\_(ツ)_/¯
♡۶Java♡۶RMI ♡۶
Make Use Method {MUM}
Reply | Threaded
Open this post in threaded view
|

Re: Programmatic: parquet file corruption error

Zahid Rahman
Thanks Wenchen.  SOLVED! KINDA!

I removed all dependencies from the pom.xml  in my IDE so I wouldn't be picking up any libraries from maven repository.
I instead included the libraries (jar)  from the spark download of  spark-3.0.0-preview2-bin-hadoop2.7
This way I am using the same libraries which are used when running spark-submit scripts.

I  believe I managed to trace the issue.
I copied  the log4j.properties.template into Intellij's resources  directory in my project.
Obviously renaming it to log4.properties.
So now I am using also same log4j.properties as when running spark-submit scipt.

I noticed the value of log4j.logger.org.apache.parquet=ERROR & log4j.logger.parquet=ERROR.
It appears that this parquet corruption warning is an outstanding bug and the work around is to quieten the warning.
 
#
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements. See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#

# Set everything to be logged to the console
log4j.rootCategory=WARN, console
log4j.appender.console=org.apache.log4j.ConsoleAppender
log4j.appender.console.target=System.err
log4j.appender.console.layout=org.apache.log4j.PatternLayout
log4j.appender.console.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p %c{1}: %m%n

# Set the default spark-shell log level to WARN. When running the spark-shell, the
# log level for this class is used to overwrite the root logger's log level, so that
# the user can have different defaults for the shell and regular Spark apps.
log4j.logger.org.apache.spark.repl.Main=WARN

# Settings to quiet third party logs that are too verbose
log4j.logger.org.sparkproject.jetty=WARN
log4j.logger.org.sparkproject.jetty.util.component.AbstractLifeCycle=ERROR
log4j.logger.org.apache.spark.repl.SparkIMain$exprTyper=WARN
log4j.logger.org.apache.spark.repl.SparkILoop$SparkILoopInterpreter=INFO
log4j.logger.org.apache.parquet=ERROR
log4j.logger.parquet=ERROR

# SPARK-9183: Settings to avoid annoying messages when looking up nonexistent UDFs in SparkSQL with Hive support
log4j.logger.org.apache.hadoop.hive.metastore.RetryingHMSHandler=FATAL
log4j.logger.org.apache.hadoop.hive.ql.exec.FunctionRegistry=ERROR



¯\_(ツ)_/¯
♡۶Java♡۶RMI ♡۶
Make Use Method {MUM}


On Fri, 27 Mar 2020 at 07:44, Wenchen Fan <[hidden email]> wrote:
Running Spark application with an IDE is not officially supported. It may work under some cases but there is no guarantee at all. The official way is to run interactive queries with spark-shell or package your application to a jar and use spark-submit.

On Thu, Mar 26, 2020 at 4:12 PM Zahid Rahman <[hidden email]> wrote:
Hi,

When I run the code for a user defined data type dataset using case class in scala  and run the code in the interactive spark-shell against parquet file. The results are as expected.
However I then the same code programmatically in IntelliJ IDE then spark is give a file corruption error.

Steps I have taken to determine the source of error are :
I have tested for file permission and made sure to chmod 777 , just in case.
I tried a fresh copy of same parquet file.
I ran both programme before and after the fresh copy.
I also rebooted then ran programmatically against a fresh parquet file.
The corruption error was consistent in all cases.
I have copy and pasted the spark-shell , the error message and the code in the IDE and the pom.xml, IntelliJ java  classpath command line.

Perhaps the code in the libraries are different than the ones  used by spark-shell from that when run programmatically.
I don't believe it is an error on my part.
<------------------------------------------------------------------------------------------------------

07:28:45 WARN  CorruptStatistics:117 - Ignoring statistics because created_by could not be parsed (see PARQUET-251): parquet-mr (build 32c46643845ea8a705c35d4ec8fc654cc8ff816d)
org.apache.parquet.VersionParser$VersionParseException: Could not parse created_by: parquet-mr (build 32c46643845ea8a705c35d4ec8fc654cc8ff816d) using format: (.*?)\s+version\s*(?:([^(]*?)\s*(?:\(\s*build\s*([^)]*?)\s*\))?)?
at org.apache.parquet.VersionParser.parse(VersionParser.java:112)
at org.apache.parquet.CorruptStatistics.shouldIgnoreStatistics(CorruptStatistics.java:72)
at org.apache.parquet.format.converter.ParquetMetadataConverter.fromParquetStatisticsInternal(ParquetMetadataConverter.java:435)
at org.apache.parquet.format.converter.ParquetMetadataConverter.fromParquetStatistics(ParquetMetadataConverter.java:454)
at org.apache.parquet.format.converter.ParquetMetadataConverter.fromParquetMetadata(ParquetMetadataConverter.java:914)
at org.apache.parquet.format.converter.ParquetMetadataConverter.readParquetMetadata(ParquetMetadataConverter.java:885)
at org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:532)
at org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:505)
at org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:499)
at org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:448)
at org.apache.spark.sql.execution.datasources.parquet.SpecificParquetRecordReaderBase.initialize(SpecificParquetRecordReaderBase.java:105)
at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.initialize(VectorizedParquetRecordReader.java:131)
at org.apache.spark.sql.execution.datasources.v2.parquet.ParquetPartitionReaderFactory.buildReaderBase(ParquetPartitionReaderFactory.scala:174)
at org.apache.spark.sql.execution.datasources.v2.parquet.ParquetPartitionReaderFactory.createVectorizedReader(ParquetPartitionReaderFactory.scala:205)
at org.apache.spark.sql.execution.datasources.v2.parquet.ParquetPartitionReaderFactory.buildColumnarReader(ParquetPartitionReaderFactory.scala:103)
at org.apache.spark.sql.execution.datasources.v2.FilePartitionReaderFactory.$anonfun$createColumnarReader$1(FilePartitionReaderFactory.scala:38)
at org.apache.spark.sql.execution.datasources.v2.FilePartitionReaderFactory$$Lambda$2018/0000000000000000.apply(Unknown Source)
at scala.collection.Iterator$$anon$10.next(Iterator.scala:459)
at org.apache.spark.sql.execution.datasources.v2.FilePartitionReader.getNextReader(FilePartitionReader.scala:109)
at org.apache.spark.sql.execution.datasources.v2.FilePartitionReader.next(FilePartitionReader.scala:42)
at org.apache.spark.sql.execution.datasources.v2.DataSourceRDD$$anon$1.hasNext(DataSourceRDD.scala:62)
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown Source)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:726)
at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:321)
at org.apache.spark.sql.execution.SparkPlan$$Lambda$1879/0000000000000000.apply(Unknown Source)
at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:872)
at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:872)
at org.apache.spark.rdd.RDD$$Lambda$1875/0000000000000000.apply(Unknown Source)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:313)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:127)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:441)
at org.apache.spark.executor.Executor$TaskRunner$$Lambda$855/0000000000000000.apply(Unknown Source)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:444)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:821)


<--------------------------------------------------------------------------------------------------------------
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 3.0.0-preview2
      /_/
         
Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_242)
Type in expressions to have them evaluated.
Type :help for more information.

scala>       val flightDf = spark.read.parquet("/data/flight-data/parquet/2010-summary.parquet/")
flightDf: org.apache.spark.sql.DataFrame = [DEST_COUNTRY_NAME: string, ORIGIN_COUNTRY_NAME: string ... 1 more field]

scala> case class flight (DEST_COUNTRY_NAME: String,
     |                      ORIGIN_COUNTRY_NAME:String,
     |                      count: BigInt)
defined class flight

scala>      val flightDf = spark.read.parquet("/data/flight-data/parquet/2010-summary.parquet/")
flightDf: org.apache.spark.sql.DataFrame = [DEST_COUNTRY_NAME: string, ORIGIN_COUNTRY_NAME: string ... 1 more field]

scala>     val flights = flightDf.as[flight]
flights: org.apache.spark.sql.Dataset[flight] = [DEST_COUNTRY_NAME: string, ORIGIN_COUNTRY_NAME: string ... 1 more field]

scala>   flights.filter(flight_row => flight_row.ORIGIN_COUNTRY_NAME != "Canada").map(flight_row => flight_row).take(3)
res0: Array[flight] = Array(flight(United States,Romania,1), flight(United States,Ireland,264), flight(United States,India,69))

< ---------------------------------------------------------------------------------------------------------------------
import org.apache.spark.sql.SparkSession

object chapter2 {

// define specific data type class then manipulate it using the filter and map functions
case class flight (DEST_COUNTRY_NAME: String,
ORIGIN_COUNTRY_NAME:String,
count: BigInt)

def main(args: Array[String]): Unit = {

// using an inter active shell, spark session needed here to avoid Intellij errors
val spark = SparkSession.builder.master("local[*]").appName(" chapter2").getOrCreate

import spark.implicits._
val flightDf = spark.read.parquet("/data/flight-data/parquet/2010-summary.parquet/")
val flights = flightDf.as[flight]
flights.filter(flight_row => flight_row.ORIGIN_COUNTRY_NAME != "Canada").map(flight_row => flight_row).take(3)
spark.stop()
}
}
<---------------------------------------------------------------------------------------------------------------------------------
<!-- https://mvnrepository.com/artifact/org.apache.spark/spark-core -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.12</artifactId>
<version>3.0.0-preview2</version>
</dependency>

<!-- https://mvnrepository.com/artifact/org.apache.spark/spark-sql -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.12</artifactId>
<version>3.0.0-preview2</version>
</dependency>
<!---------------------Intellij command line ------------------------------------->
/home/kub19/jdk8u242-b08/bin/java -javaagent:/snap/intellij-idea-community/216/lib/idea_rt.jar=42207:/snap/intellij-idea-community/216/bin -Dfile.encoding=UTF-8 -classpath /home/kub19/jdk8u242-b08/jre/lib/charsets.jar:/home/kub19/jdk8u242-b08/jre/lib/ext/cldrdata.jar:/home/kub19/jdk8u242-b08/jre/lib/ext/dnsns.jar:/home/kub19/jdk8u242-b08/jre/lib/ext/dtfj.jar:/home/kub19/jdk8u242-b08/jre/lib/ext/dtfjview.jar:/home/kub19/jdk8u242-b08/jre/lib/ext/jaccess.jar:/home/kub19/jdk8u242-b08/jre/lib/ext/localedata.jar:/home/kub19/jdk8u242-b08/jre/lib/ext/nashorn.jar:/home/kub19/jdk8u242-b08/jre/lib/ext/sunec.jar:/home/kub19/jdk8u242-b08/jre/lib/ext/sunjce_provider.jar:/home/kub19/jdk8u242-b08/jre/lib/ext/sunpkcs11.jar:/home/kub19/jdk8u242-b08/jre/lib/ext/traceformat.jar:/home/kub19/jdk8u242-b08/jre/lib/ext/zipfs.jar:/home/kub19/jdk8u242-b08/jre/lib/jce.jar:/home/kub19/jdk8u242-b08/jre/lib/jsse.jar:/home/kub19/jdk8u242-b08/jre/lib/management-agent.jar:/home/kub19/jdk8u242-b08/jre/lib/resources.jar:/home/kub19/jdk8u242-b08/jre/lib/rt.jar:/home/kub19/spark-2.4.5-bin-hadoop2.7/projects/SparkDefinitiveGuide/target/classes:/home/kub19/.m2/repository/org/apache/spark/spark-core_2.12/3.0.0-preview2/spark-core_2.12-3.0.0-preview2.jar:/home/kub19/.m2/repository/com/thoughtworks/paranamer/paranamer/2.8/paranamer-2.8.jar:/home/kub19/.m2/repository/org/apache/avro/avro/1.8.2/avro-1.8.2.jar:/home/kub19/.m2/repository/org/codehaus/jackson/jackson-core-asl/1.9.13/jackson-core-asl-1.9.13.jar:/home/kub19/.m2/repository/org/codehaus/jackson/jackson-mapper-asl/1.9.13/jackson-mapper-asl-1.9.13.jar:/home/kub19/.m2/repository/org/apache/commons/commons-compress/1.8.1/commons-compress-1.8.1.jar:/home/kub19/.m2/repository/org/tukaani/xz/1.5/xz-1.5.jar:/home/kub19/.m2/repository/org/apache/avro/avro-mapred/1.8.2/avro-mapred-1.8.2-hadoop2.jar:/home/kub19/.m2/repository/org/apache/avro/avro-ipc/1.8.2/avro-ipc-1.8.2.jar:/home/kub19/.m2/repository/commons-codec/commons-codec/1.9/commons-codec-1.9.jar:/home/kub19/.m2/repository/com/twitter/chill_2.12/0.9.3/chill_2.12-0.9.3.jar:/home/kub19/.m2/repository/com/esotericsoftware/kryo-shaded/4.0.2/kryo-shaded-4.0.2.jar:/home/kub19/.m2/repository/com/esotericsoftware/minlog/1.3.0/minlog-1.3.0.jar:/home/kub19/.m2/repository/org/objenesis/objenesis/2.5.1/objenesis-2.5.1.jar:/home/kub19/.m2/repository/com/twitter/chill-java/0.9.3/chill-java-0.9.3.jar:/home/kub19/.m2/repository/org/apache/xbean/xbean-asm7-shaded/4.15/xbean-asm7-shaded-4.15.jar:/home/kub19/.m2/repository/org/apache/hadoop/hadoop-client/2.7.4/hadoop-client-2.7.4.jar:/home/kub19/.m2/repository/org/apache/hadoop/hadoop-common/2.7.4/hadoop-common-2.7.4.jar:/home/kub19/.m2/repository/commons-cli/commons-cli/1.2/commons-cli-1.2.jar:/home/kub19/.m2/repository/xmlenc/xmlenc/0.52/xmlenc-0.52.jar:/home/kub19/.m2/repository/commons-httpclient/commons-httpclient/3.1/commons-httpclient-3.1.jar:/home/kub19/.m2/repository/commons-io/commons-io/2.4/commons-io-2.4.jar:/home/kub19/.m2/repository/commons-collections/commons-collections/3.2.2/commons-collections-3.2.2.jar:/home/kub19/.m2/repository/org/mortbay/jetty/jetty-sslengine/6.1.26/jetty-sslengine-6.1.26.jar:/home/kub19/.m2/repository/javax/servlet/jsp/jsp-api/2.1/jsp-api-2.1.jar:/home/kub19/.m2/repository/commons-configuration/commons-configuration/1.6/commons-configuration-1.6.jar:/home/kub19/.m2/repository/commons-digester/commons-digester/1.8/commons-digester-1.8.jar:/home/kub19/.m2/repository/commons-beanutils/commons-beanutils/1.7.0/commons-beanutils-1.7.0.jar:/home/kub19/.m2/repository/com/google/code/gson/gson/2.2.4/gson-2.2.4.jar:/home/kub19/.m2/repository/org/apache/hadoop/hadoop-auth/2.7.4/hadoop-auth-2.7.4.jar:/home/kub19/.m2/repository/org/apache/httpcomponents/httpclient/4.2.5/httpclient-4.2.5.jar:/home/kub19/.m2/repository/org/apache/httpcomponents/httpcore/4.2.4/httpcore-4.2.4.jar:/home/kub19/.m2/repository/org/apache/directory/server/apacheds-kerberos-codec/2.0.0-M15/apacheds-kerberos-codec-2.0.0-M15.jar:/home/kub19/.m2/repository/org/apache/directory/server/apacheds-i18n/2.0.0-M15/apacheds-i18n-2.0.0-M15.jar:/home/kub19/.m2/repository/org/apache/directory/api/api-asn1-api/1.0.0-M20/api-asn1-api-1.0.0-M20.jar:/home/kub19/.m2/repository/org/apache/directory/api/api-util/1.0.0-M20/api-util-1.0.0-M20.jar:/home/kub19/.m2/repository/org/apache/curator/curator-client/2.7.1/curator-client-2.7.1.jar:/home/kub19/.m2/repository/org/apache/htrace/htrace-core/3.1.0-incubating/htrace-core-3.1.0-incubating.jar:/home/kub19/.m2/repository/org/apache/hadoop/hadoop-hdfs/2.7.4/hadoop-hdfs-2.7.4.jar:/home/kub19/.m2/repository/org/mortbay/jetty/jetty-util/6.1.26/jetty-util-6.1.26.jar:/home/kub19/.m2/repository/xerces/xercesImpl/2.9.1/xercesImpl-2.9.1.jar:/home/kub19/.m2/repository/xml-apis/xml-apis/1.3.04/xml-apis-1.3.04.jar:/home/kub19/.m2/repository/org/apache/hadoop/hadoop-mapreduce-client-app/2.7.4/hadoop-mapreduce-client-app-2.7.4.jar:/home/kub19/.m2/repository/org/apache/hadoop/hadoop-mapreduce-client-common/2.7.4/hadoop-mapreduce-client-common-2.7.4.jar:/home/kub19/.m2/repository/org/apache/hadoop/hadoop-yarn-client/2.7.4/hadoop-yarn-client-2.7.4.jar:/home/kub19/.m2/repository/org/apache/hadoop/hadoop-yarn-server-common/2.7.4/hadoop-yarn-server-common-2.7.4.jar:/home/kub19/.m2/repository/org/apache/hadoop/hadoop-mapreduce-client-shuffle/2.7.4/hadoop-mapreduce-client-shuffle-2.7.4.jar:/home/kub19/.m2/repository/org/apache/hadoop/hadoop-yarn-api/2.7.4/hadoop-yarn-api-2.7.4.jar:/home/kub19/.m2/repository/org/apache/hadoop/hadoop-mapreduce-client-core/2.7.4/hadoop-mapreduce-client-core-2.7.4.jar:/home/kub19/.m2/repository/org/apache/hadoop/hadoop-yarn-common/2.7.4/hadoop-yarn-common-2.7.4.jar:/home/kub19/.m2/repository/javax/xml/bind/jaxb-api/2.2.2/jaxb-api-2.2.2.jar:/home/kub19/.m2/repository/javax/xml/stream/stax-api/1.0-2/stax-api-1.0-2.jar:/home/kub19/.m2/repository/org/codehaus/jackson/jackson-jaxrs/1.9.13/jackson-jaxrs-1.9.13.jar:/home/kub19/.m2/repository/org/codehaus/jackson/jackson-xc/1.9.13/jackson-xc-1.9.13.jar:/home/kub19/.m2/repository/org/apache/hadoop/hadoop-mapreduce-client-jobclient/2.7.4/hadoop-mapreduce-client-jobclient-2.7.4.jar:/home/kub19/.m2/repository/org/apache/hadoop/hadoop-annotations/2.7.4/hadoop-annotations-2.7.4.jar:/home/kub19/.m2/repository/org/apache/spark/spark-launcher_2.12/3.0.0-preview2/spark-launcher_2.12-3.0.0-preview2.jar:/home/kub19/.m2/repository/org/apache/spark/spark-kvstore_2.12/3.0.0-preview2/spark-kvstore_2.12-3.0.0-preview2.jar:/home/kub19/.m2/repository/org/fusesource/leveldbjni/leveldbjni-all/1.8/leveldbjni-all-1.8.jar:/home/kub19/.m2/repository/com/fasterxml/jackson/core/jackson-core/2.10.0/jackson-core-2.10.0.jar:/home/kub19/.m2/repository/com/fasterxml/jackson/core/jackson-annotations/2.10.0/jackson-annotations-2.10.0.jar:/home/kub19/.m2/repository/org/apache/spark/spark-network-common_2.12/3.0.0-preview2/spark-network-common_2.12-3.0.0-preview2.jar:/home/kub19/.m2/repository/org/apache/spark/spark-network-shuffle_2.12/3.0.0-preview2/spark-network-shuffle_2.12-3.0.0-preview2.jar:/home/kub19/.m2/repository/org/apache/spark/spark-unsafe_2.12/3.0.0-preview2/spark-unsafe_2.12-3.0.0-preview2.jar:/home/kub19/.m2/repository/javax/activation/activation/1.1.1/activation-1.1.1.jar:/home/kub19/.m2/repository/org/apache/curator/curator-recipes/2.7.1/curator-recipes-2.7.1.jar:/home/kub19/.m2/repository/org/apache/curator/curator-framework/2.7.1/curator-framework-2.7.1.jar:/home/kub19/.m2/repository/com/google/guava/guava/16.0.1/guava-16.0.1.jar:/home/kub19/.m2/repository/org/apache/zookeeper/zookeeper/3.4.14/zookeeper-3.4.14.jar:/home/kub19/.m2/repository/org/apache/yetus/audience-annotations/0.5.0/audience-annotations-0.5.0.jar:/home/kub19/.m2/repository/javax/servlet/javax.servlet-api/3.1.0/javax.servlet-api-3.1.0.jar:/home/kub19/.m2/repository/org/apache/commons/commons-lang3/3.9/commons-lang3-3.9.jar:/home/kub19/.m2/repository/org/apache/commons/commons-math3/3.4.1/commons-math3-3.4.1.jar:/home/kub19/.m2/repository/org/apache/commons/commons-text/1.6/commons-text-1.6.jar:/home/kub19/.m2/repository/com/google/code/findbugs/jsr305/3.0.0/jsr305-3.0.0.jar:/home/kub19/.m2/repository/org/slf4j/slf4j-api/1.7.16/slf4j-api-1.7.16.jar:/home/kub19/.m2/repository/org/slf4j/jul-to-slf4j/1.7.16/jul-to-slf4j-1.7.16.jar:/home/kub19/.m2/repository/org/slf4j/jcl-over-slf4j/1.7.16/jcl-over-slf4j-1.7.16.jar:/home/kub19/.m2/repository/log4j/log4j/1.2.17/log4j-1.2.17.jar:/home/kub19/.m2/repository/org/slf4j/slf4j-log4j12/1.7.16/slf4j-log4j12-1.7.16.jar:/home/kub19/.m2/repository/com/ning/compress-lzf/1.0.3/compress-lzf-1.0.3.jar:/home/kub19/.m2/repository/org/xerial/snappy/snappy-java/1.1.7.3/snappy-java-1.1.7.3.jar:/home/kub19/.m2/repository/org/lz4/lz4-java/1.7.0/lz4-java-1.7.0.jar:/home/kub19/.m2/repository/com/github/luben/zstd-jni/1.4.4-3/zstd-jni-1.4.4-3.jar:/home/kub19/.m2/repository/org/roaringbitmap/RoaringBitmap/0.7.45/RoaringBitmap-0.7.45.jar:/home/kub19/.m2/repository/org/roaringbitmap/shims/0.7.45/shims-0.7.45.jar:/home/kub19/.m2/repository/commons-net/commons-net/3.1/commons-net-3.1.jar:/home/kub19/.m2/repository/org/scala-lang/modules/scala-xml_2.12/1.2.0/scala-xml_2.12-1.2.0.jar:/home/kub19/.m2/repository/org/scala-lang/scala-library/2.12.10/scala-library-2.12.10.jar:/home/kub19/.m2/repository/org/scala-lang/scala-reflect/2.12.10/scala-reflect-2.12.10.jar:/home/kub19/.m2/repository/org/json4s/json4s-jackson_2.12/3.6.6/json4s-jackson_2.12-3.6.6.jar:/home/kub19/.m2/repository/org/json4s/json4s-core_2.12/3.6.6/json4s-core_2.12-3.6.6.jar:/home/kub19/.m2/repository/org/json4s/json4s-ast_2.12/3.6.6/json4s-ast_2.12-3.6.6.jar:/home/kub19/.m2/repository/org/json4s/json4s-scalap_2.12/3.6.6/json4s-scalap_2.12-3.6.6.jar:/home/kub19/.m2/repository/org/glassfish/jersey/core/jersey-client/2.29.1/jersey-client-2.29.1.jar:/home/kub19/.m2/repository/jakarta/ws/rs/jakarta.ws.rs-api/2.1.6/jakarta.ws.rs-api-2.1.6.jar:/home/kub19/.m2/repository/org/glassfish/hk2/external/jakarta.inject/2.6.1/jakarta.inject-2.6.1.jar:/home/kub19/.m2/repository/org/glassfish/jersey/core/jersey-common/2.29.1/jersey-common-2.29.1.jar:/home/kub19/.m2/repository/jakarta/annotation/jakarta.annotation-api/1.3.5/jakarta.annotation-api-1.3.5.jar:/home/kub19/.m2/repository/org/glassfish/hk2/osgi-resource-locator/1.0.3/osgi-resource-locator-1.0.3.jar:/home/kub19/.m2/repository/org/glassfish/jersey/core/jersey-server/2.29.1/jersey-server-2.29.1.jar:/home/kub19/.m2/repository/org/glassfish/jersey/media/jersey-media-jaxb/2.29.1/jersey-media-jaxb-2.29.1.jar:/home/kub19/.m2/repository/jakarta/validation/jakarta.validation-api/2.0.2/jakarta.validation-api-2.0.2.jar:/home/kub19/.m2/repository/org/glassfish/jersey/containers/jersey-container-servlet/2.29.1/jersey-container-servlet-2.29.1.jar:/home/kub19/.m2/repository/org/glassfish/jersey/containers/jersey-container-servlet-core/2.29.1/jersey-container-servlet-core-2.29.1.jar:/home/kub19/.m2/repository/org/glassfish/jersey/inject/jersey-hk2/2.29.1/jersey-hk2-2.29.1.jar:/home/kub19/.m2/repository/org/glassfish/hk2/hk2-locator/2.6.1/hk2-locator-2.6.1.jar:/home/kub19/.m2/repository/org/glassfish/hk2/external/aopalliance-repackaged/2.6.1/aopalliance-repackaged-2.6.1.jar:/home/kub19/.m2/repository/org/glassfish/hk2/hk2-api/2.6.1/hk2-api-2.6.1.jar:/home/kub19/.m2/repository/org/glassfish/hk2/hk2-utils/2.6.1/hk2-utils-2.6.1.jar:/home/kub19/.m2/repository/org/javassist/javassist/3.22.0-CR2/javassist-3.22.0-CR2.jar:/home/kub19/.m2/repository/io/netty/netty-all/4.1.42.Final/netty-all-4.1.42.Final.jar:/home/kub19/.m2/repository/com/clearspring/analytics/stream/2.9.6/stream-2.9.6.jar:/home/kub19/.m2/repository/io/dropwizard/metrics/metrics-core/4.1.1/metrics-core-4.1.1.jar:/home/kub19/.m2/repository/io/dropwizard/metrics/metrics-jvm/4.1.1/metrics-jvm-4.1.1.jar:/home/kub19/.m2/repository/io/dropwizard/metrics/metrics-json/4.1.1/metrics-json-4.1.1.jar:/home/kub19/.m2/repository/io/dropwizard/metrics/metrics-graphite/4.1.1/metrics-graphite-4.1.1.jar:/home/kub19/.m2/repository/io/dropwizard/metrics/metrics-jmx/4.1.1/metrics-jmx-4.1.1.jar:/home/kub19/.m2/repository/com/fasterxml/jackson/core/jackson-databind/2.10.0/jackson-databind-2.10.0.jar:/home/kub19/.m2/repository/com/fasterxml/jackson/module/jackson-module-scala_2.12/2.10.0/jackson-module-scala_2.12-2.10.0.jar:/home/kub19/.m2/repository/com/fasterxml/jackson/module/jackson-module-paranamer/2.10.0/jackson-module-paranamer-2.10.0.jar:/home/kub19/.m2/repository/org/apache/ivy/ivy/2.4.0/ivy-2.4.0.jar:/home/kub19/.m2/repository/oro/oro/2.0.8/oro-2.0.8.jar:/home/kub19/.m2/repository/net/razorvine/pyrolite/4.30/pyrolite-4.30.jar:/home/kub19/.m2/repository/net/sf/py4j/py4j/0.10.8.1/py4j-0.10.8.1.jar:/home/kub19/.m2/repository/org/apache/spark/spark-tags_2.12/3.0.0-preview2/spark-tags_2.12-3.0.0-preview2.jar:/home/kub19/.m2/repository/org/apache/commons/commons-crypto/1.0.0/commons-crypto-1.0.0.jar:/home/kub19/.m2/repository/org/spark-project/spark/unused/1.0.0/unused-1.0.0.jar:/home/kub19/.m2/repository/org/apache/spark/spark-sql_2.12/3.0.0-preview2/spark-sql_2.12-3.0.0-preview2.jar:/home/kub19/.m2/repository/com/univocity/univocity-parsers/2.8.3/univocity-parsers-2.8.3.jar:/home/kub19/.m2/repository/org/apache/spark/spark-sketch_2.12/3.0.0-preview2/spark-sketch_2.12-3.0.0-preview2.jar:/home/kub19/.m2/repository/org/apache/spark/spark-catalyst_2.12/3.0.0-preview2/spark-catalyst_2.12-3.0.0-preview2.jar:/home/kub19/.m2/repository/org/scala-lang/modules/scala-parser-combinators_2.12/1.1.2/scala-parser-combinators_2.12-1.1.2.jar:/home/kub19/.m2/repository/org/codehaus/janino/janino/3.0.15/janino-3.0.15.jar:/home/kub19/.m2/repository/org/codehaus/janino/commons-compiler/3.0.15/commons-compiler-3.0.15.jar:/home/kub19/.m2/repository/org/antlr/antlr4-runtime/4.7.1/antlr4-runtime-4.7.1.jar:/home/kub19/.m2/repository/org/apache/arrow/arrow-vector/0.15.1/arrow-vector-0.15.1.jar:/home/kub19/.m2/repository/org/apache/arrow/arrow-format/0.15.1/arrow-format-0.15.1.jar:/home/kub19/.m2/repository/org/apache/arrow/arrow-memory/0.15.1/arrow-memory-0.15.1.jar:/home/kub19/.m2/repository/com/google/flatbuffers/flatbuffers-java/1.9.0/flatbuffers-java-1.9.0.jar:/home/kub19/.m2/repository/org/apache/orc/orc-core/1.5.8/orc-core-1.5.8.jar:/home/kub19/.m2/repository/org/apache/orc/orc-shims/1.5.8/orc-shims-1.5.8.jar:/home/kub19/.m2/repository/com/google/protobuf/protobuf-java/2.5.0/protobuf-java-2.5.0.jar:/home/kub19/.m2/repository/commons-lang/commons-lang/2.6/commons-lang-2.6.jar:/home/kub19/.m2/repository/io/airlift/aircompressor/0.10/aircompressor-0.10.jar:/home/kub19/.m2/repository/org/apache/orc/orc-mapreduce/1.5.8/orc-mapreduce-1.5.8.jar:/home/kub19/.m2/repository/org/apache/hive/hive-storage-api/2.6.0/hive-storage-api-2.6.0.jar:/home/kub19/.m2/repository/org/apache/parquet/parquet-column/1.10.1/parquet-column-1.10.1.jar:/home/kub19/.m2/repository/org/apache/parquet/parquet-common/1.10.1/parquet-common-1.10.1.jar:/home/kub19/.m2/repository/org/apache/parquet/parquet-encoding/1.10.1/parquet-encoding-1.10.1.jar:/home/kub19/.m2/repository/org/apache/parquet/parquet-hadoop/1.10.1/parquet-hadoop-1.10.1.jar:/home/kub19/.m2/repository/org/apache/parquet/parquet-format/2.4.0/parquet-format-2.4.0.jar:/home/kub19/.m2/repository/org/apache/parquet/parquet-jackson/1.10.1/parquet-jackson-1.10.1.jar:/home/kub19/.m2/repository/org/scala-lang/scala-library/2.12.8/scala-library-2.12.8.jar:/home/kub19/.m2/repository/org/scala-lang/scala-reflect/2.12.8/scala-reflect-2.12.8.jar chapter2
 








¯\_(ツ)_/¯
♡۶Java♡۶RMI ♡۶
Make Use Method {MUM}