results of taken(3) not appearing in console window

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

results of taken(3) not appearing in console window

Zahid Rahman
I am running the same code with the same libraries but not getting same output.
scala>  case class flight (DEST_COUNTRY_NAME: String,
     |                      ORIGIN_COUNTRY_NAME:String,
     |                      count: BigInt)
defined class flight

scala>     val flightDf = spark.read.parquet("/data/flight-data/parquet/2010-summary.parquet/")
flightDf: org.apache.spark.sql.DataFrame = [DEST_COUNTRY_NAME: string, ORIGIN_COUNTRY_NAME: string ... 1 more field]

scala> val flights = flightDf.as[flight]
flights: org.apache.spark.sql.Dataset[flight] = [DEST_COUNTRY_NAME: string, ORIGIN_COUNTRY_NAME: string ... 1 more field]

scala> flights.filter(flight_row => flight_row.ORIGIN_COUNTRY_NAME != "Canada").map(flight_row => flight_row).take(3)
res0: Array[flight] = Array(flight(United States,Romania,1), flight(United States,Ireland,264), flight(United States,India,69))

<!------------------------------------------------------------------------------------------------------------------------------
20/03/26 19:09:00 INFO SparkContext: Running Spark version 3.0.0-preview2
20/03/26 19:09:00 INFO ResourceUtils: ==============================================================
20/03/26 19:09:00 INFO ResourceUtils: Resources for spark.driver:

20/03/26 19:09:00 INFO ResourceUtils: ==============================================================
20/03/26 19:09:00 INFO SparkContext: Submitted application:  chapter2
20/03/26 19:09:00 INFO SecurityManager: Changing view acls to: kub19
20/03/26 19:09:00 INFO SecurityManager: Changing modify acls to: kub19
20/03/26 19:09:00 INFO SecurityManager: Changing view acls groups to:
20/03/26 19:09:00 INFO SecurityManager: Changing modify acls groups to:
20/03/26 19:09:00 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(kub19); groups with view permissions: Set(); users  with modify permissions: Set(kub19); groups with modify permissions: Set()
20/03/26 19:09:00 INFO Utils: Successfully started service 'sparkDriver' on port 46817.
20/03/26 19:09:00 INFO SparkEnv: Registering MapOutputTracker
20/03/26 19:09:00 INFO SparkEnv: Registering BlockManagerMaster
20/03/26 19:09:00 INFO BlockManagerMasterEndpoint: Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information
20/03/26 19:09:00 INFO BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up
20/03/26 19:09:00 INFO SparkEnv: Registering BlockManagerMasterHeartbeat
20/03/26 19:09:00 INFO DiskBlockManager: Created local directory at /tmp/blockmgr-0baf5097-2595-4542-99e2-a192d7baf37c
20/03/26 19:09:00 INFO MemoryStore: MemoryStore started with capacity 127.2 MiB
20/03/26 19:09:01 INFO SparkEnv: Registering OutputCommitCoordinator
20/03/26 19:09:01 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
20/03/26 19:09:01 INFO Utils: Successfully started service 'SparkUI' on port 4041.
20/03/26 19:09:01 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at http://localhost:4041
20/03/26 19:09:01 INFO Executor: Starting executor ID driver on host localhost
20/03/26 19:09:01 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 38135.
20/03/26 19:09:01 INFO NettyBlockTransferService: Server created on localhost:38135
20/03/26 19:09:01 INFO BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
20/03/26 19:09:01 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(driver, localhost, 38135, None)
20/03/26 19:09:01 INFO BlockManagerMasterEndpoint: Registering block manager localhost:38135 with 127.2 MiB RAM, BlockManagerId(driver, localhost, 38135, None)
20/03/26 19:09:01 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(driver, localhost, 38135, None)
20/03/26 19:09:01 INFO BlockManager: Initialized BlockManager: BlockManagerId(driver, localhost, 38135, None)
20/03/26 19:09:01 INFO SharedState: Setting hive.metastore.warehouse.dir ('null') to the value of spark.sql.warehouse.dir ('file:/home/kub19/spark-3.0.0-preview2-bin-hadoop2.7/projects/spark-warehouse').
20/03/26 19:09:01 INFO SharedState: Warehouse path is 'file:/home/kub19/spark-3.0.0-preview2-bin-hadoop2.7/projects/spark-warehouse'.
20/03/26 19:09:02 INFO SparkContext: Starting job: parquet at chapter2.scala:18
20/03/26 19:09:02 INFO DAGScheduler: Got job 0 (parquet at chapter2.scala:18) with 1 output partitions
20/03/26 19:09:02 INFO DAGScheduler: Final stage: ResultStage 0 (parquet at chapter2.scala:18)
20/03/26 19:09:02 INFO DAGScheduler: Parents of final stage: List()
20/03/26 19:09:02 INFO DAGScheduler: Missing parents: List()
20/03/26 19:09:02 INFO DAGScheduler: Submitting ResultStage 0 (MapPartitionsRDD[1] at parquet at chapter2.scala:18), which has no missing parents
20/03/26 19:09:02 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 72.8 KiB, free 127.1 MiB)
20/03/26 19:09:02 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 25.9 KiB, free 127.1 MiB)
20/03/26 19:09:02 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on localhost:38135 (size: 25.9 KiB, free: 127.2 MiB)
20/03/26 19:09:02 INFO SparkContext: Created broadcast 0 from broadcast at DAGScheduler.scala:1206
20/03/26 19:09:02 INFO DAGScheduler: Submitting 1 missing tasks from ResultStage 0 (MapPartitionsRDD[1] at parquet at chapter2.scala:18) (first 15 tasks are for partitions Vector(0))
20/03/26 19:09:02 INFO TaskSchedulerImpl: Adding task set 0.0 with 1 tasks
20/03/26 19:09:02 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, localhost, executor driver, partition 0, PROCESS_LOCAL, 7560 bytes)
20/03/26 19:09:02 INFO Executor: Running task 0.0 in stage 0.0 (TID 0)
20/03/26 19:09:02 INFO Executor: Finished task 0.0 in stage 0.0 (TID 0). 1840 bytes result sent to driver
20/03/26 19:09:02 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 204 ms on localhost (executor driver) (1/1)
20/03/26 19:09:02 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool
20/03/26 19:09:02 INFO DAGScheduler: ResultStage 0 (parquet at chapter2.scala:18) finished in 0.304 s
20/03/26 19:09:02 INFO DAGScheduler: Job 0 is finished. Cancelling potential speculative or zombie tasks for this job
20/03/26 19:09:02 INFO TaskSchedulerImpl: Killing all running tasks in stage 0: Stage finished
20/03/26 19:09:02 INFO DAGScheduler: Job 0 finished: parquet at chapter2.scala:18, took 0.332643 s
20/03/26 19:09:03 INFO BlockManagerInfo: Removed broadcast_0_piece0 on localhost:38135 in memory (size: 25.9 KiB, free: 127.2 MiB)
20/03/26 19:09:04 INFO V2ScanRelationPushDown:
Pushing operators to parquet file:/data/flight-data/parquet/2010-summary.parquet
Pushed Filters:
Post-Scan Filters:
Output: DEST_COUNTRY_NAME#0, ORIGIN_COUNTRY_NAME#1, count#2L
         
20/03/26 19:09:04 INFO MemoryStore: Block broadcast_1 stored as values in memory (estimated size 290.0 KiB, free 126.9 MiB)
20/03/26 19:09:04 INFO MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 24.3 KiB, free 126.9 MiB)
20/03/26 19:09:04 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on localhost:38135 (size: 24.3 KiB, free: 127.2 MiB)
20/03/26 19:09:04 INFO SparkContext: Created broadcast 1 from take at chapter2.scala:20
20/03/26 19:09:04 INFO MemoryStore: Block broadcast_2 stored as values in memory (estimated size 290.1 KiB, free 126.6 MiB)
20/03/26 19:09:04 INFO MemoryStore: Block broadcast_2_piece0 stored as bytes in memory (estimated size 24.3 KiB, free 126.6 MiB)
20/03/26 19:09:04 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on localhost:38135 (size: 24.3 KiB, free: 127.2 MiB)
20/03/26 19:09:04 INFO SparkContext: Created broadcast 2 from take at chapter2.scala:20
20/03/26 19:09:04 INFO CodeGenerator: Code generated in 159.155401 ms
20/03/26 19:09:04 INFO SparkContext: Starting job: take at chapter2.scala:20
20/03/26 19:09:04 INFO DAGScheduler: Got job 1 (take at chapter2.scala:20) with 1 output partitions
20/03/26 19:09:04 INFO DAGScheduler: Final stage: ResultStage 1 (take at chapter2.scala:20)
20/03/26 19:09:04 INFO DAGScheduler: Parents of final stage: List()
20/03/26 19:09:04 INFO DAGScheduler: Missing parents: List()
20/03/26 19:09:04 INFO DAGScheduler: Submitting ResultStage 1 (MapPartitionsRDD[5] at take at chapter2.scala:20), which has no missing parents
20/03/26 19:09:04 INFO MemoryStore: Block broadcast_3 stored as values in memory (estimated size 22.7 KiB, free 126.6 MiB)
20/03/26 19:09:04 INFO MemoryStore: Block broadcast_3_piece0 stored as bytes in memory (estimated size 8.1 KiB, free 126.6 MiB)
20/03/26 19:09:04 INFO BlockManagerInfo: Added broadcast_3_piece0 in memory on localhost:38135 (size: 8.1 KiB, free: 127.1 MiB)
20/03/26 19:09:04 INFO SparkContext: Created broadcast 3 from broadcast at DAGScheduler.scala:1206
20/03/26 19:09:04 INFO DAGScheduler: Submitting 1 missing tasks from ResultStage 1 (MapPartitionsRDD[5] at take at chapter2.scala:20) (first 15 tasks are for partitions Vector(0))
20/03/26 19:09:04 INFO TaskSchedulerImpl: Adding task set 1.0 with 1 tasks
20/03/26 19:09:04 INFO TaskSetManager: Starting task 0.0 in stage 1.0 (TID 1, localhost, executor driver, partition 0, PROCESS_LOCAL, 7980 bytes)
20/03/26 19:09:04 INFO Executor: Running task 0.0 in stage 1.0 (TID 1)
20/03/26 19:09:05 INFO FilePartitionReader: Reading file path: file:///data/flight-data/parquet/2010-summary.parquet/part-r-00000-1a9822ba-b8fb-4d8e-844a-ea30d0801b9e.gz.parquet, range: 0-3921, partition values: [empty row]
20/03/26 19:09:05 INFO ZlibFactory: Successfully loaded & initialized native-zlib library
20/03/26 19:09:05 INFO CodecPool: Got brand-new decompressor [.gz]
20/03/26 19:09:05 INFO Executor: Finished task 0.0 in stage 1.0 (TID 1). 1762 bytes result sent to driver
20/03/26 19:09:05 INFO TaskSetManager: Finished task 0.0 in stage 1.0 (TID 1) in 219 ms on localhost (executor driver) (1/1)
20/03/26 19:09:05 INFO TaskSchedulerImpl: Removed TaskSet 1.0, whose tasks have all completed, from pool
20/03/26 19:09:05 INFO DAGScheduler: ResultStage 1 (take at chapter2.scala:20) finished in 0.235 s
20/03/26 19:09:05 INFO DAGScheduler: Job 1 is finished. Cancelling potential speculative or zombie tasks for this job
20/03/26 19:09:05 INFO TaskSchedulerImpl: Killing all running tasks in stage 1: Stage finished
20/03/26 19:09:05 INFO DAGScheduler: Job 1 finished: take at chapter2.scala:20, took 0.238010 s
20/03/26 19:09:05 INFO CodeGenerator: Code generated in 17.77886 ms
20/03/26 19:09:05 INFO SparkUI: Stopped Spark web UI at http://localhost:4041
20/03/26 19:09:05 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
20/03/26 19:09:05 INFO MemoryStore: MemoryStore cleared
20/03/26 19:09:05 INFO BlockManager: BlockManager stopped
20/03/26 19:09:05 INFO BlockManagerMaster: BlockManagerMaster stopped
20/03/26 19:09:05 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
20/03/26 19:09:05 INFO SparkContext: Successfully stopped SparkContext
20/03/26 19:09:05 INFO ShutdownHookManager: Shutdown hook called
20/03/26 19:09:05 INFO ShutdownHookManager: Deleting directory /tmp/spark-6d99677e-ae1b-4894-aa32-3a79fb0b4307

Process finished with exit code 0
<!----------------------------------------------------------------------------------------------------------
import org.apache.spark.sql.SparkSession

object chapter2 {

// define specific data type class then manipulate it using the filter and map functions
// this is also known as an Encoder
case class flight (DEST_COUNTRY_NAME: String,
ORIGIN_COUNTRY_NAME:String,
count: BigInt)

def main(args: Array[String]): Unit = {

// using an inter active shell, spark session needed here to avoid Intellij errors
val spark = SparkSession.builder.master("local[*]").appName(" chapter2").getOrCreate

// looks like a hard coded system work around
import spark.implicits._
val flightDf = spark.read.parquet("/data/flight-data/parquet/2010-summary.parquet/")
val flights = flightDf.as[flight]
flights.filter(flight_row => flight_row.ORIGIN_COUNTRY_NAME != "Canada").map(flight_row => flight_row).take(3)

spark.stop()
}
}






¯\_(ツ)_/¯
♡۶Java♡۶RMI ♡۶
Make Use Method {MUM}
Reply | Threaded
Open this post in threaded view
|

Re: results of taken(3) not appearing in console window

rxin
bcc dev, +user

You need to print out the result. Take itself doesn't print. You only got the results printed to the console because the Scala REPL automatically prints the returned value from take.


On Thu, Mar 26, 2020 at 12:15 PM, Zahid Rahman <[hidden email]> wrote:
I am running the same code with the same libraries but not getting same output.
scala>  case class flight (DEST_COUNTRY_NAME: String,
     |                      ORIGIN_COUNTRY_NAME:String,
     |                      count: BigInt)
defined class flight

scala>     val flightDf = spark.read.parquet("/data/flight-data/parquet/2010-summary.parquet/")
flightDf: org.apache.spark.sql.DataFrame = [DEST_COUNTRY_NAME: string, ORIGIN_COUNTRY_NAME: string ... 1 more field]

scala> val flights = flightDf.as[flight]
flights: org.apache.spark.sql.Dataset[flight] = [DEST_COUNTRY_NAME: string, ORIGIN_COUNTRY_NAME: string ... 1 more field]

scala> flights.filter(flight_row => flight_row.ORIGIN_COUNTRY_NAME != "Canada").map(flight_row => flight_row).take(3)
res0: Array[flight] = Array(flight(United States,Romania,1), flight(United States,Ireland,264), flight(United States,India,69))

<!------------------------------------------------------------------------------------------------------------------------------
20/03/26 19:09:00 INFO SparkContext: Running Spark version 3.0.0-preview2
20/03/26 19:09:00 INFO ResourceUtils: ==============================================================
20/03/26 19:09:00 INFO ResourceUtils: Resources for spark.driver:

20/03/26 19:09:00 INFO ResourceUtils: ==============================================================
20/03/26 19:09:00 INFO SparkContext: Submitted application:  chapter2
20/03/26 19:09:00 INFO SecurityManager: Changing view acls to: kub19
20/03/26 19:09:00 INFO SecurityManager: Changing modify acls to: kub19
20/03/26 19:09:00 INFO SecurityManager: Changing view acls groups to:
20/03/26 19:09:00 INFO SecurityManager: Changing modify acls groups to:
20/03/26 19:09:00 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(kub19); groups with view permissions: Set(); users  with modify permissions: Set(kub19); groups with modify permissions: Set()
20/03/26 19:09:00 INFO Utils: Successfully started service 'sparkDriver' on port 46817.
20/03/26 19:09:00 INFO SparkEnv: Registering MapOutputTracker
20/03/26 19:09:00 INFO SparkEnv: Registering BlockManagerMaster
20/03/26 19:09:00 INFO BlockManagerMasterEndpoint: Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information
20/03/26 19:09:00 INFO BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up
20/03/26 19:09:00 INFO SparkEnv: Registering BlockManagerMasterHeartbeat
20/03/26 19:09:00 INFO DiskBlockManager: Created local directory at /tmp/blockmgr-0baf5097-2595-4542-99e2-a192d7baf37c
20/03/26 19:09:00 INFO MemoryStore: MemoryStore started with capacity 127.2 MiB
20/03/26 19:09:01 INFO SparkEnv: Registering OutputCommitCoordinator
20/03/26 19:09:01 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
20/03/26 19:09:01 INFO Utils: Successfully started service 'SparkUI' on port 4041.
20/03/26 19:09:01 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at http://localhost:4041
20/03/26 19:09:01 INFO Executor: Starting executor ID driver on host localhost
20/03/26 19:09:01 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 38135.
20/03/26 19:09:01 INFO NettyBlockTransferService: Server created on localhost:38135
20/03/26 19:09:01 INFO BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
20/03/26 19:09:01 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(driver, localhost, 38135, None)
20/03/26 19:09:01 INFO BlockManagerMasterEndpoint: Registering block manager localhost:38135 with 127.2 MiB RAM, BlockManagerId(driver, localhost, 38135, None)
20/03/26 19:09:01 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(driver, localhost, 38135, None)
20/03/26 19:09:01 INFO BlockManager: Initialized BlockManager: BlockManagerId(driver, localhost, 38135, None)
20/03/26 19:09:01 INFO SharedState: Setting hive.metastore.warehouse.dir ('null') to the value of spark.sql.warehouse.dir ('file:/home/kub19/spark-3.0.0-preview2-bin-hadoop2.7/projects/spark-warehouse').
20/03/26 19:09:01 INFO SharedState: Warehouse path is 'file:/home/kub19/spark-3.0.0-preview2-bin-hadoop2.7/projects/spark-warehouse'.
20/03/26 19:09:02 INFO SparkContext: Starting job: parquet at chapter2.scala:18
20/03/26 19:09:02 INFO DAGScheduler: Got job 0 (parquet at chapter2.scala:18) with 1 output partitions
20/03/26 19:09:02 INFO DAGScheduler: Final stage: ResultStage 0 (parquet at chapter2.scala:18)
20/03/26 19:09:02 INFO DAGScheduler: Parents of final stage: List()
20/03/26 19:09:02 INFO DAGScheduler: Missing parents: List()
20/03/26 19:09:02 INFO DAGScheduler: Submitting ResultStage 0 (MapPartitionsRDD[1] at parquet at chapter2.scala:18), which has no missing parents
20/03/26 19:09:02 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 72.8 KiB, free 127.1 MiB)
20/03/26 19:09:02 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 25.9 KiB, free 127.1 MiB)
20/03/26 19:09:02 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on localhost:38135 (size: 25.9 KiB, free: 127.2 MiB)
20/03/26 19:09:02 INFO SparkContext: Created broadcast 0 from broadcast at DAGScheduler.scala:1206
20/03/26 19:09:02 INFO DAGScheduler: Submitting 1 missing tasks from ResultStage 0 (MapPartitionsRDD[1] at parquet at chapter2.scala:18) (first 15 tasks are for partitions Vector(0))
20/03/26 19:09:02 INFO TaskSchedulerImpl: Adding task set 0.0 with 1 tasks
20/03/26 19:09:02 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, localhost, executor driver, partition 0, PROCESS_LOCAL, 7560 bytes)
20/03/26 19:09:02 INFO Executor: Running task 0.0 in stage 0.0 (TID 0)
20/03/26 19:09:02 INFO Executor: Finished task 0.0 in stage 0.0 (TID 0). 1840 bytes result sent to driver
20/03/26 19:09:02 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 204 ms on localhost (executor driver) (1/1)
20/03/26 19:09:02 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool
20/03/26 19:09:02 INFO DAGScheduler: ResultStage 0 (parquet at chapter2.scala:18) finished in 0.304 s
20/03/26 19:09:02 INFO DAGScheduler: Job 0 is finished. Cancelling potential speculative or zombie tasks for this job
20/03/26 19:09:02 INFO TaskSchedulerImpl: Killing all running tasks in stage 0: Stage finished
20/03/26 19:09:02 INFO DAGScheduler: Job 0 finished: parquet at chapter2.scala:18, took 0.332643 s
20/03/26 19:09:03 INFO BlockManagerInfo: Removed broadcast_0_piece0 on localhost:38135 in memory (size: 25.9 KiB, free: 127.2 MiB)
20/03/26 19:09:04 INFO V2ScanRelationPushDown:
Pushing operators to parquet file:/data/flight-data/parquet/2010-summary.parquet
Pushed Filters:
Post-Scan Filters:
Output: DEST_COUNTRY_NAME#0, ORIGIN_COUNTRY_NAME#1, count#2L
         
20/03/26 19:09:04 INFO MemoryStore: Block broadcast_1 stored as values in memory (estimated size 290.0 KiB, free 126.9 MiB)
20/03/26 19:09:04 INFO MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 24.3 KiB, free 126.9 MiB)
20/03/26 19:09:04 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on localhost:38135 (size: 24.3 KiB, free: 127.2 MiB)
20/03/26 19:09:04 INFO SparkContext: Created broadcast 1 from take at chapter2.scala:20
20/03/26 19:09:04 INFO MemoryStore: Block broadcast_2 stored as values in memory (estimated size 290.1 KiB, free 126.6 MiB)
20/03/26 19:09:04 INFO MemoryStore: Block broadcast_2_piece0 stored as bytes in memory (estimated size 24.3 KiB, free 126.6 MiB)
20/03/26 19:09:04 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on localhost:38135 (size: 24.3 KiB, free: 127.2 MiB)
20/03/26 19:09:04 INFO SparkContext: Created broadcast 2 from take at chapter2.scala:20
20/03/26 19:09:04 INFO CodeGenerator: Code generated in 159.155401 ms
20/03/26 19:09:04 INFO SparkContext: Starting job: take at chapter2.scala:20
20/03/26 19:09:04 INFO DAGScheduler: Got job 1 (take at chapter2.scala:20) with 1 output partitions
20/03/26 19:09:04 INFO DAGScheduler: Final stage: ResultStage 1 (take at chapter2.scala:20)
20/03/26 19:09:04 INFO DAGScheduler: Parents of final stage: List()
20/03/26 19:09:04 INFO DAGScheduler: Missing parents: List()
20/03/26 19:09:04 INFO DAGScheduler: Submitting ResultStage 1 (MapPartitionsRDD[5] at take at chapter2.scala:20), which has no missing parents
20/03/26 19:09:04 INFO MemoryStore: Block broadcast_3 stored as values in memory (estimated size 22.7 KiB, free 126.6 MiB)
20/03/26 19:09:04 INFO MemoryStore: Block broadcast_3_piece0 stored as bytes in memory (estimated size 8.1 KiB, free 126.6 MiB)
20/03/26 19:09:04 INFO BlockManagerInfo: Added broadcast_3_piece0 in memory on localhost:38135 (size: 8.1 KiB, free: 127.1 MiB)
20/03/26 19:09:04 INFO SparkContext: Created broadcast 3 from broadcast at DAGScheduler.scala:1206
20/03/26 19:09:04 INFO DAGScheduler: Submitting 1 missing tasks from ResultStage 1 (MapPartitionsRDD[5] at take at chapter2.scala:20) (first 15 tasks are for partitions Vector(0))
20/03/26 19:09:04 INFO TaskSchedulerImpl: Adding task set 1.0 with 1 tasks
20/03/26 19:09:04 INFO TaskSetManager: Starting task 0.0 in stage 1.0 (TID 1, localhost, executor driver, partition 0, PROCESS_LOCAL, 7980 bytes)
20/03/26 19:09:04 INFO Executor: Running task 0.0 in stage 1.0 (TID 1)
20/03/26 19:09:05 INFO FilePartitionReader: Reading file path: file:///data/flight-data/parquet/2010-summary.parquet/part-r-00000-1a9822ba-b8fb-4d8e-844a-ea30d0801b9e.gz.parquet, range: 0-3921, partition values: [empty row]
20/03/26 19:09:05 INFO ZlibFactory: Successfully loaded & initialized native-zlib library
20/03/26 19:09:05 INFO CodecPool: Got brand-new decompressor [.gz]
20/03/26 19:09:05 INFO Executor: Finished task 0.0 in stage 1.0 (TID 1). 1762 bytes result sent to driver
20/03/26 19:09:05 INFO TaskSetManager: Finished task 0.0 in stage 1.0 (TID 1) in 219 ms on localhost (executor driver) (1/1)
20/03/26 19:09:05 INFO TaskSchedulerImpl: Removed TaskSet 1.0, whose tasks have all completed, from pool
20/03/26 19:09:05 INFO DAGScheduler: ResultStage 1 (take at chapter2.scala:20) finished in 0.235 s
20/03/26 19:09:05 INFO DAGScheduler: Job 1 is finished. Cancelling potential speculative or zombie tasks for this job
20/03/26 19:09:05 INFO TaskSchedulerImpl: Killing all running tasks in stage 1: Stage finished
20/03/26 19:09:05 INFO DAGScheduler: Job 1 finished: take at chapter2.scala:20, took 0.238010 s
20/03/26 19:09:05 INFO CodeGenerator: Code generated in 17.77886 ms
20/03/26 19:09:05 INFO SparkUI: Stopped Spark web UI at http://localhost:4041
20/03/26 19:09:05 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
20/03/26 19:09:05 INFO MemoryStore: MemoryStore cleared
20/03/26 19:09:05 INFO BlockManager: BlockManager stopped
20/03/26 19:09:05 INFO BlockManagerMaster: BlockManagerMaster stopped
20/03/26 19:09:05 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
20/03/26 19:09:05 INFO SparkContext: Successfully stopped SparkContext
20/03/26 19:09:05 INFO ShutdownHookManager: Shutdown hook called
20/03/26 19:09:05 INFO ShutdownHookManager: Deleting directory /tmp/spark-6d99677e-ae1b-4894-aa32-3a79fb0b4307

Process finished with exit code 0
<!----------------------------------------------------------------------------------------------------------
import org.apache.spark.sql.SparkSession

object chapter2 {

// define specific data type class then manipulate it using the filter and map functions
// this is also known as an Encoder
case class flight (DEST_COUNTRY_NAME: String,
ORIGIN_COUNTRY_NAME:String,
count: BigInt)

def main(args: Array[String]): Unit = {

// using an inter active shell, spark session needed here to avoid Intellij errors
val spark = SparkSession.builder.master("local[*]").appName(" chapter2").getOrCreate

// looks like a hard coded system work around
import spark.implicits._
val flightDf = spark.read.parquet("/data/flight-data/parquet/2010-summary.parquet/")
val flights = flightDf.as[flight]
flights.filter(flight_row => flight_row.ORIGIN_COUNTRY_NAME != "Canada").map(flight_row => flight_row).take(3)

spark.stop()
}
}






¯\_(ツ)_/¯
♡۶Java♡۶RMI ♡۶
Make Use Method {MUM}


smime.p7s (6K) Download Attachment