I would like to start a discussion about adding support for images in Spark. We will follow up with a formal vote in two weeks. Please feel free to comment on the JIRA ticket too.
JIRA ticket: https://issues.apache.org/jira/browse/SPARK-21866
PDF version: https://issues.apache.org/jira/secure/attachment/12884792/SPIP%20-%20Image%20support%20for%20Apache%20Spark%20V1.1.pdf
Background and motivation
As Apache Spark is being used more and more in the industry, some new use cases are emerging for different data formats beyond the traditional SQL types or the numerical types (vectors and matrices). Deep Learning applications commonly deal with image processing. A number of projects add some Deep Learning capabilities to Spark (see list below), but they struggle to communicate with each other or with MLlib pipelines because there is no standard way to represent an image in Spark DataFrames. We propose to federate efforts for representing images in Spark by defining a representation that caters to the most common needs of users and library developers.
This SPIP proposes a specification to represent images in Spark DataFrames and Datasets (based on existing industrial standards), and an interface for loading sources of images. It is not meant to be a full-fledged image processing library, but rather the core description that other libraries and users can rely on. Several packages already offer various processing facilities for transforming images or doing more complex operations, and each has various design tradeoffs that make them better as standalone solutions.
This project is a joint collaboration between Microsoft and Databricks, which have been testing this design in two open source packages: MMLSpark and Deep Learning Pipelines.
The proposed image format is an in-memory, decompressed representation that targets low-level applications. It is significantly more liberal in memory usage than compressed image representations such as JPEG, PNG, etc., but it allows easy communication with popular image processing libraries and has no decoding overhead.
Data scientists, data engineers, library developers.
Images are a versatile medium and encompass a very wide range of formats and representations. This SPIP explicitly aims at the most common use case in the industry currently: multi-channel matrices of binary, int32, int64, float or double data that can fit comfortably in the heap of the JVM:
We propose to add a new package in the package structure, under the MLlib project:
We propose to add the following structure:
imageSchema = StructType([
For more information about image types, here is an OpenCV guide on types: http://docs.opencv.org/2.4/modules/core/doc/intro.html#fixed-pixel-types-limited-use-of-templates
The reference implementation provides some functions to convert popular formats (JPEG, PNG, etc.) to the image specification above, and some functions to verify if an image is valid.
We propose the following function to load images from a remote distributed source as a DataFrame. Here is the signature in Scala. The python interface is similar. For compatibility with java, this function should be made available through a builder pattern or through the DataSource API. The exact mechanics can be discussed during implementation; the goal of the proposal below is to propose a specification of the behavior.
The type of the returned DataFrame should be the structure type above, with the expectation that all the file names be filled.
Parameters that are experimental/may be quickly deprecated. These would be useful to have but are not critical for a first cut:
The implementation is expected to be in Scala for performance, with a wrapper for python.
The reference implementation has also some experimental options (undocumented here).
A reference implementation is available as an open-source Spark package in this repository (Apache 2.0 license):
This Spark package will also be published in a binary form on spark-packages.org .
Comments about the API should be addressed in this ticket.
The use of User-Defined Types was considered. It adds some burden to the implementation of various languages and does not provide significant advantages.
|Free forum by Nabble||Edit this page|