Spark Release 1.3.0 | Apache Spark (original) (raw)

Spark 1.3.0 is the fourth release on the 1.X line. This release brings a new DataFrame API alongside the graduation of Spark SQL from an alpha project. It also brings usability improvements in Spark’s core engine and expansion of MLlib and Spark Streaming. Spark 1.3 represents the work of 174 contributors from more than 60 institutions in more than 1000 individual patches.

To download Spark 1.3 visit the downloads page.

Spark Core

Spark 1.3 sees a handful of usability improvements in the core engine. The core API now supports multi level aggregation trees to help speed up expensive reduce operations. Improved error reporting has been added for certain gotcha operations. Spark’s Jetty dependency is now shaded to help avoid conflicts with user programs. Spark now supports SSL encryption for some communication endpoints. Finally, realtime GC metrics and record counts have been added to the UI.

DataFrame API

Spark 1.3 adds a new DataFrames API that provides powerful and convenient operators when working with structured datasets. The DataFrame is an evolution of the base RDD API that includes named fields along with schema information. It’s easy to construct a DataFrame from sources such as Hive tables, JSON data, a JDBC database, or any implementation of Spark’s new data source API. Data frames will become a common interchange format between Spark components and when importing and exporting data to other systems. Data frames are supported in Python, Scala, and Java.

Spark SQL

In this release Spark SQL graduates from an alpha project, providing backwards compatibility guarantees for the HiveQL dialect and stable programmatic API’s. Spark SQL adds support for writing tables in the data sources API. A new JDBC data source allows importing and exporting from MySQL, Postgres, and other RDBMS systems. A variety of small changes have expanded the coverage of HiveQL in Spark SQL. Spark SQL also adds support schema evolution with the ability to merging compatible schemas in Parquet.

Spark ML/MLlib

In this release Spark MLlib introduces several new algorithms: latent Dirichlet allocation (LDA) for topic modeling, multinomial logistic regression for multiclass classification, Gaussian mixture model (GMM) and power iteration clustering for clustering, FP-growth for frequent pattern mining, and block matrix abstraction for distributed linear algebra. Initial support has been added for model import/export in exchangeable format, which will be expanded in future versions to cover more model types in Java/Python/Scala. The implementations of k-means and ALS receive updates that lead to significant performance gain. PySpark now supports the ML pipeline API added in Spark 1.2, and gradient boosted trees and Gaussian mixture model. Finally, the ML pipeline API has been ported to support the new DataFrames abstraction.

Spark Streaming

Spark 1.3 introduces a new direct Kafka API (docs) which enables exactly-once delivery without the use of write ahead logs. It also adds a Python Kafka API along with infrastructure for additional Python API’s in future releases. An online version of logistic regression and the ability to read binary records have also been added. For stateful operations, support has been added for loading of an initial state RDD. Finally, the streaming programming guide has been updated to include information about SQL and DataFrame operations within streaming applications, and important clarifications to the fault-tolerance semantics.

GraphX

GraphX adds a handful of utility functions in this release, including conversion into a canonical edge graph.

Upgrading to Spark 1.3

Spark 1.3 is binary compatible with Spark 1.X releases, so no code changes are necessary. This excludes API’s marked explicitly as unstable.

As part of stabilizing the Spark SQL API, the SchemaRDD class has been renamed to DataFrame. Spark SQL’s migration guide describes the upgrade process in detail. Spark SQL also now requires that column identifiers which use reserved words (such as “string” or “table”) be escaped using backticks.

Known Issues

This release has few known issues which will be addressed in Spark 1.3.1:

Credits

Thanks to everyone who contributed!

Spark News Archive