Pyspark 2.4.1 provides two packages for machine learning, pyspark.ml
and pyspark.mllib
.
When there are two packages for similar purposes, one of them may have a more primitive implementation with finer control, while it may be too flexible and difficult to use, thus the other one wraps it for ease of use. However, it seems that we are in the process of transitioning from pyspark.mllib
to pyspark.ml
.
RDD → DataFrame / Dataset
Spark internally distributes a data structure called RDD (Resilient Distributed Dataset) across a cluster while processing. Since RDD does not need to be structured, it can contain whatever kind of data, for example media streams.
On the other hand, most of the data targeted by distributed processing tasks have some kind of structure. For example, in a two-dimensional matrix operation, each piece of data can be represented by three sets of rows, columns, and values. There is room for optimization here, and a new data structure called DataFrame / Dataset has been built on top of RDD.
Dataset can not only handle structured data, but can also be categorized into strongly typed and untyped APIs, with the untyped API being called a DataFrame. As you can see from the name, it has a similar structure to Pandas' DataFrame, and since Python does not have compile-time type safety, you can use only DataFrame.
Using a DataFrame not only reduces memory usage and improves performance, but also increases productivity by using more abstract functions, and allows you to display your data neatly in a tabular format in a Jupyter Notebook.
pyspark.mllib → pyspark.ml
pyspark.mllib
and pyspark.ml
provide APIs based on RDD and DataFrame, respectively.
Since machine learning algorithms are often formulated as matrix operations, they are a perfect fit for DataFrame replacement.
It seems pyspark.mllib
is already in maintenance mode and will be deprecated in Spark 3.0, so you should use pyspark.ml
.