yt coffee

Study hard, play harder.

Getting Started With Pyspark Using Docker

  • 2019-01-16 11:50
  • #pyspark

Apatch Spark is an open source distributed programming environment implemented on top of the JVM that has seen a rapid rise in popularity in recent years. I will use Spark through pyspark at my next job, so I want to use pyspark, but it's hard to install Spark from scratch because I'm not familiar with the JVM.

So in this article, I'll use Docker to build Spark and pyspark environments.

Set Up an Environment

There is a docker image named jupyter/pyspark-notebook published by Jupyter Lab. For now, let's pull the latest version:

docker pull jupyter/pyspark-notebook:87210526f381

Run it:

docker run --rm -w /app -p 8888:8888 \
    --mount type=bind,src=$(pwd),dst=/app \
    jupyter/pyspark-notebook:87210526f381

Then you will see several messages, among which is the URL. If you access that URL, you will see a Jupyter Notebook that you can use with pyspark:

import pyspark
pyspark.version.__version__
'2.4.0'

Note that this article is written using Jupyter Notebook, which was launched exactly in this way.

Launching a Spark Cluster

Spark usually creates a cluster in distributed environment, but creating a distributed cluster in development is not a big deal, so there is a local mode.

To start Spark in local mode via pyspark, call pyspark.SparkContext:

sc = pyspark.SparkContext('local[*]')

The string specifies the number available threads:

It seems that local[*] is commonly used.

Try to calculate the sum of the numbers from 0 to 10.

rdd = sc.parallelize(range(10))
rdd.sum()
45

Stop the cluster when you are done using it.

sc.stop()

Conclusion

In this article, I created a pyspark environment using Docker and launched a Spark cluster in a local mode. I don't know much about Spark yet, but I'll try to do more and more things little by little.

References