Apatch Spark is an open source distributed programming environment implemented on top of the JVM that has seen a rapid rise in popularity in recent years. I will use Spark through pyspark at my next job, so I want to use pyspark, but it's hard to install Spark from scratch because I'm not familiar with the JVM.
So in this article, I'll use Docker to build Spark and pyspark environments.
Set Up an Environment
There is a docker image named jupyter/pyspark-notebook published by Jupyter Lab. For now, let's pull the latest version:
docker pull jupyter/pyspark-notebook:87210526f381
Run it:
docker run --rm -w /app -p 8888:8888 \
--mount type=bind,src=$(pwd),dst=/app \
jupyter/pyspark-notebook:87210526f381
Then you will see several messages, among which is the URL. If you access that URL, you will see a Jupyter Notebook that you can use with pyspark:
import pyspark
pyspark.version.__version__
'2.4.0'
Note that this article is written using Jupyter Notebook, which was launched exactly in this way.
Launching a Spark Cluster
Spark usually creates a cluster in distributed environment, but creating a distributed cluster in development is not a big deal, so there is a local mode.
To start Spark in local mode via pyspark, call pyspark.SparkContext
:
sc = pyspark.SparkContext('local[*]')
The string specifies the number available threads:
local
- 1 threadlocal[n]
-n
threads(n
is a number)local[*]
- As many threads as available in JVM.(Runtime.getRuntime.availableProcessors()
is used internally)
It seems that local[*]
is commonly used.
Try to calculate the sum of the numbers from 0 to 10.
rdd = sc.parallelize(range(10))
rdd.sum()
45
Stop the cluster when you are done using it.
sc.stop()
Conclusion
In this article, I created a pyspark environment using Docker and launched a Spark cluster in a local mode. I don't know much about Spark yet, but I'll try to do more and more things little by little.