IPython Notebook and Spark’s Python API are a powerful combination for data science.
The developers of Apache Spark have given thoughtful consideration to Python as a language of choice for data analysis. They have developed the PySpark API for working with RDDs in Python, and further support using the powerful IPythonshell instead of the builtin Python REPL.
The developers of IPython have invested considerable effort in building the IPython Notebook, a system inspired by Mathematica that allows you to create “executable documents.” IPython Notebooks can integrate formatted text (Markdown), executable code (Python), mathematical formulae (LaTeX), and graphics/visualizations (matplotlib) into a single document that captures the flow of an exploration and can be exported as a formatted report or an executable script. Below are a few pieces on why IPython Notebooks can improve your productivity:
Why IPython Notebook will improve your workflow
A scientific paper by Titus Brown demonstrating data/figure reproducibility by sharing an IPython Notebook
A gallery of interesting IPython Notebooks
Here I will describe how to set up IPython Notebook to work smoothly with PySpark, allowing a data scientist to document the history of her exploration while taking advantage of the scalability of Spark and Apache Hadoop.
IPython: I used IPython 1.x, since I’m running Python 2.6 on CentOS 6. This required me to install a few extra dependencies, like Jinja2, ZeroMQ, pyzmq, and Tornado, to allow the notebook functionality, as detailed in the IPython docs. :These requirements are only for the node on which IPython Notebook (and therefore the PySpark driver) will be running.
PySpark: I used the CDH-installed PySpark (1.x) running through YARN-client mode, which is our recommended method on CDH 5.1. It’s easy to use a custom Spark (or any commit from the repo) through YARN as well. Finally, this will also work with Spark standalone mode.
This installation workflow loosely follows the one contributed by Fernando Perez here. This should be performed on the machine where the IPython Notebook will be executed, typically one of the Hadoop nodes.
First create an IPython profile for use with PySpark.
ipython profile create pyspark
1 |
ipython profile create pyspark |
This should have created the profile directory ~/.ipython/profile_pyspark/
. Edit the file ~/.ipython/profile_pyspark/ipython_notebook_config.py
to have:
c = get_config() c.NotebookApp.ip = '*' c.NotebookApp.open_browser = False c.NotebookApp.port = 8880 # or whatever you want; be aware of conflicts with CDH
1 2 3 4 5 |
c = get_config()
c.NotebookApp.ip = '*' c.NotebookApp.open_browser = False c.NotebookApp.port = 8880 # or whatever you want; be aware of conflicts with CDH |
If you want a password prompt as well, first generate a password for the notebook app:
python -c 'from IPython.lib import passwd; print passwd()' > ~/.ipython/profile_pyspark/nbpasswd.txt
1 |
python -c 'from IPython.lib import passwd; print passwd()' > ~/.ipython/profile_pyspark/nbpasswd.txt |
and set the following in the same .../ipython_notebook_config.py
file you just edited:
PWDFILE='~/.ipython/profile_pyspark/nbpasswd.txt' c.NotebookApp.password = open(PWDFILE).read().strip()
1 2 |
PWDFILE='~/.ipython/profile_pyspark/nbpasswd.txt' c.NotebookApp.password = open(PWDFILE).read().strip() |
Finally, create the file ~/.ipython/profile_pyspark/startup/00-pyspark-setup.py
with the following contents:
import os import sys spark_home = os.environ.get('SPARK_HOME', None) if not spark_home: raise ValueError('SPARK_HOME environment variable is not set') sys.path.insert(0, os.path.join(spark_home, 'python')) sys.path.insert(0, os.path.join(spark_home, 'python/lib/py4j-0.8.1-src.zip')) execfile(os.path.join(spark_home, 'python/pyspark/shell.py'))
1 2 3 4 5 6 7 8 9 |
import os import sys
spark_home = os.environ.get('SPARK_HOME', None) if not spark_home: raise ValueError('SPARK_HOME environment variable is not set') sys.path.insert(0, os.path.join(spark_home, 'python')) sys.path.insert(0, os.path.join(spark_home, 'python/lib/py4j-0.8.1-src.zip')) execfile(os.path.join(spark_home, 'python/pyspark/shell.py')) |
IPython Notebook should be run on a machine from which PySpark would be run on, typically one of the Hadoop nodes.
First, make sure the following environment variables are set:
# for the CDH-installed Spark export SPARK_HOME='/opt/cloudera/parcels/CDH/lib/spark' # this is where you specify all the options you would normally add after bin/pyspark export PYSPARK_SUBMIT_ARGS='--master yarn --deploy-mode client --num-executors 24 --executor-memory 10g --executor-cores 5'
1 2 3 4 5 |
# for the CDH-installed Spark export SPARK_HOME='/opt/cloudera/parcels/CDH/lib/spark'
# this is where you specify all the options you would normally add after bin/pyspark export PYSPARK_SUBMIT_ARGS='--master yarn --deploy-mode client --num-executors 24 --executor-memory 10g --executor-cores 5' |
Note that you must set whatever other environment variables you want to get Spark running the way you desire. For example, the settings above are consistent with running the CDH-installed Spark in YARN-client mode. If you wanted to run your own custom Spark, you could build it, put the JAR on HDFS, set the SPARK_JAR
environment variable, along with any other necessary parameters. For example, see here for running a custom Spark on YARN.
Finally, decide from what directory to run the IPython Notebook. This directory will contain the .ipynb
files that represent the different notebooks that can be served. See the IPython docs for more information. From this directory, execute:
ipython notebook --profile=pyspark
1 |
ipython notebook --profile=pyspark |
Note that if you just want to serve the notebooks without initializing Spark, you can start IPython Notebook using a profile that does not execute the shell.py
script in the startup file.
At this point, the IPython Notebook server should be running. Point your browser to http://my.host.com:8880/
, which should open up the main access point to the available notebooks. This should look something like this:
This will show the list of possible .ipynb
files to serve. If it is empty (because this is the first time you’re running it) you can create a new notebook, which will also create a new .ipynb
file. As an example, here is a screenshot from a session that uses PySpark to analyze the GDELT event data set:
The full .ipynb
file can be obtained as a GitHub gist.
The notebook itself can be viewed (but not executed) using the public IPython Notebook Viewer.
Learn more about Spark’s role in an enterprise data hub (EDH) here.
Uri Laserson (@laserson) is a data scientist at Cloudera.