Ravi Shekhar's Technical Blog

A Technical Blog of the Data Science Process

A Reproducible Data Science Environment in Ubuntu 16.04 with Python, R, Spark, and Docker


Data Science is heavily reliant on fairly complex software environments in which a minor version bump within one software package can change the ultimate data science result. For reproducibility, it's very useful to have an environment with well defined versions of common packages that is easy to replicate. As data science typically involves a pipeline, it is not uncommon to have different pipeline stages written in different languages, such as R and Python, with an interoperability layer (such as rpy2) linking them. This dual-language environment is especially difficult to version control and reproduce. Here is my solution to a Python/Spark/R environment, mostly based on Anaconda.

I have tested the entire tutorial on a local workstation, and on Amazon EC2. I post this mostly for my own documentation, but it may be useful to others as well.

Step 0

Install Ubuntu 16.04 LTS, and execute the following.

Setup SSH. You can skip the following step if running on EC2.

sudo apt-get install openssh-server

# setup easy login to localhost
ssh-keygen -t rsa -b 4096
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
chmod 600 ~/.ssh/authorized_keys

Update the system and reboot

sudo apt-get update && sudo apt-get upgrade
sudo reboot

Step 1 : Java

Note: If you have no interest in Spark, you can skip this step.

Install Java. This step is first because sometimes packages can require openJDK, which does not work as well with Spark as Oracle Java.

sudo add-apt-repository -y ppa:webupd8team/java
sudo apt-get update
sudo apt-get -y install oracle-java8-installer
sudo apt-get install -y oracle-java8-set-default

Step 2 : Anaconda

Install Anaconda3 to get Python. We will later install R through Anaconda as well.

wget 'http://repo.continuum.io/archive/Anaconda3-4.2.0-Linux-x86_64.sh'
bash Anaconda3*sh -b
echo 'export PATH=${HOME}/anaconda3/bin:$PATH' >> ~/.bashrc

I strongly recommend using the version of Anaconda indicated above, and not a later version. Apache Spark / PySpark packages require Python 3.5, whereas more recent Anaconda installations use Python 3.6.

These are a set of packages that I use

conda install -c conda-forge fastparquet geopandas fiona snappy python-snappy\
    bokeh dask distributed numba scikit-learn pyarrow matplotlib palettable\
    seaborn bottleneck

pip install git+https://github.com/pymc-devs/pymc3
pip install brewer2mpl

Step 3 : Spark

Install Spark 2.0.2 using Anaconda

conda install -c quasiben spark=2.0.2 -y

The following set of environment variables will set jupyter as the the pyspark driver. Add it to your .bashrc

#HADOOP VARIABLES START
JAVA_HOME=`which java`
JAVA_HOME=`readlink -f ${JAVA_HOME}`
export JAVA_HOME=`echo ${JAVA_HOME} | sed -e 's/\/bin\/java//'`

export JRE_HOME=${JAVA_HOME}
export PYSPARK_SUBMIT_ARGS="--master 'local[4]' --executor-memory 3G --driver-memory 3G"
export PYSPARK_DRIVER_PYTHON="jupyter"
export PYSPARK_DRIVER_PYTHON_OPTS="notebook --no-browser "
export SPARK_DRIVER_PYTHON=${HOME}/anaconda3/bin/python

Adjust Spark to use the number of cores on your machine using local[N], as well as the executor and driver memory. With a 16GB memory machine with 4 cores, 3 GB for driver and executor memory is a good rule of thumb.

Step 4 : R

Continuum, the company behind the Anaconda Python distribution, has produced an R channel for anaconda, where many (most?) common packages are available.

conda install -c r r 
conda install -c r rpy2 r-tidyverse r-shiny r-essentials r-sparklyr \
    r-feather r-markdown r-knitr r-spatial r-rstan r-rbokeh r-maps \
    r-hexbin r-ggvis

[Optional: RStudio]

Following R-bloggers

sudo echo "deb http://cran.rstudio.com/bin/linux/ubuntu xenial/" | sudo tee -a /etc/apt/sources.list
sudo gpg --keyserver keyserver.ubuntu.com --recv-key E084DAB9
sudo gpg -a --export E084DAB9 | sudo apt-key add -
sudo apt-get update

sudo apt-get install gdebi-core
wget https://download1.rstudio.org/rstudio-1.0.143-amd64.deb
sudo gdebi -n rstudio-1.0.143-amd64.deb
rm rstudio-1.0.143-amd64.deb

If you want RStudio-server, it can be installed with the guide here.

Step 5 : Archive the Anaconda Environment

The following command will produce a list of packages installed in the Anaconda environment. You can find more documentation from Continuum

conda list --explicit > spec-file.txt

Save this spec file that was just created. It can be used to create a new environment on another machine with the same operating system.

conda create --name MyEnvironment --file spec-file.txt

I have made my spec-file available on Github.

Step 6 : [Optional] Docker

Install Docker

sudo apt-key adv --keyserver hkp://p80.pool.sks-keyservers.net:80 --recv-keys 58118E89F3A912897C070ADBF76221572C52609D
sudo echo 'deb https://apt.dockerproject.org/repo ubuntu-xenial main' | sudo tee -a /etc/apt/sources.list.d/docker.list
sudo apt-get update && sudo apt-get install -y docker-engine

Step 7 : Optional [EC2] — Create a Custom AMI

A custom AMI (Amazon Machine Image) allows creation of a bootable image from a running EBS-backed instance. If all previous steps in this tutorial have been run on EC2, Amazon's documentation shows how to create an image from a running instance, which can be used to spin up additional identical instances. Their documentation is excellent, so I do not rehash it here.

Conclusion

Setting up an environment that works is a very time consuming and inconvenient part of data science. This post represents the results of a lot of trial and error at obtaining an environment that is stable and reproducible. Using the directions here, it should be possible to create an environment that is identical to the one I will use for data analyses I publish on this blog.