Package Installation in Python and R#

What are the packages pre-installed for every hub?

Procedure for package installation varies across different programming languages. Basic python packages such as numPy, pandas, scikit-learn, matplotlib, etc., are installed across the main Datahub. R hubs supports packages such as shiny, dplyr, tidyR, RSQLlite, etc. However, you can customize the packages for the hubs by requesting them using this template. If you want to check the list of packages installed,

  • You can use the below command for Python,

!pip list
  • You can use the below command for R,

installed.packages()
  • You can check the packages installed in Julia by accessing the Julia Hub

Note

Here is the list of Python packages installed in datahub.berkeley.edu. Here is the list of R packages installed in datahub.berkeley.edu. Here is the list of Julia packages installed in datahub.berkeley.edu.

What should I do if I want to install more packages?

  • Use your datahub instance to install the required version of the package. Self installation of packages in your instanceof hub is a temporary measure to identify dependencies. If you require a permanent solution then you need to request us to install the required package(s) in your hub.

  • If you want to install packages for Python in your instance, then use the following syntax,

pip install <package-name>
Eg: pip install numpy
  • If you want to install packages for R in your instance, then use the following syntax,

install.packages("<package-name>")
install.packages("ggplot2")
  • Check if there are specific dependencies for the installed package. Highlight the package name along with their version and dependencies as part of your request.

  • Raise a request using this template!

Pre-installed Packages#

Many Python packages have been pre-installed on JupyterHub and are available by default. This file on GitHub contains a list of all the packages (and corresponding versions) that are available by default. To use a pre-installed package such as numpy, you can simply type the line below into a code cell.

import numpy

Installing New Packages#

You can also install other packages that are not on this list. There are two methods for installation. If you will be using a package regularly in your course, we recommend using the long-term installation method.

Temporary Installation#

Notebooks provide support for bash commands in code cells. The line below, when run in a notebook code cell, will temporarily install numpy into a user’s personal account. numpy will then be available for use while the server is running. This cell must be run every time the user’s server is restarted. Note, this is not a system-wide installation. Running the below cells will only install numpy and ggplot2 temporarily into a user’s personal account.

!pip install numpy
import numpy
install.packages("ggplot2")
library("ggplot2")

Long-term installation#

This is the recommended method for packages that will be used frequently.

Our JupyterHub is deployed from the berkeley-dsep-infra/datahub GitHub repository. The staging branch reflects the state of the staging hub while the prod branch reflects the state of the production hub.

To request additional libraries in the user environment,, create an issue on the GitHub repo for DataHub (linked above); make sure to use the “DataHub Package Addition / Change Request” template.

On reproducibility#

Make sure to specify a version for any library you install. If you do not, it is likely that the deployment process will break at some point during the semester. Omitting a version will not enable the user environment to always have the latest version – it will only have the latest version that existed on the date that CI process runs. If you want to use an unreleased version of a library, specify the corresponding git SHA of that library’s repository.