2019独角兽企业重金招聘Python工程师标准>>>
Jupyter Hub on Kubernetes Part II: NFS
September 10, 2016
Second part of my JupyterHub deployment in Kubernetes experiment be sure to read Part I.
Last time we got JupyterHub authenticating to LDAP and creating the single user notebooks in Kubernetes containers. As I mentioned in that post one big problem with that deployment was that the notebook files are gone when the pod is deleted so this time I add an NFS volume to the JupyterHub single user containers to persist the notebook data.
Also improved a little bit the deployment and code so to its no longer needed to build a custom image you can just pull two images from my docker hub registry and configure them using a Kubernetes ConfigMap.
All the code in this post is at danielfrg/jupyterhub-kubernetes_spawner. Specifically in the example
directory.
NFS
There is multiple options to have persistent data in Kubernetes containers. I chose NFS because its one of the few types that allows multiple containers to have read and write access (ReadWriteMany). Most of the persistent volume types have read and write access to only one container (ReadWriteOnce) and/or read only access to multiple containers (ReadOnlyMany).
I wanted to have all JupyterHub single user containers write their notebooks to the same location so its easier to do backups of the data but since the notebook servers are single user each one of those containers could use its own disk. On that case backup and maintenance is more complicated.
The following is heavily based on the Kubernetes NFS documentation.
In order to mount an NFS volume into the Kubernetes containers we need an NFS server but even before that we need a Volume that the NFS server can use to store the data.
In GCE we can use a persistent volume claim to ask for a Disk resource:
apiVersion: v1 kind: PersistentVolumeClaim metadata: name: jupyterhub-storage annotations: volume.alpha.kubernetes.io/storage-class: any spec: accessModes: ["ReadWriteOnce"] resources: requests: storage: 200Gi
When you execute this using kubectl -f file.yaml
GCE will create a new disk that you can see at the GCE cloud console, something like this:
Now that we have a Disk that the NFS server can use to store its data we create a Deployment of the NFS server and a service to expose it.
apiVersion: extensions/v1beta1 kind: Deployment metadata: name: jupyterhub-nfs spec: replicas: 1 template: metadata: labels: app: jupyterhub-nfs spec: containers: - name: nfs-server image: gcr.io/google-samples/nfs-server:1.1 ports: - name: nfs containerPort: 2049 - name: mountd containerPort: 20048 - name: rpcbind containerPort: 111 securityContext: privileged: true volumeMounts: - name: mypvc mountPath: /exports volumes: - name: mypvc persistentVolumeClaim: claimName: jupyterhub-storage --- kind: Service apiVersion: v1 metadata: name: jupyterhub-nfs spec: ports: - name: nfs port: 2049 - name: mountd port: 20048 - name: rpcbind port: 111 selector: app: jupyterhub-nfs
All this can be found in the example/nfs.yml
file when executed there should be a jupyterhub-nfs
service.
$ kubectl create -f example/nfs.yml ... $ kubectl get services NAME CLUSTER-IP EXTERNAL-IP PORT(S) AGE jupyterhub-nfs 10.103.253.185 2049/TCP,20048/TCP,111/TCP 15h
Now with the NFS service ready we need to create a Persistent Volume and another Persistent Volume Claim for the containers to use. There is a need for a small manual step here since at the moment its not possible to get a service IP from a Persistent Volume Object. So open example/nfs2.yml
and change {{ X.X.X.X }}
for the jupyterhub-nfs
service IP, in this case 10.103.253.185
.
The nfs2.yml
file also includes a small NGINX deployment with autoindex on;
to see the files on the server. Its the base nginx
image with a ConfigMap to change the default NGNIX conf file.
apiVersion: v1 kind: ConfigMap metadata: name: jupyterhub-nfs-web-config data: default.conf: |- server { listen 80; server_name localhost; root /usr/share/nginx/html; location / { index none; autoindex on; autoindex_exact_size off; autoindex_localtime on; } } --- apiVersion: extensions/v1beta1 kind: Deployment metadata: name: jupyterhub-nfs-web spec: replicas: 1 template: metadata: labels: app: jupyterhub-nfs-web spec: containers: - name: web image: nginx ports: - containerPort: 80 volumeMounts: - name: nfs mountPath: "/usr/share/nginx/html" - name: config-volume mountPath: "/etc/nginx/conf.d/" volumes: - name: nfs persistentVolumeClaim: claimName: jupyterhub-nfs - name: config-volume configMap: name: jupyterhub-nfs-web-config
Start the deployment and service and you should see a new jupyterhub-nfs-web
service
$ kubectl create -f example/nfs2.yml ... $ kubectl get services NAME CLUSTER-IP EXTERNAL-IP PORT(S) AGE jupyterhub-nfs-web 10.103.241.248 104.197.178.53 80/TCP 15h
Going to that External IP you should see an empty NGINX file listing.
Thats it! You have a NFS server and the kubernetes_spawer
will take care of mounting it to the containers based on some settings.
JupyterHub
Last time we had the JupyterHub creating containers in Kubernetes now we have the same functionality but it should be easier to get it running and configured based on a Kubernetes ConfigMap and two example docker images I uploaded to my docker registry.
In the last post it was needed to create custom a JupyterHub container images for each deployment, this image used to havethe jupyterhub_config.py
file with some values (credentials and ip addresses). I changed that deployment to user a Kubernetes ConfigMap that creates the config file so some base container images can be reused and just need to fill the missing values in the ConfigMap (example/hub.yml
).
- danielfrg/jupyterhub-kube-ldap: Is based on jupyterhub/jupyterhub and includes thejupyterhub/ldapauthenticator and my jupyterhub-kubernetes_spawner
- danielfrg/jupyterhub-kube-ldap-singleuser: Is based on jupyterhub/singleuser and just has different startup script to create the working notebook directory before starting the
jupyterhub-singleuser
server
There is some missing values that need to be filled before starting the deployment but they are very easy to find.
apiVersion: v1 kind: ConfigMap metadata: name: jupyterhub-config-py data: jupyterhub-config.py: |- c.JupyterHub.confirm_no_ssl = True c.JupyterHub.db_url = 'sqlite:////tmp/jupyterhub.sqlite' c.JupyterHub.cookie_secret_file = '/tmp/jupyterhub_cookie_secret' c.JupyterHub.authenticator_class = 'ldapauthenticator.LDAPAuthenticator' c.LDAPAuthenticator.bind_dn_template = 'cn={username},cn=jupyterhub,dc=example,dc=org' # c.LDAPAuthenticator.server_address = '{{ LDAP_SERVICE_IP }}' c.LDAPAuthenticator.use_ssl = False c.JupyterHub.spawner_class = 'kubernetes_spawner.KubernetesSpawner' # c.KubernetesSpawner.host = '{{ KUBE_HOST }}' # c.KubernetesSpawner.username = '{{ KUBE_USER }}' # c.KubernetesSpawner.password = '{{ KUBE_PASS }}' c.KubernetesSpawner.verify_ssl = False c.KubernetesSpawner.hub_ip_from_service = 'jupyterhub' c.KubernetesSpawner.container_image = "danielfrg/jupyterhub-kube-ldap-singleuser:0.1" c.Spawner.notebook_dir = '/mnt/notebooks/%U' c.KubernetesSpawner.persistent_volume_claim_name = 'jupyterhub-nfs' c.KubernetesSpawner.persistent_volume_claim_path = '/mnt'
Based on the default settings the Kubernetes spawner will user a Kubernetes Persistent Volume Claim and mount it under /mnt
then I just have to tell the Spawner what directory under /mnt
to use for the user notebooks, for example '/mnt/notebooks/danielfrg'.
Start the JupyterHub service same as before.
$ kubectl create -f hub.yml
Wait for the service to give you an External IP and login as any LDAP user (see Part I).
After logging in a new pod will be created and mount the Persistent Volume Claim that uses the NFS Server. Create a couple of notebooks and in the jupyterhub-nfs-web
service you should see something like this.
Now from the JupyterHub admin interface you should be able to terminate the server and start a new one
and the notebooks should be persisted even when they will be a different Pod.
Future
I think that the persisting notebooks was the biggest issue with the previous deployment that was fixed but there is some security issues.
All the JupyterHub single user containers run using the same user (root
in this example) and the whole NFS is mounted to all the containers this means that all the users have access to all the (other users) notebooks. Even if by default it only shows the user notebooks its possible to access the files at /mnt
.
There are some possible solutions in the NFS side but it can also happen in the Kubernetes and Docker side, maybe with a custom Persistent Volume Claim per user (?). More work is needed here.
The second big (also security) issue right now is how the JupyterHub access the Kubernetes API. Right now its needed to put the credentials in the ConfigMap and I was planning on user Kubernetes secrets to make it a little bit more secure but thats not needed.
While I was building this post I found (as I should have expected) that Kubernetes thought about this issue and that it has the concept of Service Accounts.
With a Service Account its possible to access the Kubernetes API without having to use the User Account credentials. These Service Account credentials can even be mounted (and are by default) to a container so it should be very easy to change the code to use them.
This removes the need to have a secondary User Account for the JupyterHub. Thanks Kubernetes!
This is probably the last (big) post about the this experiment but I plan to keep working on the kubernetes_spawner to make it better. At minimum I am for sure going to test and user the Service Account to make the deployment more secure.
Update: I have added support for Kubernetes Service Accounts to the kubernetes_spawner. So user account credentials are no longer needed. For updated docs and information take a look at the example folder here: examples/ldap_nfs.