I can't mount a volume in my container locally #3835

vlagache · 2020-05-23T17:14:21Z

What steps did you take:

Hello everyone ,

I'm starting with Kubeflow and I'm trying to do a simple pipeline. Right now I'm just trying to use a .csv file that's in a volume . I tried to reproduce what I saw here : #477 , but I use MiniKf where I created a notebook server with a Data-Volume. My csv file is in this datavolume.

EXPERIMENT_NAME = 'test_pipeline' # Name of the experiment in the UI 
BASE_IMAGE = 'tensorflow/tensorflow:1.11.0-py3' # Base Image used for components in the pipeline

According to this : #783 , I've been looking for information about my volume ( copass-vol )

λ kubectl -n kubeflow-user get pvc
NAME                         STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS   AGE
copass-vol-t72cyk1yp         Bound    pvc-2ed2f366-9cdc-11ea-800e-0800279113b6   1Gi        RWO            rok            4h2m
workspace-copass-rw0ok0av7   Bound    pvc-2eceac20-9cdc-11ea-800e-0800279113b6   5Gi        RWO            rok            4h2m

kubectl -n kubeflow-user get pv 
NAME                                       CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS   CLAIM                                      STORAGECLASS   REASON   AGE
pvc-2eceac20-9cdc-11ea-800e-0800279113b6   5Gi        RWO            Delete           Bound    kubeflow-user/workspace-copass-rw0ok0av7   rok                     3h32m
pvc-2ed2f366-9cdc-11ea-800e-0800279113b6   1Gi        RWO            Delete           Bound    kubeflow-user/copass-vol-t72cyk1yp         rok                     3h32m
pvc-53d7f8da-9cda-11ea-800e-0800279113b6   10Gi       RWO            Delete           Bound    istio-system/authservice-pvc               rok                     3h45m
pvc-5b4f2803-9cda-11ea-800e-0800279113b6   10Gi       RWO            Delete           Bound    kubeflow/metadata-mysql                    rok                     3h45m
pvc-6b73b979-9cda-11ea-800e-0800279113b6   10Gi       RWO            Delete           Bound    kubeflow/katib-mysql                       rok                     3h44m
pvc-6ce6de44-9cda-11ea-800e-0800279113b6   20Gi       RWO            Delete           Bound    kubeflow/minio-pv-claim                    rok                     3h44m
pvc-6dd15e57-9cda-11ea-800e-0800279113b6   20Gi       RWO            Delete           Bound    kubeflow/mysql-pv-claim                    rok                     3h44m
pvc-c8d31c88-9cd9-11ea-800e-0800279113b6   2Gi        RWO            Delete           Bound    rok/data-rok-etcd-0                        standard                3h49m
pvc-cf2b786b-9cd9-11ea-800e-0800279113b6   2Gi        RWO            Delete           Bound    rok/data-rok-postgresql-0                  standard                3h49m

λ kubectl -n kubeflow-user describe pv pvc-2ed2f366-9cdc-11ea-800e-0800279113b6
Name:              pvc-2ed2f366-9cdc-11ea-800e-0800279113b6
Labels:            <none>
Annotations:       pv.kubernetes.io/provisioned-by: rok.arrikto.com
Finalizers:        [kubernetes.io/pv-protection external-attacher/rok-arrikto-com]
StorageClass:      rok
Status:            Bound
Claim:             kubeflow-user/copass-vol-t72cyk1yp
Reclaim Policy:    Delete
Access Modes:      RWO
VolumeMode:        Filesystem
Capacity:          1Gi
Node Affinity:
  Required Terms:
    Term 0:        node in [minikube]
Message:
Source:
    Type:              CSI (a Container Storage Interface (CSI) volume source)
    Driver:            rok.arrikto.com
    VolumeHandle:      pvc-2ed2f366-9cdc-11ea-800e-0800279113b6
    ReadOnly:          false
    VolumeAttributes:      pvc-name=copass-vol-t72cyk1yp
                           pvc/jupyter-dataset=copass-vol
                           pvc/rok/creds-secret-name=rok-secret-user
                           pvc/rok/origin=
                           pvc/volume.beta.kubernetes.io/storage-provisioner=rok.arrikto.com
                           pvc/volume.kubernetes.io/selected-node=minikube
                           storage.kubernetes.io/csiProvisionerIdentity=1590227198131-8081-rok.arrikto.com
Events:                <none>

pvc_name = "copass-vol-t72cyk1yp"
vol_name = "pvc-2ed2f366-9cdc-11ea-800e-0800279113b6"
vol_mount_path = "/data/"

pvc = k8s_client.V1PersistentVolumeClaimVolumeSource(claim_name=pvc_name)
volume = k8s_client.V1Volume(name=vol_name, persistent_volume_claim=pvc)
volume_mount = k8s_client.V1VolumeMount(mount_path=vol_mount_path,name=vol_name)

@dsl.python_component(
    name='data_loading_op',
    description='Loads a csv as a dataframe and displays the first line'
)
def data_loader():
    import pandas as pd 
    df = pd.read_csv("/data/titanic_train.csv")
    print(df.head(1))

data_loading_op = comp.func_to_container_op(data_loader, base_image=BASE_IMAGE)
@dsl.pipeline(
    name='Display dataframe',
    description='pipeline test that simply displays the first line of a csv'
)
def my_pipeline():
    display_task = data_loading_op().add_volume(volume).add_volume_mount(volume_mount)

What happened:

I still get the same error message when i run the pipeline (I tried several values for pvc_name , copass-vol , copass-vol-t72cyk1yp , kubeflow-user/copass-vol-t72cyk1yp ):

This step is in Pending state with this message: Unschedulable: persistentvolumeclaim "copass-vol" not found

What did you expect to happen:

See the first line of my csv in the run logs of my pipeline 😄 . There's probably something I don't understand about the volumes.

Environment:

I use Minikf installed with the instructions here
, https://www.kubeflow.org/docs/started/workstation/getting-started-minikf/

KFP version: Build commit: ca58b22
KFP SDK version:
When I execute the command in terminal of my Notebook Server

pip list | grep kfp

the answer is empty, yet I manage to import kfp into my python code . It also say that

DEPRECATION: Python 2.7 reached the end of its life on January 1st, 2020.[...]

Yet python version returns 3.6.9

jovyan@copass-0:~$ python --version
Python 3.6.9

Thank you for your answers. Have a nice evening.

The text was updated successfully, but these errors were encountered:

vlagache · 2020-05-24T08:33:24Z

I just realized that my pipeline was running in a different namespace than the notebook server (kubeflow and not kubeflow-user). Maybe that's what caused the error? Indeed the pvc doesn't exist in the kubeflow namespace ....But I'm still a little bit lost on how to proceed. I'm just starting with Kubernetes/Kubeflow and the beginning is quite rough 😄

Ark-kun · 2020-05-25T20:42:27Z

When I execute the command in terminal of my Notebook Server
pip list | grep kfp
the answer is empty, yet I manage to import kfp into my python code . It also say that
DEPRECATION: Python 2.7 reached the end of its life on January 1st, 2020.[...]
Yet python version returns 3.6.9
jovyan@copass-0:~$ python --version
Python 3.6.9

Your pip and python point to different environments. It's better to use python3 -m pip list to be sure it uses the same environment.

I just realized that my pipeline was running in a different namespace than the notebook server (kubeflow and not kubeflow-user). Maybe that's what caused the error?

Indeed the pvc doesn't exist in the kubeflow namespace

The pvc needs to be in the same namespace as the pipeline.

I'm just starting with Kubernetes/Kubeflow and the beginning is quite rough

The easiest way way is to just install KFP from Google Cloud Marketplace: https://console.cloud.google.com/marketplace/details/google-cloud-ai-platform/kubeflow-pipelines

If you're on Windows or Mac OS X it should be pretty easy to just install Docker Desktop with Kubernetes and then use Kubeflow Pipelines Standalone deployment.

Installing KFP using Kubeflow tools (kfctl, minikf) is not the most up to date and supported deployment option...

Ark-kun · 2020-05-25T20:55:59Z

I'm just starting with Kubernetes/Kubeflow and the beginning is quite rough

Can you give us some feedback on how to make the initial experience better?

TBH, You seem to have chosen some pretty hard and rough path to learn KFP: A very specialized installation option (minifk) for a different product (KF vs KFP (yes, I know this is confusing)). Coupled with a very specialized community contrib feature (VolumeOp) based on an advanced Kubernetes feature (PVCs).

The road would be smoother if you startwith the official KFP documentation and the core features before moving to some advanced and specialized options. Check the two linked tutorials - they should give you a good jump-start on your pipeline and component building.

I wonder whether you really need volumes.

Volumes are advanced Kubernetes concepts and are specific to Kubernetes which means that you might be reducing your pipeline portability if you were to depend on them.

KFP has great data passing support so you do not need to care about storage methods: Please check the following two tutorials: https://github.com/kubeflow/pipelines/blob/fd5778d/samples/tutorials/Data%20passing%20in%20python%20components.ipynb https://github.com/Ark-kun/kfp_samples/blob/ae1a5b6/2019-10%20Kubeflow%20summit/106%20-%20Creating%20components%20from%20command-line%20programs/106%20-%20Creating%20components%20from%20command-line%20programs.ipynb

elikatsis · 2020-05-25T22:55:08Z

Hello @vlagache!

Thank you for the detailed description in the issue and for using MiniKF.

As you mentioned, the problem is indeed that pods cannot mount PVCs living in different namespaces.

In the latest MiniKF (and KF) release, there is no support for multi-user pipelines, aka pipelines running in the namespace of the user (e.g., kubeflow-user). On the contrary, currently all pipelines run in the kubeflow namespace.
We plan to support multi-user KFP in the next MiniKF release, stay tuned!

However, the question remains: what can you do now to solve this?
Since you are using MiniKF you can exploit Rok.
TL;DR: One of its use cases is that you can take snapshots of your PVCs in your user's namespace and create clones in the kubeflow namespace.
So you can take a snapshot of your data volume and clone it as the first step of your pipeline. This is shown in this example:

pipelines/samples/contrib/volume_snapshot_ops/volume_snapshotop_rokurl.py

Line 36 in e52481a

annotations={"rok/origin": rok_url},

Here are the steps you can follow to take a snapshot and get the RokURL of your data volume:

Navigate to Rok UI (Snapshots in the left side panel of central dashboard, log in with the same credentials with Kubeflow)
Create/Pick an existing bucket
Take a snapshot of either a specific PVC (dataset) or your whole JupyterLab (Notebook Server). To do this, click on the camera icon and select your case. After Rok successfuly takes the snapshot, you will find it under Files. Here's a video demo on how to do it.
Select the snapshot of the PVC where your data live in. If you took a JupyterLab snapshot you can expand its row and see multiple snapshots, one for each PVC that was mounted on your Notebook Server.
In this snapshot you can find its Rok URL. Copy it and use it as a rok/origin annotation in your VolumeOp.

Finally - and unrelated to the your issue - in latest MiniKF you can use the Volumes manager to populate PVCs in user's namespace (instead of creating a Notebook Server, as you did).
You can find more information in this article, and here's a video demo

vlagache · 2020-05-27T08:53:09Z

Thank you for your answers @Ark-kun and @elikatsis , I will take the time to read and understand everything and get back to you if I have any more questions , thanks again for your quick answers.

Indeed I had not understood that there were KF and KFP.

Just to explain a little bit about the path I've taken regarding Kubeflow's apprenticeship...

I have a (small) experience with AzureML (pipeline creation) and Docker.
I'm trying to reproduce what I've done on AzureML (end-to-end pipeline : database connection, data preparation, hyperparameter selection, training, deployment etc...) to build competence on different platforms
Concerning Kubeflow, I started by reading the tutorials on the Kubeflow website, and the first problem I had was that I wanted to test locally so I had to install MiniKF (with Vagrant and Virtual Box).
Then I did as I usually do trying something simple ( importing data in a pipeline step ) using either simple python components or reusable components ( based on a docker image , where I realized that I couldn't push the image of my container to DockerHub if HyperV ( Windows ) was disabled and HyperV must be disabled to use VirtualBox :D ) I also tested as I couldn't import local data to store my .csv on GoogleCloudStorage and using a secret Kubernetes in a namespace, it worked and then not at all, apparently a bug related to the time of my VM...
At that moment I realized that I had a lack of knowledge on all these subjects ( Kubeflow , Kubernetes , Docker, virtual machine creation ) which made me unable to know what was not working exactly , so I decided to learn Kubernetes with a course on Udemy , so now I have some basics on how Kubernetes works and I come back to Kubeflow 😄

And regarding your comment about volumes, I was thinking of doing this to import data into my pipeline. To process data in a component it is necessary that this data is in a volume mounted on this component, isn't it ?

So I'm going to continue with the new information you gave me and I'll get back to you if I have new problems, thanks again.

Ark-kun · 2020-05-27T22:10:35Z

I was thinking of doing this to import data into my pipeline. To process data in a component it is necessary that this data is in a volume mounted on this component, isn't it ?

Mounting a volume is not necessary. The preferred model for Argo and KFP is that the system moves the data for you, so you do not have to care about the storage. This has many benefits - portability, simpler component code, data immutability guarantees, caching, etc.

Although you can use the data from volumes, doing this opts you out of some system guarantees and can interfere with operation of some services. For example, caching either does not activate or activates when you do not want it. The data system does not know anything about the data you have in the volumes. They're also acting as global variables in some sense, so they bring all the problems that global variables entail. The system-passed data is strictly scoped in contrast.

Creating components that process data:

I've linked the two tutorials (python and command-line) that should be enough to cover the topic. You use InputPath and OutputPath annotations in python case and inputPath/outputPath placeholders in command-line case. You component only needs to work with the locally available files that the system gives you or takes from you.

Importing the data:

You just need a component with an output that downloads or extracts the data, saves it as a file and lets the system store it in the artifact repository. Two examples:

https://github.com/kubeflow/pipelines/blob/master/components/datasets/Chicago_Taxi_Trips/component.yaml

https://github.com/kubeflow/pipelines/blob/master/components/google-cloud/storage/download_blob/component.yaml

You can also check this sample pipeline that imports data and processes it: https://github.com/kubeflow/pipelines/blob/2d26a4c/components/XGBoost/_samples/sample_pipeline.py

vlagache · 2020-05-28T06:35:34Z

Ok thanks for your answer @Ark-kun I'll read all about it, I already started by installing KFP standalone on a Kubernetes cluster that runs with Docker Desktop as you recommended ( with this Readme : https://github.com/kubeflow/pipelines/tree/master/manifests/kustomize ) It's actually much easier, I see the difference between KFP and KF and now 😄

I'm going to go ahead and apply all the codes you provided me. Is it important for you that I close the issue or can I leave it open if I have more questions?

Ark-kun · 2020-05-28T07:49:18Z

I already started by installing KFP standalone on a Kubernetes cluster that runs with Docker Desktop as you recommended ( with this Readme : https://github.com/kubeflow/pipelines/tree/master/manifests/kustomize ) It's actually much easier, I see the difference between KFP and KF and now 😄

Great to hear that =) There is also documentation here https://www.kubeflow.org/docs/pipelines/ although some parts may be a bit outdated - I'm working on updating it.

Is it important for you that I close the issue or can I leave it open if I have more questions?

Not very important. You can leave it open for some time.

vlagache · 2020-05-28T08:23:44Z

I deleted my last question about kfp_endpoint.

I was asking what the value of kfp.endpoint is for local use in the

kfp.Client(host=kfp_endpoint).create_run_from_pipeline_func(constant_to_consumer_pipeline, arguments={})

It seems that it works when I remove "host" and that I execute my python file in the console, I have an error in the console but I had the run is well executed, I will be able to advance finally, thank you very much :)

Traceback (most recent call last):
  File "pipeline.py", line 19, in <module>
    kfp.Client().create_run_from_pipeline_func(constant_to_consumer_pipeline, arguments={})
  File "C:\ProgramData\Anaconda3\lib\site-packages\kfp\_client.py", line 540, in create_run_from_pipeline_func
    os.remove(pipeline_package_path)
PermissionError: [WinError 32] Le processus ne peut pas accéder au fichier car ce fichier est utilisé par un autre processus: 'C:\\Users\\V61C3~1.LAG\\AppData\\Local\\Temp\\tmp1n6rad96.zip'

the error is in French, but says "The process can't access the file because this file is used by another process:".

Ark-kun · 2020-05-30T07:48:50Z

I will be able to advance finally, thank you very much :)

I'm really glad to hear that!

I was asking what the value of kfp.endpoint is for local use in the

When KFP is deployed to GKE, it's the main URL to of the UX. If you use port-forwarding, it's the IP address and port. There are some other options...

says "The process can't access the file because this file is used by another process:".

Maybe you have something like an antivirus which locks the file.

I'll try to create a fix...

Maybe fixes kubeflow#3835 (comment)

…3878) Maybe fixes #3835 (comment)

Kulikovpavel · 2020-06-12T06:34:22Z

@Ark-kun , what is the best practice for mounting directory with dataset for Computer vision training task?
Thousands, maybe millions of small and moderate files.

Download -> Preprocessing -> Training, how to make data flow between components?

…ubeflow#3878) Maybe fixes kubeflow#3835 (comment)

Ark-kun self-assigned this May 25, 2020

Ark-kun added the status/triaged Whether the issue has been explicitly triaged label May 25, 2020

Bobgy added the area/kf KF core related label May 26, 2020

Ark-kun added a commit to Ark-kun/pipelines that referenced this issue May 30, 2020

SDK - Client - Use temporary directory context for pipeline package

66ae799

Maybe fixes kubeflow#3835 (comment)

Ark-kun mentioned this issue May 30, 2020

SDK - Client - Use temporary directory context for pipeline package #3878

Merged

k8s-ci-robot closed this as completed in #3878 Jun 1, 2020

k8s-ci-robot pushed a commit that referenced this issue Jun 1, 2020

SDK - Client - Use temporary directory context for pipeline package (#…

cd8a913

…3878) Maybe fixes #3835 (comment)

RedbackThomson pushed a commit to RedbackThomson/pipelines that referenced this issue Jun 17, 2020

SDK - Client - Use temporary directory context for pipeline package (k…

fb88953

…ubeflow#3878) Maybe fixes kubeflow#3835 (comment)

Jeffwan pushed a commit to Jeffwan/pipelines that referenced this issue Dec 9, 2020

SDK - Client - Use temporary directory context for pipeline package (k…

8956346

…ubeflow#3878) Maybe fixes kubeflow#3835 (comment)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

I can't mount a volume in my container locally #3835

I can't mount a volume in my container locally #3835

vlagache commented May 23, 2020

vlagache commented May 24, 2020

Ark-kun commented May 25, 2020

Ark-kun commented May 25, 2020 •

edited

Loading

elikatsis commented May 25, 2020

vlagache commented May 27, 2020

Ark-kun commented May 27, 2020 •

edited

Loading

vlagache commented May 28, 2020 •

edited

Loading

Ark-kun commented May 28, 2020

vlagache commented May 28, 2020

Ark-kun commented May 30, 2020

Kulikovpavel commented Jun 12, 2020

I can't mount a volume in my container locally #3835

I can't mount a volume in my container locally #3835

Comments

vlagache commented May 23, 2020

What steps did you take:

What happened:

What did you expect to happen:

Environment:

vlagache commented May 24, 2020

Ark-kun commented May 25, 2020

Ark-kun commented May 25, 2020 • edited Loading

elikatsis commented May 25, 2020

vlagache commented May 27, 2020

Ark-kun commented May 27, 2020 • edited Loading

vlagache commented May 28, 2020 • edited Loading

Ark-kun commented May 28, 2020

vlagache commented May 28, 2020

Ark-kun commented May 30, 2020

Kulikovpavel commented Jun 12, 2020

Ark-kun commented May 25, 2020 •

edited

Loading

Ark-kun commented May 27, 2020 •

edited

Loading

vlagache commented May 28, 2020 •

edited

Loading