df-parallel

This repo demonstrates how to setup CONDA environments for popular Dataframe libraries and process large tabular data files.

It compares parallel and out-of-core (data that are too large to fit into the computer's memory) reading and processing of large datasets on CPU and GPU.

Dataframe Library	Parallel	Out-of-core	CPU/GPU	Evaluation
Pandas	no	no [1]	CPU	eager
Dask	yes	yes	CPU	lazy
Spark	yes	yes	CPU	lazy
cuDF	yes	no	GPU	eager
Dask-cuDF	yes	yes	GPU	lazy

[1] Pandas can read data in chunks, but they have to be processed independently.

Running Jupyter Lab locally (CPU only)

Prerequisites: Miniconda3 (light-weight, preferred) or Anaconda3 and Mamba

Install Miniconda3
Install Mamba: conda install mamba -n base -c conda-forge

Clone this git repository

git clone https://github.com/sbl-sdsc/df-parallel.git

Create CONDA environment

mamba env create -f df-parallel/environment.yml

Activate the CONDA environment

conda activate df-parallel

Launch Jupyter Lab

jupyter lab

Deactivate the CONDA environment

conda deactivate

To remove the CONDA environment, run conda env remove -n df-parallel

Running Jupyter Lab on SDSC Expanse

To launch Jupyter Lab on Expanse, use the galyleo script. Specify your ACCESS account number with the --account option. If you do not have an ACCESS acount and allocation on Expanse, you can apply through NSF’s ACCESS program or for a trial allocation, contact [email protected].

Clone this git repository

git clone https://github.com/sbl-sdsc/df-parallel.git

2a. Run on CPU (Pandas, Dask, and Spark dataframes):

galyleo launch --account <account_number> --partition shared --cpus 10 --memory 20 --time-limit 00:30:00 --conda-env df-parallel --conda-yml "${HOME}/df-parallel/environment.yml" --mamba

2b. Run on GPU (required for cuDF and Dask-cuDF dataframes):

galyleo launch --account <account_number> --partition gpu-shared --cpus 10 --memory 92 --gpus 1 --time-limit 00:30:00 --conda-env df-parallel-gpu --conda-yml "${HOME}/df-parallel/environment-gpu.yml" --mamba

Running the example notebooks

After Jupyter Lab has been launched, run the Notebook DownloadData.ipynb to create a dataset. In this notebook, specify the number of copies (ncopies) to be made from the orignal dataset to increase its size. By default, a single copy is created. After the dataset has been created, run the dataframe specific notebooks. Note, the cuDF and Dask-cuDF dataframe libraries require a GPU.

Test results (not representative)

Results for running on SDSC Expanse GPU node with 10 CPU cores (Intel Xeon Gold 6248 2.5 GHz), 1 GPU (NVIDIA V100 SMX2, 32GB), and 92 GB of memory (DDR4 DRAM), local storage (1.6 TB Samsung PM1745b NVMe PCIe SSD).

Datafile size (gene_info.tsv as of June 2022):

Dataset 1: 5.4 GB (18 GB in Pandas)
Dataset 2: 21.4 GB (4 x Dataset 1) (62.4 GB in Pandas)
Dataset 3: 43.7 GB (8 x Dataset 1)

Dataframe Library	time(5.4 GB) (s)	time(21.4 GB) (s)	time(43.7 GB) (s)	Parallel	Out-of-core	CPU/GPU
Pandas	56.3	222.4	-- [2]	no	no	CPU
Dask	15.7	42.1	121.8	yes	yes	CPU
Spark	14.2	31.2	56.5	yes	yes	CPU
cuDF	3.2	-- [2]	-- [2]	yes	no	GPU
Dask-cuDF	7.3	11.9	19.0	yes	yes	GPU

[2] out of memory

Name		Name	Last commit message	Last commit date
Latest commit History 121 Commits
conf		conf
notebooks		notebooks
.gitignore		.gitignore
LICENSE.md		LICENSE.md
README.md		README.md
benchmark.sb		benchmark.sb
environment-gpu.yml		environment-gpu.yml
environment.yml		environment.yml
problem.sh		problem.sh
solution.sh		solution.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

df-parallel

Running Jupyter Lab locally (CPU only)

Running Jupyter Lab on SDSC Expanse

Running the example notebooks

Test results (not representative)

About

Releases

Packages

Languages

License

sbl-sdsc/df-parallel

Folders and files

Latest commit

History

Repository files navigation

df-parallel

Running Jupyter Lab locally (CPU only)

Running Jupyter Lab on SDSC Expanse

Running the example notebooks

Test results (not representative)

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages