Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update CmdStan read and logic #1565

Merged
merged 22 commits into from
Feb 18, 2021
Merged

Update CmdStan read and logic #1565

merged 22 commits into from
Feb 18, 2021

Conversation

ahartikainen
Copy link
Contributor

@ahartikainen ahartikainen commented Feb 15, 2021

Description

Read CmdStan csv files manually. This enables us to parse large models (100k parameters) much faster than pandas.

This PR also adds dtypes argument, which user can use to transform dtypes for specific parameters

dtypes = {"theta": int}

Checklist

  • Follows official PR format
  • Includes a sample plot to visually illustrate the changes (only for plot-related functions)
  • New features are properly documented (with an example if appropriate)?
  • Includes new or updated tests to cover the new feature
  • Code style correct (follows pylint and black guidelines)
  • Changes are listed in changelog

@OriolAbril
Copy link
Member

I thought that pandas reader was written in C to be fast, do you have some advise on which cases make having a custom worth it?

@ahartikainen
Copy link
Contributor Author

I will give you example later today.

@ahartikainen
Copy link
Contributor Author

ahartikainen commented Feb 16, 2021

import tempfile
import numpy as np
import pandas as pd
from pathlib import Path
from uuid import uuid4
import time
import pandas as pd
import numpy as np
import shutil
import matplotlib.pyplot as plt

def read_output_file_manual(path):
    comments = []
    data = []
    columns = None
    with open(path, "rb") as f_obj:
        # read header
        for line in f_obj:
            if line.startswith(b"#"):
                comments.append(line.decode("utf-8").strip())
                continue
            columns = {key: idx for idx, key in enumerate(line.strip().decode("utf-8").split(","))}
            break

        for line in f_obj:
            line = line.strip()
            if line.startswith(b"#"):
                comments.append(line.decode("utf-8"))
                continue
            if line:
                data.append(line.split(b","))

        data = np.array(data, dtype=np.float64)

    return columns, data, comments

def read_output_file_pandas(path):
    comments = []
    data = []
    columns = None
    with open(path, "rb") as f_obj:
        # read header
        for line in f_obj:
            if line.startswith(b"#"):
                comments.append(line.decode("utf-8").strip())
                continue
            columns = {key: idx for idx, key in enumerate(line.strip().decode("utf-8").split(","))}
            break

        f_obj_loc = f_obj.tell()
        for line in f_obj:
            if line.startswith(b"#"):
                comments.append(line.strip().decode("utf-8"))
                continue
        f_obj.seek(f_obj_loc)
        data = pd.read_csv(f_obj, header=None, comment="#", float_precision="high", dtype=np.float64).values

    return columns, data, comments


%%time
np.random.seed(10)
reference_files = {}
with tempfile.TemporaryDirectory() as tmpdir:
    path = Path(tmpdir)
    for parameter_size in [10, 100, 1000, 10_000, 20_000, 50_000]:
        for draw_size in [100, 500, 1000, 10_000]:
            data = np.random.randn(draw_size, parameter_size)
            columns = [str(uuid4()) for _ in range(parameter_size)]
            df = pd.DataFrame(data=data, columns=columns)
            output_path = Path(tmpdir) / f"parameters_{parameter_size}_draws_{draw_size}.csv"
            output_path2 = Path(tmpdir) / f"parameters_{parameter_size}_draws_{draw_size}_2.csv"
            df.to_csv(str(output_path))
            shutil.copy(output_path, output_path2)
            reference_files[(parameter_size, draw_size)] = (output_path, output_path2)
    
    results = {"pandas": {}, "manual": {}}
    for key, (path, path2) in reference_files.items():
        res_manual = []
        res_pandas = []
        for n in range(1):
            st = time.time()
            val = read_output_file_manual(path)
            et = time.time()
            res_manual.append(et - st)
            
            st = time.time()
            val = read_output_file_pandas(path2)
            et = time.time()
            res_pandas.append(et - st)
            
        results["manual"][key] = res_manual
        results["pandas"][key] = res_pandas

res_df = pd.DataFrame(results).applymap(lambda x: np.mean(x)).reset_index().rename(columns={"level_0": "parameters", "level_1": "draws"})

fig, ax = plt.subplots(1, dpi=100, figsize=(7,7))
for i, (group, gdf) in enumerate(res_df.groupby(by="draws")):
    plt.plot(gdf["parameters"], gdf["pandas"], marker='.', label=f"pandas par:{group}", c=f"C{i%8}", lw=1)
    plt.plot(gdf["parameters"], gdf["manual"], marker='.', label=f"manual par:{group}", c=f"C{i%8}", ls="--", lw=1)
    
plt.xlabel("Parameters")
plt.ylabel("Timing (s)")
plt.yscale("log")
plt.xscale("log")
plt.legend()
plt.grid()
{spine.set_visible(False) for key, spine in plt.gca().spines.items() if key in ["top", "right"]}
plt.savefig("./csv_read_comparison.png", dpi=200, bbox_inches="tight")

Run duration was approx 40 minutes

csv_read_comparison

@codecov
Copy link

codecov bot commented Feb 17, 2021

Codecov Report

Merging #1565 (3df02e7) into main (1e3356e) will decrease coverage by 0.03%.
The diff coverage is 83.45%.

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #1565      +/-   ##
==========================================
- Coverage   90.28%   90.25%   -0.04%     
==========================================
  Files         105      105              
  Lines       11405    11419      +14     
==========================================
+ Hits        10297    10306       +9     
- Misses       1108     1113       +5     
Impacted Files Coverage Δ
arviz/data/io_cmdstan.py 91.03% <83.45%> (-0.87%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 1e3356e...568a296. Read the comment docs.

@ahartikainen
Copy link
Contributor Author

ahartikainen commented Feb 17, 2021

Results (in seconds)

\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
parametersdrawspandasmanualnumpy
21000100002.1035434.69891811.429663
55000010000420.375299227.8317401131.707759
8100000100001750.8738201399.8829805438.172931

Difference against pandas (x - pandas) (in seconds)

\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
parametersdrawsmanualnumpy
21000100002.5953749.326120
55000010000-192.543559711.332459
810000010000-350.9908403687.299111

csv_read_comparison

I don't know, maybe go with the manual handling? All csv files should be "good" (we don't need to consider ill-formed ones)

@ahartikainen
Copy link
Contributor Author

@OriolAbril are you happy with the changes?

Code should now a bit more clear than previously.

@OriolAbril
Copy link
Member

Looks good, thanks!

@ahartikainen ahartikainen merged commit 3d788cc into main Feb 18, 2021
@ahartikainen ahartikainen deleted the bugfixes/cmdstan branch February 18, 2021 12:38
utkarsh-maheshwari pushed a commit to utkarsh-maheshwari/arviz that referenced this pull request May 27, 2021
* rewrite cmdstan logic

* clean sample_stats

* fix

* Handle empty lines

* use numbers

* fix typo

* temporarily downgrade pylint

* downgrade astroid

* fix

* fix errors

* change dict kw order

* fix handling

* update test

* combine pandas and manual

* remove requirement restrictions

* use numpy gentext for fileloading

* remove pandas import

* fix typo

* change test

* update csv reader

* clean file

* add info to changelog
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants