Update age and weight charts for adults, refs #24

mitre · Apr 29, 2021 · bb90bad · bb90bad
1 parent f5ceef7
commit bb90bad
Show file tree

Hide file tree

Showing 4 changed files with 301 additions and 370 deletions.
diff --git a/GrowthViz-adults.ipynb b/GrowthViz-adults.ipynb
diff --git a/GrowthViz-adults.py b/GrowthViz-adults.py
@@ -25,7 +25,7 @@
 # 
 # Jupyter Notebooks have documentation cells, such as this one, and code cells like the one below. The notebook server can runs the code and provides results (if applicable) back in the notebook. The following code cell loads the libraries necessary for the tool to work. If you would like to incorporate other Python libraries to assist in data exploration, they can be added here. Removing libraries from this cell will very likely break the tool.
 
-# In[1]:
+# In[35]:
 
 
 import pandas as pd
@@ -41,29 +41,29 @@
 
 # The next two code cells tell the notebook server to automatically reload the externally defined Python functions created to assist in data analysis.
 
-# In[2]:
+# In[36]:
 
 
 get_ipython().run_line_magic('load_ext', 'autoreload')
 
 
-# In[3]:
+# In[37]:
 
 
 get_ipython().run_line_magic('autoreload', '2')
 
 
 # This code cell instructs the notebook to display plots automatically inline
 
-# In[4]:
+# In[38]:
 
 
 get_ipython().run_line_magic('matplotlib', 'inline')
 
 
 # This code cell tells the notebook to output plots for high DPI displays, such as 4K monitors, many smartphones or a retina display on Apple hardware. This cell does not need to be run and can be safely removed. If removed, charts will look more "blocky" or "pixelated" on high DPI displays.
 
-# In[5]:
+# In[39]:
 
 
 get_ipython().run_line_magic('config', "InlineBackend.figure_format = 'retina'")
@@ -73,7 +73,7 @@
 # 
 # The following cell import functions created for the tool to asssist in data analysis. Some of the functions generate charts used in this tool. The chart code may be modified to change the appearance of plots without too much risk of breaking things. Other functions transform DataFrames and changing those will very likely cause things to break. If you are unable to tell the difference in the functions by looking at the code, it is probably best to leave them unmodified.
 
-# In[6]:
+# In[40]:
 
 
 import processdata
@@ -100,23 +100,23 @@
 # 
 # This information will be loaded into a [pandas DataFrame](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html) called `cleaned_obs`
 
-# In[7]:
+# In[41]:
 
 
 cleaned_obs = pd.read_csv("growthviz-data/sample-adults-cleaned.csv")
 
 
 # The following cell shows what the first five rows look like in the CSV file
 
-# In[8]:
+# In[42]:
 
 
 cleaned_obs.head()
 
 
 # This next cell runs through a series of data checks on the original data file, such as making sure all values of `sex` are either 0 or 1, or no age values are negative.
 
-# In[9]:
+# In[43]:
 
 
 warnings = check_data.check_patient_data("growthviz-data/sample-adults-cleaned.csv", "adults")
@@ -129,43 +129,43 @@
 
 # Next, the `processdata.setup_individual_obs_df` function performs transformations on the `cleaned_obs` DataFrame. This will create an `age` column, which is a decimal column that represents the patient's age in years at the time of the observation. It changes the `clean_value` column into a [pandas categorical type](https://pandas.pydata.org/pandas-docs/stable/user_guide/categorical.html). It also create an `include` column which contains a boolean value indicating whether growthcleanr states to include (true) or exclude (false) the observation. The resulting DataFrame is assigned to `obs`.
 
-# In[10]:
+# In[44]:
 
 
 obs_full = processdata.setup_individual_obs_df(cleaned_obs, 'adults')
 
 
-# In[11]:
+# In[45]:
 
 
 obs_full.head()
 
 
-# In the following cell, the `processdata.keep_age_range` function visually displays the range of ages in the dataset, with those to be excluded identified by the red bars. It then removes patients outside the intended target population of this notebook (adults 20 to 65).
+# In the following cell, the `processdata.keep_age_range` function visually displays the range of ages in the dataset, with those to be excluded identified by the red bars with the **/** pattern, and those that are outside the optimal range of the notebook identified by the orange bars with the **x** pattern. As noted above, if the population in the dataset is primarily pediatrics, you will want to switch to the pediatrics notebooks. The function then **removes** patients in the excluded categories (below 18 and above 80).
 
-# In[12]:
+# In[47]:
 
 
 obs = processdata.keep_age_range(obs_full, 'adults')
 
 
-# In[13]:
+# After that, `charts.weight_distr` creates two visualizations. The first shows a distribution of all of the included weights in the dataset. The second shows weights above a certain threshold to see whether there are spikes at a certain *Included* weights that might indicate that a commonly used scale maxes out at a certain value. This chart is restricted to values of 135kg or higher (rounded to the nearest KG) to make patterns in higher weights easier to identify. This potential issue is important to keep in mind when conducting an analysis.
 
+# In[48]:
 
-obs.head()
 
+charts.weight_distr(obs, 'all')
 
-# After that, `charts.weight_distr` creates a visualization to see whether there are spikes at a certain *Included* weights that might indicate that a commonly used scale maxes out at a certain value. The chart is restricted to values of 120kg or higher (rounded to the nearest KG) to make patterns in higher weights easier to identify. This potential issue is important to keep in mind when conducting an analysis.
 
-# In[14]:
+# In[49]:
 
 
-charts.weight_distr(obs)
+charts.weight_distr(obs, 'high')
 
 
 # The following cell loads in the [CDC Anthropometric Reference Data for Adults](https://www.cdc.gov/nchs/data/series/sr_03/sr03-046-508.pdf). Rows, which represent decades (e.g., 20-29), are expanded so that there is one record per year. Standard deviation is calculated from the count of examined persons and the standard error. `Sex` is then transformed so that the values align with the values used in growthcleanr, 0 (male) or 1 (female). Finally, percentiles are smoothed across decade changes (e.g., any change happens gradually from 29 to 31). This data is used to plot percentile bands in visualizations in the tool. 
 
-# In[15]:
+# In[50]:
 
 
 # adult percentiles
@@ -185,7 +185,7 @@
 
 # In this cell, the percentiles data are reshaped to provide mean and standard deviation values for each parameter that will later be used for z-score calculations.
 
-# In[16]:
+# In[51]:
 
 
 percentiles_long = sumstats.setup_percentile_zscore_adults(percentiles_clean)
@@ -207,7 +207,7 @@
 # 
 # The result is stored in `merged_df`.
 
-# In[17]:
+# In[52]:
 
 
 merged_df = processdata.setup_merged_df(obs, 'adults')
@@ -216,7 +216,7 @@
 
 # In the following cell, `processdata.setup_bmi_adults` calculates BMI for each weight and height pairing to be used in later individual trajectory visualizations.
 
-# In[18]:
+# In[53]:
 
 
 # create BMI data to add below for individual trajectories
@@ -227,7 +227,7 @@
 # 
 # The following shows the counts of the values for inclusion/exclusion along with the percentages of 
 
-# In[19]:
+# In[54]:
 
 
 processdata.exclusion_information(obs)
@@ -237,7 +237,7 @@
 # 
 # This next cell creates an interactive tool that can be used to explore patients. The `sumstats.add_mzscored_to_merged_df` function will add modified Z Scores for height, weight and BMI to `merged_df`. The tool uses [Qgrid](https://github.com/quantopian/qgrid) to create the interactive table. Clicking on a row will create a plot for the individual below the table.
 
-# In[20]:
+# In[56]:
 
 
 mdf = sumstats.add_mzscored_to_merged_df_adults(merged_df, percentiles_long) 
@@ -286,7 +286,7 @@ def handle_selection_change(_event, _widget):
 # 
 # In this chart, the blue line represents all measurements for an individual. Any values marked for exclusion are represented with a red x. The yellow dashed line represents the trajectory with exclusions removed. Any carried forward values are represented by a blue triangle, unless `include_carry_forward` is set to False, when they will also be represented as a red x.
 
-# In[21]:
+# In[57]:
 
 
 all_ids = cleaned_obs['subjid'].unique()
@@ -299,13 +299,13 @@ def handle_selection_change(_event, _widget):
             wt_df=fixed(wt_percentiles), bmi_df=fixed(bmi_percentiles), ht_df=fixed(ht_percentiles))
 
 
-# In[22]:
+# In[58]:
 
 
 obs_wbmi[obs_wbmi['subjid'] == 'd88d3987-93ff-0820-286f-754cd971012d'] # b5a84a9d-dd7c-95cb-5fd9-3c581a72c812, 867a461b-7cb8-76aa-9891-42369a9899e8 is an example with the underweight line
 
 
-# In[23]:
+# In[59]:
 
 
 # display all charts at the same time
@@ -327,7 +327,7 @@ def handle_selection_change(_event, _widget):
 # 
 # Next, the tool creates a series that contains the unique set of `subjid`s that have more than one record per category (as determined by `charts.mult_obs`) and stores that in `uniq_ids`.
 
-# In[24]:
+# In[60]:
 
 
 obs_wbmi_mult = charts.mult_obs(obs_wbmi)
@@ -336,21 +336,21 @@ def handle_selection_change(_event, _widget):
 
 # From the series of unique ids, the following cell randonly selects 25 individuals and assigns them to `sample`.
 
-# In[25]:
+# In[61]:
 
 
 sample = np.random.choice(uniq_ids, size=25, replace=False)
 
 
-# In[26]:
+# In[62]:
 
 
 sample
 
 
 # The `sample` can be passed into the `charts.five_by_five_view` function which will create a [small multiple](https://en.wikipedia.org/wiki/Small_multiple) plot for each of the individuals. Exclusions, including carry forwards, will be represented by a red x.
 
-# In[27]:
+# In[63]:
 
 
 charts.five_by_five_view(obs_wbmi, sample, 'HEIGHTCM', wt_percentiles, ht_percentiles, bmi_percentiles, 'dotted')
@@ -362,7 +362,7 @@ def handle_selection_change(_event, _widget):
 # 
 # The cell below selects all observations with a weight exclusion of "Exclude-EWMA-Extreme". It then sorts by weight in descending order. The code then takes the top 50 values and selects 25 random, unique `subjids` from that set. Finally it plots the results.
 
-# In[28]:
+# In[64]:
 
 
 # TO DO WHEN WE HAVE MORE EXCLUSION CATEGORIES
@@ -375,7 +375,7 @@ def handle_selection_change(_event, _widget):
 # 
 # The following cell uses the same function as above to create a 5 x 5 set of small multiple charts, but selects the top/bottom 25 individuals by growthcleanr category. The results can be sorted by maximum parameter, minimum parameter, starting age, or size of age range.
 
-# In[29]:
+# In[65]:
 
 
 def edge25(cleaned_obs, category, group, sort_order, param):
@@ -397,7 +397,7 @@ def edge25(cleaned_obs, category, group, sort_order, param):
 # 
 # The `charts.param_with_percentiles` function displays a chart showing BMI, height, or weight for an individual over time. Black bands representing the 5th and 95th percentiles for age and sex are shown with the individual's BMI, height, or weight shown in blue. The plot on the left represents all values. The plot on the right is only included values.
 
-# In[30]:
+# In[66]:
 
 
 all_ids = obs_wbmi['subjid'].unique()
@@ -415,7 +415,7 @@ def edge25(cleaned_obs, category, group, sort_order, param):
 # The buttons can be used to add or remove columns from the table.
 # The checkbox includes "missing" values (note: this will impact the raw columns as missing values may cause BMI values of infinity since they divide by 0 when missing). Missing values are not included by default.
 
-# In[31]:
+# In[67]:
 
 
 min_toggle = widgets.ToggleButton(value=True, description='Minimum BMI', 
@@ -447,7 +447,7 @@ def edge25(cleaned_obs, category, group, sort_order, param):
 # 
 # The following code allows you to export a DataFrame as a CSV file. When the cell below is run, the drop down will contain all DataFrames stored in variables in this notebook. Select the desired DataFrame and click Generate CSV. This will create the CSV file and provide a link to download it.
 
-# In[32]:
+# In[68]:
 
 
 df_selector = widgets.Dropdown(options=processdata.data_frame_names(locals()), description='Data Frames')

diff --git a/charts.py b/charts.py
@@ -2,15 +2,17 @@
 import math
 import matplotlib.pyplot as plt
 
-# should we add this to pediatrics?
-def weight_distr(df):
-    wgt_grp = df[
-        (df["param"] == "WEIGHTKG")
-        & (df["measurement"] >= 120)
-        & (df["include"] == True)
-    ]
+
+def weight_distr(df, mode):
+    wgt_grp = df[(df["param"] == "WEIGHTKG") & (df["include"] == True)]
+    if mode == "high":
+        wgt_grp = wgt_grp[wgt_grp["measurement"] >= 135]
+        plt.title("Weights At or Above 135kg")
+    else:
+        wgt_grp = wgt_grp
+        plt.title("All Weights")
     if len(wgt_grp.index) == 0:
-        print("No included observations with weight (kg) >= 120")
+        print("No included observations with weight (kg) >= 135")
     else:
         round_col = wgt_grp.apply(
             lambda row: np.around(row.measurement, decimals=0), axis=1
@@ -584,4 +586,3 @@ def cutoff_view(merged_df, subjid, cutoff, wt_df):
     # selected_param_plot.plot(percentile_window.age, percentile_window.P5, color='k')
     # selected_param_plot.plot(percentile_window.age, percentile_window.P95, color='k')
     return selected_param_plot
-