-
Notifications
You must be signed in to change notification settings - Fork 0
/
Cadavid_Nonis_Roghani_lab1_v1.Rmd
785 lines (531 loc) · 37.8 KB
/
Cadavid_Nonis_Roghani_lab1_v1.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
---
title: An Exploratory Analysis of Cancer Incidence and Mortality to Identify High-Risk
Communities and Improve Survival
author: "Ramiro Cadavid, Pri Nonis, Payman Roghani"
date: "September 24, 2018"
output:
html_document:
df_print: paged
pdf_document: default
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
## Setup
```{r include=FALSE}
source("install.R")
source("plots.R")
source("outliers.R")
options(repr.plot.width=5, repr.plot.height=5)
```
## Introduction
In this project our efforts are focused on the analysis of data included in the csv file provided, to primarily understand the potential relationship between different parameters and the incidences of cancer across counties in the US. The main objectives are:
>1. To understand factors that predict cancer mortality rate, with the ultimate aim of identifying communities for social interventions.
>2. To determine which interventions are likely to have the most impact.
### Cancer Data
```{r}
Cancer <- read.csv('cancer.csv', header = T, as.is = T, row.names = 1) # load the cancer dataset
```
```{r fig.height=6, fig.width=6, collapse=TRUE, comment=NA, paged.print=TRUE}
#summary(Cancer) #summary statistics
```
---
```{r collapse = TRUE, comment=NA}
#str(Cancer, max.level = 1, strict.width = "wrap")
```
---
```{r collapse = TRUE, comment=NA}
colnames(Cancer)
cat(" \n")
print(paste0('Number of rows: ', nrow(Cancer)))
print(paste0('Number of columns: ', ncol(Cancer)))
```
The **cancer.csv** file contains 29 variables and 3,047 observations. Each observation (i.e. row) includes data for a county across the US. The variables are mostly numbers and integers, except for 2 that are factors (`binnedInc` and `Geography`).
Taking a first look at the data, we wanted to do some spasial data exploration. Cancer affects a significant population and the effect is felt mostly uniformly across the country.
Before we could do this we needed to augment our dataset with additional geocoordinate details. We were able to obtain that data from the website of the Census Beauru at census.gov/geo/maps-data. After some minor data manipulations, we were able to join the two datasets using the **Geography** variable.
```{r}
source("maps.R")
map_setup()
```
This allowed us to visualize and explore the Cancer dataset:
```{r}
map_us_incidenceRate()
```
```{r}
map_us_deathRate()
```
However, we did find some outliers, espcially in **Kentucky** and **West Virgiana** that is seeing a dispropotional effect, compared to other states such as California.
```{r}
map_wv_ky_deathRate()
```
```{r}
map_ca_deathRate()
```
We also discovered that the Cancer dataset was missing data from 93 counties belonging to the continental United States. These counties appear as blank polygons in our maps.
Below, we explain the variables in detail and provide our assessment of the quality of the data.
>data on smoking and obesity and other cancer risk factors could've been very helpful
#### Variables
* Cancer data:
+ `avgAnnCount`: The average number of new cancer cases per year per county for years 2009-2013
+ `popEst2015`: Estimated population by county 2015
* Economic status:
+ `medIncome`: Median income per county
+ `povertyPercent`: Percent of population below poverty line
+ `binnedInc`: ???
* Population age and gender:
+ `MedianAge`: Median age per county
+ `MedianAgeMale`: Median age among males per county
+ `MedianAgeFemale`: Median age among females per county
* Location:
+ `Geography`: County, State names
* Marital status:
+ `PercentMarried`: Percentage of married population
+ `PctMarriedHouseholds`: Percentage of married households per county
* Education:
+ `PctNoHS18_24`: Percentage of 18-24 year old population with no high school education
+ `PctHS18_24`: Percentage of 18-24 year old population with high school education
+ `PctSomeCol18_24`: Percentage of 18-24 year old population with some college education
+ `PctBachDeg18_24`: Percentage of 18-24 year old population with bachelor's degree
+ `PctHS25_Over`: Percentage of population above 25 years old with high school education
+ `PctBachDeg25_Over`: Percentage of population above 25 years old with bachelor's degree
* Household size:
+ `AvgHouseholdSize`: Average household size per county
* Employment status:
+ `PctEmployed16_Over`: Percentage of population above 16 years old who have jobs
+ `PctUnemployed16_Over`: Percentage of population above 16 years old with no jobs
* Health insurance coverage:
+ `PctPrivateCoverage`: Percentage of the population with private insurance coverage
+ `PctEmpPrivCoverage`: percentage of the population with employer-sponsored insurance coverage
+ `PctPublicCoverage`: Percentage of the population with public insurance coverage
* Race:
+ `PctWhite`: Percentage of white population by county
+ `PctBlack`: Percentage of African-American population by county
+ `PctAsian`: Percentage of Asian population by county
+ `PctOtherRace`: Percentage of other races by county
* Birth and death rates:
+ `BirthRate`: Birth rate per county
+ `deathRate`: Death rate per county
#### Evaluation of Dataset and Variables
Based on the outputs from diagnostic and summary statistics functions, as well as further univariate analysis, using relevant charts, below we describe our evaluation of the dataset and its variables. Since definition of most variables was not provided to us, our first step was to ensure understanding of what exactly such variables represent. We also evaluated the data to identify potentially erroneous values, extreme outliers and variables that might require transformation.
>>>from the assignment document: Evaluate the data quality. Are there any issues with the data? Explain how you handled these potential issues. Explain whether any data processing or preparation is required for your data set.
* **Data time frame:**: While `avgAnnCount` represents statistics for 2009-2013, the population by county is for 2015 and other variables do not have date stamps. Ideally all variables should have been from the same time period.
* **`avgAnnCount` definition:**: There is no clear definition for incidence rate per county for the `avgAnnCount` variable. Since the sum of all values is 1,847,514 and that based on [Cancer.gov](https://www.cancer.gov/)) data the average number of cases for 2009-2013 is 1,617,144, we assume this variable represents the actual count of new cases. Therefore, in our analysis we created a new variable called `incidenceRate` to represent the incidence rate of cancer per 100,000 people per county to be able to compare the spread of new cancer cases in different geographical regions regarldess of the actual population of such regions.
```{r collapse = TRUE, comment=NA}
#caclulating the total for avgAnnCount to compare with offical reports by Cancer.gov
sum(Cancer$avgAnnCount)
```
Official Cancer Statistics, 2009-2013
Source: [Cancer.gov](https://www.cancer.gov/)
|Year | New Cases | Deaths|
|-----|-----|-------|
|2009 | 1,660,290 | 562,340 |
|2010 | 1,529,560 | 569,490 |
|2011 | 1,596,670 | 571,950 |
2012 | 1,638,910 | 577,190 |
2013 | 1,660,290 | 580,350 |
```{r collapse = TRUE, comment=NA}
#calculating the mean of the number of new cancer cases for years 2009-2013, based on Cancer.gov data, in order to cofirm our assumption regarding avgAnnCount
incidence_cancer <- c(1660290, 1529560, 1596670, 1638910, 1660290)
mean(incidence_cancer)
```
* **Anomaly in `avgAnnCount`:** Through our assessment, we noticed that the number of new cancer cases (`avgAnnCount`) for 6 counties were greater than those counties' populations (`popEst2015`). Looking at the 6 observations, we realized that the value assigned to `popEst2015` for all these 6 counties is exactly the same number (1962.667684). In fact there are a total of 206 counties that have exactly the same average number of new cases, which is probably an error in the dataset. We decided to replace all of them with NA in our analysis.
```{r collapse = TRUE, comment=NA}
#cheking the number of observatios, where new case count is greater than the population
sum(Cancer$avgAnnCount > Cancer$popEst2015, na.rm = TRUE)
```
```{r collapse = TRUE, comment=NA}
Cancer$avgAnnCount[Cancer$avgAnnCount == 1962.667684] <- NA #removing the potentially erroneous number
Cancer$incidenceRate <- Cancer$avgAnnCount / Cancer$popEst2015 * 100000 #creating a new variable: new cases per 100,00 people per county
```
* **`Geography`:** We checked this variable to identify potential duplicates. Since the number of unique values in this column (3,047) is equal to the total number of observations, there can not be any duplicates in this column.
```{r collapse = TRUE, comment=NA}
#checking for potential dubplicates in this variable
length(unique(Cancer[["Geography"]]))
```
* **`binnedInc`:** This variable has 10 levels that seem arbitrary and have different bin sizes. It is not clear why the income bins have been defined this way. As a result, we decided to ignore it in our analysis.
* **Anomaly in `MedianAge`:** The maximum `MedianAge` shows a value of 624, which is clearly a wrong number. We actually identified a total of 30 values in this column that are above 100; therefore, we will replace such values with NA in our analysis.
```{r collapse = TRUE, comment=NA}
age_error = subset(Cancer, MedianAge > 100) #checking the number of erroneous values
nrow(age_error)
```
```{r collapse = TRUE, comment=NA}
# we can recover these numbers #line 301
#Cancer$MedianAge[Cancer$MedianAge > 100] <- NA #replacing erroneous values with NA
```
* **Anomaly in `AvgHouseholdSize`:** The minimum for `AvgHouseholdSize` is 0.0221, which does not make sense, since we do not expect a household size below 1. There are 61 values in this column that are below 1, which we will replace with NA in our analysis.
```{r collapse = TRUE, comment=NA}
household_error = subset(Cancer, AvgHouseholdSize < 1) #checking the number of erroneous values
nrow(household_error)
```
```{r collapse = TRUE, comment=NA}
# we can recover these numbers #line 301
#Cancer$AvgHouseholdSize[Cancer$AvgHouseholdSize < 1] = NA #replacing erroneous values with NA
```
* **`PctSomeCol18_24`:** 75% of values within this variable are NAs (2285 out 3047). Therefore, we decided to ignore this variable in our analysis.
* **`BirthRate`:** It is not clear what exactly this represents. Often, the birth rate is defined as childbirths per 1,000 people per year, but applying that to this variable would not give us the right number. For example in Los Angeles County with the population of 10,170,292, there were 124,641 live births in 2015 based on official reports, which translates into a birth rate of 12.25 (BR = (b / p) X 1,000). However, the birth rate in our data shows a value of 4.7 for this county, which is probably the ratio of women aged 15-50 years old who gave birth in 2015 as reported by [TownCharts](http://www.towncharts.com/California/Demographics/Los-Angeles-County-CA-Demographics-data.html)). As a result, we decided to ignore this variable in our analysis.
```{r collapse = TRUE, comment=NA}
#checking the BirthRate value for Los Angeles County
Cancer[1000,'BirthRate']
```
```{r collapse = TRUE, comment=NA}
#Calculating LA County birth rate based on official figures. Formula: BR = (b ÷ p) X 1,000
124641/10170292*1000
```
* **`deathRate`:** Based on our assessment, we believe this variable represents the number of deaths due to cancer per 100,000 population per county. For instance, we looked at the figure for Kings County, NY (173.6) and the number in our data is closer to the officially reported cancer death rate (140.3), as opposed to overall death rate (603.1). We also calculated the actual number of deaths per county (`deathRate` * `popEst2015` / 100000) and the total for these values, which is eaual to 525,347. This number is pretty close to the figure reported by [Cancer.gov](https://www.cancer.gov/) (589,430), further confirming our assumption regarding `deathRate`.
```{r collapse = TRUE, comment=NA}
#checking the deathRate for Kings County, NY
Cancer[388, 'deathRate']
```
Kings County, NY statistics:
2015 population: 2,673,000
2015 death rate (per 100,000 population): 603.1
2015 Cancer death rate (per 100,000 population): 140.3
>>> Sources: [DATA USA](https://datausa.io/), [NY State Dpt of Health](https://www.health.ny.gov/)
```{r collapse = TRUE, comment=NA}
#comparing total death count in our dataset with official stats reported by health authorities
death_count <- Cancer$deathRate * Cancer$popEst2015/100000
sum(Cancer$death_count)
```
* **`PctEmpPrivCoverage`:** We assume that the values in this variable represent a subset of values in `PctPrivateCoverage`, since the sum of these two variables in some rows is above 100.
* **Overlap between `PctPrivateCoverage` and `PctPublicCoverage`:** We assume that there is an overlap between people that have public health insurance and those with private health insurance, since the sum of `PctPrivateCoverage` and `PctPublicCoverage` in some rows is above 100. In fact, this is not uncommon among some senior citizenz that have both Medicare and a supplementary private health plan (aka. Medigap).
```{r collapse = TRUE, comment=NA}
#adding up health insurance coverage variables, to makes sence of such variables and check for overlaps
Pct_insured <- Cancer$PctPrivateCoverage + Cancer$PctPublicCoverage
Pct_PersonalIsure <- Cancer$PctPrivateCoverage + Cancer$PctEmpPrivCoverage
print('Pct_insured')
summary(Pct_insured)
print('Pct_PersonalIsure')
summary(Pct_PersonalIsure)
```
#### Data Transformations
Based on the data evaluation mentiond before and additional analysis, we are transforming some of the valuables that have issues, as explained below.
```{r}
Cancer$color <- '#098154'
Cancer$color[Cancer$MedianAge>100] <- '#c72e29' # highlight anomalous values
Cancer$color[Cancer$AvgHouseholdSize<1] <- '#c72e29' # highlight anomalous values
Cancer$color[Cancer$avgAnnCount==1962.667684] <- '#c72e29' # highlight anomalous values
```
### Breakup Geography
We will use the State variable to do more aggregate analysis.
```{r}
Cancer <- separate(Cancer, col = Geography, into = c("County","State"), sep = ", ", remove = FALSE)
```
#### Median Age Corrections
The Median Age variable contained several observations that were larger than 100. As we have determined, based on the median, that this variable is given in terms of years these obeservations did not make sense. However, looking at the clustering pattern it was reasonable to assume that these specific values were given in terms of months and not years.
```{r}
Cancer$MedianAge[Cancer$MedianAge>100] <- Cancer$MedianAge[Cancer$MedianAge>100] / 12 # correct median ages over 100 years which were provided in months
```
### Average Household Size Corrections
The Average Household Size for several observation were below 1, i.e. more houses in that county that people. Looking closer to that group of outliers it looked like they were off by a factor of 100 from the mean and median of the variable. Multiplying these observations by 100 brought them into a acceptable range within the first SD.
```{r}
Cancer$AvgHouseholdSize[Cancer$AvgHouseholdSize<1] <- Cancer$AvgHouseholdSize[Cancer$AvgHouseholdSize<1] * 100 # correct average household sizes below 1 which were divided by 100
```
### Average Annual Count
The Average Annual Count variable which designate the number of new cancer cases within the county contained a significant number of values that seems to be some data inputing error, they were all 1962.667684. This value is too large to be resonably considered a valid incidence rate. These datapoints were from only 3 states. So we explored several approaches to correct these values.
* **Keep the Anomalous Values**
This would however change the sample size of variables we want to compare with each other.
* **Remove Entire Observation with the Anomalous Values**.
This would however remove a large percentage of our dataset and significantly weaken our analysis.
* **Remove Just the Anomalous Values**
This would however change the sample size of variables we want to compare with each other.
* **Replae the Anomalous Values by Imputation**
Based on the statistical relationship between the county population and the incidence count, we can replace the anomalous values.
First by finding the mean of the avgAnnCount without the 1962.667684 value. Then we can use this mean to multiply the popEst2015 to replace all values of 1962.667684.
#### Choice 1 - Not Evaluvated
```{r eval = FALSE}
incidenceMean <- mean(Cancer$avgAnnCount[Cancer$avgAnnCount!=1962.667684] / Cancer$popEst2015[Cancer$avgAnnCount!=1962.667684] * 100000) # mean incidence across the u.s.
Cancer$avgAnnCount[Cancer$avgAnnCount==1962.667684] <- incidenceMean * Cancer$popEst2015[Cancer$avgAnnCount==1962.667684] / 100000 # replace anomalous values with imputation
Cancer$incidenceRate <- Cancer$avgAnnCount / Cancer$popEst2015 * 100000 # new cancer cases per 100,000 persons per year
```
### Choice 2 - Final Choice
```{r}
Cancer$avgAnnCount[Cancer$avgAnnCount==1962.667684] <- NA # remove just the anomalous values
Cancer$incidenceRate <- Cancer$avgAnnCount / Cancer$popEst2015 * 100000 # new cancer cases per 100,000 persons per year
```
# Univariate Analysis of Key Variables
Even though the presentation of this section takes a linear form, the actual analysis of key variables was an iterative process. The key variables were chosen based on:
* Our initial hypotheses regarding variables were potentially related to `deathRate`.
* The possibility that a variable/factor could be changed through interventions implemented by government health agencies to improve cancer prevention and survival.
* Additional analysis that we performed to identify variables that actually had a correlation with the dependent variable.
After selecting the key variables, our approach was to focus on assessing the quality of the data (as partly explained in the Introduction) and detecting features, through univariate analysis, that are important to include when modelling the relationships of interest, such as particular features in the distributions, unusual concentrations of observations around certain values, the presence of outliers and extreme outliers, among others.
### Death Rate
Death rate's distribution is symmetric and bell-shaped, with a small amount of outliers at both sides of the mean (2.1% of outliers, with 0.03% of extreme outliers). However, these outliers are still within a reasonable range and do not seem to be errors in the data. Furthermore, the observation corresponding to the only extreme outlier does not look atypical based on the values of the other variables.
Finally, using both summary metrics and visualizations, we did not find any unusual concentration of observations around specific values.
```{r}
summary(Cancer$deathRate)
```
```{r}
boxHist(Cancer$deathRate, "The Cancer Death Rate", "Nummber of Deaths per 100K People")
```
```{r}
outliers.summ(Cancer, 'deathRate')
```
```{r}
Cancer[Cancer$deathRate > 300, ]
```
### Incidence (DEATILS MIGHT BE HIDDEN TO SAVE SPACE)
Looking at the frequency of unique values in 'AvgAnnCount', we found that 206 observations contain the value 1962.667684. This is very likely an error because the values in this variable should all be integers, and in some cases this value is higher than the county population.
```{r incidence_freq}
incidence_freq <- data.frame(table(Cancer$avgAnnCount))
incidence_freq[incidence_freq$Freq > 20, ]
```
```{r}
table(Cancer$avgAnnCount > Cancer$popEst2015)
```
Furthermore, these values are causing the incidence rate (that we will build to be able to compare death with incidence) to have extremely large values.
Incidence rate contains 188 extremely large values (higher than 1500 cases per 100,000 people). As can be seen below, all of these values are caused by the error in AvgAnnCount.
```{r}
Cancer$incidenceRate <- Cancer$avgAnnCount / Cancer$popEst2015 * 100000
table(Cancer$incidenceRate > 1500)
table(Cancer$incidenceRate[Cancer$avgAnnCount != 1962.667684] > 1500)
```
Therefore, we decided to remove these "1962.667684" values and replace them with NA.
```{r}
Cancer$avgAnnCount[Cancer$avgAnnCount == 1962.667684] <- NA
Cancer$incidenceRate <- Cancer$avgAnnCount / Cancer$popEst2015 * 100000
```
```{r}
outliers.summ(Cancer, 'avgAnnCount')
```
### Incidence Rate
The distribution of the incidence rate is unimodal and positively skewed, with 46 outliers and 1 extreme outlier. Since these values represent only 1.5% of observations and there is no furhter evidence that they are errors, they will be kept, but should be taken into account when modelling the relationship between incidence and death rates.
```{r}
summary(Cancer$incidenceRate)
```
```{r}
boxHist(Cancer$incidenceRate, "Cancer Incidence Rate", "New Diagnosed Cases per 100K People")
```
```{r}
outliers.summ(Cancer, 'incidenceRate')
```
### Median Income
There are two income variables available: binned income and median income. From these two, We chose median income as our key variable because it is more granular than binned income and, second, because the width of the binned income seem to have been defined to have a similar number of observations in each bin, which is not useful to observe its distribution, and the cutoffs chosen make the charts hard to read.
```{r}
summary(Cancer$binnedInc)
```
Below, we can see that the median income is inded a good candidate, since it doesn't vary as much as income typically does (in this case, the difference between the minimum and maximum values is less than one order of magnitude), representing better the "average" member of each county. However, it's distribution is positively sekewed, having 64 counties where the median income is higher than 80,000 USD.
```{r}
summary(Cancer$medIncome)
```
```{r}
sum(Cancer$medIncome > 80000)
```
```{r}
boxHist(Cancer$medIncome, "The Median Income")
```
Including the 64 observations above that contribute to the positive skewness of this variable, there are still 122 outliers (around 4% of the total observations) that need to be taken into account when building the statistical model that captures the relationship between this variable and the death rate.
```{r}
outliers.summ(Cancer, "medIncome")
```
Given the rather large number of outliers in this variable, we could transform it by taking its logarithm. However, we have decided to follow the rule provided by Fox (2011), where logarithmic transformation is only likely to make a difference if its values "cover two or more orders of magnitude" (Fox, p. 128).
### Education
To measure education, we have six possible candidates: 'PctNoHS18_24', 'PctHS18_24', 'PctSomeCol18_24', 'PctBachDeg18_24', 'PctHS25_Over' and 'PctBachDeg25_Over' that can be divided in two groups: 18-24 and '25 and above' years old. Our initial hypothesis is that the second group should have a stronger correlation with death rate. We validated this hypothesis with the correlations table shown below, that found that only PctBachDeg from the 18-24 group has a correlation with deathRate (although this correlation is very week, -0.31). Instead, as expected, the two'25 and above' education variables have a much higher correlation with deathRate.
Therefore, we will focus on these two variables for further analyses on education.
```{r}
cor(Cancer[, names(Cancer) %in%
c('PctNoHS18_24', 'PctHS18_24', 'PctSomeCol18_24', 'PctBachDeg18_24',
'PctHS25_Over', 'PctBachDeg25_Over', 'deathRate')], use = 'complete.obs')[7, ]
```
We also validated that education variables within each group are mutually exclusive, by making sure that they add up to 100%, for all observations that have complete data, where we find that these variables indeed seem to be mutually exclusive, given that their range is between 99.9 and 100.1, where the small variations around 100 are likely due to rounding.
We can only test this with the 18-24 group since the 25_over group is missing two variables that capture 'no high school' and 'some college'. However, it is reasonable to assume that the same definition is applied to our group of interest (25_over).
```{r}
educ.18.24 <- c('PctNoHS18_24', 'PctHS18_24', 'PctSomeCol18_24', 'PctBachDeg18_24')
educ.df <- subset(Cancer, select = educ.18.24)
educ.complete <- complete.cases(educ.df)
sum.pct.freq <- data.frame(table(rowSums(educ.df[educ.complete, ], na.rm = TRUE)))
names(sum.pct.freq) <- c("Sum", "Frequency")
sum.pct.freq
```
#### PctHS25_over
Values in `PctHS25` are within a reasonable range (7 to 55%) and there doesn't seem to be an unusual concentration of observations around certain values. Also, the disribution of this variable is unimodal and negatively skewed. However, it only contains 31 outliers (1% of observations) and there are no extreme outliers. Furthermore, there are no indications that these outliers are errors, so we decided to keep them.
```{r}
summary(Cancer$PctHS25_Over)
```
```{r}
boxHist(Cancer$PctHS25_Over, "Percentage age 25 or older with high school only")
```
```{r}
outliers.summ(Cancer, "PctHS25_Over")
```
#### PctBachDeg25_Over
Values in `PctHS25_Over` are within a reasonable range (7% to 55%) and there doesn't seem to be an unusual concentration of observations around certain values. The disribution of this variable is unimodal and positively skewed. It contains 82 outliers (2.7% of observations) all of which at are at the right side of the mean. Of these 82 outliers, only 5 are extreme outliers, that will be kept in the data set, since there are no indications that they are errors.
```{r}
summary(Cancer$PctBachDeg25_Over)
```
```{r}
boxHist(Cancer$PctBachDeg25_Over, "Percentage age 25 or older with bachelors degree only")
```
Extreme outliers
```{r}
outliers.summ(Cancer, "PctBachDeg25_Over")
```
### Poverty percent
The distribution of `povertyPercent` is unimodal and positively skewed. This is reflected by the fact that all outliers are at the right of the mean. Taking a deeper dive into the outliers, we found that only 3 are extreme while 66 are mild. For this reason, and because we did not find other indication that the outliers or other values were errors, we will keep all data from this variable.
However, when modeling the relationship of interest, we should take into account that the distribution of this variable is not normal and it may be necessary to transform it if the model used requires it.
```{r}
summary(Cancer$povertyPercent)
```
```{r}
boxHist(Cancer$povertyPercent, "Percentage age 25 or older with up to bachelors degree")
```
Extreme outliers
```{r}
outliers.summ(Cancer, "povertyPercent")
```
### Percentage employed (16 or over)
The distribution of `PctEmployed16_Over` is unimodal and negatively skewed. This is reflected by the fact that all but one of the outliers are at the left of the mean. There are no extreme outliers and 20 mild outliers (0.7% of observations). For this reason, and because we did not find other indication that the outliers or other values were errors, we will keep all data from this variable.
```{r}
summary(Cancer$PctEmployed16_Over)
```
```{r}
boxHist(Cancer$PctEmployed16_Over, "Percentage age 25 or older with up to bachelors degree")
```
Extreme outliers
```{r}
outliers.summ(Cancer, "PctEmployed16_Over")
```
### Percentage with public coverage
The distribution of `PctPublicCoverage` is unimodal and symmetric, with no extreme outliers and only 18 mild outliers (0.6% of observations). For this reason, and because we did not find other indication that the outliers or other values were errors, we will keep all data from this variable. There are also no other particular features from this variables that grant further warnings in modelling the relationship between it and `deathRate`.
```{r}
summary(Cancer$PctPublicCoverage)
```
```{r}
boxHist(Cancer$PctPublicCoverage, "Percentage age 25 or older with up to bachelors degree")
```
Extreme outliers
```{r}
outliers.summ(Cancer, "PctPublicCoverage")
```
## 3. Analysis of Key Relationships
We wanted to find the variables that have strong correlations to the Cancer Incidence (**incedenceRate**), and Cancer Deaths (**deathRate**).
```{r}
Cancer.Numerical <- Cancer[, !names(Cancer) %in% c('Geography','binnedInc','color','State', 'County')]
Cancer.Correlation <- cor(Cancer.Numerical, use = 'pairwise.complete.obs')
```
We choose all the original numerical variables for this purpose using. We chose to use pairwise complete observations as there were some variables with many missing values that would significantly alter the correlation of other variable pairs otherwise.
Using a matrix correlation plot and using a ordering algorithm to cluster correlated variables together allows us easily find strong $\color{blue}{\text{POSITIVELY}}$ and $\color{red}{\text{NEGATIVELY}}$ correlated variables.
```{r}
corrplot(Cancer.Correlation, method = 'circle', type = 'lower', order = 'FPC', diag = F)
```
### Network Diagram
A network diagram allows us to connect the variables (nodes) to each other based on their respective correlation (links).
For this we needed to first respahe the Correlation Matrix we generated in the previous step into a flat list of variable combinations along with their respective correlations.
```{r}
links <- subset(melt(Cancer.Correlation), value != 1.0 & abs(value) > 0.4)
links <- links[!duplicated(t(apply(links, 1, sort))),]
```
Afterwards we picked all absolute correlations greater than 0.4, being careful to ignore the diagonal. Also, as the correlation appears twice within the matrix, we removed all duplicate combinations to make the network graph easier to intrepret.
Stylize the links, mapping the inverse log of the correlation magnitudes to the width of the arrows, i.e. stronger the correlation the thicker the arrow. Also, we conditionally mappedchanged the color of the link based on if the correlations are $\color{blue}{\text{POSITIVE}}$ or $\color{red}{\text{NEGATIVE}}$.
```{r}
names(links)[1] = 'from'
names(links)[2] = 'to'
names(links)[3] = 'correlation'
links$magnitude <- abs(links$correlation)
links$width <- 10^links$magnitude
links$color <- ifelse(links$correlation < 0, 'red', 'blue')
```
The complete set of links were filterd into two subsets that connected to the deathRate or the incidenceRate variable.
```{r}
links.deathRate <- links[links$from == 'deathRate',]
links.incidenceRate <- links[links$from == 'incidenceRate',]
links.outcomeVariables <- rbind(links.deathRate,
links.incidenceRate)
```
Next we had to create a list of all nodes--Variables--that we want to see in the diagram. First we start with all the original numerial variables.
```{r}
nodes <- data.frame('id' = names(Cancer.Numerical))
```
Stylize the nodes, highlighting $\color{navy}{\text{deathRate}}$ in navy and $\color{purple}{\text{incidenceRate}}$ in purple.
```{r}
nodes$label <- nodes$id
nodes$shadow <- T
nodes$color.background <- 'tomato'
nodes$color.border <- 'black'
nodes$color.highlight.background <- 'orange'
nodes$color.highlight.border <- 'darkred'
nodes$color.background[nodes$id=='deathRate' ] = 'navy'
nodes$color.background[nodes$id=='incidenceRate'] = 'purple'
nodes.deathRate <- nodes[nodes$id %in% c(as.vector(links.deathRate$to),'deathRate'),]
nodes.incidenceRate <- nodes[nodes$id %in% c(as.vector(links.incidenceRate$to),'incidenceRate'),]
nodes.outcomeVariables <- nodes[nodes$id %in% unique(c(as.vector(links.incidenceRate$to),
as.vector(links.deathRate$to),
'incidenceRate',
'deathRate')),]
```
Network diagram showing strong correlations to deathRate.
```{r}
visNetwork(nodes.deathRate, links.deathRate)
```
Network diagram showing strong correlations to incidenceRate.
```{r}
visNetwork(nodes.incidenceRate, links.incidenceRate)
```
Network diagram showing strong correlations to both outcome variables.
```{r}
visNetwork(nodes.outcomeVariables, links.outcomeVariables)
```
The preceeding visual analysis helped us to pick the strongest candidates to do further data exploration.
### Education
As explained above, guided by our hypothesis that the education of the '25 and over' years old group should have a much stronger relationship with deathRate than the '18-24' years old group, which was supported by the correlations between these variables, we will be focusing on the former group.
#### PctHS25_over
A correlation of $0.4$ between `PctHS25_over` and `deathRate` indicates that there is indeed a relationship between these variables, which is further indicated by plotting them together in a scatterplot, that shows that higher values of percentage of population with only high school tend to be associated to higher death rates (this is also reflected in the regression line added to the scatterplot).
This is an intuitive result since it indicates that a higher concentration of people with low education levels may have poorer health habits and lower access to medical services. However, both of these variables could be affeccted by `MedianAge` in the same direction: older counties might have lower levels of higher education and higher rates of death.
```{r collapse=TRUE}
cor(Cancer$deathRate, Cancer$PctHS25_Over)
```
```{r}
#plot(Cancer$PctHS25_Over, Cancer$deathRate, main = "HS (>24)")
#abline(lm(Cancer$deathRate ~ Cancer$PctHS25_Over), lty = 'dashed', lwd = 2, col = 'red')
yxScatter(Cancer$deathRate, Cancer$PctHS25_Over, "HS Education (>24)")
```
#### PctBachDeg25_0ver
A correlation of $-0.48$ indicates that there is relationship between `PctBachDeg25_over` and `deathRate`, which is further supported by plotting these variables in a scatterplot, where it can be seen that higher values of percentage of people with bachelors degree are associated to lower levels of death rates. This relationship is also supported by the regression line included in the scatterplot.
This is also an intuitive result, since higher levels of education might be linked to better health habits and access to health services. However, and following the same reasoning than `PctHS25_over`, the relationship between these two variables may be confounded by `MedianAge`, although it is not clear in which direction this effect might go. Therefore, it will also be necessary to explore the effect of `MedianAge` in the following section.
```{r}
cor(Cancer$deathRate, Cancer$PctBachDeg25_Over)
```
```{r}
#plot(Cancer$PctBachDeg25_Over, Cancer$deathRate, main = "Bachelor (>24)")
#abline(lm(Cancer$deathRate ~ Cancer$PctBachDeg25_Over), lty = 'dashed', lwd = 2, col = 'red')
yxScatter(Cancer$PctBachDeg25_Over, Cancer$deathRate, "Bachelor's Degree (>24)")
```
## 4. Analysis of Secondary Effects
Throughout the analyses above, we began to identify that some of the relationships found between `deathRate` and other variables may not only be capturing the direct relationship betwee these variables but of additional variable(s) that may be impacting both. To further assess this systematically, the following network visualization shows the variables that have a correlation higher than 0.4, where each node represents a different variable and each vertex indicates the strength of the relationship between the variables connected.
```{r}
visNetwork(nodes,links)
```
![secondary_analysis](secondary_analysis.png)
### Age and family/householdd
`PercentMarried` has a (weak) relation and `AvgHouseholdSize` has a moderate relation with `MedianAge`. Based on these results, we explored this relationship further.
```{r}
cor(subset(Cancer,
select = c("MedianAge", "AvgHouseholdSize", "PercentMarried")),
use = "pairwise.complete.obs")[1, ]
```
Both the scatterplots below and the regression lines imposed on them provide further support that there is indeed a relationship between these two variabes and `MedianAge`, indicating that `MedianAge` may confound the relationship between these two and `deathRate`. Therefore, this should be taken into account when modelling the relation of interest, in order to isolate the effect of the family variables on the death rate.
```{r}
#plot(Cancer$MedianAge, Cancer$PercentMarried, main = "Age vs PercentMarried")
#abline(lm(PercentMarried ~ MedianAge, data = Cancer), lty = 'dashed', lwd = 2, col = 'red')
yxScatter(Cancer$PercentMarried, Cancer$MedianAge, "Age vs PercentMarried")
```
```{r}
#plot(Cancer$MedianAge, Cancer$AvgHouseholdSize, main = "Age vs Average household size")
#abline(lm(AvgHouseholdSize ~ MedianAge, data = Cancer), lty = 'dashed', lwd = 2, col = 'red')
yxScatter(Cancer$AvgHouseholdSize, Cancer$MedianAge, "Age vs Household Size")
```
### Black population and employment
Correlation analysis shows that there is a relationship between the percentage of black population and employment, which is further confirmed both by a visual inspection of the scatterplot and the linear regression line charted in this plot. Since employment es related to `deathRate`, its correlation with `PctBlack` may indicate that this variable may be confounding the relationship of interest and thus further modeling needs to take this into account, to isolate the effect of unemployment on death rate.
```{r}
cor(subset(Cancer, select = c("PctBlack", "PctUnemployed16_Over")),
use = "pairwise.complete.obs")[, 1]
```
```{r}
#plot(Cancer$PctBlack, Cancer$PctUnemployed16_Over, main = "Age vs PercentMarried")
#abline(lm(PctUnemployed16_Over ~ PctBlack, data = Cancer), lty = 'dashed', lwd = 2, col = 'red')
yxScatter(Cancer$PctUnemployed16_Over, Cancer$PctBlack, "Percent Unemployed vs Black")
```
### State
A boxplot containing different location measures of `deathRate` by `State` shows that these values vary measures vary significantly across state. Since `State` may be capturing several state-level characteristics that may in turn affect other variables that have a relation with `deathRate`, it is recommended to include state-level effects when modeling the relation of interest, to control for confounding these state-level factors.
```{r}
boxplot(Cancer$deathRate ~ Cancer$State)
```