Michael Davies (mld9s)
Akeem Wells (ajw3rg)
Predicting Heart Disease
Overview: “Heart disease has become a major health problem in both developed and developing countries, and it is cited as the number one cause of death throughout the world each year.” Given the risk of heart disease in modern society, detection of cardiovascular disease and identifying its risk level for adults is a critical task.
Objectives: We will implement a model to classify whether a patient is normal or has heart disease. More specifically, we will develop a binary classification model that predicts the posterior probability that an individual has heart disease (given our data and model).
In short, we obtained the data and have completed preliminary cleaning, which can be seen below.
Data
The data we selected comes from:
Variables:
We plan to implement a Hierarchical Bayes Approach. This is appropriate given that our data contains the same feature but is drawn from from distinct locations: Budapest-Hungary, Zurich-Switzerland, Basel-Switzerland and the VA Medical Center (Long Beach and Cleveland).
Questions we have at this point are:
import pandas as pd
import numpy as np
import os
import matplotlib.pyplot as plt
import seaborn as sns
import itertools
from itertools import combinations
import warnings
warnings.filterwarnings('ignore')
from mpmath import mp
mp.dps = 50
# pymc3 / arviz / theano
import arviz as az
import theano.tensor as T
from pymc3 import Model, sample, Normal, HalfCauchy, Uniform, Deterministic
from pymc3 import forestplot, traceplot, plot_posterior
import pymc3 as pm
# scipy / statsmodels / sklearn
from sklearn.impute import KNNImputer
from scipy.stats import multivariate_normal
import statsmodels.api as sm
from statsmodels.tools import add_constant
#np.set_printoptions(threshold=np.inf)
#pd.set_option("display.max_rows", None, "display.max_columns", None)
#!nproc
!rm -rf processed.cleveland.data*
!rm -rf processed.hungarian.data*
!rm -rf processed.switzerland.data*
!rm -rf processed.va.data*
!wget -N https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.cleveland.data
!wget -N https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.hungarian.data
!wget -N https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.switzerland.data
!wget -N https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.va.data
--2021-08-11 23:11:25-- https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.cleveland.data Resolving archive.ics.uci.edu (archive.ics.uci.edu)... 128.195.10.252 Connecting to archive.ics.uci.edu (archive.ics.uci.edu)|128.195.10.252|:443... connected. HTTP request sent, awaiting response... 200 OK Length: 18461 (18K) [application/x-httpd-php] Saving to: ‘processed.cleveland.data’ processed.cleveland 100%[===================>] 18.03K --.-KB/s in 0.1s 2021-08-11 23:11:25 (122 KB/s) - ‘processed.cleveland.data’ saved [18461/18461] --2021-08-11 23:11:25-- https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.hungarian.data Resolving archive.ics.uci.edu (archive.ics.uci.edu)... 128.195.10.252 Connecting to archive.ics.uci.edu (archive.ics.uci.edu)|128.195.10.252|:443... connected. HTTP request sent, awaiting response... 200 OK Length: 10263 (10K) [application/x-httpd-php] Saving to: ‘processed.hungarian.data’ processed.hungarian 100%[===================>] 10.02K --.-KB/s in 0s 2021-08-11 23:11:26 (146 MB/s) - ‘processed.hungarian.data’ saved [10263/10263] --2021-08-11 23:11:26-- https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.switzerland.data Resolving archive.ics.uci.edu (archive.ics.uci.edu)... 128.195.10.252 Connecting to archive.ics.uci.edu (archive.ics.uci.edu)|128.195.10.252|:443... connected. HTTP request sent, awaiting response... 200 OK Length: 4109 (4.0K) [application/x-httpd-php] Saving to: ‘processed.switzerland.data’ processed.switzerla 100%[===================>] 4.01K --.-KB/s in 0s 2021-08-11 23:11:27 (164 MB/s) - ‘processed.switzerland.data’ saved [4109/4109] --2021-08-11 23:11:27-- https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.va.data Resolving archive.ics.uci.edu (archive.ics.uci.edu)... 128.195.10.252 Connecting to archive.ics.uci.edu (archive.ics.uci.edu)|128.195.10.252|:443... connected. HTTP request sent, awaiting response... 200 OK Length: 6737 (6.6K) [application/x-httpd-php] Saving to: ‘processed.va.data’ processed.va.data 100%[===================>] 6.58K --.-KB/s in 0s 2021-08-11 23:11:28 (372 MB/s) - ‘processed.va.data’ saved [6737/6737]
ls
age_histplot.png hierarchical_trace_subset_mu.png
BMA_Logit_summary.tex hierarchical_trace_subset.png
ca_histplot.png logistic_model_trace_full.png
chol_histplot.png logistic_model_trace_subset.png
corr_heatmap.png logistic_model_trace_summary.tex
country_histplot.png oldpeak_histplot.png
cp_histplot.png pairplot_by_country.png
data_describe_scaled.tex pairplot_by_target.png
data_describe.tex processed.cleveland.data
exang_histplot.png processed.hungarian.data
fbs_histplot.png processed.switzerland.data
final_summary_BMA.tex processed.va.data
final_summary_hierarchical.tex restecg_histplot.png
final_summary_logistic.tex sample_data/
hierarchical_trace_full_marginals.png sex_histplot.png
hierarchical_trace_full_mu.png slope_histplot.png
hierarchical_trace_full.png sm_Logit_summary.tex
hierarchical_trace_full_sigma.png target_histplot.png
hierarchical_trace_marginals_summary.tex test.tex
hierarchical_trace_mu_summary.tex thalach_histplot.png
hierarchical_trace_offset_summary.tex thal_histplot.png
hierarchical_trace_sigma_summary.tex trestbps_histplot.png
def load_data(infile):
matrix = []
file = open(infile, "r")
tmp = file.read()
for i in range(len(tmp.split("\n"))):
if len(tmp.split("\n")[i].split(',')) == 14:
row = tmp.split("\n")[i].split(',')
matrix.append(row)
file.close()
return matrix
columns = ["age","sex","cp","trestbps","chol","fbs", "restecg",
"thalach", "exang","oldpeak","slope", "ca","thal","target"]
nan_map = {'?':np.nan}
df_processed_hungarian = pd.DataFrame(load_data("processed.hungarian.data"))
df_processed_hungarian.columns = columns
df_processed_hungarian = df_processed_hungarian.replace(nan_map)
df_processed_hungarian.insert(0, "country", "hungary")
df_processed_va = pd.DataFrame(load_data("processed.va.data"))
df_processed_va.columns = columns
df_processed_va = df_processed_va.replace(nan_map)
df_processed_va.insert(0, "country", "va-longbeach")
df_processed_switzerland = pd.DataFrame(load_data("processed.switzerland.data"))
df_processed_switzerland.columns = columns
df_processed_switzerland = df_processed_switzerland.replace(nan_map)
df_processed_switzerland.insert(0, "country", "switzerland")
df_processed_switzerland[columns].replace(nan_map)
df_processed_cleveland = pd.DataFrame(load_data("processed.cleveland.data"))
df_processed_cleveland.columns = columns
df_processed_cleveland = df_processed_cleveland.replace(nan_map)
df_processed_cleveland.insert(0, "country", "cleveland")
print(df_processed_switzerland.shape)
df_processed_switzerland.head()
(123, 15)
country | age | sex | cp | trestbps | chol | fbs | restecg | thalach | exang | oldpeak | slope | ca | thal | target | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | switzerland | 32 | 1 | 1 | 95 | 0 | NaN | 0 | 127 | 0 | .7 | 1 | NaN | NaN | 1 |
1 | switzerland | 34 | 1 | 4 | 115 | 0 | NaN | NaN | 154 | 0 | .2 | 1 | NaN | NaN | 1 |
2 | switzerland | 35 | 1 | 4 | NaN | 0 | NaN | 0 | 130 | 1 | NaN | NaN | NaN | 7 | 3 |
3 | switzerland | 36 | 1 | 4 | 110 | 0 | NaN | 0 | 125 | 1 | 1 | 2 | NaN | 6 | 1 |
4 | switzerland | 38 | 0 | 4 | 105 | 0 | NaN | 0 | 166 | 0 | 2.8 | 1 | NaN | NaN | 2 |
print(df_processed_cleveland.shape)
df_processed_cleveland.head()
(303, 15)
country | age | sex | cp | trestbps | chol | fbs | restecg | thalach | exang | oldpeak | slope | ca | thal | target | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | cleveland | 63.0 | 1.0 | 1.0 | 145.0 | 233.0 | 1.0 | 2.0 | 150.0 | 0.0 | 2.3 | 3.0 | 0.0 | 6.0 | 0 |
1 | cleveland | 67.0 | 1.0 | 4.0 | 160.0 | 286.0 | 0.0 | 2.0 | 108.0 | 1.0 | 1.5 | 2.0 | 3.0 | 3.0 | 2 |
2 | cleveland | 67.0 | 1.0 | 4.0 | 120.0 | 229.0 | 0.0 | 2.0 | 129.0 | 1.0 | 2.6 | 2.0 | 2.0 | 7.0 | 1 |
3 | cleveland | 37.0 | 1.0 | 3.0 | 130.0 | 250.0 | 0.0 | 0.0 | 187.0 | 0.0 | 3.5 | 3.0 | 0.0 | 3.0 | 0 |
4 | cleveland | 41.0 | 0.0 | 2.0 | 130.0 | 204.0 | 0.0 | 2.0 | 172.0 | 0.0 | 1.4 | 1.0 | 0.0 | 3.0 | 0 |
print(df_processed_va.shape)
df_processed_va.head()
(200, 15)
country | age | sex | cp | trestbps | chol | fbs | restecg | thalach | exang | oldpeak | slope | ca | thal | target | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | va-longbeach | 63 | 1 | 4 | 140 | 260 | 0 | 1 | 112 | 1 | 3 | 2 | NaN | NaN | 2 |
1 | va-longbeach | 44 | 1 | 4 | 130 | 209 | 0 | 1 | 127 | 0 | 0 | NaN | NaN | NaN | 0 |
2 | va-longbeach | 60 | 1 | 4 | 132 | 218 | 0 | 1 | 140 | 1 | 1.5 | 3 | NaN | NaN | 2 |
3 | va-longbeach | 55 | 1 | 4 | 142 | 228 | 0 | 1 | 149 | 1 | 2.5 | 1 | NaN | NaN | 1 |
4 | va-longbeach | 66 | 1 | 3 | 110 | 213 | 1 | 2 | 99 | 1 | 1.3 | 2 | NaN | NaN | 0 |
print(df_processed_hungarian.shape)
df_processed_hungarian.head()
(294, 15)
country | age | sex | cp | trestbps | chol | fbs | restecg | thalach | exang | oldpeak | slope | ca | thal | target | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | hungary | 28 | 1 | 2 | 130 | 132 | 0 | 2 | 185 | 0 | 0 | NaN | NaN | NaN | 0 |
1 | hungary | 29 | 1 | 2 | 120 | 243 | 0 | 0 | 160 | 0 | 0 | NaN | NaN | NaN | 0 |
2 | hungary | 29 | 1 | 2 | 140 | NaN | 0 | 0 | 170 | 0 | 0 | NaN | NaN | NaN | 0 |
3 | hungary | 30 | 0 | 1 | 170 | 237 | 0 | 1 | 170 | 0 | 0 | NaN | NaN | 6 | 0 |
4 | hungary | 31 | 0 | 2 | 100 | 219 | 0 | 1 | 150 | 0 | 0 | NaN | NaN | NaN | 0 |
def convert_float_to_int(df, variable):
df[variable] = df[variable].apply(lambda x: int(float(x)) if pd.notnull(x) else x)
# To make columns into float
df[variable] = df[variable].astype('float')
## To make columns into ints
#df[variable] = df[variable].astype('Int64')
convert_columns = ["age","sex","cp","trestbps","chol","fbs",
"restecg","thalach", "exang","slope", "ca","thal","target"]
for variable in convert_columns:
convert_float_to_int(df_processed_switzerland, variable)
convert_float_to_int(df_processed_cleveland, variable)
convert_float_to_int(df_processed_va, variable)
convert_float_to_int(df_processed_hungarian, variable)
countries = (df_processed_switzerland, df_processed_cleveland, df_processed_va, df_processed_hungarian)
df = pd.concat(countries) # ignore_index=True
df['target'] = df['target'].apply(lambda x: 0 if int(x) == 0 else 1)
print(df.shape)
(920, 15)
#df.dropna(inplace=True)
print(df.shape)
df.sample(10)
(920, 15)
country | age | sex | cp | trestbps | chol | fbs | restecg | thalach | exang | oldpeak | slope | ca | thal | target | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
63 | va-longbeach | 56.0 | 1.0 | 3.0 | 170.0 | 0.0 | 0.0 | 2.0 | 123.0 | 1.0 | 2.5 | NaN | NaN | NaN | 1 |
54 | switzerland | 55.0 | 1.0 | 4.0 | 140.0 | 0.0 | 0.0 | 0.0 | 83.0 | 0.0 | 0 | 2.0 | NaN | 7.0 | 1 |
164 | hungary | 55.0 | 1.0 | 4.0 | 120.0 | 270.0 | 0.0 | 0.0 | 140.0 | 0.0 | 0 | NaN | NaN | NaN | 0 |
166 | hungary | 56.0 | 0.0 | 3.0 | 130.0 | 219.0 | NaN | 1.0 | 164.0 | 0.0 | 0 | NaN | NaN | 7.0 | 0 |
253 | hungary | 44.0 | 1.0 | 2.0 | 150.0 | 288.0 | 0.0 | 0.0 | 150.0 | 1.0 | 3 | 2.0 | NaN | NaN | 1 |
219 | hungary | 57.0 | 1.0 | 2.0 | 140.0 | 265.0 | 0.0 | 1.0 | 145.0 | 1.0 | 1 | 2.0 | NaN | NaN | 1 |
263 | cleveland | 44.0 | 1.0 | 3.0 | 120.0 | 226.0 | 0.0 | 0.0 | 169.0 | 0.0 | 0.0 | 1.0 | 0.0 | 3.0 | 0 |
211 | hungary | 49.0 | 1.0 | 4.0 | 130.0 | 206.0 | 0.0 | 0.0 | 170.0 | 0.0 | 0 | NaN | NaN | NaN | 1 |
4 | cleveland | 41.0 | 0.0 | 2.0 | 130.0 | 204.0 | 0.0 | 2.0 | 172.0 | 0.0 | 1.4 | 1.0 | 0.0 | 3.0 | 0 |
6 | hungary | 32.0 | 1.0 | 2.0 | 110.0 | 225.0 | 0.0 | 0.0 | 184.0 | 0.0 | 0 | NaN | NaN | NaN | 0 |
# coercing categorical and discrete values to category type
categ_cols = ['country', 'sex', 'target', 'cp', 'fbs', 'exang', 'ca',
'thal', 'slope', 'restecg']
int_cols = [col for col in df.columns if col not in categ_cols]
df[categ_cols] = df[categ_cols].astype('category')
df[int_cols] = df[int_cols].apply(pd.to_numeric)
pd.DataFrame(df.dtypes)
0 | |
---|---|
country | category |
age | float64 |
sex | category |
cp | category |
trestbps | float64 |
chol | float64 |
fbs | category |
restecg | category |
thalach | float64 |
exang | category |
oldpeak | float64 |
slope | category |
ca | category |
thal | category |
target | category |
f = open("data_describe.tex", "w")
f.write(df.describe().to_latex())
f.close()
df.describe()
age | trestbps | chol | thalach | oldpeak | |
---|---|---|---|---|---|
count | 920.000000 | 861.000000 | 890.000000 | 865.000000 | 858.000000 |
mean | 53.510870 | 132.132404 | 199.130337 | 137.545665 | 0.878788 |
std | 9.424685 | 19.066070 | 110.780810 | 25.926276 | 1.091226 |
min | 28.000000 | 0.000000 | 0.000000 | 60.000000 | -2.600000 |
25% | 47.000000 | 120.000000 | 175.000000 | 120.000000 | 0.000000 |
50% | 54.000000 | 130.000000 | 223.000000 | 140.000000 | 0.500000 |
75% | 60.000000 | 140.000000 | 268.000000 | 157.000000 | 1.500000 |
max | 77.000000 | 200.000000 | 603.000000 | 202.000000 | 6.200000 |
!pwd
/content
df.target.value_counts()
1 509 0 411 Name: target, dtype: int64
Initial EDA
plt.figure(figsize=(12,10))
sns.pairplot(df, hue='country', kind="reg", plot_kws={'scatter_kws': {'alpha': 0.3}})
plt.savefig("pairplot_by_country.png")
<Figure size 864x720 with 0 Axes>