Bivariate Analysis between GDP per capita, Sanitation and Life Expectancy across Nations in 2010
05 Dec 2022
Our Aim is to determine which features more directly affect the Life Expectation of the citizens.
\[ ~ \]
This presentation used various methods of Bivariate Analysis to infer relationship between various features of Nations.
log of GDP per capita: Logarithm (base \(e\)) of Gross Domestic Product (in $) per citizen. Adjusted for Inflation. [lngdp]
The Gross Domestic Product per capita of a country basically tells about the wealth of the citizens of the country.
Sanitation Access %: Percentage of people using at least basic Sanitation facilities, not shared with other households. [snt]
Sanitation of oneself is one of the basic tasks of a human being. We know poor sanitation is linked to transmission of diarrheal diseases such as cholera and dysentery, as well as typhoid, intestinal worm infections and polio which directly affect the health of an individual.
Life Expectancy: The average number of years a newly born child would live, provided current mortality patterns hold. [lfx]
Life expectancy is calculated based on the assumption that probability of death at a certain age stays constant in future. Hence we Life Expectation as a measurable proxy for Health.
script.dir <- getSrcDirectory(function(x) {x})
setwd(script.dir)
numerise = function(x){
x[grepl("k$", x)] <- as.numeric(sub("k$", "", x[grepl("k$", x)]))*10^3
x <- as.numeric(x)
return(x)
}
d1_raw = read.csv(file.path(".","Data","gdp.csv"), fileEncoding = 'UTF-8-BOM')
d2_raw = read.csv(file.path(".","Data","sanitation.csv"), fileEncoding = 'UTF-8-BOM')
d3_raw = read.csv(file.path(".","Data","life_expectancy.csv"), fileEncoding = 'UTF-8-BOM')
yearname = "X2010"
d1 = d1_raw[!is.na(numerise(d1_raw[, yearname])),][,c("country", yearname)]
colnames(d1)[2] = "lngdp"
d2 = d2_raw[!is.na(numerise(d2_raw[, yearname])),][,c("country", yearname)]
colnames(d2)[2] = "snt"
d3 = d3_raw[!is.na(numerise(d3_raw[, yearname])),][,c("country", yearname)]
colnames(d3)[2] = "lfx"
dtemp = merge(x = d1, y = d2, by = "country")
d = merge(x = dtemp, y = d3, by = "country")
d$lngdp = log(numerise(d$lngdp))
write.csv(d, "./Data/assembled.csv")
kable(head(d, 6L))
country | lngdp | snt | lfx |
---|---|---|---|
Afghanistan | 6.265301 | 34.9 | 60.5 |
Albania | 8.183118 | 95.2 | 78.1 |
Algeria | 8.273847 | 87.0 | 74.5 |
Andorra | 10.454495 | 100.0 | 81.8 |
Angola | 8.291547 | 41.1 | 60.2 |
Antigua and Barbuda | 9.546813 | 86.3 | 75.9 |
Mean or Arithmetic Mean \(\bar{x}\), Median \(\operatorname{median}(x)\) and Mode \(\operatorname{mode}(x)\) are some measures of central tendency in the sample.
\[ \begin{aligned} x = \{x_1, x_2, \ldots,x_{n-1}, x_n\} && \operatorname{mean}(x)=\bar{x} = \frac{1}{n} \sum _{i=1}^{n}(x_{i}) \end{aligned} \]\[ \begin{aligned} \operatorname{median}(x)= \frac{x_{\lfloor\frac{n+1}{2}\rfloor}+x_{\lfloor\frac{n+2}{2}\rfloor}}{2} && \operatorname{mode}(x) = x_i \text{ s.t. } \operatorname{Pr}(x_i) = \operatorname{sup}(\operatorname{Pr}(x)) \end{aligned} \]
Note: \(f_i\) is the frequency of the ith observation. \(x_{(i)}\) is the ith largest observation.
getmode <- function(v) {
uniqv <- unique(v)
freq = max(tabulate(match(v, uniqv)))
res = uniqv[which.max(tabulate(match(v, uniqv)))]
if (freq == 1) res = NULL
return(res)
}
d_central = data.frame(
row.names = "Variable",
Variable = c(
"*ln(GDP)*",
"*Sanitation*",
"*Life Exp.*"
),
Mean = c(
mean(d$lngdp),
mean(d$snt),
mean(d$lfx)
),
Median = c(
median(d$lngdp),
median(d$snt),
median(d$lfx)
),
Mode = c(
getmode(d$lngdp),
getmode(d$snt),
getmode(d$lfx)
)
)
kable(
d_central,
col.names = c(
"$\\quad \\quad \\bar{x}$",
"$\\operatorname{median}(x)$",
"$\\operatorname{mode}(x)$"
),
digits=5
)
\(\quad \quad \bar{x}\) | \(\operatorname{median}(x)\) | \(\operatorname{mode}(x)\) | |
---|---|---|---|
ln(GDP) | 8.54124 | 8.48673 | 9.23014 |
Sanitation | 72.43857 | 85.60000 | 100.00000 |
Life Exp. | 70.54603 | 72.40000 | 73.20000 |
\(~\)
Range \(\operatorname{range}(x)\), Semi-Interquartile Range \(\operatorname{SIR}(x)\), Mean Deviation about x’ \(\operatorname{MD}_{(x')}(x)\), Variance \(s_x^2\), Standard Deviation \(s_x\) are some measures of dispersion in the sample.
\[ \begin{aligned} \operatorname{range}(x)=|x_{(n)} - x_{(1)}| && \ Q_1 = \operatorname{median}(x_{(1)}, \ldots ,x_{(\lfloor \frac{n+1}{2} \rfloor)}) && \end{aligned} \] \[ \begin{aligned} \ Q_3 = \operatorname{median}(x_{(\lfloor \frac{n+2}{2} \rfloor)}, \ldots , x_{(n)}) && \operatorname{MD}_{(x')}(x) = \operatorname{mean}(|x_i-x'|) \end{aligned} \] \[ \begin{aligned} \operatorname{SIR}(x)=\frac{|Q_1-Q_3|}{2} && s_x = \sqrt{\operatorname{mean}([x_i - \bar{x}]^2)} && s^2_x= (s_x)^2 \end{aligned} \]
getmd = function(x, center = mean(x)){
md = mean(
abs(
x - rep(center, length(x))
)
)
return(md)
}
d_disp = data.frame(
row.names = "Variable",
Variable = c(
"*ln(GDP)*",
"*Sanitation*",
"*Life Exp.*"
),
Range = c(
max(d$lngdp) - min(d$lngdp),
max(d$snt) - min(d$snt),
max(d$lfx) - min(d$lfx)
),
SIR = c(
IQR(d$lngdp)/2,
IQR(d$snt)/2,
IQR(d$lfx)/2
),
MD = c(
getmd(d$lngdp),
getmd(d$snt),
getmd(d$lfx)
),
variance = c(
(sd(d$lngdp))^2,
(sd(d$snt))^2,
(sd(d$lfx))^2
),
SD = c(
sd(d$lngdp),
sd(d$snt),
sd(d$lfx)
)
)
kable(
d_disp,
col.names = c(
"$\\operatorname{range}(x)$",
"$\\operatorname{SIR}(x)$",
"$\\operatorname{MD}_{(\\bar{x})}(x)$",
"$\\quad \\quad \\quad \\quad s_x^2$",
"$\\quad \\quad \\quad \\quad s_x$"
),
digits=5
)
\(\operatorname{range}(x)\) | \(\operatorname{SIR}(x)\) | \(\operatorname{MD}_{(\bar{x})}(x)\) | \(\quad \quad \quad \quad s_x^2\) | \(\quad \quad \quad \quad s_x\) | |
---|---|---|---|---|---|
ln(GDP) | 6.04435 | 1.06914 | 1.17229 | 2.01791 | 1.42053 |
Sanitation | 94.03000 | 24.65000 | 25.50487 | 872.29346 | 29.53461 |
Life Exp. | 50.80000 | 6.00000 | 6.98712 | 75.33494 | 8.67957 |
\(~\)
Coefficients of Skewness \(g_1\) and Kurtosis \(g_2\) describe the symmetry and extremity of tails of the sample distribution.
\[ \begin{aligned} m_k = \operatorname{mean}([x-\bar{x}]^k) && g_1 = \frac{m_3}{{m_2}^{\frac{3}{2}}} && g_2 = \frac{m_4}{{m_2}^2} \end{aligned} \]
d_shape = data.frame(
row.names = "Variable",
Variable = c(
"*ln(GDP)*",
"*Sanitation*",
"*Life Exp.*"
),
Skewness = c(
skewness(d$lngdp),
skewness(d$snt),
skewness(d$lfx)
),
Kurtosis = c(
kurtosis(d$lngdp),
kurtosis(d$snt),
kurtosis(d$lfx)
)
)
kable(
d_shape,
col.names = c(
"Skewness $g_1$",
"Kurtosis $g_2$"
),
digits=5
)
Skewness \(g_1\) | Kurtosis \(g_2\) | |
---|---|---|
ln(GDP) | 0.09619 | 2.14435 |
Sanitation | -0.85989 | 2.27111 |
Life Exp. | -0.87903 | 4.02465 |
\[ ~ \]
ln(GDP) is nearly symmetrical, while Sanitation and Life Exp. are highly left-skewed.
This indicates majority of countries have good sanitation system and citizen health.
ln(GDP) and Sanitation are platykurtic, while Life Exp. is leptokurtic.
labelfunction = as_labeller(c(
`lngdp` = "log of GDP per capita",
`snt` = "Sanitation Access %",
`lfx` = "Life Expectancy"
)
)
ggplot(stack(d[2:4]), mapping = aes(x = values))+
geom_density(aes(color=ind), linewidth=rel(1.5))+
labs(
x=NULL,
y=NULL
)+
mytheme+
mycolor+
facet_wrap(~ind, scales="free", labeller = labelfunction, ncol=1)+
theme(legend.position="none",
strip.text.x = element_text(size = rel(1.5))
)
Box plots help us detect potential outliers. They also help us in estimating location and skewness of the distribution.
plot1 = ggplot(stack(d[2:4]), mapping = aes(y = values, x = ind))+
geom_boxplot(aes(fill=ind), alpha=0.6)+
labs(
x=NULL,
y=NULL
)+
mytheme+
facet_wrap(vars(ind), scales="free", labeller = labelfunction)+
theme(axis.text.x=element_blank(),
legend.position="none",
strip.text.x = element_text(size = rel(1.5)),
panel.grid.minor.x = element_blank(),
panel.grid.major.x = element_blank()
)+
scale_color_manual(
values=c(
"#cc241d",
"#458588",
"#689d6a",
"#d65d0e"
),
aesthetics="fill"
)
ggplotly(plot1)
A Scatter plot helps us estimate the type of relationship between variables.
We observe a fairly strong positive Linear Correlation between the three features in the Scatter Plot.
We subsequently compute the Correlation Coefficients to quantify the Linear Correlation.
Correlation is any relationship, causal or spurious, between two random variables \(x\), \(y\).
Pearson’s \(r\), Spearman’s \(r_s\), and Kendall’s \(\tau\) are some correlation coefficients. These estimate the linear correlation between two variables.
d_cor = data.frame(
row.names = "Variable",
Variable = c(
"*Sanitation vs. ln(GDP)*",
"*Life Exp. vs. ln(GDP)*",
"*Life Exp. vs. Sanitation*"
),
Pearson = c(
cor(d$snt, d$lngdp, method="pearson"),
cor(d$lfx, d$lngdp, method="pearson"),
cor(d$lfx, d$snt, method="pearson")
),
Spearman = c(
cor(d$snt, d$lngdp, method="spearman"),
cor(d$lfx, d$lngdp, method="spearman"),
cor(d$lfx, d$snt, method="spearman")
),
Kendall = c(
cor(d$snt, d$lngdp, method="kendall"),
cor(d$lfx, d$lngdp, method="kendall"),
cor(d$lfx, d$snt, method="kendall")
)
)
avg_cor = round(mean(d_cor[, 1]), digits=2)
kable(
d_cor,
digit = 5,
col.names = c(
"*Pearson's* $r$",
"*Spearman's* $r_s$",
"*Kendall's* $\\tau$"
)
)
Pearson’s \(r\) | Spearman’s \(r_s\) | Kendall’s \(\tau\) | |
---|---|---|---|
Sanitation vs. ln(GDP) | 0.80659 | 0.85920 | 0.67458 |
Life Exp. vs. ln(GDP) | 0.77229 | 0.81639 | 0.62168 |
Life Exp. vs. Sanitation | 0.81351 | 0.83513 | 0.63744 |
\[ ~ \]
Covariance \(\operatorname{cov}(x, y)\) is a measure of the joint variability of two random variables \(x\), \(y\).
\[ \begin{aligned} \operatorname {cov} (x,y)={\frac {\sum _{i=1}^{n}(x_{i}-\bar{x})(y_{i}-\bar{y})}{n}} && r_{x,y}= \frac{\operatorname{cov}(x,y)}{s_x s_x} \end{aligned} \]
lngdp | snt | lfx | |
---|---|---|---|
lngdp | 2.01791 | 33.84045 | 9.52202 |
snt | 33.84045 | 872.29346 | 208.54155 |
lfx | 9.52202 | 208.54155 | 75.33494 |
\[A_{i,j} = \operatorname{cov}(x_i, x_j)\]
We observe fairly good Linear Correlation of around 0.8 between all three variables. Thus a slight increase in ln(GDP), Sanitation or Life expectancy tends to accompany rise in the other two features increasing too.
Let us try to create a Linear Model to best estimate the relationship between the three variables.
Simple Univariate Linear Regression is a method for estimating the relationship \(y_i=f(x_i)\) of a response variable \(y\) with a predictor variable \(x\), as a line that closely fits the \(y\) vs. \(x\) scatter plot.
\[ y_i = \hat{a} + \hat{b} x_i + e_i \]
Where \(\hat{a}\) is the intercept, \(\hat{b}\) is the slope, and \(e_i\) is the ith residual error. We aim to minimize \(e_i\) for better fit.
Ordinary Least squares method reduces \(e_i\) by minimizing error sum of squares \(\sum{{e_i}^2}\).
Coefficient of Determination \(R^2\) is the proportion of the variation in \(y\) predictable by the model.
p-value of an estimated coefficient denotes the reliability of the estimate.
\[ \begin{aligned} \hat{b} = r\frac{s_y}{s_x} && \hat{a} = \bar{y} - \hat{b}\bar{x} && R^2 = 1 - \frac{\sum{{e_i}^2}}{\sum{(y-\bar{y}})^2} \end{aligned} \]
olssmry = function(
d, x_map, y_map,
x_lab=waiver(), y_lab=waiver(),
title=waiver()
){
model = lm(formula=y_map~x_map)
smry = summary(model, signif.stars=TRUE)
star = symnum(smry$coefficients[2, 4],
corr = FALSE, na = FALSE,
cutpoints = c(0, 0.001, 0.01, 0.05, 0.1, 1),
symbols = c("***", "**", "*", ".", " ")
)
smryvec = c(
round(as.numeric(model$coefficients["(Intercept)"]), digits = 5),
round(as.numeric(model$coefficients["x_map"]), digits = 5),
round(smry$r.squared, digits = 5),
paste(signif(smry$coefficients[2, 4], digits = 5), star)
)
return(smryvec)
}
olstab = t(data.frame(
SvG = olssmry(d, d$lngdp, d$snt),
LvG = olssmry(d, d$lngdp, d$lfx),
LvS = olssmry(d, d$snt, d$lfx)
))
row.names(olstab) = c(
"*Sanitation vs. ln(GDP)*",
"*Life Exp. vs. ln(GDP)*",
"*Life Exp. vs. Sanitation*"
)
avg_r2 = round(mean(as.numeric(olstab[, 3])),digits = 1)
kable(
olstab,
digit = 5,
col.names=c(
"$\\hat{a}$",
"$\\hat{b}$",
"$R^2$",
"*p-value* of $\\hat{b}$"
)
)
\(\hat{a}\) | \(\hat{b}\) | \(R^2\) | p-value of \(\hat{b}\) | |
---|---|---|---|---|
Sanitation vs. ln(GDP) | -70.79844 | 16.77006 | 0.65059 | 1.4431e-44 *** |
Life Exp. vs. ln(GDP) | 30.24203 | 4.71876 | 0.59643 | 1.0702e-38 *** |
Life Exp. vs. Sanitation | 53.22795 | 0.23907 | 0.6618 | 6.7883e-46 *** |
Least absolute Deviation method reduces \(e_i\) by minimizing the sum of absolute deviations \(\sum{|e_i|}\).
ladsmry = function(
d, x_map, y_map,
x_lab=waiver(), y_lab=waiver(),
title=waiver()
){
model = rq(formula=y_map~x_map)
smry = summary(model)
smryvec = c(
as.numeric(model$coefficients[1]),
as.numeric(model$coefficients[2])
)
return(smryvec)
}
olstab = t(data.frame(
SvG = ladsmry(d, d$lngdp, d$snt),
LvG = ladsmry(d, d$lngdp, d$lfx),
LvS = ladsmry(d, d$snt, d$lfx)
))
row.names(olstab) = c(
"*Sanitation vs. ln(GDP)*",
"*Life Exp. vs. ln(GDP)*",
"*Life Exp. vs. Sanitation*"
)
kable(
olstab,
digit = 5,
col.names=c(
"$\\hat{a}$",
"$\\hat{b}$"
)
)
\(\hat{a}\) | \(\hat{b}\) | |
---|---|---|
Sanitation vs. ln(GDP) | -71.23153 | 16.80472 |
Life Exp. vs. ln(GDP) | 31.99047 | 4.61340 |
Life Exp. vs. Sanitation | 53.73041 | 0.23963 |
Plotting the estimated Linear Models on the Scatter Plots.
linearplot = function(
d, x_map, y_map,
x_lab=waiver(), y_lab=waiver(),
title=waiver()
){
olsvec = round(as.numeric(olssmry(d, x_map, y_map)[1:3]), digit=5)
ladvec = round(ladsmry(d, x_map, y_map), digit=5)
ols_str = paste("Ordinary Least Squares", "<br>",
"y =", olsvec[1], "+", olsvec[2], "x")
lad_str = paste("Least Absolute Deviation", "<br>",
"y =", ladvec[1], "+", ladvec[2], "x")
plot1 = ggplot(d, mapping = aes(x = x_map, y = y_map))+
geom_point(
alpha=0.6,
aes(text = country)
)+
mytheme+
labs(
x= x_lab,
y=y_lab,
title=title,
parse=TRUE
)+
geom_smooth(
method="lm",
formula=y~x,
se=FALSE,
aes(color = ols_str, text = ols_str)
)+
geom_smooth(
method="rq",
formula=y~x,
se=FALSE,
aes(color = lad_str, text = lad_str)
)+
labs(
color="Linear Model"
)+
scale_color_manual(
values=c(
"#cc241d80",
"#45858880",
"#689d6a80",
"#d65d0e80"
)
)
plot2 = ggplotly(plot1, tooltip = "text")
return(plot2)
}
Both OLS and LAD models fit the scatter plots very well. The OLS model is affected in the Life Expectancy plots due to the existence of Haiti as an outlier.
Our aim is to deduce which features more directly affect the Life Expectancy. Thus we should remove the effect of the other features when computing correlation.
Partial Correlation is the relationship between two variables \(x\), \(y\) of interest, after removing effect of some other related variable \(z\).
\[ \begin{aligned} &x_i = \hat{a_x} + \hat{b_x} z_i + e_{x,i} && y_i = \hat{a_y} + \hat{b_y} z_i + e_{y,i} \\ &\Rightarrow r_{x,y;z} = r_{e_{x}, e_{y}} \end{aligned} \]
partcor = pcor(d[, 2:4])$estimate
pcortab = data.frame(
row.names = "Variable",
Variable = c(
"*Sanitation vs. ln(GDP)*",
"*Life Exp.$\\quad$ vs. ln(GDP)*",
"*Life Exp. vs. Sanitation*"
),
PCor = c(
partcor[2, 1],
partcor[3, 1],
partcor[3, 2]
)
)
kable(pcortab,
col.names = c(
"Partial Correlation"
))
Partial Correlation | |
---|---|
Sanitation vs. ln(GDP) | 0.4826925 |
Life Exp.\(\quad\) vs. ln(GDP) | 0.3377892 |
Life Exp. vs. Sanitation | 0.5075384 |
Thus Life Expection is more directly improved by better Sanitaion access, than increase in wealth.
This indicates that wealthier countries tend to improve their sanitation infrastructure.
numerise = function(x){
x[grepl("k$", x)] <- as.numeric(sub("k$", "", x[grepl("k$", x)]))*10^3
x <- as.numeric(x)
return(x)
}
d1_raw = read.csv(file.path(".","Data","gdp.csv"), fileEncoding = 'UTF-8-BOM')
d2_raw = read.csv(file.path(".","Data","sanitation.csv"), fileEncoding = 'UTF-8-BOM')
d3_raw = read.csv(file.path(".","Data","life_expectancy.csv"), fileEncoding = 'UTF-8-BOM')
years = 2000:2019
yearnames = paste0("X", years)
makedata = function(yearname){
d1 = d1_raw[!is.na(numerise(d1_raw[, yearname])),][,c("country", yearname)]
colnames(d1)[2] = "lngdp"
d2 = d2_raw[!is.na(numerise(d2_raw[, yearname])),][,c("country", yearname)]
colnames(d2)[2] = "snt"
d3 = d3_raw[!is.na(numerise(d3_raw[, yearname])),][,c("country", yearname)]
colnames(d3)[2] = "lfx"
dtemp = merge(x = d1, y = d2, by = "country")
d = merge(x = dtemp, y = d3, by = "country")
d$lngdp = log(numerise(d$lngdp))
return(d)
}
Thus our analysis is fairly robust with respect to variation in Time. Thus we only perform the analysis on the year 2010.
Scatter Plots of Sanitation or Life Exp. with respect to raw GDP are not linearly correlated. As this was beyond our scope of knowledge, we instead considered the log (base \(e\)) of GDP.
Thus in reality, rise in GDP provides diminishing returns in citizen health the wealthier a country is.
Thus we have observed that Life expectancy increases with rise in log of GDP per capita and Sanitation facility access. Life Expectancy is also more directly affected by increase in basic Sanitation access than with rise in log of GDP per capita.
Governments should consider that:
Focusing on better sanitation services leads to visibly better lifespan of the citizens.
Improvements in GDP per capita make practical improvements in sanitation and life expectancy only for poorer countries, due diminishing returns in the log scale.
Improvements to life expectancies once made, are fairly stable.
Some countries had alarmingly low Life Expectancies:
The countries like the Central African Republic, Zimbabwe have very low life expectancies due to endemic poverty and weak governance, contributing to a dire health situation.
Haiti had a high child mortality rate in 2010 due to natural disasters and cholera outbreaks. This caused low Life expectancy for that year.