rm(list=ls())
seed <- 12345
set.seed(seed)
# the amount of ppl with disease X
pX = list()
# the amount of ppl with disease Y
pY = list()
# the amount of ppl that vaccinate against disease Y
pVY = list()
# the amount of ppl tested for X
pTX = list()
# the amount of ppl which appear as positive for X
pPX = list()
months = 60
mean_p = 10000
sd_p = 1000
# p=population, sampled and studied for a given amount of months
p = floor(rnorm(months, mean_p, sd_p))
# 1% of p have contracted X
pX = floor(0.01 * p + rnorm(months, 0, 5))
# as Y is a very common disease, 95% of p is vaccinated against it
pVY = pmin(floor(0.95 * p + rnorm(months, 0, 100)),p)
# only 0.1% of the vaccinated contract Y, while 20% of the non-vaccinated do
pY = pmax(floor(0.001 * pVY + 0.2 * (p-pVY) + rnorm(months, 0, 5)),0)
# 70% of the vaccinated against Y are also tested for X
# (out of the non-vaccinated, only 10% are tested)
pTX = floor(0.3 * pVY + 0.1 * (p-pVY) + rnorm(months, 0, 200))
# 99% of the times, somebody who has contracted X and tested for it will be positive
pPX = floor(0.99 * pTX * pX/p + rnorm(months, 0, 2))
################################# MAIN CODE ###########################################
cat("
-------------------------------------------------------------------------------------
Welcome to the second ML demo/homework (year 2017): correlation and causation
We want to study the relationship between some vaccines and the presence of some
diseases in people. To do this, we are going to generate a synthetic dataset
which follows a simple model, aimed at mimicking some the dynamics that might
take place in this field.
Disclaimer: some people were vaccinated and/or contracted diseases while
preparing this homework. However, not *because of* this homework. Another
example of correlation without causation, I guess.
")
invisible(readline(prompt = "Press [enter] to continue"))
# study the positives for X as a function of the vaccinated ones
fit = lm(pPX ~ pVY)
plot(pVY, pPX)
abline(fit, col='gray')
cat("
What you see in the figure is a plot of the number of people being tested
as positive to disease X (pPX) as a function of the number of people being
vaccinated against disease Y (pVY). Now answer the following questions:
- how large is the correlation between the two variables? Is it also visually
evident?
- the plotted line is the result of a fitted linear model: what are the
learned parameters? Is there an actual relationship between the two
variables? Should the null hypothesis be rejected or not?
- from the data you currently have (pVY and pPX), can you state there is
a causal relationship between the number of vaccinated people and the
number of those who caught disease X?
(NOTE: to take control of the interactive console and run commands now, just
hit the ESC key)
")
invisible(readline(prompt = "Press [enter] to continue"))
cat("
With the doubt that vaccinating for Y causes X and pressures, hospitals decide
to increase the number of tests for the X disease. However, as the cost of
these tests is not negligible, the don't do the same for the people who have
not been vaccinated. Check out the code below here and answer the following
questions:
- tests on pVY were increased from 30% (0.3) to 60% (0.6): what happens to pPX,
its correlation with pVY, and the null hypothesis on our fit?
- what happens when, alarmed by the results, the tests were extended to 90%
of the population of vaccinated patients pVY?
")
set.seed(seed)
# 30% of the vaccinated are tested (CHANGE), only 10% of the non-vaccinated ones are
pTX = floor(0.3 * pVY + 0.1 * (p-pVY) + rnorm(months, 0, 200))
# 99% of the times, somebody who has contracted X and tested for it will be positive
pPX = floor(0.99 * pTX * pX/p + rnorm(months, 0, 2))
fit = lm(pPX ~ pVY)
plot(pVY, pPX)
abline(fit, col='gray')
invisible(readline(prompt = "Press [enter] to continue"))
cat("
A a data scientist, you intuitively reject the possibility of a causal
relationship between pVY and pPX without any additional information, and
look for more data to see if there is any lurking variable which might
better explain what is happening. You finally get your hands on the total
numbers of people tested for X (pTX).
- what is the correlation of pPX with this new variable?
- what happens when you try to fit a multiple linear regression on both
pTX and pVY?
")
invisible(readline(prompt = "Press [enter] to continue"))
### COMPLETE THE CODE HERE
# fit = ...
# summary(fit)
cat("
In the meantime, a lot of people panic and stop vaccinating. The number
of people who get a vaccine for disease Y drops from 95% to 70%. Update
the code below and answer the following questions:
- did reducing the vaccine reduce the number of people *found as positive*
to disease X? (NOTE: you can calculate this as sum(pPX))
- did reducing the vaccine reduce the number of people *who have disease
X*?
- what happens to the number of people contracting Y?
")
invisible(readline(prompt = "Press [enter] to continue"))
set.seed(seed)
# p=population, sampled and studied for a given amount of months
p = floor(rnorm(months, mean_p, sd_p))
# 1% of p have contracted X
pX = floor(0.01 * p + rnorm(months, 0, 5))
# as Y is a very common disease, 95% of p is vaccinated against it (CHANGE)
pVY = pmin(floor(0.95 * p + rnorm(months, 0, 100)),p)
# only 0.1% of the vaccinated contract Y, while 20% of the non-vaccinated do
pY = pmax(floor(0.001 * pVY + 0.2 * (p-pVY) + rnorm(months, 0, 5)),0)
# 90% of the vaccinated against Y are also tested for X
# (out of the non-vaccinated, only 10% are tested)
pTX = floor(0.9 * pVY + 0.1 * (p-pVY) + rnorm(months, 0, 200))
# 99% of the times, somebody who has contracted X and tested for it will be positive
pPX = floor(0.99 * pTX * pX/p + rnorm(months, 0, 2))
cat("
Years later (not many), after humanity almost extinguishes due to an outbreak
of disease Y, you finally get access to all the data that was collected (but,
unfortunately, not made public) about X and Y. This is exactly the same type
of data that we simulated and you should know well by now. Try running a
multiple linear regression using all the variables and comment the results
(what should be eliminated by the model according to the null hypothesis?).
Then perform attribute selection and comment: which were the hidden variables
that should have been taken into account to describe the number of patients
found as affected by disease X?
")
invisible(readline(prompt = "Press [enter] to continue"))
# for attribute selection, you can run the following:
# fit = regsubsets(pPX~., data.frame(cbind(p,pX,pVY,pY,pTX,pPX)))