library(sf) 
## Linking to GEOS 3.9.3, GDAL 3.5.2, PROJ 8.2.1; sf_use_s2() is TRUE
library(ggplot2)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
wards <- sf::st_read("./DataRegression/LondonWards.shp") 
## Reading layer `LondonWards' from data source 
##   `C:\Users\rodri\OneDrive - stud.sbg.ac.at\CDE\2S_SpatialStatistics\Assignments\Assignment4\DataRegression\LondonWards.shp' 
##   using driver `ESRI Shapefile'
## Simple feature collection with 626 features and 73 fields
## Geometry type: POLYGON
## Dimension:     XY
## Bounding box:  xmin: 503568.2 ymin: 155850.8 xmax: 561957.5 ymax: 200933.9
## Projected CRS: OSGB36 / British National Grid
ggplot(data = wards) + 
    geom_sf(aes(fill = AGc_2)) + 
    scale_fill_viridis_c(direction = -1) + 
    labs(fill='Average GCSE') 

ggplot(data = wards, aes(x = MdA_2013, y = AGc_2)) + 
  geom_point() + 
  xlab("Unauthorised Absence in All Schools (%) - 2013") + 
  ylab("Average GCSE - 2014") + 
  theme_minimal() 

ggplot(data = wards, aes(x = MdA_2013, y = AGc_2)) + 
  geom_point() + 
  xlab("Median Age - 2013") + 
  ylab("Average GCSE - 2014") + 
  geom_smooth(method=lm , color="red", fill="#69b3a2", se=FALSE) + 
  theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'

model1 <- lm(AGc_2 ~ MdA_2013, data=wards) 

summary(model1)
## 
## Call:
## lm(formula = AGc_2 ~ MdA_2013, data = wards)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -49.834 -12.915  -1.517  11.265  65.587 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 234.6639     6.6078   35.51   <2e-16 ***
## MdA_2013      2.6867     0.1904   14.11   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 18.8 on 624 degrees of freedom
## Multiple R-squared:  0.2419, Adjusted R-squared:  0.2407 
## F-statistic: 199.2 on 1 and 624 DF,  p-value: < 2.2e-16

The linear regression between Median Age (2013) and the Average GSCE score has a positive slope. When the mean age is 0, the grade is 234.66; 24.19% of the variance in GSCE score is explained by the model. It shows that there’s a positive correlation between these two variables, however, it implies in a causality.

There are more important factors that really have impact in the GSCE score, such as income, absence in class, amount of studied hours. Another thing to take into account is: one student that is one or two year old younger can have a better score than an older one, specially if he/she was more dedicated, studied more, or had better educational/financial support.

Despite having a highly significant model, it implies causality.