| Title: | Machine Learning and Mapping for Spatial Epidemiology |
|---|---|
| Description: | Provides tools for the integration, visualisation, and modelling of spatial epidemiological data using the method described in Azeez, A., & Noel, C. (2025). 'Predictive Modelling and Spatial Distribution of Pancreatic Cancer in Africa Using Machine Learning-Based Spatial Model' <doi:10.5281/zenodo.16529986> and <doi:10.5281/zenodo.16529016>. It facilitates the analysis of geographic health data by combining modern spatial mapping tools with advanced machine learning (ML) algorithms. 'mlspatial' enables users to import and pre-process shapefile and associated demographic or disease incidence data, generate richly annotated thematic maps, and apply predictive models, including Random Forest, 'XGBoost', and Support Vector Regression, to identify spatial patterns and risk factors. It is suited for spatial epidemiologists, public health researchers, and GIS analysts aiming to uncover hidden geographic patterns in health-related outcomes and inform evidence-based interventions. |
| Authors: | Adeboye Azeez [aut, cre], Colin Noel [aut] |
| Maintainer: | Adeboye Azeez <[email protected]> |
| License: | MIT + file LICENSE |
| Version: | 0.1.1 |
| Built: | 2026-06-03 08:25:15 UTC |
| Source: | https://github.com/azizadeboye/mlspatial |
A dataset containing spatial polygons of Africa.
africa_shpafrica_shp
An sf object with spatial features.
Your data source
A dataset containing spatial polygons of Africa.
africa_shpsafrica_shps
An sf object with spatial features.
Your data source
Computes global and local Moran’s I to assess spatial autocorrelation and classifies observations into spatial cluster types (e.g., High-High).
compute_spatial_autocorr(sf_data, values, signif = 0.05)compute_spatial_autocorr(sf_data, values, signif = 0.05)
sf_data |
An |
values |
A numeric vector or column name with the variable to test. |
signif |
Numeric significance level threshold for clusters (default 0.05). |
A named list with elements:
data: An sf object with added columns for standardized values,
spatial lag, local Moran's I values, z-scores, p-values, and cluster classification.
moran: An object of class htest with global Moran's I test results.
library(sf) library(spdep) library(dplyr) #Load and prepare spatial data mapdata <- st_read(system.file("shape/nc.shp", package="sf"), quiet = TRUE) mapdata <- st_make_valid(mapdata) #Variable to analyze values <- rnorm(nrow(mapdata)) #Run function result <- compute_spatial_autocorr(mapdata, values, signif = 0.05) #Inspect results head(result$data) result$moranlibrary(sf) library(spdep) library(dplyr) #Load and prepare spatial data mapdata <- st_read(system.file("shape/nc.shp", package="sf"), quiet = TRUE) mapdata <- st_make_valid(mapdata) #Variable to analyze values <- rnorm(nrow(mapdata)) #Run function result <- compute_spatial_autocorr(mapdata, values, signif = 0.05) #Inspect results head(result$data) result$moran
Evaluate Model Performance by calculating RMSE, MAE, and R² metrics.
eval_model(model, data, formula, model_type = c("rf", "xgb", "svr"))eval_model(model, data, formula, model_type = c("rf", "xgb", "svr"))
model |
A trained model |
data |
A data frame |
formula |
A formula object |
model_type |
Character string: one of "rf", "xgb", or "svr" |
A numeric value representing the model's accuracy
This is to suppress R CMD check notes about undefined global variables.
Join spatial and incidence datasets
join_data(sf_data, tbl_data, by)join_data(sf_data, tbl_data, by)
sf_data |
sf object |
tbl_data |
tibble of incidence |
by |
Column name to join on |
sf object with joined attributes
Load incidence data from Excel
load_incidence_data(xlsx_path)load_incidence_data(xlsx_path)
xlsx_path |
Path to Excel file |
tibble of data
Load shapefile as sf + optionally convert to sp
load_shapefile(shp_path, to_sp = FALSE)load_shapefile(shp_path, to_sp = FALSE)
shp_path |
Path to shapefile (.shp) |
to_sp |
logical: also return Spatial object? |
list with sf and optionally sp object
Examples for model evaluation functions
library(randomForest) library(caret) data(panc_incidence) mapdata <- join_data(africa_shp, panc_incidence, by = "NAME") rf_model <- randomForest(incidence ~ female + male + agea + ageb + agec + fagea + fageb + fagec + magea + mageb + magec + yrb + yrc + yrd + yre, data = mapdata, ntree = 500, importance = TRUE) rf_preds <- predict(rf_model, newdata = mapdata) rf_metrics <- postResample(pred = rf_preds, obs = mapdata$incidence) print(rf_metrics)library(randomForest) library(caret) data(panc_incidence) mapdata <- join_data(africa_shp, panc_incidence, by = "NAME") rf_model <- randomForest(incidence ~ female + male + agea + ageb + agec + fagea + fageb + fagec + magea + mageb + magec + yrb + yrc + yrd + yre, data = mapdata, ntree = 500, importance = TRUE) rf_preds <- predict(rf_model, newdata = mapdata) rf_metrics <- postResample(pred = rf_preds, obs = mapdata$incidence) print(rf_metrics)
This dataset contains pancreatic cancer incidence rates across African countries.
data(panc_incidence)data(panc_incidence)
A data frame with the following variables:
Character. Name of the country.
Double. Incidence rate per 100,000 population.
Double. Female pancreatic cancer patients.
Double. Male pancreatic cancer patients.
Double. Patients age between 20-54 years.
Double. Patients age above 55 years.
Double. Patients age below 20 years.
Double. Female patients age between 20-54 years.
Double. Female patients age above 55 years.
Double. Female patients age below 20 years.
Double. Male patients age between 20-54 years.
Double. Male patients age above 55 years.
Double. Male patients age below 20 years.
Double. Incidence rate in year 2017.
Double. Incidence rate in year 2018.
Double. Incidence rate in year 2019.
Double. Incidence rate in year 2020.
Double. Incidence rate in year 2021.
Global Burden of Disease (GBD) 2021 estimates, Seattle, United States https://vizhub.healthdata.org/gbd-results/
This dataset contains pancreatic cancer incidence rates across African countries.
data(panc_prevalence)data(panc_prevalence)
A data frame with the following variables:
Character. Name of the country.
Numeric. Prevalence rate per 100,000 population.
Numeric. Female pancreatic cancer patients.
Numeric. Male pancreatic cancer patients.
Numeric. Patients age between 20-54 years.
Numeric. Patients age above 55 years.
Numeric. Patients age below 20 years.
Numeric. Female patients age between 20-54 years.
Numeric. Female patients age above 55 years.
Numeric. Female patients age below 20 years.
Numeric. Male patients age between 20-54 years.
Numeric. Male patients age above 55 years.
Numeric. Male patients age below 20 years.
Numeric. Incidence rate in year 2017.
Numeric. Incidence rate in year 2018.
Numeric. Incidence rate in year 2019.
Numeric. Incidence rate in year 2020.
Numeric. Incidence rate in year 2021.
Global Burden of Disease (GBD) 2021 estimates, Seattle, United States https://vizhub.healthdata.org/gbd-results/
This dataset contains pancreatic cancer incidence rates across African countries.
data(pancre_mort)data(pancre_mort)
A data frame with the following variables:
Character. Name of the country.
Numeric. Mortality rate per 100,000 population.
Numeric. Female pancreatic cancer patients.
Numeric. Male pancreatic cancer patients.
Numeric. Patients age between 20-54 years.
Numeric. Patients age above 55 years.
Numeric. Patients age below 20 years.
Numeric. Female patients age between 20-54 years.
Numeric. Female patients age above 55 years.
Numeric. Female patients age below 20 years.
Numeric. Male patients age between 20-54 years.
Numeric. Male patients age above 55 years.
Numeric. Male patients age below 20 years.
Numeric. Incidence rate in year 2017.
Numeric. Incidence rate in year 2018.
Numeric. Incidence rate in year 2019.
Numeric. Incidence rate in year 2020.
Numeric. Incidence rate in year 2021.
Global Burden of Disease (GBD) 2021 estimates, https://vizhub.healthdata.org/gbd-results/
Arrange a list of tmap objects into a grid layout.
plot_map_grid(maps, ncol = 2)plot_map_grid(maps, ncol = 2)
maps |
A list of tmap objects. |
ncol |
Number of columns in the grid (default is 2). |
A tmap object representing arranged maps.
library(sf) library(tmap) # Load sample spatial data nc <- st_read(system.file("shape/nc.shp", package = "sf"), quiet = TRUE) # Add mock variables to map nc$var1 <- runif(nrow(nc), 0, 100) nc$var2 <- runif(nrow(nc), 10, 200) # Create individual maps map1 <- tm_shape(nc) + tm_fill("var1", title = "Variable 1") map2 <- tm_shape(nc) + tm_fill("var2", title = "Variable 2") # Arrange the maps in a grid using your function plot_map_grid(list(map1, map2), ncol = 2)library(sf) library(tmap) # Load sample spatial data nc <- st_read(system.file("shape/nc.shp", package = "sf"), quiet = TRUE) # Add mock variables to map nc$var1 <- runif(nrow(nc), 0, 100) nc$var2 <- runif(nrow(nc), 10, 200) # Create individual maps map1 <- tm_shape(nc) + tm_fill("var1", title = "Variable 1") map2 <- tm_shape(nc) + tm_fill("var2", title = "Variable 2") # Arrange the maps in a grid using your function plot_map_grid(list(map1, map2), ncol = 2)
Creates a scatterplot of observed vs predicted values, with a 1:1 reference line and Pearson's R².
plot_obs_vs_pred(observed, predicted, title = "")plot_obs_vs_pred(observed, predicted, title = "")
observed |
Numeric vector of observed values. |
predicted |
Numeric vector of predicted values. |
title |
String for the plot title (default: ""). |
No return value; called for side effect of displaying a plot.
observed <- c(10, 20, 30, 40) predicted <- c(12, 18, 33, 39) plot_obs_vs_pred(observed, predicted, title = "Observed vs Predicted")observed <- c(10, 20, 30, 40) predicted <- c(12, 18, 33, 39) plot_obs_vs_pred(observed, predicted, title = "Observed vs Predicted")
Creates a thematic map using the tmap package for a single variable in an sf object.
plot_single_map(sf_data, var, title, palette = "reds")plot_single_map(sf_data, var, title, palette = "reds")
sf_data |
An sf object containing spatial data. |
var |
Variable name as a string to map. |
title |
Legend title for the fill legend. |
palette |
Color palette for the map (default is "reds"). |
A tmap object representing the thematic map.
library(sf) # Create example sf object nc <- st_read(system.file("shape/nc.shp", package = "sf"), quiet = TRUE) nc$incidence <- runif(nrow(nc), 0, 100) # Plot p1 <- plot_single_map(nc, "incidence", "Incidence")library(sf) # Create example sf object nc <- st_read(system.file("shape/nc.shp", package = "sf"), quiet = TRUE) nc$incidence <- runif(nrow(nc), 0, 100) # Plot p1 <- plot_single_map(nc, "incidence", "Incidence")
Trains a Random Forest regression model.
train_rf(data, formula, ntree = 500, seed = 123)train_rf(data, formula, ntree = 500, seed = 123)
data |
A data frame containing the training data. |
formula |
A formula describing the model structure. |
ntree |
Number of trees to grow (default 500). |
seed |
Random seed for reproducibility (default 123). |
A trained randomForest model object.
library(randomForest) data(mtcars) rf_model <- train_rf(mtcars, mpg ~ cyl + hp + wt, ntree = 100) print(rf_model)library(randomForest) data(mtcars) rf_model <- train_rf(mtcars, mpg ~ cyl + hp + wt, ntree = 100) print(rf_model)
Train Support
train_svr(data, formula)train_svr(data, formula)
data |
A data frame containing the training data. |
formula |
A formula specifying the model. |
Trains an SVR model using the radial kernel.
A trained svm model object from the e1071 package.
# Load required package library(e1071) # Use built-in dataset data(mtcars) # Define regression formula svr_formula <- mpg ~ cyl + disp + hp + wt # Train SVR model svr_model <- train_svr(data = mtcars, formula = svr_formula) # Print model summary print(svr_model) # Predict on the same data (for illustration) preds <- predict(svr_model, newdata = mtcars) head(preds)# Load required package library(e1071) # Use built-in dataset data(mtcars) # Define regression formula svr_formula <- mpg ~ cyl + disp + hp + wt # Train SVR model svr_model <- train_svr(data = mtcars, formula = svr_formula) # Print model summary print(svr_model) # Predict on the same data (for illustration) preds <- predict(svr_model, newdata = mtcars) head(preds)
Train XGBoost model
train_xgb(data, formula, nrounds = 100, max_depth = 4, learning_rate = 0.1)train_xgb(data, formula, nrounds = 100, max_depth = 4, learning_rate = 0.1)
data |
A data frame with the training data. |
formula |
A formula defining the model structure. |
nrounds |
Number of boosting iterations. |
max_depth |
Maximum tree depth. |
learning_rate |
Learning rate for boosting. |
Trains an XGBoost regression model.
A trained xgboost model object.
# Load required package library(xgboost) # Use built-in dataset data(mtcars) # Define regression formula xgb_formula <- mpg ~ cyl + disp + hp + wt # Train XGBoost model xgb_model <- train_xgb(data = mtcars, formula = xgb_formula, nrounds = 50) # Print model summary print(xgb_model)# Load required package library(xgboost) # Use built-in dataset data(mtcars) # Define regression formula xgb_formula <- mpg ~ cyl + disp + hp + wt # Train XGBoost model xgb_model <- train_xgb(data = mtcars, formula = xgb_formula, nrounds = 50) # Print model summary print(xgb_model)