TidyTuesday Exercise

Introduction

Data Source

This analysis uses data from the TidyTuesday project, a weekly data science initiative that provides real-world datasets for practice.

The dataset for this week was obtained from the TidyTuesday GitHub repository.

This dataset focuses on bird sightings at sea, recorded from ship-based observations between 1969 and 1990. The data includes information on bird species, counts, behavior, location, and environmental conditions such as weather and sea state.

Data Cleaning and Wrangling

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.2.0     ✔ readr     2.1.6
✔ forcats   1.0.1     ✔ stringr   1.6.0
✔ ggplot2   4.0.1     ✔ tibble    3.3.1
✔ lubridate 1.9.5     ✔ tidyr     1.3.2
✔ purrr     1.2.1     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

birds <- read_csv("data/birds.csv")

Warning: One or more parsing issues, call `problems()` on your data frame for details,
e.g.:
  dat <- vroom(...)
  problems(dat)

Rows: 49019 Columns: 26
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (6): species_common_name, species_scientific_name, species_abbreviation...
dbl  (9): bird_observation_id, record_id, count, n_feeding, n_sitting_on_wat...
lgl (11): sex, feeding, sitting_on_water, sitting_on_ice, sitting_on_ship, i...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

ships <- read_csv("data/ships.csv")

Rows: 12310 Columns: 21
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr   (7): hemisphere, activity, cloud_cover, precipitation, observer, censu...
dbl  (12): record_id, latitude, longitude, speed, direction, wind_speed_clas...
date  (1): date
time  (1): time

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

sea_states <- read_csv("data/sea_states.csv")

Rows: 7 Columns: 4
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): sea_state_description
dbl (3): sea_state_class, wave_meters_min, wave_meters_max

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

beaufort_scale <- read_csv("data/beaufort_scale.csv")

Rows: 13 Columns: 4
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): wind_description
dbl (3): wind_speed_class, wind_speed_knots_min, wind_speed_knots_max

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

head(birds)

# A tibble: 6 × 26
  bird_observation_id record_id species_common_name       species_scientific_n…¹
                <dbl>     <dbl> <chr>                     <chr>                 
1                   1   1083001 Royal / Wandering albatr… Diomedea epomophora /…
2                   2   1083001 Black-browed albatross s… Diomedea impavida / m…
3                   3   1083001 Cape petrel               Daption capense       
4                   4   1083001 Fairy prion               Pachyptila turtur     
5                   5   1083001 Sooty shearwater          Puffinus griseus      
6                   6   1084001 Royal albatross sensu la… Diomedea epomophora /…
# ℹ abbreviated name: ¹species_scientific_name
# ℹ 22 more variables: species_abbreviation <chr>, age <chr>,
#   wan_plumage_phase <chr>, plumage_phase <chr>, sex <lgl>, count <dbl>,
#   n_feeding <dbl>, feeding <lgl>, n_sitting_on_water <dbl>,
#   sitting_on_water <lgl>, n_sitting_on_ice <dbl>, sitting_on_ice <lgl>,
#   sitting_on_ship <lgl>, in_hand <lgl>, n_flying_past <dbl>,
#   flying_past <lgl>, n_accompanying <dbl>, accompanying <lgl>, …

glimpse(birds)

Rows: 49,019
Columns: 26
$ bird_observation_id     <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14,…
$ record_id               <dbl> 1083001, 1083001, 1083001, 1083001, 1083001, 1…
$ species_common_name     <chr> "Royal / Wandering albatross", "Black-browed a…
$ species_scientific_name <chr> "Diomedea epomophora / sanfordi / antipodensis…
$ species_abbreviation    <chr> "DIOEPOSANANTEXU", "DIOIMPMEL", "DAPCAP", "PAC…
$ age                     <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ wan_plumage_phase       <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ plumage_phase           <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ sex                     <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ count                   <dbl> 6, 2, 8, 2, 4, 10, 2, 18, 10, 2, 2, 2, 8, 1, 1…
$ n_feeding               <dbl> 0, 0, 0, 0, 0, 0, 0, NA, 0, 0, NA, NA, NA, 0, …
$ feeding                 <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALS…
$ n_sitting_on_water      <dbl> 0, 0, 0, 0, 0, 2, 0, NA, 0, 0, NA, NA, NA, 0, …
$ sitting_on_water        <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, FALSE…
$ n_sitting_on_ice        <dbl> 0, 0, 0, 0, 0, 0, 0, NA, 0, 0, NA, NA, NA, 0, …
$ sitting_on_ice          <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALS…
$ sitting_on_ship         <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALS…
$ in_hand                 <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALS…
$ n_flying_past           <dbl> 0, 0, 0, 0, 0, 0, 0, NA, 0, 0, NA, NA, NA, 0, …
$ flying_past             <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALS…
$ n_accompanying          <dbl> 6, 2, 8, 2, 4, 0, 0, NA, 0, 0, NA, NA, NA, 0, …
$ accompanying            <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, FALSE, FALSE, TR…
$ n_following_ship        <dbl> 0, 0, 0, 0, 0, 8, 2, NA, 10, 2, NA, NA, NA, 1,…
$ following_ship          <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, TRUE,…
$ moulting                <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ naturally_feeding       <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALS…

summary(birds)

 bird_observation_id   record_id        species_common_name
 Min.   :    1       Min.   : 1083001   Length:49019       
 1st Qu.:12256       1st Qu.: 6059010   Class :character   
 Median :24510       Median :13004002   Mode  :character   
 Mean   :24510       Mean   :39568084                      
 3rd Qu.:36765       3rd Qu.:84016021                      
 Max.   :49019       Max.   :88007036                      
                                                           
 species_scientific_name species_abbreviation     age           
 Length:49019            Length:49019         Length:49019      
 Class :character        Class :character     Class :character  
 Mode  :character        Mode  :character     Mode  :character  
                                                                
                                                                
                                                                
                                                                
 wan_plumage_phase  plumage_phase        sex              count         
 Length:49019       Length:49019       Mode:logical   Min.   :    1.00  
 Class :character   Class :character   NA's:49019     1st Qu.:    1.00  
 Mode  :character   Mode  :character                  Median :    2.00  
                                                      Mean   :   41.93  
                                                      3rd Qu.:    4.00  
                                                      Max.   :99999.00  
                                                      NA's   :2699      
   n_feeding         feeding        n_sitting_on_water sitting_on_water
 Min.   :    0.00   Mode :logical   Min.   :    0.00   Mode :logical   
 1st Qu.:    0.00   FALSE:24320     1st Qu.:    0.00   FALSE:24427     
 Median :    0.00   TRUE :3604      Median :    0.00   TRUE :3470      
 Mean   :   11.31   NA's :21095     Mean   :    4.13   NA's :21122     
 3rd Qu.:    0.00                   3rd Qu.:    0.00                   
 Max.   :99999.00                   Max.   :20000.00                   
 NA's   :26448                      NA's   :26448                      
 n_sitting_on_ice   sitting_on_ice  sitting_on_ship  in_hand       
 Min.   :  0.0000   Mode :logical   Mode :logical   Mode :logical  
 1st Qu.:  0.0000   FALSE:27847     FALSE:27801     FALSE:27883    
 Median :  0.0000   TRUE :41        TRUE :86        TRUE :3        
 Mean   :  0.0249   NA's :21131     NA's :21132     NA's :21133    
 3rd Qu.:  0.0000                                                  
 Max.   :300.0000                                                  
 NA's   :26448                                                     
 n_flying_past      flying_past     n_accompanying      accompanying   
 Min.   :    0.00   Mode :logical   Min.   :   0.0000   Mode :logical  
 1st Qu.:    0.00   FALSE:16163     1st Qu.:   0.0000   FALSE:21227    
 Median :    0.00   TRUE :11728     Median :   0.0000   TRUE :6659     
 Mean   :   23.61   NA's :21128     Mean   :   0.3396   NA's :21133    
 3rd Qu.:    1.00                   3rd Qu.:   0.0000                  
 Max.   :99999.00                   Max.   :1000.0000                  
 NA's   :26448                      NA's   :26448                      
 n_following_ship  following_ship   moulting       naturally_feeding
 Min.   : 0.0000   Mode :logical   Mode :logical   Mode :logical    
 1st Qu.: 0.0000   FALSE:16068     FALSE:30        FALSE:26867      
 Median : 0.0000   TRUE :11819     TRUE :83        TRUE :982        
 Mean   : 0.9958   NA's :21132     NA's :48906     NA's :21170      
 3rd Qu.: 1.0000                                                    
 Max.   :50.0000                                                    
 NA's   :26448

head(ships)

# A tibble: 6 × 21
  record_id date       time   latitude longitude hemisphere activity       speed
      <dbl> <date>     <time>    <dbl>     <dbl> <chr>      <chr>          <dbl>
1   1083001 1975-10-15 14:00     -45.9      165. E          steaming, sai…  15  
2   1084001 1975-11-03 13:10     -35.5      125  E          steaming, sai…  14  
3   1084002 1975-11-04 14:20     -37.7      132. E          steaming, sai…  14.5
4   1084003 1975-11-08 16:15     -40        162  E          steaming, sai…  14.6
5   1086001 1975-11-16 12:30     -36.2      175. E          steaming, sai…  15  
6   1086002 1975-11-16 15:30     -35.4      175. E          steaming, sai…  15  
# ℹ 13 more variables: direction <dbl>, cloud_cover <chr>, precipitation <chr>,
#   wind_speed_class <dbl>, wind_direction <dbl>, air_temperature <dbl>,
#   pressure <dbl>, sea_state_class <dbl>, sea_surface_temperature <dbl>,
#   depth <dbl>, observer <chr>, census_method <chr>, season <chr>

glimpse(ships)

Rows: 12,310
Columns: 21
$ record_id               <dbl> 1083001, 1084001, 1084002, 1084003, 1086001, 1…
$ date                    <date> 1975-10-15, 1975-11-03, 1975-11-04, 1975-11-0…
$ time                    <time> 14:00:00, 13:10:00, 14:20:00, 16:15:00, 12:30…
$ latitude                <dbl> -45.917, -35.533, -37.667, -40.000, -36.167, -…
$ longitude               <dbl> 165.400, 125.000, 132.250, 162.000, 174.917, 1…
$ hemisphere              <chr> "E", "E", "E", "E", "E", "E", "E", "E", "E", "…
$ activity                <chr> "steaming, sailing", "steaming, sailing", "ste…
$ speed                   <dbl> 15.0, 14.0, 14.5, 14.6, 15.0, 15.0, 15.0, 15.0…
$ direction               <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ cloud_cover             <chr> "overcast", "partially cloudy", "overcast", "p…
$ precipitation           <chr> "showers", "none", "none", "squalls", "none", …
$ wind_speed_class        <dbl> 5, 4, 4, 4, 0, 3, 3, 2, 3, 2, 7, 4, 6, 6, 6, 6…
$ wind_direction          <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ air_temperature         <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ pressure                <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ sea_state_class         <dbl> 5, 4, 4, 4, 1, 3, 3, 3, 4, 3, 5, 4, 5, 5, 5, 5…
$ sea_surface_temperature <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ depth                   <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ observer                <chr> "D. Jeffcock", "D. Jeffcock", "D. Jeffcock", "…
$ census_method           <chr> "full", "full", "full", "full", "full", "full"…
$ season                  <chr> "spring", "spring", "spring", "spring", "sprin…

Exploratory Data Analysis (EDA)

ggplot(ships, aes(x = sea_state_class)) +
  geom_bar() +
  labs(title = "Distribution of Sea State",
       x = "Sea State Class",
       y = "Count")

Warning: Removed 4751 rows containing non-finite outside the scale range
(`stat_count()`).

Sea state is measured on a scale from 0 (calm) to 6 (very rough). The distribution shows that most observations occur at moderate sea conditions (classes 3–4), while very calm and very rough conditions are less frequent.

ggplot(ships, aes(x = wind_speed_class)) +
  geom_bar() +
  labs(title = "Distribution of Wind Speed (Beaufort Scale)",
       x = "Wind Speed Class",
       y = "Count")

Warning: Removed 4688 rows containing non-finite outside the scale range
(`stat_count()`).

ggplot(ships, aes(x = sea_state_class, fill = factor(wind_speed_class))) +
  geom_bar(position = "dodge") +
  labs(title = "Sea State vs Wind Speed",
       x = "Sea State",
       fill = "Wind Class")

Warning: Removed 4751 rows containing non-finite outside the scale range
(`stat_count()`).

The relationship between sea state and wind speed shows that higher sea states tend to occur with higher wind speeds. Calm sea conditions are generally associated with low wind speeds, while rougher sea conditions correspond to moderate to high wind levels.

Research Question

This analysis investigates whether environmental conditions affect the presence of albatross during ship-based observations.

The outcome of interest is albatross presence (present vs absent). The main predictor is sea state class, and wind speed class (Beaufort scale) is included as a secondary predictor.

Specifically, we ask: Does sea state affect the presence of albatross, after accounting for wind speed?

Data Preparation for Modeling

# create binary outcome: 1 = albatross present, 0 = absent
birds_albatross <- birds %>%
  filter(str_detect(species_common_name, regex("albatross", ignore_case = TRUE))) %>%
  group_by(record_id) %>%
  summarise(albatross_present = ifelse(sum(count) > 0, 1, 0))

head(birds_albatross)

# A tibble: 6 × 2
  record_id albatross_present
      <dbl>             <dbl>
1   1083001                 1
2   1084001                 1
3   1084002                 1
4   1084003                 1
5   1086003                 1
6   1086004                 1

data <- ships %>%
  left_join(birds_albatross, by = "record_id") %>%
  mutate(albatross_present = ifelse(is.na(albatross_present), 0, albatross_present))

data <- data %>%
  select(sea_state_class, wind_speed_class, albatross_present) %>%
  drop_na()

data <- data %>%
  mutate(albatross_present = factor(albatross_present))

data %>%
  count(albatross_present)

# A tibble: 2 × 2
  albatross_present     n
  <fct>             <int>
1 0                  3024
2 1                  4521

glimpse(data)

Rows: 7,545
Columns: 3
$ sea_state_class   <dbl> 5, 4, 4, 4, 1, 3, 3, 3, 4, 3, 5, 4, 5, 5, 5, 5, 5, 4…
$ wind_speed_class  <dbl> 5, 4, 4, 4, 0, 3, 3, 2, 3, 2, 7, 4, 6, 6, 6, 6, 6, 4…
$ albatross_present <fct> 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0…

ggplot(data, aes(x = sea_state_class, fill = albatross_present)) +
  geom_bar(position = "fill") +
  labs(
    title = "Proportion of Albatross Presence by Sea State",
    x = "Sea State Class",
    y = "Proportion",
    fill = "Albatross Presence"
  )

The plot suggests that albatross presence increases with higher sea-state classes. In calmer conditions, albatross are less frequently observed, while in rougher sea conditions, the proportion of observations with albatross is higher. This pattern indicates that sea state may be an important predictor of albatross presence and supports further modeling.

Based on the exploratory data analysis, sea state appears to be associated with albatross presence, with higher sea states showing a higher proportion of presence. Wind speed also appears related to sea conditions and may influence observations.

Based on these observations, I proceed to build predictive models to evaluate whether these variables can reliably predict albatross presence.

Modeling

Modeling Setup

library(tidymodels)

── Attaching packages ────────────────────────────────────── tidymodels 1.4.1 ──

✔ broom        1.0.12     ✔ rsample      1.3.2 
✔ dials        1.4.2      ✔ tailor       0.1.0 
✔ infer        1.1.0      ✔ tune         2.0.1 
✔ modeldata    1.5.1      ✔ workflows    1.3.0 
✔ parsnip      1.4.1      ✔ workflowsets 1.1.1 
✔ recipes      1.3.1      ✔ yardstick    1.3.2

── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
✖ scales::discard() masks purrr::discard()
✖ dplyr::filter()   masks stats::filter()
✖ recipes::fixed()  masks stringr::fixed()
✖ dplyr::lag()      masks stats::lag()
✖ yardstick::spec() masks readr::spec()
✖ recipes::step()   masks stats::step()

set.seed(123)
# split data into training (75%) and testing (25%)

data_split <- initial_split(data, prop = 0.75, strata = albatross_present)

train_data <- training(data_split)
test_data  <- testing(data_split)

tibble(
  set = c("training", "test"),
  rows = c(nrow(train_data), nrow(test_data))
)

# A tibble: 2 × 2
  set       rows
  <chr>    <int>
1 training  5658
2 test      1887

The dataset was split into training (75%) and test (25%) sets using stratified sampling to preserve the proportion of albatross presence in both sets.

To compare models fairly, model development was carried out on the training data using cross-validation. Since the outcome is binary (albatross present or absent), classification models are appropriate. Sea state and wind speed were used as predictors.

# create 5-fold cross-validation for model evaluation
cv_folds <- vfold_cv(train_data, v = 5, strata = albatross_present)

albatross_recipe <- recipe(
  albatross_present ~ sea_state_class + wind_speed_class,
  data = train_data
)

class_metrics <- metric_set(roc_auc, accuracy, sens, spec)

Three different classification models were considered: - Logistic regression - LASSO - Random forest

Model 1: Logistic Regression

Logistic regression was used as the first model because it is a simple and interpretable method for binary classification. It provides a useful baseline for comparison with more flexible models.

log_model <- logistic_reg() %>%
  set_engine("glm") %>%
  set_mode("classification")

log_workflow <- workflow() %>%
  add_model(log_model) %>%
  add_recipe(albatross_recipe)

log_results <- fit_resamples(
  log_workflow,
  resamples = cv_folds,
  metrics = class_metrics
)

collect_metrics(log_results)

# A tibble: 4 × 6
  .metric  .estimator  mean     n std_err .config        
  <chr>    <chr>      <dbl> <int>   <dbl> <chr>          
1 accuracy binary     0.607     5 0.00268 pre0_mod0_post0
2 roc_auc  binary     0.605     5 0.00302 pre0_mod0_post0
3 sens     binary     0.108     5 0.00502 pre0_mod0_post0
4 spec     binary     0.941     5 0.00496 pre0_mod0_post0

Model 2: LASSO

LASSO was used as a second model because it extends logistic regression by applying regularization, which can help control model complexity while maintaining interpretability.

lasso_spec <- logistic_reg(
  penalty = tune(),
  mixture = 1
) %>%
  set_engine("glmnet") %>%
  set_mode("classification")

lasso_workflow <- workflow() %>%
  add_recipe(albatross_recipe) %>%
  add_model(lasso_spec)

lasso_grid <- grid_regular(
  penalty(),
  levels = 20
)

lasso_results <- tune_grid(
  lasso_workflow,
  resamples = cv_folds,
  grid = lasso_grid,
  metrics = class_metrics
)

show_best(lasso_results, metric = "roc_auc")

# A tibble: 5 × 7
   penalty .metric .estimator  mean     n std_err .config         
     <dbl> <chr>   <chr>      <dbl> <int>   <dbl> <chr>           
1 1   e-10 roc_auc binary     0.602     5 0.00540 pre0_mod01_post0
2 3.36e-10 roc_auc binary     0.602     5 0.00540 pre0_mod02_post0
3 1.13e- 9 roc_auc binary     0.602     5 0.00540 pre0_mod03_post0
4 3.79e- 9 roc_auc binary     0.602     5 0.00540 pre0_mod04_post0
5 1.27e- 8 roc_auc binary     0.602     5 0.00540 pre0_mod05_post0

Model 3: Random Forest

Random forest was used as a third model because it can capture more flexible and potentially non-linear relationships between predictors and albatross presence.

rf_spec <- rand_forest(
  mtry = tune(),
  min_n = tune(),
  trees = 500
) %>%
  set_engine("ranger") %>%
  set_mode("classification")

rf_workflow <- workflow() %>%
  add_recipe(albatross_recipe) %>%
  add_model(rf_spec)

rf_grid <- grid_regular(
  mtry(range = c(1L, 2L)),
  min_n(range = c(2L, 20L)),
  levels = 5
)

rf_results <- tune_grid(
  rf_workflow,
  resamples = cv_folds,
  grid = rf_grid,
  metrics = class_metrics
)

show_best(rf_results, metric = "roc_auc")

# A tibble: 5 × 8
   mtry min_n .metric .estimator  mean     n std_err .config         
  <int> <int> <chr>   <chr>      <dbl> <int>   <dbl> <chr>           
1     1    11 roc_auc binary     0.609     5 0.00526 pre0_mod03_post0
2     1     2 roc_auc binary     0.609     5 0.00536 pre0_mod01_post0
3     1    20 roc_auc binary     0.609     5 0.00547 pre0_mod05_post0
4     1    15 roc_auc binary     0.609     5 0.00533 pre0_mod04_post0
5     1     6 roc_auc binary     0.608     5 0.00517 pre0_mod02_post0

Compare Models

To compare the models fairly, this is a summary of the cross-validated ROC AUC results for each model.

log_summary <- collect_metrics(log_results) %>%
  filter(.metric == "roc_auc") %>%
  transmute(model = "Logistic regression", mean, std_err)

lasso_summary <- show_best(lasso_results, metric = "roc_auc", n = 1) %>%
  transmute(model = "LASSO", mean, std_err)

rf_summary <- show_best(rf_results, metric = "roc_auc", n = 1) %>%
  transmute(model = "Random forest", mean, std_err)

bind_rows(log_summary, lasso_summary, rf_summary) %>%
  arrange(desc(mean))

# A tibble: 3 × 3
  model                mean std_err
  <chr>               <dbl>   <dbl>
1 Random forest       0.609 0.00526
2 Logistic regression 0.605 0.00302
3 LASSO               0.602 0.00540

The comparison shows that all models perform similarly, with ROC AUC values around 0.60. The random forest achieved the highest performance, but only slightly better than logistic regression.

This suggests that the relationship between predictors and outcome is relatively weak and likely does not require complex models. The small improvement from more flexible models indicates limited additional predictive signal in the data.

Model Selection

Although the random forest model achieved the highest ROC AUC during cross-validation, the difference compared with logistic regression was very small. In contrast, logistic regression provided nearly the same performance while remaining simpler and easier to interpret.

Logistic regression was therefore selected as the final model because it offers a better balance between predictive performance and interpretability. In addition, the more complex models did not provide a meaningful improvement, suggesting that the relationship between the available predictors and albatross presence is relatively simple.

Final Model Evaluation

final_log_fit <- fit(log_workflow, data = train_data)

# generate class predictions and probabilities on test data
log_test_results <- test_data %>%
  bind_cols(predict(final_log_fit, test_data)) %>%
  bind_cols(predict(final_log_fit, test_data, type = "prob"))

class_metrics(log_test_results, truth = albatross_present, estimate = .pred_class, .pred_1)

# A tibble: 4 × 3
  .metric  .estimator .estimate
  <chr>    <chr>          <dbl>
1 accuracy binary         0.614
2 sens     binary         0.128
3 spec     binary         0.939
4 roc_auc  binary         0.379

log_test_results %>%
  roc_curve(truth = albatross_present, .pred_1) %>%
  autoplot()

conf_mat(log_test_results, truth = albatross_present, estimate = .pred_class)

          Truth
Prediction    0    1
         0   97   69
         1  659 1062

The performance of the final model on the test data shows limited predictive ability. While the accuracy is moderate, the sensitivity is very low, indicating that the model struggles to correctly identify cases where albatross are present.

The ROC curve further confirms this, with a low AUC value, suggesting poor discrimination between presence and absence.

Overall, the model performs only slightly better than random guessing and does not provide strong predictive power.

The confusion matrix further supports this, showing a large number of misclassifications, particularly false positives, indicating difficulty in distinguishing between the two classes.

Interpretation of the Final Logistic Regression Model

library(broom)

tidy(final_log_fit) %>%
  mutate(odds_ratio = exp(estimate))

# A tibble: 3 × 6
  term             estimate std.error statistic  p.value odds_ratio
  <chr>               <dbl>     <dbl>     <dbl>    <dbl>      <dbl>
1 (Intercept)       -0.822     0.0985     -8.34 7.22e-17      0.440
2 sea_state_class    0.401     0.0409      9.80 1.17e-22      1.49 
3 wind_speed_class  -0.0516    0.0240     -2.15 3.16e- 2      0.950

test_grid <- crossing(
  sea_state_class = seq(min(test_data$sea_state_class), max(test_data$sea_state_class), by = 1),
  wind_speed_class = median(test_data$wind_speed_class)
)

test_curve <- test_grid %>%
  bind_cols(
    predict(final_log_fit, new_data = test_grid, type = "prob")
  )

# plot
ggplot() +
  geom_point(
    data = test_data,
    aes(
      x = sea_state_class,
      y = as.numeric(albatross_present == 1)
    ),
    alpha = 0.2
  ) +
  geom_line(
    data = test_curve,
    aes(x = sea_state_class, y = .pred_1),
    linewidth = 1
  ) +
  labs(
    title = "Predicted probability curve on test data",
    x = "Sea state",
    y = "Probability of albatross presence"
  )

The predicted probability curve shows a slight increase in the probability of albatross presence as sea state increases. However, the observed data points remain highly scattered, with both presence and absence occurring across all sea state values. This indicates that while the model captures a general trend, it does not clearly separate the two classes.

This observation is consistent with the performance metrics on the test data, where the model showed low sensitivity and low ROC AUC. The model appears to struggle in correctly identifying positive cases and exhibits high uncertainty in its predictions.

Model Limitations

The model predictions remain uncertain, as probabilities do not strongly approach 0 or 1 and observed outcomes overlap across predictor values. This indicates that the model lacks confidence in distinguishing between presence and absence.

One possible reason for this is that important predictors may be missing from the analysis. Environmental factors beyond sea state and wind speed, such as location or time, may influence albatross presence.

Additionally, the variability observed in the data suggests that the relationship between predictors and outcome may be inherently noisy, limiting the achievable predictive performance.

Discussion

Summary of Findings

This analysis investigated whether sea state and wind speed can predict albatross presence. While exploratory analysis suggested a potential relationship, predictive modeling showed that this relationship is weak.

All models achieved similar performance, with only slight improvements from more complex approaches. Logistic regression was selected as the final model due to its simplicity and comparable performance.

Interpretation

The results indicate that while sea state may influence albatross presence, it is not sufficient on its own to accurately predict observations. The high variability in the data suggests that other environmental or biological factors may play an important role.

Limitations

One limitation of this analysis is the limited set of predictors. Additional variables such as location, time of year, or food availability may improve model performance.

Another limitation is the relatively low predictive power of all models, indicating that the problem may be inherently noisy.

Conclusion

Overall, this analysis demonstrates that while environmental conditions show some association with albatross presence, they are not strong predictors. More comprehensive data would be needed to build a reliable predictive model.