This project stems from my love of running! While I’ve never done an ultramarathon, I’m fascinated with the reasons that would drive somebody to run that far (50+ miles).
I mainly wanted to investigate the relationship between age, gender, and speed with this project. The dataset is from an old Tidy Tuesday project that I thought looked interesting. It’s a large dataset with over 100,000 observations from many ultra races around the world. Some of the most popular and well-known are the UTMB, Western States, and the Leadville 100.
Follow along below as I explore the data and predict speed from various factors within the dataset! I’ve collapsed my code to make things more readable, but click the little black arrows on the left hand side to check it out.
Data Cleaning and Exploratory Data Analysis
Click for R code!
Below is an initial plot of age and finishing time to get an idea of the shape of the data.
There seems to be many 0’s present in the data, which I eventually mutated to NA’s because there really isn’t any value in keeping them as zero.
I also wanted to join the two datasets with race info on the variable race_year_id
, which is essentially a unique identifier for each race. You’ll see that I also changed a few of the variables to factors from numeric.
After taking a look at the data a little more, I realized that one of my favorite races (the Leadville 100) didn’t have any distance data. I double checked the exact distance and added it in so the race could be a part of my analysis.
Finally, I went back to my initial observation above and changed the zeros in the age
and distance
variables to have an NA value. This way I could simply omit NA’s and get rid of those values without sacrificing losing too much data (only omitting about 17,000 rows from 137,000).
After that, you’ll see me creating a new column called AgeAtRace
, as the original dataset contained ages of participants now, not when they competed. Further explanation of this can be found in the original Github Repo for this project.
Now that we have AgeAtRace
, I wanted to take a look at the distribution of ages across all participants.
As we can see, a relatively normal age distribution for both men and women. It looks like I should filter the data to only include ages under 80, as there are a few points past that age that will skew future plots.
Let’s also take a look at how time to finish a race increases with age.
There’s not a clear pattern except for the presence of many almost-zeros in the time variable. I think I’ll need to address those to be able to get some more predictive results.
Before addressing the zeros, there is one last manipulation I wanted to make - creating a speed variable. It allows us to have a more standardized way of comparing finishers across all races, rather than using time alone. That wouldn’t be as useful since many of the races are different lengths and finish times would greatly vary across those lengths. You’ll see below I had to create a few other new variables before creating the mph
(or speed) variable.
After some further investigating, I decided to omit all observations in the AgeAtRace
column greater than 80 as there were a minimal amount of those to begin with. I also omitted finish times less than 150 minutes. I decided on this number because the shortest distance in this dataset is 20 miles, and running that in 150 would be a pace of 7.5 miles per hour. This is very fast! So it seemed like any time less than 150 minutes was likely error. Finally, along the same lines, I omitted speeds greater than 20 mph as it is impossible for a human to run that fast.
Now let’s look at the distribution of speed across all races.
It seems like most speeds fall between 2 and 4 mph with a gradual right hand tail. Speed is the variable I’m hoping to be able to predict, but based on this lognormal distribution I think I’ll have to log transform the speed variable in order to create a reliable model with it.
Next I wanted to dig a little deeper into how speed varies by course. First I created a new data frame with average speed for each course, broken down by gender. I wanted to visualize this, so I changed the event
variable into a factor.
Below is the visualization of this data, although I did combine both gender’s averages into this plot.
Although the names of the races probably don’t mean much to you (or even me!), it’s clear that some race courses have a significantly faster speed than other courses. This could depend on many factors, such as overall elevation gain, whether more men or women ran the race, and the mean age of participants.
Next I dove into individual runner speeds and if any runners have run more than one race. Many have run over 20 of the races included in this dataset! The fastest runners tend to have an average speed of greater than 8 mph, which is so fast to be running for a long distance.
# A tibble: 58,402 × 2
runner n
<chr> <int>
1 FANCETT Kenneth 36
2 SMITH Mike 36
3 DONNELLY Susan 32
4 CATMUR Edward 28
5 THOMPSON David 26
6 JOHNSON Peter 25
7 NAKAMURA Sean 24
8 THOMPSON Mark 24
9 CARTER Williams 23
10 JONES Chris 23
# … with 58,392 more rows
# A tibble: 58,402 × 2
runner runner_speed
<chr> <dbl>
1 PICARD Romain 8.27
2 RAZET Vivien 8.25
3 BAHLA Sadek 8.24
4 PILLAT PIERRE ETIENNE 8.21
5 PEGEOT Pierre Alain 8.17
6 MATHIEU Damien 8.15
7 LACROIX Olivier 8.14
8 DOS SANTOS Matthieu 8.14
9 VANETTEN Taggart 8.13
10 GUILLOT Elise 8.11
# … with 58,392 more rows
# A tibble: 86,216 × 3
# Groups: event [372]
event runner runner_speed
<chr> <chr> <dbl>
1 #Småland100 BELCASTRO Luigi 3.64
2 #Småland100 EKBLAD Jonathan 3.50
3 #Småland100 FERNEMAR Thomas 3.73
4 #Småland100 GUSTAVSSON Jimmy 3.64
5 #Småland100 Jonsson Daniel 3.73
6 #Småland100 STROEM Martin 4.43
7 #Småland100 WILSON Torbjorn 4.43
8 100 Mile Spa BREMMERS Paul 4.47
9 100 Mile Spa CHEVALIER Michael 4.03
10 100 Mile Spa DE CREMER Fabian 3.81
# … with 86,206 more rows
Visualized below is mean speed for men and women by age. It looks like the speed of men decreases with age faster and then spikes after 60, while women stay more consistent throughout their lives and actually outperform men from ages 25 - 65.
I was curious at the ratio of men vs. women within finishers. Had it changed over the 10 years that this dataset includes? Let’s take a look below and see.
First I created another new dataframe with the percentage of men and women finishers by year. I then visualized this to see if the percents have changed significantly over time.
# A tibble: 20 × 5
Year gender count per per.round
<dbl> <fct> <int> <dbl> <dbl>
1 2012 M 5037 85.9 85.9
2 2012 W 830 14.1 14.1
3 2013 M 7356 87.4 87.4
4 2013 W 1060 12.6 12.6
5 2014 M 8504 85.6 85.6
6 2014 W 1428 14.4 14.4
7 2015 M 8224 86.2 86.2
8 2015 W 1315 13.8 13.8
9 2016 M 8729 86.9 86.9
10 2016 W 1313 13.1 13.1
11 2017 M 10574 85.6 85.6
12 2017 W 1775 14.4 14.4
13 2018 M 13489 84.6 84.6
14 2018 W 2449 15.4 15.4
15 2019 M 14054 84.3 84.3
16 2019 W 2617 15.7 15.7
17 2020 M 4129 83.5 83.5
18 2020 W 816 16.5 16.5
19 2021 M 5859 86.4 86.4
20 2021 W 920 13.6 13.6
Looks like 2020 was the year with the highest percentage of women finishers.
Modeling the Data
Now that I’ve explored the data a good amount, I’m the most curious about digging deeper into how well certain factors (age, gender, course length, elevation gain) predict speed. Since speed is one dependent variable and the various factors are independent variables, a regression analysis seems like the best option here. I’m going to train both a linear regression model and a random forest regression model to see which one fits the data better. All of this is done with the help of the amazing tidymodels package!
First, I created my modeling dataset that only contains the variables I want included in my model. Next I split the data by gender into a training and a testing set. There is a large class imbalance within the gender variable, so I was sure to specify the proportion to split on so that one data set (training or testing) didn’t end up with a larger proportion of men or women.
After that, I specified my models, both linear and random forest.
Then I’m going to fit a basic linear regression model and a random forest regression model to my training data. I’m first doing this without any resampling of my data, which I address a little later on. You’ll notice I log transform the speed(mph) variable in both models as the distribution of the original variable is log normal.
Now that the models are fit on my training data, I wanted to evaluate their performance using my testing dataset that I created above.
> metrics(results.b, truth = mph, estimate = .pred_lm)
# A tibble: 3 × 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 rmse standard 0.192
2 rsq standard 0.530
3 mae standard 0.148
> metrics(results.b, truth = mph, estimate = .pred_rf)
# A tibble: 3 × 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 rmse standard 0.179
2 rsq standard 0.632
3 mae standard 0.138
From the metrics that are printed out, you’ll see that the random forest model has slightly higher performance, particularly in the R-squared value of 0.632
. While this is not the most amazing value for our model to predict speed, it is significant and tells us that our data fit the model predictions relatively well.
I was curious how the models would perform on a resampled dataset to better evaluate both models. I used the bootstrap method, which creates a larger simulated dataset by drawing with replacement from my original dataset and then fitting the model on that. The default for this with the bootstraps()
function is 25 times, which was fine with me and took a significant amount of time to run on my random forest model!
I first created my new resampled dataset, and then fit my models on that bootstrapped dataset.
> collect_metrics(lm_res)
# A tibble: 2 × 6
.metric .estimator mean n std_err .config
<chr> <chr> <dbl> <int> <dbl> <chr>
1 rmse standard 0.192 25 0.000175 Preprocessor1_Model1
2 rsq standard 0.544 25 0.000633 Preprocessor1_Model1
> collect_metrics(rf_res)
# A tibble: 2 × 6
.metric .estimator mean n std_err .config
<chr> <chr> <dbl> <int> <dbl> <chr>
1 rmse standard 0.180 25 0.000246 Preprocessor1_Model1
2 rsq standard 0.641 25 0.000696 Preprocessor1_Model1
I then collected the metrics created in new both resampled datasets to see how either one performed on the data. It looks like the random forest is again the better predictor, this time with a slightly higher R-squared value (than the linear regression) of 0.641
. This means that approximately 64% of the variance within the speed variable is explained by the predictor variables in the random forest model I specified.
Then I plotted the results to visualize each model. The blue line shows the model prediction compared to the ideal grey dotted line. While it’s a little hard to tell in the plot, the random forest model provides slightly better predictions than the linear model. This may because more of the points are clumped in a tighter area in the middle, which likely increases the R-squared. There is a larger spread and variance within the linear model.
Takeaways
All in all, the predictor variables I thought would have the most influence on speed (age, gender, course length, elevation gain) explained about 64% ( within the random forest model) of the variance within the speed variable. This means that those variables do a pretty good job of predicting speed during an ultramarathon, but there are likely other factors that also contribute. These could be things like starting elevation, wind on race day, how somebody slept, what they ate, etc. And of course, speed isn’t everything in these long races! These are tests of mental and physical endurance and oftentimes, speed is the last thing on most runner’s minds.
Thanks for joining me into this exploration - next time you see somebody running, just think of all the factors that must be contributing to how fast they’re going (good and bad)!
Full code for this project available on my Github.