Gavin Simpson's Avatar

Gavin Simpson

@gsimpson.bsky.social

(Palaeo)[ecologist | limnologist] & #fakeStatistican, #rstats user, wielder of #GAMs. He/him/his. Opinions mine…

2,631 Followers  |  696 Following  |  109 Posts  |  Joined: 26.08.2023  |  2.1246

Latest posts by gsimpson.bsky.social on Bluesky

Post image

Huge thanks to @gsimpson.bsky.social and everyone who joined our GAMs in R course! πŸš€

Happy modeling, everyone! πŸŽ‰

11.12.2025 16:08 β€” πŸ‘ 5    πŸ” 1    πŸ’¬ 0    πŸ“Œ 1
Post image

A palaeoecological plant find from a Breckland ghost ice age pond core (East Harling, E. England). Puzzled over by @hayleymcmechan.bsky.social & myself. Help with ID please team!! Seems obvious but stuck. @jo-the-botanist.bsky.social @bramblebotanist.bsky.social @timholtwilson.bsky.social

07.12.2025 18:58 β€” πŸ‘ 16    πŸ” 4    πŸ’¬ 6    πŸ“Œ 1
Figure with two panels. Left panel: visualisation of a 3D movement track. Right panel: visualisation of the 3D direction of movement as two angles (one horizontal angle and one vertical angle).

Figure with two panels. Left panel: visualisation of a 3D movement track. Right panel: visualisation of the 3D direction of movement as two angles (one horizontal angle and one vertical angle).

We have a preprint about modelling three-dimensional movement tracks, led by @njklappstein.bsky.social.

The model takes the form of a step selection function and, just like in 2D, it can include directional persistence, attraction to targets, and habitat selection.

doi.org/10.1101/2025...

02.12.2025 15:07 β€” πŸ‘ 15    πŸ” 6    πŸ’¬ 0    πŸ“Œ 1
Preview
Next Public Health in Action Seminar - Dr Darren Dahly-20251113_130641-Meeting Recording

If you want to see me get flustered, but recover to give a banger talk about RSV prophylaxis, the magic of GAMs, and the vital importance of survellence data, then this is the talk for you. With shout-outs for @gsimpson.bsky.social, @vincentab.bsky.social, and @hpscireland.bsky.social

03.12.2025 08:31 β€” πŸ‘ 31    πŸ” 4    πŸ’¬ 1    πŸ“Œ 0

EFS is the Extended Fellner Schall smoothness selection method that Simon & Matteo Fasiolo developed, & which was initially used for the twlss() family

EFS is neat bc it doesn't require all the fancy higher order derivatives of model quantities to fit the model, even for location scale families

12.11.2025 11:40 β€” πŸ‘ 5    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

Why is scasm() such a big deal? Natalya Pya & Simon did some work leading to Natalya's *scam* πŸ“¦ with a load of different shape constraints. But the algorithm was GCV-based and only worked for the standard families.

scasm() works for *any* family in mgcv thanks to EFS & just slots into models

12.11.2025 11:40 β€” πŸ‘ 8    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

The full change log with all the changes is here: cran.r-project.org/web/packages...

I've already started the process of getting these new features supported in the gratia πŸ“¦. Shape constraint smooths mostly just work; small fix was needed bc structure of the smooths lacked something gam() produces

12.11.2025 11:28 β€” πŸ‘ 11    πŸ” 1    πŸ’¬ 1    πŸ“Œ 0
Example plot showing the partial effects of the estimated smooths in a GAM with terms y ~ x0 + s(x1) + s(x2) + x(x3). Three panels are shown, one per smooth. The first panel is for s(x1) and shows an increasing, slightly curved function that increases slightly more as x1 increases. The second panel shows the partial effects for s(x2), which shows much more variation, being initially negative at x2=0, rising to a peak around x2=0.35, then falling to ~-1 at x2=0.5, then remaining roughly flat followed by a gradual decline thereafter. The final panel shows the partial effect of s(x3), which is ~0 for all x3, with uncertainty band covering 0 too. This predictor is known to have no effect in the simulated data.

The uncertainty in the partial effects is shown by two credible interval bands; a dark blue central band is a 68% Bayesian credible interval, while the lighter blue outer interval is a 95% Bayesian credible interval.

The background of each panel is light grey with white grid lines, in a similar style to ggplot2's default theme.

Example plot showing the partial effects of the estimated smooths in a GAM with terms y ~ x0 + s(x1) + s(x2) + x(x3). Three panels are shown, one per smooth. The first panel is for s(x1) and shows an increasing, slightly curved function that increases slightly more as x1 increases. The second panel shows the partial effects for s(x2), which shows much more variation, being initially negative at x2=0, rising to a peak around x2=0.35, then falling to ~-1 at x2=0.5, then remaining roughly flat followed by a gradual decline thereafter. The final panel shows the partial effect of s(x3), which is ~0 for all x3, with uncertainty band covering 0 too. This predictor is known to have no effect in the simulated data. The uncertainty in the partial effects is shown by two credible interval bands; a dark blue central band is a 68% Bayesian credible interval, while the lighter blue outer interval is a 95% Bayesian credible interval. The background of each panel is light grey with white grid lines, in a similar style to ggplot2's default theme.

🌟 plot.gam() also has a new theme: scheme=2

This draws plots with 68 & 95% intervals (by default) and has a ggplot2-like grey plot background

12.11.2025 11:28 β€” πŸ‘ 3    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Example plot showing the derivative of the estimated smooths in a GAM with terms y ~ x0 + s(x1) + s(x2) + x(x3). Three panels are shown, one per smooth. The first panel is for s(x1) and shows a positive derivative that increases slightly as x1 increases. The second panel shows the derivative for s(x2), which shows much more variation, being initially strongly positive from x2=0, moving gradually to strongly negative by x2=0.35, then back to 0 at x2=0.5, then remaining roughly flat thereafter. This reflects the strongly peaked nature of the fitted smooth s(x2). The final panel shows the derivative of s(x3), which is ~0 for all x3, with uncertainty band covering 0 too. This predictor is known to have no effect in the simulated data.

The uncertainty in the derivatives is shown by two credible interval bands; a dark blue central band is a 68% Bayesian credible interval, while the lighter blue outer interval is a 95% Bayesian credible interval.

The background of each panel is light grey with white grid lines, in a similar style to ggplot2's default theme.

Example plot showing the derivative of the estimated smooths in a GAM with terms y ~ x0 + s(x1) + s(x2) + x(x3). Three panels are shown, one per smooth. The first panel is for s(x1) and shows a positive derivative that increases slightly as x1 increases. The second panel shows the derivative for s(x2), which shows much more variation, being initially strongly positive from x2=0, moving gradually to strongly negative by x2=0.35, then back to 0 at x2=0.5, then remaining roughly flat thereafter. This reflects the strongly peaked nature of the fitted smooth s(x2). The final panel shows the derivative of s(x3), which is ~0 for all x3, with uncertainty band covering 0 too. This predictor is known to have no effect in the simulated data. The uncertainty in the derivatives is shown by two credible interval bands; a dark blue central band is a 68% Bayesian credible interval, while the lighter blue outer interval is a 95% Bayesian credible interval. The background of each panel is light grey with white grid lines, in a similar style to ggplot2's default theme.

🌟 plot.gam() gains a deriv argument, which if TRUE plots derivatives of univariate smooths instead of the usual partial effect plots

Partial effect plots can be confusing. With deriv=TRUE you see the change in Y (Ξ·) for a small change in X, which is comparable with usual interpretations of model 𝛽

12.11.2025 11:28 β€” πŸ‘ 6    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

🌟 New family, bcg(), for (censored) Box-Cox Gaussian responses (basically anything that is conditionally Gaussian *after* a Box-Cox transform of Y_i)

12.11.2025 11:28 β€” πŸ‘ 4    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Model is:

b3 <- scasm(
  y ~ s(x0, bs = "bs", k= k) + s(x1, bs = "sc", xt = "m+", k = k) +
         s(x2, bs = "bs", k = k) + s(x3, bs = "bs", k = k),
  family=poisson, bs=200
)

The second smooth `s(x1) is a shape constrained smooth with a positive monotonicity constraint (xt = "m+").

The `bs = 200` arguments uses 200 boostrap samples, which generates bootstrap distributions for each coefficient in the model. These bootstrap samples respect the shape constraints, while the usual +/- 2 SE credible intervals may not.

The uncertainty in the partial effects is shown by two credible interval bands; a dark blue central band is a 68% Bayesian credible interval, while the lighter blue outer interval is a 95% Bayesian credible interval.

The background of each panel is light grey with white grid lines, in a similar style to ggplot2's default theme.

Model is: b3 <- scasm( y ~ s(x0, bs = "bs", k= k) + s(x1, bs = "sc", xt = "m+", k = k) + s(x2, bs = "bs", k = k) + s(x3, bs = "bs", k = k), family=poisson, bs=200 ) The second smooth `s(x1) is a shape constrained smooth with a positive monotonicity constraint (xt = "m+"). The `bs = 200` arguments uses 200 boostrap samples, which generates bootstrap distributions for each coefficient in the model. These bootstrap samples respect the shape constraints, while the usual +/- 2 SE credible intervals may not. The uncertainty in the partial effects is shown by two credible interval bands; a dark blue central band is a 68% Bayesian credible interval, while the lighter blue outer interval is a 95% Bayesian credible interval. The background of each panel is light grey with white grid lines, in a similar style to ggplot2's default theme.

A new release of the mgcv #RStats πŸ“¦ is out on CRAN and Simon Wood (U Edinburgh) has added some significant new features despite the small bump in version number:

🌟 scasm() for estimating GAMs with shape constrained smooths. Can be used with any family & smoothness selection is via the EFS method

12.11.2025 11:28 β€” πŸ‘ 95    πŸ” 24    πŸ’¬ 3    πŸ“Œ 5
Preview
Chaos is coming for scholarly publishing - Research Professional News Buckling of commercial models alongside maturing of community-led efforts promises major shifts, says Caroline Edwards

Opinion: Chaos is coming for scholarly publishing.

Buckling of commercial models alongside maturing of community-led efforts promises major shifts, says Caroline Edwards (@theblochian.bsky.social).

www.researchprofessionalnews.com/rr-news-uk-v...

12.11.2025 10:17 β€” πŸ‘ 35    πŸ” 31    πŸ’¬ 0    πŸ“Œ 5

Feel free to tag me on any questions if you post them here. Lots of answers on CrossValidated in the generalized additive model tag cover HGAMs. If you have a stats question you could also ask it there

12.11.2025 06:40 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

filter_out() = yeet()

07.11.2025 16:34 β€” πŸ‘ 54    πŸ” 3    πŸ’¬ 3    πŸ“Œ 0
Results of model fitting to the average daily fat content data from @Henderson1990-bd. a) observed average daily fat content (points) and estimated lactation curves from Wood's [-@Wood1967-re] model, a Tweedie GLM, and a Tweedie GAM (lines) with associated 95% confidence (Wood's model) or 95% credible intervals (GLM and GAM). Response residuals for Wood's model (b), Tweedie GLM (c), and Tweedie GAM (d), plus scatter plot smoothers (lines) and 95% credible intervals (shaded ribbons).

The fitted lactation curves are like an inverted U, with an extended longer tail to the right (later in lactation). The GAM curve fits the data well, but the fitted curves from Wood's model and the GLM equivalent do not provide good fits to the data, and over predict the amount of fat produced at the peak of lactation, and only grossly capture the decline in fat production later in lactation. The remaining panels show the raw response residuals for the three models, drawing attention to the poor fit; for Wood's model and the GLM there is significant pattern in the residuals, while for the GAM no residual pattern is observed.

Results of model fitting to the average daily fat content data from @Henderson1990-bd. a) observed average daily fat content (points) and estimated lactation curves from Wood's [-@Wood1967-re] model, a Tweedie GLM, and a Tweedie GAM (lines) with associated 95% confidence (Wood's model) or 95% credible intervals (GLM and GAM). Response residuals for Wood's model (b), Tweedie GLM (c), and Tweedie GAM (d), plus scatter plot smoothers (lines) and 95% credible intervals (shaded ribbons). The fitted lactation curves are like an inverted U, with an extended longer tail to the right (later in lactation). The GAM curve fits the data well, but the fitted curves from Wood's model and the GLM equivalent do not provide good fits to the data, and over predict the amount of fat produced at the peak of lactation, and only grossly capture the decline in fat production later in lactation. The remaining panels show the raw response residuals for the three models, drawing attention to the poor fit; for Wood's model and the GLM there is significant pattern in the residuals, while for the GAM no residual pattern is observed.

Quantities of interest derived from Wood's model, a Tweedie GLM, and a Tweedie GAM fitted to the lactation data example: a) the estimated week of peak average daily fat content, b) the estimated average daily fat content at the peak, and c) the rate of change (first derivative) of the lactation curve estimated at a point that is midway between the peak fat content and the end of lacation. The points are the estimated values and the lines are a 95% uncertainty interval. The uncertainty interval is based on the 0.025 and 0.975 percentiles of the bootstrap distribution of model coefficient estimates (Wood's model) or of the posterior distribution (GLM and GAM).

Each panel shows three point estimates and an uncertainty range. The three points are the estimates from a GAM, a GLM, and Wood's lactation model. The first panel shows the estimated timing of the peak of lactation, with the GAM capturing the fact that the peak in the data occurs much later in lactation (~ week 11) while the other two models confidently estimate that the peak is in week ~8-9. The GAM estimate has a much wider credible interval, which does include the estimates of Wood's model & the GLM at the extreme end. This reflects the uncertainty in the estimation of the peak timing arising from the data having a wide flat peak.

The other panels show the estimates of fat content at the peak, which are broadly similar at ~ 0.7 kg fat per day. The final panel showing the persistency estimate shows the GAM estimate diverging from those of the GLM and Wood's model. Again, the latter two models are overly confident in their estimation of this biologically relevant parameter, despite the fited lactation curve not really following the lactation data.

Quantities of interest derived from Wood's model, a Tweedie GLM, and a Tweedie GAM fitted to the lactation data example: a) the estimated week of peak average daily fat content, b) the estimated average daily fat content at the peak, and c) the rate of change (first derivative) of the lactation curve estimated at a point that is midway between the peak fat content and the end of lacation. The points are the estimated values and the lines are a 95% uncertainty interval. The uncertainty interval is based on the 0.025 and 0.975 percentiles of the bootstrap distribution of model coefficient estimates (Wood's model) or of the posterior distribution (GLM and GAM). Each panel shows three point estimates and an uncertainty range. The three points are the estimates from a GAM, a GLM, and Wood's lactation model. The first panel shows the estimated timing of the peak of lactation, with the GAM capturing the fact that the peak in the data occurs much later in lactation (~ week 11) while the other two models confidently estimate that the peak is in week ~8-9. The GAM estimate has a much wider credible interval, which does include the estimates of Wood's model & the GLM at the extreme end. This reflects the uncertainty in the estimation of the peak timing arising from the data having a wide flat peak. The other panels show the estimates of fat content at the peak, which are broadly similar at ~ 0.7 kg fat per day. The final panel showing the persistency estimate shows the GAM estimate diverging from those of the GLM and Wood's model. Again, the latter two models are overly confident in their estimation of this biologically relevant parameter, despite the fited lactation curve not really following the lactation data.

a) Estimated daily growth rate on November 15^th^, 2021 and 95% Bayesian credible interval for the 18 pigs in the pig growth example. b) Posterior distribution of daily growth rate on November 15^th^, 2021, for three pigs (numbers 2, 13, and 17), for whom weight observations ceased before November 1^st^, 2021. In b), the shaded region is the posterior distribution, the point, and thick and thin bars are the posterior median, and 66% and 95% posterior intervals respectively.

With the fitted growth curves, we can estimate for any day what the growth rate of each pig was. In this figure I'm showing the estimated growth rate of each pig in the example on November 21st. This growth rate is the first derivative of the fitted growth curve (smooth function). I used posterior sampling to produce the posterior distribution of the growth rate for each pig. These are summarised as a point estimate (median) and ccredible interval in the first panel with most pigs growin at ~1-1.5 kg per day by November 21st, with uncertainties on the order of +/- 0.5 kg per day.

The second panel shows the entire posterior distribution of the estimated growth rate for three pigs (2, 13, and 17) for whom there were no weight estimates after November 1st. Here, the model is drawing power from the other pigs to help extrapolate the growth curves for these three pigs, but pig-specific details remain, with the posterior distribution for pig 17 being much more diffuse (wider) than for either pigs 2 or 13, reflecting greater uncertainty for the former animal.

a) Estimated daily growth rate on November 15^th^, 2021 and 95% Bayesian credible interval for the 18 pigs in the pig growth example. b) Posterior distribution of daily growth rate on November 15^th^, 2021, for three pigs (numbers 2, 13, and 17), for whom weight observations ceased before November 1^st^, 2021. In b), the shaded region is the posterior distribution, the point, and thick and thin bars are the posterior median, and 66% and 95% posterior intervals respectively. With the fitted growth curves, we can estimate for any day what the growth rate of each pig was. In this figure I'm showing the estimated growth rate of each pig in the example on November 21st. This growth rate is the first derivative of the fitted growth curve (smooth function). I used posterior sampling to produce the posterior distribution of the growth rate for each pig. These are summarised as a point estimate (median) and ccredible interval in the first panel with most pigs growin at ~1-1.5 kg per day by November 21st, with uncertainties on the order of +/- 0.5 kg per day. The second panel shows the entire posterior distribution of the estimated growth rate for three pigs (2, 13, and 17) for whom there were no weight estimates after November 1st. Here, the model is drawing power from the other pigs to help extrapolate the growth curves for these three pigs, but pig-specific details remain, with the posterior distribution for pig 17 being much more diffuse (wider) than for either pigs 2 or 13, reflecting greater uncertainty for the former animal.

Just updated my manuscript on using #GAMs in #AnimalScience, now on arXiv: doi.org/10.48550/arX...
πŸ„πŸ–πŸͺΆ

Extended examples now show how GAMs go beyond prediction, helping estimate biologically meaningful traits from data.

Code: github.com/gavinsimpson...

πŸ§ͺ #RStats #mgcv #Statistics #OpenScience

29.10.2025 10:44 β€” πŸ‘ 55    πŸ” 8    πŸ’¬ 0    πŸ“Œ 0
Suede - Black Or Blue (Audio Only)
YouTube video by Suede HQ Suede - Black Or Blue (Audio Only)

"I don't care for the UK tonight
So stay
Stay"

Fuck you, Katie Lam, absolutely fuck off.

www.youtube.com/watch?v=W38v...

23.10.2025 18:44 β€” πŸ‘ 3    πŸ” 1    πŸ’¬ 0    πŸ“Œ 0

I’m seeing some misinformation about pseudo-random number generator best practices going around the internets. Let’s talk about why the pseudo-random number generator seed you use shouldn’t actually have any impact on your results and, consequently, you can choose whatever seed you damn well please.

22.10.2025 19:06 β€” πŸ‘ 37    πŸ” 11    πŸ’¬ 4    πŸ“Œ 4

So Framework is supporting projects in the Linux ecosphere that are lead by vile, racist, homophonic, transphobic people and sees nothing wrong with that. And Shopify supports and platforms the racist idiot that built Ruby on Rails, all because their money stream is entirely dependent on Rails

23.10.2025 06:28 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

Fuck Kramnik

22.10.2025 16:43 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

I have very fond memories of feeding the giraffe there; have fun enjoying Nairobi (taking breakfast at the national park at dawn, in one of the picnic / viewpoints on the high ground overlooking the rest of the park was a real treat, as was the elephant orphanage)

06.10.2025 17:39 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Preview
GitHub - gavinsimpson/physalia-gam-course: Generalized Additive Models; a data-driven approach to estimating regression models Generalized Additive Models; a data-driven approach to estimating regression models - gavinsimpson/physalia-gam-course

If you want to check out the examples from the Physalia course, the materials from the last running are available on GitHub

github.com/gavinsimpson...

29.09.2025 08:26 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

I’d be happy to add/use a psych example or two if you have suggestions for papers or analyses where the data are open?

29.09.2025 08:23 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

As I’m a geographer by PhD working in ecology, environmental science, and now animal science, the examples tend towards the natural and life sciences, but they are quite varied so attendees from a wide array of backgrounds usually find them relatable.

29.09.2025 08:23 β€” πŸ‘ 3    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

I’ll be running a 3-day one at AU Viborg (about an hour from Aarhus, Denmark) June 9th through 11th that is in person. Dates are TBC but registration should be open in the next couple of weeks

I’m also running an online one with @physaliacourses.bsky.social in December this year.

27.09.2025 08:01 β€” πŸ‘ 3    πŸ” 1    πŸ’¬ 1    πŸ“Œ 0

The Pink Book of #MarginalEffects (aka Model to Meaning) ships next week and I've got a backlog of Zoolander memes.

Hope you're hungry for some spam in your timeline.

#RStats #PyData

22.09.2025 16:52 β€” πŸ‘ 89    πŸ” 18    πŸ’¬ 1    πŸ“Œ 3
37Β  Performance – Model to Meaning

The new {marginaleffects} release for #RStats (0.30.0) comes with two new vignettes:

1. Speed up computation with automatic differentiation (often 10x gains) marginaleffects.com/bonus/perfor...

2. Power analyses with {marginaleffects} and {DeclareDesign}. marginaleffects.com/bonus/power....

13.09.2025 18:37 β€” πŸ‘ 146    πŸ” 34    πŸ’¬ 3    πŸ“Œ 3

We're glad to finally bring you this update!

11.09.2025 12:02 β€” πŸ‘ 28    πŸ” 3    πŸ’¬ 1    πŸ“Œ 0

Congratulations Thomas; I know this release hasn’t been an easy one

11.09.2025 14:36 β€” πŸ‘ 5    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Preview
ggplot2 4.0.0 A new major version of ggplot2 has been released on CRAN. Find out what is new here.

I am beyond excited to announce that ggplot2 4.0.0 has just landed on CRAN.

It's not every day we have a new major #ggplot2 release but it is a fitting 18 year birthday present for the package.

Get an overview of the release in this blog post and be on the lookout for more in-depth posts #rstats

11.09.2025 11:20 β€” πŸ‘ 850    πŸ” 282    πŸ’¬ 9    πŸ“Œ 51

😱

10.09.2025 18:17 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

@gsimpson is following 20 prominent accounts