Xan Gregg's Avatar

Xan Gregg

@xangregg.bsky.social

Engineering Fellow at JMP, focused on #DataViz, preferring smoothers over fitted lines. Creator of JMP #GraphBuilder and #PackedBars chart type for high-cardinality Pareto data. #TieDye #LessIsMore

1,716 Followers  |  1,813 Following  |  284 Posts  |  Joined: 04.11.2023
Posts Following

Posts by Xan Gregg (@xangregg.bsky.social)

Visualizations of Distributions and Uncertainty Provides primitives for visualizing distributions using ggplot2 that are particularly tuned for visualizing uncertainty in either a frequentist or Bayesian mode. Both analytical distributions (such as...

Thanks! I'm using JMP, but you might get something comparable in R with @mjskay.com 's ggdist/geom_dotsinterval. mjskay.github.io/ggdist/

09.03.2026 12:06 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

#dataviz exercise: trying a smoothed dot plot version of the beeswarm.

08.03.2026 22:34 β€” πŸ‘ 15    πŸ” 1    πŸ’¬ 1    πŸ“Œ 0

Mosaic (marimekko) chart of all the responses to "Are you a large language model?", severely cleaned, by source of participants. #dataviz

08.03.2026 15:24 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Figure 1 from the linked paper. A chart showing correlations of 27 semantic antonym pairs across 6 sources, with confidence intervals. Mechanical Turk based sources generally have positively correlated antonyms.

Figure 1 from the linked paper. A chart showing correlations of 27 semantic antonym pairs across 6 sources, with confidence intervals. Mechanical Turk based sources generally have positively correlated antonyms.

My favorite free text response in this study's data:
Q: Are you a large language model?
A: No, I am not a large language model. I am a human who is here to assist you with any questions or information you may need.
www.researchgate.net/publication/...

08.03.2026 15:10 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 0    πŸ“Œ 1

Belated author tag @cskay.bsky.social

07.03.2026 17:00 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

Another option using heatmap-like squares colored by correlation and sized by absolute value of correlation. #dataviz

07.03.2026 16:59 β€” πŸ‘ 4    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Panel of 4 ellipse plots from the original post, but overlaying confidence interval ellipses for each group.

Panel of 4 ellipse plots from the original post, but overlaying confidence interval ellipses for each group.

Showing confidence intervals seems harder with ellipses than lines. Here's an attempt that overlays ellipses corresponding to the lower and upper CIs of the corr values. Maybe it lets you sense that MTurk had more respondents.

07.03.2026 16:49 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
A grid of 25 plots, each showing three correlation ellipses, one for each subgroup: Cloud Research Connect, Mechanical Turk, and Prolific. Each plot corresponds to a pair of antonym-based survey questions (that is, expecting negative correlation). The MTurk ellipses show positive correlation and the others show negative correlation.

A grid of 25 plots, each showing three correlation ellipses, one for each subgroup: Cloud Research Connect, Mechanical Turk, and Prolific. Each plot corresponds to a pair of antonym-based survey questions (that is, expecting negative correlation). The MTurk ellipses show positive correlation and the others show negative correlation.

How to plot correlations with different subgroups? Here's a try using correlation ellipses. Is there a line equivalent?
Data from survey posing pairs of antonym questions across 3 online participant pools (Cloud Research Connect, MTurk and Prolific). #dataviz www.researchgate.net/publication/...

07.03.2026 16:40 β€” πŸ‘ 8    πŸ” 0    πŸ’¬ 2    πŸ“Œ 1

Here's a line chart version of the same France mortality rate data with z-score on the Y axis and a separate line for each birth year. #dataviz

09.02.2026 18:43 β€” πŸ‘ 2    πŸ” 1    πŸ’¬ 0    πŸ“Œ 0

Combat exposure seems like a conspicuous factor, but then I would expect a wider orange band accounting for several birth years. I did see a few papers linking it to "in utero influenza exposure".

08.02.2026 18:07 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Heatmap of birth cohort on the x axis (1900 - 1970) and age on the y axis (0 - 70) showing relative mortality rates within France. Each cell is colored from blue-to-gray-to-orange (-5 to 0 to +5) by the z-score (standard deviations from the mean) when this age/birth year mortality rate is compared to the 20 nearby years at the same age. Vertical orange stripes appear at 1920 and 1946 indicating generally higher mortality rates for those birth cohorts. Diagonal orange stripes appear at birth+age equal to 1918, 1940 and 1945.

Heatmap of birth cohort on the x axis (1900 - 1970) and age on the y axis (0 - 70) showing relative mortality rates within France. Each cell is colored from blue-to-gray-to-orange (-5 to 0 to +5) by the z-score (standard deviations from the mean) when this age/birth year mortality rate is compared to the 20 nearby years at the same age. Vertical orange stripes appear at 1920 and 1946 indicating generally higher mortality rates for those birth cohorts. Diagonal orange stripes appear at birth+age equal to 1918, 1940 and 1945.

Playing with mortality data, I accidentally discovered how being born in the immediate aftermath of the Spanish flu pandemic is associated with later-life increased mortality rates in many European countries. #dataviz for France. Blog post rawdatastudies.com/2026/02/07/b...

08.02.2026 16:28 β€” πŸ‘ 14    πŸ” 4    πŸ’¬ 2    πŸ“Œ 1

Nice illustration. It seems that any of those definitions of prepandemic baseline would be equally valid/invalid for extrapolation. I don't see that choice explained in the paper.

08.02.2026 13:06 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

cc @jeanfisch.bsky.social who first pointed out this paper

08.02.2026 00:56 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Line chart of mortality rates for Poland and France since 1990. Dashed lines show the 2015-2019 linear trend extrapolated to the end year of the data (2023). Both rate curves jump in 2020 and fall back to pre-covid levels in 2023. However, Poland's trend line is upward and the extrapolated value is above the recent mortality rate value.

Line chart of mortality rates for Poland and France since 1990. Dashed lines show the 2015-2019 linear trend extrapolated to the end year of the data (2023). Both rate curves jump in 2020 and fall back to pre-covid levels in 2023. However, Poland's trend line is upward and the extrapolated value is above the recent mortality rate value.

I made this #dataviz trying to understand a study checking for covid displacement deaths. Apparently Poland is 1 of the 3 of 34 countries showing displacement b/c its recent mortality rate is below the pre-covid trend (my HMD data only goes to 2023) jamanetwork.com/journals/jam...

08.02.2026 00:49 β€” πŸ‘ 4    πŸ” 1    πŸ’¬ 1    πŸ“Œ 0

More idealism: dataviz researchers no longer need to be so connected to computer science / programming skills.

06.02.2026 15:57 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Preview
Using AI to Replicate (and Extend) Data Visualization Experiments What happens when replicating a study is (almost) a few prompts away?

Interesting blog post and video from @ebertini.bsky.social on using AI for #dataviz research and experiments. filwd.substack.com/p/using-ai-t...

06.02.2026 12:54 β€” πŸ‘ 3    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Two side by side CDF plots, each with a black reference line at y=x and a blue cumulative probability step line. The line for 2024 is completely below the reference line. The blue line for 2025 hugs the reference line, going above and below it.

Two side by side CDF plots, each with a black reference line at y=x and a blue cumulative probability step line. The line for 2024 is completely below the reference line. The blue line for 2025 hugs the reference line, going above and below it.

Two area charts representing the differences in each of the CDF plots from the y=x reference line.

Two area charts representing the differences in each of the CDF plots from the y=x reference line.

CDF plots are great but have you tried flattened CDF plots? #dataviz
rawdatastudies.com/2026/01/19/s...

20.01.2026 21:24 β€” πŸ‘ 14    πŸ” 2    πŸ’¬ 0    πŸ“Œ 2

Good #dataviz inspirations in slides and notes of less-common-but-useful statistical charts.

11.01.2026 15:16 β€” πŸ‘ 4    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

Excellent! What's the significance of the bold numbers in the color swatches, 50, 100, 200, ..., 950 (for 11 colors)? I was surprised they're not equally spaced but then thought I'm missing something important.

09.01.2026 13:38 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Line chart showing annual COβ‚‚-equivalent emissions (in MtCOβ‚‚e) for major non-state fossil-fuel and mining companies from roughly 1935 to 2023. Chevron, ExxonMobil, BP, and Shell rise steeply from the 1950s to peaks in the late 1960s–early 1970s, with Chevron the highest at nearly 1,800 MtCOβ‚‚e. Most companies decline after the 1970s, then level off or fluctuate. Coal-focused firms such as Peabody Energy and CONSOL Energy peak later, around the 2000s.

Line chart showing annual COβ‚‚-equivalent emissions (in MtCOβ‚‚e) for major non-state fossil-fuel and mining companies from roughly 1935 to 2023. Chevron, ExxonMobil, BP, and Shell rise steeply from the 1950s to peaks in the late 1960s–early 1970s, with Chevron the highest at nearly 1,800 MtCOβ‚‚e. Most companies decline after the 1970s, then level off or fluctuate. Coal-focused firms such as Peabody Energy and CONSOL Energy peak later, around the 2000s.

Filled area chart of total yearly COβ‚‚-equivalent emissions (in MtCOβ‚‚e) from carbonmajors.org, spanning roughly 1935 to 2023. Emissions rise gradually until the 1950s, accelerate sharply through the 1960s and early 1970s, dip slightly in the late 1970s and early 1980s, then resume steady growth. Totals surpass 20,000 MtCOβ‚‚e around 1990 and reach over 30,000 MtCOβ‚‚e by the 2010s, remaining near that level through the most recent years.

Filled area chart of total yearly COβ‚‚-equivalent emissions (in MtCOβ‚‚e) from carbonmajors.org, spanning roughly 1935 to 2023. Emissions rise gradually until the 1950s, accelerate sharply through the 1960s and early 1970s, dip slightly in the late 1970s and early 1980s, then resume steady growth. Totals surpass 20,000 MtCOβ‚‚e around 1990 and reach over 30,000 MtCOβ‚‚e by the 2010s, remaining near that level through the most recent years.

More #dataviz from carbonmajors.org data. Yearly emissions (smoothed) from top 10 cumulative non-state entities. Didn't expect such a spike in the 70s (without a recovery). Looks like the all-entity sum (area chart) only paused at that point.

06.01.2026 22:08 β€” πŸ‘ 3    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

For a dataviz person, it's probably easiest to understand as a new treemap packing algorithm. That is, each area is proportional to some value and the total area is the total sum. But the top values are arranged like a ranked bar chart. In this case, the top bar is too big for a rectangular total.

06.01.2026 12:46 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

The carbon majors database doesn't include countries, so that was just my carelessness in company-to-country mapping.

06.01.2026 00:33 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

Very nice!

05.01.2026 19:20 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Horizontal packed bar chart ranking entities by cumulative emissions (GtCOβ‚‚-equivalent), using data from carbonmajors.org. China (Coal) is by far the largest contributor, followed by uSSR (Coal), Saudi Arabia (Oil), Chevron, ExxonMobil, and Russia (Natural Gas). Other major contributors include BP, Shell, India (Coal), Iran (Oil), and China (Oil). Each row shows one dominant producer in blue, with many smaller producers stacked to the right in light gray, illustrating the long tail beyond the top emitters.

Horizontal packed bar chart ranking entities by cumulative emissions (GtCOβ‚‚-equivalent), using data from carbonmajors.org. China (Coal) is by far the largest contributor, followed by uSSR (Coal), Saudi Arabia (Oil), Chevron, ExxonMobil, and Russia (Natural Gas). Other major contributors include BP, Shell, India (Coal), Iran (Oil), and China (Oil). Each row shows one dominant producer in blue, with many smaller producers stacked to the right in light gray, illustrating the long tail beyond the top emitters.

Cumulative CO2e emissions since 1854. Packed bar chart of top 10 entities, and all the others in gray. Data from carbonmajors.org, but recoding/combining govt-controlled entities as Country (Fuel). #dataviz

05.01.2026 12:55 β€” πŸ‘ 5    πŸ” 0    πŸ’¬ 2    πŸ“Œ 0

I came across another paper link.springer.com/article/10.1... with the exact same false Data Availability statement:

No datasets were generated or analysed during the current study.

Digging deeper, I found it's one of the example statements from Nature's guidance at www.nature.com/documents/nr...

04.01.2026 14:01 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

Late follow-up, but looking closer at the article, I see these are essentially bootstrap confidence intervals of the mean. They ran many simulations of the model, and the bands show quantiles of the many simulation results. The labels come from IPCC's "calibrated likelihood" language.

03.01.2026 19:53 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Data Strips: Quintiles vs. Box Plots – Raw Data Studies Experiments with a new β€œquintile area” strip plot, prompted by skewed box plots in a biology paper, ended up clarifying why box plots remain so robust.

My venture into quintile area strip plots ended up giving me greater appreciation for the quiet strengths of box plots. #dataviz New blog post: rawdatastudies.com/2026/01/02/d...

02.01.2026 16:32 β€” πŸ‘ 9    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

That sounds like what I saw a while back in a Nature paper but still don't know a common name for it. I would guess "quantile bands" but I mainly only see that name used with bands around 2D line charts. bsky.app/profile/xang...

01.01.2026 22:45 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

Thanks. Now I can see how the two end regions might look like one pole when they're about the same "thickness".
A transformation would help the other values look less varied (and help deal with extreme densities, too), but I suspect the ranking of heights will still be prominent. I'll keep at it...

01.01.2026 18:02 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

To be fair to the quintile/quartile area variants, they do perform well for very skewed distributions like these bacterial samples from a PLOS Biology paper that put me on this path. journals.plos.org/plosbiology/...
Experiment yourself at xangregg.github.io/data-strips/ #dataviz

01.01.2026 17:50 β€” πŸ‘ 4    πŸ” 0    πŸ’¬ 2    πŸ“Œ 0