New paper! The Linear Representation Hypothesis is a powerful intuition for how language models work, but lacks formalization. We give a mathematical framework in which we can ask and answer a basic question: how many features can be stored under the hypothesis? 🧵 arxiv.org/abs/2602.11246
I don’t think the methods would be hard to replicate—there is code on my GitHub repo! But I’d imagine we’d need some domain expertise about what are good sources of (noisy, biased) fine-grained migration records in these countries, or about the frequency of census data. Happy to chat more!
New research is offering new insight on how Americans move — all the way to the neighborhood level.
A new dataset, MIGRATE, maps annual moves with 4,600‑times more detail than standard public data, revealing patterns hidden in county‑level reporting:
https://bit.ly/49XSD6w
We are excited to see what you do with this data, and hope to build on this work in the future. You can read our full paper at nature.com/articles/s41467-025-68019-2
This is joint work with Rachel Young, Maria Fitzpatrick, @nkgarg.bsky.social, and @emmapierson.bsky.social! 9/9
There are over 100 academic, non-profit, governmental, and journalistic research teams from all over the world already using MIGRATE to study topics across the health, environmental, and social sciences! If you are a researcher interested in data access, visit migrate.tech.cornell.edu. 8/9
MIGRATE is also useful for studying local migration trends. For example, it reveals dramatic rates of out-migration after wildfires in California that are invisible in previous Census datasets. 7/9
We found, for example, racial disparities in upward mobility —that is, the rate at which people move to higher-income areas varies according to the racial composition of their current area of residence, even after controlling for income levels. 6/9
A great advantage of MIGRATE is that CBG-level data allows us to more accurately discuss socioeconomic and demographic trends. We documented national migration patterns, such as how far do people move and what are the characteristics of the areas they move to. 5/9
MIGRATE is over 4,000x more spatially granular than publicly available migration datasets, highly correlated with Census data, and less biased than proprietary address data. We’re making it available to researchers asking all those important migration questions!! 4/9
We created and released MIGRATE: annual flows between all 47 billion pairs of US Census Block Groups. To do it, we developed an iterative-proportional-fitting based algorithm that harmonizes granular but biased proprietary data with coarser but more reliable Census data. 3/9
Migration data is key for studying responses to environmental disasters, policy impacts, etc. But public, county-level migration data is too coarse to capture many important dynamics. Proprietary, individual-level address histories are highly granular, but potentially biased. 2/9
Our paper “Inferring fine-grained migration patterns across the United States” is now out in @natcomms.nature.com! We released a new, highly granular migration dataset. 1/9
November is over, but we still have some #30DayMapChallenge entries to share! And for our transport-themed day 26 map, MBTA data analyst Joe Hilleary takes us on a ride back in time: he shows current bus routes in Greater Boston by the earliest known year in which a direct percursor route ran a bus.
We have a new paper in Science Advances proposing a simple test for bias:
Is the same person treated differently when their race is perceived differently?
Specifically, we study: is the same driver likelier to be searched by police when they are perceived as Hispanic rather than white?
1/
My best workflow improvement since starting to work with spatial libraries in Python was to always include a `crs` dictionary on a variables file listing crs for lat-long projections, equidistant projections, and "maybe not satisfying any desiderata but the prettiest out there" projections.
#30DayMapChallenge day 20: water
This map of Arsenic and Cadmium levels in Mexico's water show non-trace concentrations of Total and Soluble Arsenic and Cadium. Points are colored by the presence of high amounts of contaminants, and sized by their relative concentration.
tinyurl.com/map20wtr
#30DayMapChallenge 15: Fire
@sylviaimani.bsky.social visualized how Uganda’s transition toward electric cooking aligns with the reach of the national grid. Regions with denser grid networks show a strong correlation with higher household adoption of electric cooking technologies.
For #30DayMapChallenge Day 18: Out of this world, use the `fill_z_offset` param in mapgl to "elevate" your data.
Just be careful - if you choose a value too high, you might lose your data in the sky!
#rstats
I took me too long to accept that "Amsterdam is just 10th Ave"
#30DayMapChallenge day 10: Air
@jessiefin.bsky.social + Francisco Marmolejo-Cossío visualize the presence of ladrilleras, or brick kilns, which emit pollution across the state. Data cleaned by Jacqueline Calderón and Lizet Jarquin at UASLP.
Full interactive map: tinyurl.com/map10-air
#30DayMapChallenge day 9 asked us to get off our screens. @annaloganmc.bsky.social's "analog" map is a hand-painted postcard! 📫
"I chose to paint a postcard of a map of Ann Arbor where I currently live showing the Huron River!" she says
tho it's very possible that the dataset misnamed a Subway or two. I filtered for the DOHMH restaurants doing business as "Subway": data.cityofnewyork.us/Health/DOHMH-New-York-City-Restaurant-Inspection-Results/43nn-pn8j
The LIC Subway between the Queens Plaza and Queensboro Plaza stations is there, but so close to both stations (and CBs are small there) that we only see a slit of purple! That's also the only Subway in LIC per Google Maps but I wonder if that one is also not up to date.
Of course! Glad you liked it
Proposing the Subway-subway (🥪-🚇) index
Great map(s) by @jennahgosciak.bsky.social ---can we count that for 6 days of mapping??---that show both the permanence and the vulnerability of ecological concepts in our urban landscapes!
#30DayMapChallenge
We are slowly catching up to the #30DayMapChallenge!
In our day 3: polygons submission, @zhixuanqi.bsky.social questioned the boundaries and fuzziness of polygons with an animated map that invites us to think about the (not-so-well-defined) idea of neighborhoods.
Another dog map, this is 1 dog = 1 dot. And hopefully 1 day = 1 map for the next 30 days in our working group page 🗺️
N-gram novelty is widely used as a measure of creativity and generalization. But if LLMs produce highly n-gram novel expressions that don’t make sense or sound awkward, should they still be called creative? In a new paper, we investigate how n-gram novelty relates to creativity.
New #NeurIPS2025 paper: how should we evaluate machine learning models without a large, labeled dataset? We introduce Semi-Supervised Model Evaluation (SSME), which uses labeled and unlabeled data to estimate performance! We find SSME is far more accurate than standard methods.