dan bateyko's Avatar

dan bateyko

@dbateyko.bsky.social

maybe the hard stuff's inside, hidden — like bones, as opposed to an exoskeleton. @CornellInfoSci https://dbateyko.info

108 Followers  |  192 Following  |  20 Posts  |  Joined: 22.11.2024  |  2.2506

Latest posts by dbateyko.bsky.social on Bluesky

it finally happened (my 3090 overheated and emergency shut off)

01.08.2025 05:59 — 👍 1    🔁 0    💬 0    📌 0
The Rise of the Compliant Speech Platform Content moderation is becoming a “compliance function,” with trust and safety operations run like factories and audited like investment banks.

Here’s the article. I’ve had more positive feedback on it than things I spent a year on.

Apparently describing a problem that thousands of Trust and Safety people are seeing but also see the world ignoring is a good way to win hearts and minds :)

www.lawfaremedia.org/article/the-...

01.08.2025 00:09 — 👍 30    🔁 10    💬 1    📌 0
Post image

“The possibilities of the pole” @hoctopi.bsky.social

18.07.2025 03:07 — 👍 0    🔁 0    💬 0    📌 0
Post image

Eliza asking after an anti-suffering future of reproductive technology (the piece looks incredible)

18.07.2025 03:04 — 👍 0    🔁 0    💬 1    📌 0
Post image

@kevinbaker.bsky.social “The rules appear inevitable, natural, reasonable. We forget they were drawn by human hands.”

18.07.2025 02:43 — 👍 10    🔁 0    💬 1    📌 0
Post image

At the Kernel 5 issue launch!

18.07.2025 02:43 — 👍 3    🔁 0    💬 1    📌 0

After having such a great time at #CHI2025 and #FAccT2025, I wanted to share some of my favorite recent papers here!

I'll aim to post new ones throughout the summer and will tag all the authors I can find on Bsky. Please feel welcome to chime in with thoughts / paper recs / etc.!!

🧵⬇️:

14.07.2025 17:02 — 👍 53    🔁 10    💬 2    📌 2
Kernel Magazine
Issue 5: Rules Launch Party
July 17
6:30-9pm
SF
Gray Area

The illustration behind this text is a mix of gold chess pieces and purple snakes on a chessboard pattern

lu.ma/k5-sf

Kernel Magazine Issue 5: Rules Launch Party July 17 6:30-9pm SF Gray Area The illustration behind this text is a mix of gold chess pieces and purple snakes on a chessboard pattern lu.ma/k5-sf

i am launching a magazine with @kevinbaker.bsky.social and the rest of the reboot collective on thursday at gray area! you should be there!

open.substack.com/pub/reboothq...

15.07.2025 21:25 — 👍 14    🔁 4    💬 0    📌 0
A close-up view of the Golden Gate Bridge in the fog

A close-up view of the Golden Gate Bridge in the fog

I've arrived in the 🌁Bay Area🌁, where I'll be spending the summer as a research fellow at Stanford's RegLab! If you're also here, LMK and let's get a meal / go on a hike / etc!!

01.07.2025 18:58 — 👍 9    🔁 1    💬 0    📌 0
Grimmelmann, Sobel, & Stein on Generative AI and Legal Interpretation James Grimmelmann (Cornell Law School; Cornell Tech), Benjamin Sobel (Cornell University - Cornell Tech NYC), & David Stein (Vanderbilt University - Vanderbilt Law School) have posted Generative Misin...

Well, this was a nice surprise. Download it while it’s hot!

lsolum.typepad.com/legaltheory/...

02.07.2025 17:28 — 👍 5    🔁 1    💬 0    📌 0
"Bias Delayed is Bias Denied? Assessing the Effect of Reporting Delays on Disparity Assessments"

Conducting disparity assessments at regular time intervals is critical for surfacing potential biases in decision-making and improving outcomes across demographic groups. Because disparity assessments fundamentally depend on the availability of demographic information, their efficacy is limited by the availability and consistency of available demographic identifiers. While prior work has considered the impact of missing data on fairness, little attention has been paid to the role of delayed demographic data. Delayed data, while eventually observed, might be missing at the critical point of monitoring and action -- and delays may be unequally distributed across groups in ways that distort disparity assessments. We characterize such impacts in healthcare, using electronic health records of over 5M patients across primary care practices in all 50 states. Our contributions are threefold. First, we document the high rate of race and ethnicity reporting delays in a healthcare setting and demonstrate widespread variation in rates at which demographics are reported across different groups. Second, through a set of retrospective analyses using real data, we find that such delays impact disparity assessments and hence conclusions made across a range of consequential healthcare outcomes, particularly at more granular levels of state-level and practice-level assessments. Third, we find limited ability of conventional methods that impute missing race in mitigating the effects of reporting delays on the accuracy of timely disparity assessments. Our insights and methods generalize to many domains of algorithmic fairness where delays in the availability of sensitive information may confound audits, thus deserving closer attention within a pipeline-aware machine learning framework.

"Bias Delayed is Bias Denied? Assessing the Effect of Reporting Delays on Disparity Assessments" Conducting disparity assessments at regular time intervals is critical for surfacing potential biases in decision-making and improving outcomes across demographic groups. Because disparity assessments fundamentally depend on the availability of demographic information, their efficacy is limited by the availability and consistency of available demographic identifiers. While prior work has considered the impact of missing data on fairness, little attention has been paid to the role of delayed demographic data. Delayed data, while eventually observed, might be missing at the critical point of monitoring and action -- and delays may be unequally distributed across groups in ways that distort disparity assessments. We characterize such impacts in healthcare, using electronic health records of over 5M patients across primary care practices in all 50 states. Our contributions are threefold. First, we document the high rate of race and ethnicity reporting delays in a healthcare setting and demonstrate widespread variation in rates at which demographics are reported across different groups. Second, through a set of retrospective analyses using real data, we find that such delays impact disparity assessments and hence conclusions made across a range of consequential healthcare outcomes, particularly at more granular levels of state-level and practice-level assessments. Third, we find limited ability of conventional methods that impute missing race in mitigating the effects of reporting delays on the accuracy of timely disparity assessments. Our insights and methods generalize to many domains of algorithmic fairness where delays in the availability of sensitive information may confound audits, thus deserving closer attention within a pipeline-aware machine learning framework.

Figure contrasting a conventional approach to conducting disparity assessments, which is static, to the analysis we conduct in this paper. Our analysis (1) uses comprehensive health data from over 1,000 primary care practices and 5 million patients across the U.S., (2) timestamped information on the reporting of race to measure delay, and (3) retrospective analyses of disparity assessments under varying levels of delay.

Figure contrasting a conventional approach to conducting disparity assessments, which is static, to the analysis we conduct in this paper. Our analysis (1) uses comprehensive health data from over 1,000 primary care practices and 5 million patients across the U.S., (2) timestamped information on the reporting of race to measure delay, and (3) retrospective analyses of disparity assessments under varying levels of delay.

I am presenting a new 📝 “Bias Delayed is Bias Denied? Assessing the Effect of Reporting Delays on Disparity Assessments” at @facct.bsky.social on Thursday, with @aparnabee.bsky.social, Derek Ouyang, @allisonkoe.bsky.social, @marzyehghassemi.bsky.social, and Dan Ho. 🔗: arxiv.org/abs/2506.13735
(1/n)

24.06.2025 14:51 — 👍 13    🔁 4    💬 1    📌 3
A screenshot of our paper's:

Title: A Framework for Auditing Chatbots for Dialect-Based Quality-of-Service Harms
Authors: Emma Harvey, Rene Kizilcec, Allison Koenecke
Abstract: Increasingly, individuals who engage in online activities are expected to interact with large language model (LLM)-based chatbots. Prior work has shown that LLMs can display dialect bias, which occurs when they produce harmful responses when prompted with text written in minoritized dialects. However, whether and how this bias propagates to systems built on top of LLMs, such as chatbots, is still unclear. We conduct a review of existing approaches for auditing LLMs for dialect bias and show that they cannot be straightforwardly adapted to audit LLM-based chatbots due to issues of substantive and ecological validity. To address this, we present a framework for auditing LLM-based chatbots for dialect bias by measuring the extent to which they produce quality-of-service harms, which occur when systems do not work equally well for different people. Our framework has three key characteristics that make it useful in practice. First, by leveraging dynamically generated instead of pre-existing text, our framework enables testing over any dialect, facilitates multi-turn conversations, and represents how users are likely to interact with chatbots in the real world. Second, by measuring quality-of-service harms, our framework aligns audit results with the real-world outcomes of chatbot use. Third, our framework requires only query access to an LLM-based chatbot, meaning that it can be leveraged equally effectively by internal auditors, external auditors, and even individual users in order to promote accountability. To demonstrate the efficacy of our framework, we conduct a case study audit of Amazon Rufus, a widely-used LLM-based chatbot in the customer service domain. Our results reveal that Rufus produces lower-quality responses to prompts written in minoritized English dialects.

A screenshot of our paper's: Title: A Framework for Auditing Chatbots for Dialect-Based Quality-of-Service Harms Authors: Emma Harvey, Rene Kizilcec, Allison Koenecke Abstract: Increasingly, individuals who engage in online activities are expected to interact with large language model (LLM)-based chatbots. Prior work has shown that LLMs can display dialect bias, which occurs when they produce harmful responses when prompted with text written in minoritized dialects. However, whether and how this bias propagates to systems built on top of LLMs, such as chatbots, is still unclear. We conduct a review of existing approaches for auditing LLMs for dialect bias and show that they cannot be straightforwardly adapted to audit LLM-based chatbots due to issues of substantive and ecological validity. To address this, we present a framework for auditing LLM-based chatbots for dialect bias by measuring the extent to which they produce quality-of-service harms, which occur when systems do not work equally well for different people. Our framework has three key characteristics that make it useful in practice. First, by leveraging dynamically generated instead of pre-existing text, our framework enables testing over any dialect, facilitates multi-turn conversations, and represents how users are likely to interact with chatbots in the real world. Second, by measuring quality-of-service harms, our framework aligns audit results with the real-world outcomes of chatbot use. Third, our framework requires only query access to an LLM-based chatbot, meaning that it can be leveraged equally effectively by internal auditors, external auditors, and even individual users in order to promote accountability. To demonstrate the efficacy of our framework, we conduct a case study audit of Amazon Rufus, a widely-used LLM-based chatbot in the customer service domain. Our results reveal that Rufus produces lower-quality responses to prompts written in minoritized English dialects.

I am so excited to be in 🇬🇷Athens🇬🇷 to present "A Framework for Auditing Chatbots for Dialect-Based Quality-of-Service Harms" by me, @kizilcec.bsky.social, and @allisonkoe.bsky.social, at #FAccT2025!!

🔗: arxiv.org/pdf/2506.04419

23.06.2025 14:44 — 👍 30    🔁 10    💬 1    📌 2
"Characterizing Bias: Benchmarking Large Language Models in Simplified versus Traditional Chinese" Abstract:

While the capabilities of Large Language Models (LLMs) have been studied in both Simplified and Traditional Chinese, it is yet unclear whether LLMs exhibit differential performance when prompted in these two variants of written Chinese. This understanding is critical, as disparities in the quality of LLM responses can perpetuate representational harms by ignoring the different cultural contexts underlying Simplified versus Traditional Chinese, and can exacerbate downstream harms in LLM-facilitated decision-making in domains such as education or hiring. To investigate potential LLM performance disparities, we design two benchmark tasks that reflect real-world scenarios: regional term choice (prompting the LLM to name a described item which is referred to differently in Mainland China and Taiwan), and regional name choice (prompting the LLM to choose who to hire from a list of names in both Simplified and Traditional Chinese). For both tasks, we audit the performance of 11 leading commercial LLM services and open-sourced models -- spanning those primarily trained on English, Simplified Chinese, or Traditional Chinese. Our analyses indicate that biases in LLM responses are dependent on both the task and prompting language: while most LLMs disproportionately favored Simplified Chinese responses in the regional term choice task, they surprisingly favored Traditional Chinese names in the regional name choice task. We find that these disparities may arise from differences in training data representation, written character preferences, and tokenization of Simplified and Traditional Chinese. These findings highlight the need for further analysis of LLM biases; as such, we provide an open-sourced benchmark dataset to foster reproducible evaluations of future LLM behavior across Chinese language variants (this https URL).

"Characterizing Bias: Benchmarking Large Language Models in Simplified versus Traditional Chinese" Abstract: While the capabilities of Large Language Models (LLMs) have been studied in both Simplified and Traditional Chinese, it is yet unclear whether LLMs exhibit differential performance when prompted in these two variants of written Chinese. This understanding is critical, as disparities in the quality of LLM responses can perpetuate representational harms by ignoring the different cultural contexts underlying Simplified versus Traditional Chinese, and can exacerbate downstream harms in LLM-facilitated decision-making in domains such as education or hiring. To investigate potential LLM performance disparities, we design two benchmark tasks that reflect real-world scenarios: regional term choice (prompting the LLM to name a described item which is referred to differently in Mainland China and Taiwan), and regional name choice (prompting the LLM to choose who to hire from a list of names in both Simplified and Traditional Chinese). For both tasks, we audit the performance of 11 leading commercial LLM services and open-sourced models -- spanning those primarily trained on English, Simplified Chinese, or Traditional Chinese. Our analyses indicate that biases in LLM responses are dependent on both the task and prompting language: while most LLMs disproportionately favored Simplified Chinese responses in the regional term choice task, they surprisingly favored Traditional Chinese names in the regional name choice task. We find that these disparities may arise from differences in training data representation, written character preferences, and tokenization of Simplified and Traditional Chinese. These findings highlight the need for further analysis of LLM biases; as such, we provide an open-sourced benchmark dataset to foster reproducible evaluations of future LLM behavior across Chinese language variants (this https URL).

Figure showing that three different LLMs (GPT-4o, Qwen-1.5, and Taiwan-LLM) may answer a prompt about pineapples differently when asked in Simplified Chinese vs. Traditional Chinese.

Figure showing that three different LLMs (GPT-4o, Qwen-1.5, and Taiwan-LLM) may answer a prompt about pineapples differently when asked in Simplified Chinese vs. Traditional Chinese.

Figure showing that LLMs disproportionately answer questions about regional-specific terms (like the word for "pineapple," which differs in Simplified and Traditional Chinese) correctly when prompted in Simplified Chinese as opposed to Traditional Chinese.

Figure showing that LLMs disproportionately answer questions about regional-specific terms (like the word for "pineapple," which differs in Simplified and Traditional Chinese) correctly when prompted in Simplified Chinese as opposed to Traditional Chinese.

Figure showing that LLMs have high variance of adhering to prompt instructions, favoring Traditional Chinese names over Simplified Chinese names in a benchmark task regarding hiring.

Figure showing that LLMs have high variance of adhering to prompt instructions, favoring Traditional Chinese names over Simplified Chinese names in a benchmark task regarding hiring.

🎉Excited to present our paper tomorrow at @facct.bsky.social, “Characterizing Bias: Benchmarking Large Language Models in Simplified versus Traditional Chinese”, with @brucelyu17.bsky.social, Jiebo Luo and Jian Kang, revealing 🤖 LLM performance disparities. 📄 Link: arxiv.org/abs/2505.22645

22.06.2025 21:15 — 👍 17    🔁 4    💬 1    📌 3

I am at FAccT 2025 in Athens, feel free to grab me if you want to chat.

23.06.2025 08:43 — 👍 8    🔁 1    💬 0    📌 1

Please come see us at the RC Trust Networking Event!
You can sign up with the QR Codes around the venues and get some free drinks! 🙂‍↕️

#FAccT2025

23.06.2025 13:23 — 👍 4    🔁 3    💬 1    📌 0
Post image

New paper available - "Bureaucratic Backchannel: How r/PatentExaminer Navigates #AI Governance" which investigates how examiners navigate dual roles through a qualitative analysis of a Reddit community where U.S. Patent & Trademark Office employees discuss their work🔗📜👇

06.06.2025 20:37 — 👍 2    🔁 1    💬 1    📌 0
Preview
Extracting memorized pieces of (copyrighted) books from open-weight language models Plaintiffs and defendants in copyright lawsuits over generative AI often make sweeping, opposing claims about the extent to which large language models (LLMs) have memorized plaintiffs' protected expr...

Llama 3.1 70B contains copies of nearly the entirety of some books. Harry Potter is just one of them. I don’t know if this means it’s an infringing copy. But the first question to answer is if it’s a copy at all/in the first place. That’s what our new results suggest:

arxiv.org/abs/2505.12546

21.05.2025 11:20 — 👍 52    🔁 24    💬 4    📌 4
Post image Post image

Another hallucinated citation in court. At this point, our tracker is up to ~70 cases worldwide of hallucinated citations in court, including hallucinations from 2 adjudicators.

New Case: storage.courtlistener.com/recap/gov.us...

Tracker: www.polarislab.org/ai-law-track...

21.05.2025 22:41 — 👍 6    🔁 3    💬 0    📌 0
Preview
American Dragnet | Data-Driven Deportation in the 21st Century One of two American adults is in a law enforcement face recognition database. An investigation.

Three years ago, we released “American Dragnet: Data-Driven Deportation in the 21st Century.” The report describes the surveillance apparatus that Trump is using to target immigrants, activists and anyone else who challenges his agenda. We’re re-releasing it today with a new foreword.

15.05.2025 14:49 — 👍 20    🔁 14    💬 1    📌 2

YES!!! Congrats

10.05.2025 02:52 — 👍 1    🔁 0    💬 0    📌 0

Had a great time presenting this paper, cowritten with Sireesh Gururaja and Lucy Suchman!

Paper draft here: arxiv.org/abs/2411.17840

01.05.2025 19:45 — 👍 4    🔁 2    💬 0    📌 0

OK this keeps getting better. It’s not just that the FTC is moderating content uploaded in user comments as part of its “platform censorship” inquiry. It’s re-moderating the same content the platforms moderated, as @corbinkbarthold.bsky pointed out.

x.com/corbinkbarth...

1/

25.04.2025 16:59 — 👍 32    🔁 18    💬 1    📌 0
Science and Causality in Technology Litigation | Journal of Online Trust and Safety Journal of Online Trust and Safety

From concerns about social media addiction to urgent civil liberties issues, courts are asking scientists to be arbiters of alleged technology harms. How can scientists reliably inform courts and how can courts interpret our work?

New article with @penney.bsky.social
tsjournal.org/index.php/jo...

29.04.2025 15:23 — 👍 23    🔁 10    💬 2    📌 1
Preview
Fixing the science of digital technology harms Technology development outpaces scientific assessment of impacts

Why do scientists still struggle to answer basic questions about the safety of digital tech from AI to social media, even as families point to rising evidence of individual harm with concern & grief?

@orbenamy.bsky.social & I have a new article in @science.org:

www.science.org/doi/10.1126/...

10.04.2025 20:23 — 👍 79    🔁 37    💬 5    📌 3

"surveillance deputies"(Brayne, Lageson & Levy, 2023).

02.04.2025 19:14 — 👍 59    🔁 24    💬 2    📌 0
Post image

Trying something new:
A 🧵 on a topic I find many students struggle with: "why do their 📊 look more professional than my 📊?"

It's *lots* of tiny decisions that aren't the defaults in many libraries, so let's break down 1 simple graph by @jburnmurdoch.bsky.social

🔗 www.ft.com/content/73a1...

20.11.2024 17:02 — 👍 1603    🔁 469    💬 97    📌 99
Post image

@kennypeng.bsky.social also built a website to explore results on Yelp, headlines, & Congress datasets: hypothesaes.org.

You can see every SAE neuron in UMAP space, colored by whether the neuron correlates positively or negatively with the target variable. 8/

18.03.2025 15:17 — 👍 4    🔁 1    💬 1    📌 0
Video thumbnail

💡New preprint & Python package: We use sparse autoencoders to generate hypotheses from large text datasets.

Our method, HypotheSAEs, produces interpretable text features that predict a target variable, e.g. features in news headlines that predict engagement. 🧵1/

18.03.2025 15:17 — 👍 40    🔁 13    💬 1    📌 3
Preview
Wired is dropping paywalls for FOIA-based reporting. Others should follow As the administration does its best to hide public records from the public, Wired magazine is stepping up to help stem the secrecy

They're called public records for a reason. Starting today, WIRED will *stop paywalling* articles that are primarily based on public records obtained through the Freedom of Information Act, becoming the first publication to partner with @freedom.press to offer this for our new coverage.

18.03.2025 13:11 — 👍 92862    🔁 23820    💬 1668    📌 2116
A screenshot of our paper:

Title: “Don’t Forget the Teachers”: Towards an Educator-Centered Understanding of Harms from Large Language Models in Education

Authors: Emma Harvey, Allison Koenecke, Rene Kizilcec

Abstract: Education technologies (edtech) are increasingly incorporating new features built on LLMs, with the goals of enriching the processes of teaching and learning and ultimately improving learning outcomes. However, it is still too early to understand the potential downstream impacts of LLM-based edtech. Prior attempts to map the risks of LLMs have not been tailored to education specifically, even though it is a unique domain in many respects: from its population (students are often children, who can be especially impacted by technology) to its goals (providing the ‘correct’ answer may be less important than understanding how to arrive at an answer) to its implications for higher-order skills that generalize across contexts (e.g. critical thinking and collaboration). We conducted semi-structured interviews with six edtech providers representing leaders in the K-12 space, as well as a diverse group of 23 educators with varying levels of experience with LLM-based edtech. Through a thematic analysis, we explored how each group is anticipating, observing, and accounting for potential harms from LLMs in education. We find that, while edtech providers focus primarily on mitigating technical harms, i.e. those that can be measured based solely on LLM outputs themselves, educators are more concerned about harms that result from the broader impacts of LLMs, i.e. those that require observation of interactions between students, educators, school systems, and edtech to measure. Overall, we (1) develop an education-specific overview of potential harms from LLMs, (2) highlight gaps between conceptions of harm by edtech providers and those by educators, and (3) make recommendations to facilitate the centering of educators in the design and development of edtech tools.

A screenshot of our paper: Title: “Don’t Forget the Teachers”: Towards an Educator-Centered Understanding of Harms from Large Language Models in Education Authors: Emma Harvey, Allison Koenecke, Rene Kizilcec Abstract: Education technologies (edtech) are increasingly incorporating new features built on LLMs, with the goals of enriching the processes of teaching and learning and ultimately improving learning outcomes. However, it is still too early to understand the potential downstream impacts of LLM-based edtech. Prior attempts to map the risks of LLMs have not been tailored to education specifically, even though it is a unique domain in many respects: from its population (students are often children, who can be especially impacted by technology) to its goals (providing the ‘correct’ answer may be less important than understanding how to arrive at an answer) to its implications for higher-order skills that generalize across contexts (e.g. critical thinking and collaboration). We conducted semi-structured interviews with six edtech providers representing leaders in the K-12 space, as well as a diverse group of 23 educators with varying levels of experience with LLM-based edtech. Through a thematic analysis, we explored how each group is anticipating, observing, and accounting for potential harms from LLMs in education. We find that, while edtech providers focus primarily on mitigating technical harms, i.e. those that can be measured based solely on LLM outputs themselves, educators are more concerned about harms that result from the broader impacts of LLMs, i.e. those that require observation of interactions between students, educators, school systems, and edtech to measure. Overall, we (1) develop an education-specific overview of potential harms from LLMs, (2) highlight gaps between conceptions of harm by edtech providers and those by educators, and (3) make recommendations to facilitate the centering of educators in the design and development of edtech tools.

✨New Work✨ by me, @allisonkoe.bsky.social, and @kizilcec.bsky.social forthcoming at #CHI2025:

"Don't Forget the Teachers": Towards an Educator-Centered Understanding of Harms from Large Language Models in Education

🔗: arxiv.org/pdf/2502.14592

13.03.2025 16:07 — 👍 54    🔁 8    💬 1    📌 6

@dbateyko is following 20 prominent accounts