Mäx's Avatar

Mäx

@maxmynter.bsky.social

Numbers, data, computer, math, ML, and sociocultural Meta-commentary. In previous lifes: sociologist, physicist. i shitpost as @maxmynter on the other app — but here I want to experiment with being more serious maxmynter.com

103 Followers  |  258 Following  |  59 Posts  |  Joined: 29.09.2023  |  2.5826

Latest posts by maxmynter.bsky.social on Bluesky

Post image

I made my first OSS contribution in Rust and got a PR merged into ruff 🤩

02.04.2025 04:37 — 👍 0    🔁 0    💬 0    📌 0

I indeed churned for a while.

But i’m back.

31.03.2025 03:50 — 👍 2    🔁 0    💬 0    📌 0

Oh no, bluesky, i am churning

20.12.2024 12:26 — 👍 1    🔁 0    💬 0    📌 1

Is the concentration of money and power in AI (or generally tech) a problem? Yes.

But these datasets democratize development.
And we also don’t hate on conveyor belt workers in the auto industries just because the big cronies pocket the profits.

The librarian is the wrong target.

30.11.2024 00:06 — 👍 2    🔁 0    💬 1    📌 0

I demand deletion of this dataset as you have not obtained the consent of the posts author, triple notarized in presence of their legal guardian, a state lawyer, Ayn Rand, and god. Plus a declaration in lieu of oath that they will not revoke this consent.

I am shook to the core about your audacity

29.11.2024 10:42 — 👍 4    🔁 0    💬 1    📌 0

I mean things change once you talk about commercial use. But the collection as such is fine if you comply with GDPR stuff about PII in the EU.

29.11.2024 09:35 — 👍 1    🔁 0    💬 1    📌 0

The problem here is PII means everything that can identify a person. So would include the posts itself if i can search them on Bsky to find the author.

(Not my personal opinion, but the law if you go by scripture — so it’s insecure to use for Research).

28.11.2024 21:21 — 👍 2    🔁 0    💬 0    📌 0

Generally complexity of concepts scales with deprh and knowledge mass with width.

So it’s possible the smaller model is a bit worse (and they just advertise it bc. It has the biggest margin).

Another reason could be reproducibility for research.

But idk, tbh.

28.11.2024 21:18 — 👍 2    🔁 0    💬 0    📌 0

Distillation probably.

They use the outputs of a big model as targets for a smaller. Thus you can make it behave the same way but with fewer parameters and thus lower inference cost.

28.11.2024 21:02 — 👍 4    🔁 0    💬 1    📌 0

Personally, i think people should stfu about data that they published to the world to be scraped.

But in GDPR everything that can be used to identify counts as PII. That includes pseudonyms and even the post itself if you can search for it to identify the author...

28.11.2024 20:58 — 👍 3    🔁 0    💬 1    📌 0

I don’t think that consent makes the data worse.

There are other use-cases in the direction of computational sociology where consent is trickier.

Yet I do believe in the right for privacy. But i find it peculiar to blame someone for compiling posts that were made publicly for the world to see.

28.11.2024 20:38 — 👍 2    🔁 0    💬 1    📌 0

Research should follow ethical guidelines. But collecting data people voluntarily published for the world to see is fine.

28.11.2024 20:06 — 👍 2    🔁 0    💬 0    📌 0

I think i mainly just disagree. But it's fine. I'll just take your ad hominems.

I don't think that the Librarian did anything bad. I also don't think that is what many here took issue with.

They hate what they call "Tech bro's" and found someone they could torch.

28.11.2024 19:49 — 👍 0    🔁 0    💬 1    📌 0

Yeah, it's exhausting! Other than you i am here clear name.

I don't know why you gatekeep "real" research.

Have I complied with law and ethics board in my research? Yes.

Am I happy to see grass roots / citizen science intiatives? Also yes.

Is the behaviour of the mob legitimate? Fuck no.

28.11.2024 19:32 — 👍 0    🔁 0    💬 1    📌 0

A librarian receiving death threats for doing librarian things from people with all the progressive icons in their bio. I still have to reconcile that with my worldview somehow.

28.11.2024 19:12 — 👍 1    🔁 0    💬 1    📌 0

You know, it's hard to study radicalisation on the internet with only opt in data because racist trolls - believe it or not - usually don't consent.

i'm not accusing you of being one, just describing the research I did.

You can still stop with your condescending smugness.

28.11.2024 19:11 — 👍 1    🔁 0    💬 1    📌 0

Thank you for play, professor.

28.11.2024 19:05 — 👍 0    🔁 0    💬 1    📌 0

Also to cite article 89 "Processing for archiving purposes in the public interest, scientific or historical research purposes or statistical purposes, shall be subject to appropriate safeguards"

Therefore, it is in principle okay to collect this data for the abovementioned purpose.

28.11.2024 19:04 — 👍 0    🔁 0    💬 1    📌 0

I was a computational social scientist and worked with datasets like these more than five years ago.

I unequivocally think they are a net good to society and important for research. It is important that these datasets are made available easily to other researchers. Yes that includes AI research.

28.11.2024 19:00 — 👍 2    🔁 0    💬 2    📌 0
Post image

The librarian was "just" harassed. The 2nd user (AlpinDale) who posted a link to a 2M entries HF dataset was banned.

28.11.2024 18:56 — 👍 2    🔁 0    💬 0    📌 0

GDPR explicitly makes exceptions for datasets compiled for research purposes in Article 89.

28.11.2024 18:51 — 👍 1    🔁 0    💬 1    📌 0

These data are important sources for (computational) social science. This scraping has been standard practice since more than a decade.

E.g. research on trajectories of online radicalisation and online polarisation via posts during the 1st trump presidency.

These datasets are important!

28.11.2024 18:47 — 👍 0    🔁 0    💬 0    📌 0

I thought bluesky was a nicer place, but what i experienced here today almost makes me believe in horseshoe theory

28.11.2024 18:44 — 👍 1    🔁 0    💬 1    📌 0

Especially since that is effectively what I did five years ago and what is standard across computational social science since more than a decade.

I have never before encountered this amount of hate from people I did not outright classify as right wing trolls.

28.11.2024 18:41 — 👍 1    🔁 0    💬 1    📌 0

I mean I like that the EU has mechanisms for data protection in place. And it's definitely a conversation worth having if big companies and VCs earn a lot from tech that is built on uncompensated labor.

Despite that, I'm shocked at how the mob behaved toward that librarian.

28.11.2024 18:40 — 👍 3    🔁 0    💬 1    📌 0

My life is better with them. I use them every day, many of my colleagues and friends do to. In fact, in would have to search pretty hard to find people who don’t.

Maybe it‘s not just about the iotas in your life…

28.11.2024 18:26 — 👍 0    🔁 0    💬 0    📌 0

Good data for open science is a fundamentally good and legitimate thing. I‘ll die on this hill.

28.11.2024 18:24 — 👍 1    🔁 0    💬 2    📌 0

That doesn‘t mean i Like sexist tech bro‘s but the guy is a librarian and does what Librarians do, documenting Culture, archiving it and making it freely available.

28.11.2024 18:22 — 👍 0    🔁 0    💬 0    📌 0

Ofc. I have heard Arguments against it. But that doesnt mean its all bad.

I was a computational social science researcher for a while to study online radicalisation during 1st trump term.

I use LLMa daily. These datasets are important for science and a healthy civic Society.

28.11.2024 18:20 — 👍 0    🔁 0    💬 1    📌 0

Enables them to do what?

Good things? Good.

Bad things? Bad.

But the people who do bad things should be restricted, not on the data layer.

Bluesky is pretty upfront with it being a decentralized protocol, it's main selling point, which means no one gets to centrally control who gets data.

28.11.2024 12:23 — 👍 7    🔁 0    💬 1    📌 0

@maxmynter is following 17 prominent accounts