I made my first OSS contribution in Rust and got a PR merged into ruff 🤩
02.04.2025 04:37 — 👍 0 🔁 0 💬 0 📌 0@maxmynter.bsky.social
Numbers, data, computer, math, ML, and sociocultural Meta-commentary. In previous lifes: sociologist, physicist. i shitpost as @maxmynter on the other app — but here I want to experiment with being more serious maxmynter.com
I made my first OSS contribution in Rust and got a PR merged into ruff 🤩
02.04.2025 04:37 — 👍 0 🔁 0 💬 0 📌 0I indeed churned for a while.
But i’m back.
Oh no, bluesky, i am churning
20.12.2024 12:26 — 👍 1 🔁 0 💬 0 📌 1Is the concentration of money and power in AI (or generally tech) a problem? Yes.
But these datasets democratize development.
And we also don’t hate on conveyor belt workers in the auto industries just because the big cronies pocket the profits.
The librarian is the wrong target.
I demand deletion of this dataset as you have not obtained the consent of the posts author, triple notarized in presence of their legal guardian, a state lawyer, Ayn Rand, and god. Plus a declaration in lieu of oath that they will not revoke this consent.
I am shook to the core about your audacity
I mean things change once you talk about commercial use. But the collection as such is fine if you comply with GDPR stuff about PII in the EU.
29.11.2024 09:35 — 👍 1 🔁 0 💬 1 📌 0The problem here is PII means everything that can identify a person. So would include the posts itself if i can search them on Bsky to find the author.
(Not my personal opinion, but the law if you go by scripture — so it’s insecure to use for Research).
Generally complexity of concepts scales with deprh and knowledge mass with width.
So it’s possible the smaller model is a bit worse (and they just advertise it bc. It has the biggest margin).
Another reason could be reproducibility for research.
But idk, tbh.
Distillation probably.
They use the outputs of a big model as targets for a smaller. Thus you can make it behave the same way but with fewer parameters and thus lower inference cost.
Personally, i think people should stfu about data that they published to the world to be scraped.
But in GDPR everything that can be used to identify counts as PII. That includes pseudonyms and even the post itself if you can search for it to identify the author...
I don’t think that consent makes the data worse.
There are other use-cases in the direction of computational sociology where consent is trickier.
Yet I do believe in the right for privacy. But i find it peculiar to blame someone for compiling posts that were made publicly for the world to see.
Research should follow ethical guidelines. But collecting data people voluntarily published for the world to see is fine.
28.11.2024 20:06 — 👍 2 🔁 0 💬 0 📌 0I think i mainly just disagree. But it's fine. I'll just take your ad hominems.
I don't think that the Librarian did anything bad. I also don't think that is what many here took issue with.
They hate what they call "Tech bro's" and found someone they could torch.
Yeah, it's exhausting! Other than you i am here clear name.
I don't know why you gatekeep "real" research.
Have I complied with law and ethics board in my research? Yes.
Am I happy to see grass roots / citizen science intiatives? Also yes.
Is the behaviour of the mob legitimate? Fuck no.
A librarian receiving death threats for doing librarian things from people with all the progressive icons in their bio. I still have to reconcile that with my worldview somehow.
28.11.2024 19:12 — 👍 1 🔁 0 💬 1 📌 0You know, it's hard to study radicalisation on the internet with only opt in data because racist trolls - believe it or not - usually don't consent.
i'm not accusing you of being one, just describing the research I did.
You can still stop with your condescending smugness.
Thank you for play, professor.
28.11.2024 19:05 — 👍 0 🔁 0 💬 1 📌 0Also to cite article 89 "Processing for archiving purposes in the public interest, scientific or historical research purposes or statistical purposes, shall be subject to appropriate safeguards"
Therefore, it is in principle okay to collect this data for the abovementioned purpose.
I was a computational social scientist and worked with datasets like these more than five years ago.
I unequivocally think they are a net good to society and important for research. It is important that these datasets are made available easily to other researchers. Yes that includes AI research.
The librarian was "just" harassed. The 2nd user (AlpinDale) who posted a link to a 2M entries HF dataset was banned.
28.11.2024 18:56 — 👍 2 🔁 0 💬 0 📌 0GDPR explicitly makes exceptions for datasets compiled for research purposes in Article 89.
28.11.2024 18:51 — 👍 1 🔁 0 💬 1 📌 0These data are important sources for (computational) social science. This scraping has been standard practice since more than a decade.
E.g. research on trajectories of online radicalisation and online polarisation via posts during the 1st trump presidency.
These datasets are important!
I thought bluesky was a nicer place, but what i experienced here today almost makes me believe in horseshoe theory
28.11.2024 18:44 — 👍 1 🔁 0 💬 1 📌 0Especially since that is effectively what I did five years ago and what is standard across computational social science since more than a decade.
I have never before encountered this amount of hate from people I did not outright classify as right wing trolls.
I mean I like that the EU has mechanisms for data protection in place. And it's definitely a conversation worth having if big companies and VCs earn a lot from tech that is built on uncompensated labor.
Despite that, I'm shocked at how the mob behaved toward that librarian.
My life is better with them. I use them every day, many of my colleagues and friends do to. In fact, in would have to search pretty hard to find people who don’t.
Maybe it‘s not just about the iotas in your life…
Good data for open science is a fundamentally good and legitimate thing. I‘ll die on this hill.
28.11.2024 18:24 — 👍 1 🔁 0 💬 2 📌 0That doesn‘t mean i Like sexist tech bro‘s but the guy is a librarian and does what Librarians do, documenting Culture, archiving it and making it freely available.
28.11.2024 18:22 — 👍 0 🔁 0 💬 0 📌 0Ofc. I have heard Arguments against it. But that doesnt mean its all bad.
I was a computational social science researcher for a while to study online radicalisation during 1st trump term.
I use LLMa daily. These datasets are important for science and a healthy civic Society.
Enables them to do what?
Good things? Good.
Bad things? Bad.
But the people who do bad things should be restricted, not on the data layer.
Bluesky is pretty upfront with it being a decentralized protocol, it's main selling point, which means no one gets to centrally control who gets data.