Collin Berke's Avatar

Collin Berke

@collinberke.bsky.social

Media Research Analyst | #rstats | data enthusiast | news, sports, and podcast aficionado Website: https://www.collinberke.com/ GitHub: https://github.com/collinberke LinkedIn: https://www.linkedin.com/in/collinberke/

108 Followers  |  218 Following  |  174 Posts  |  Joined: 02.12.2024  |  2.1587

Latest posts by collinberke.bsky.social on Bluesky

Post image

Note to self: It's been awhile since I've needed to do back indexing to return a single row of data while using #RStats.

So, here's a Base R and dplyr refresher πŸ‘‡

#dataBS

01.10.2025 20:34 β€” πŸ‘ 5    πŸ” 1    πŸ’¬ 1    πŸ“Œ 0
Preview
Comprehension Debt: The Ticking Time Bomb of LLM-Generated Code An effect that’s being more and more widely reported is the increase in time it’s taking developers to modify or fix code that was generated by Large Language Models. If you’ve wo…

Interesting framing of the use of LLMs for coding in a post I recently bumped into: codemanship.wordpress.com/2025/09/30/c... (via news.ycombinator.com/item?id=4542...).

I'm certainly not AI-averse when performing coding tasks. But, 'Comprehension Debt' is something to be mindful of.

30.09.2025 17:24 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Preview
data.tree: General Purpose Hierarchical Data Structure Create tree structures from hierarchical data, and traverse the tree in various orders. Aggregate, cumulate, print, plot, convert to and from data.frame and more. Useful for decision trees, machine le...

I just came across the `data.tree` #RStats package:

πŸ“¦: cran.r-project.org/web/packages...
πŸ§‘β€πŸ’»: github.com/gluc/data.tree
πŸ“š: cran.r-project.org/web/packages...

I was looking for a way to create simple tree diagrams of various outcomes and probabilities. Any #dataBS folks have other suggestions?

29.09.2025 17:07 β€” πŸ‘ 7    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

If you're still following this thread, I highly suggest checking out the full post: www.counting-stuff.com/data-cleanin....

It's full of other great takeaways.

9/9

28.09.2025 04:59 β€” πŸ‘ 3    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

"Unless you know what people are going to use the data for, you won’t know. Even if you know, you’d still must guess at what is likely to help their analysis and not hurt it. By far the most important thing to do is to fully document every cleaning decision ..."

8/9

28.09.2025 04:59 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

I'm a big fan of the practical advice at the end:
* Save a copy of the original data, avoid permanent changes.
* Leave a paper trail.
* Fix what's needed to get your analysis to work.
* Reduce unwanted variation
* Eliminate bias where possible

7/9

28.09.2025 04:59 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

Data practitioners are not safe from change, which is the result of operating in environments utilizing modern distributed-systems infrastructure. Thus, cleaning operations will be constantly revisited.

6/9

28.09.2025 04:59 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

"The problem with automation is that it can divorce understanding the underlying data from the analysis and interpretation of the data."

5/9

28.09.2025 04:59 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

"If you delegate that responsibility someone else [data cleaning], whether it’s another human or a machine, you put yourself at risk of doing something dangerous with your data."

4/9

28.09.2025 04:59 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

"The real reason for all this cleaning work is the signal-to-noise ratio in raw data is too poor for purpose we intend. We need to improve data quality to amp the signal for our tools to find."

3/9

28.09.2025 04:59 β€” πŸ‘ 1    πŸ” 1    πŸ’¬ 1    πŸ“Œ 0

"We’re doing cleaning because we want to extract the useful signal from the noise, and we decide certain bits of noise β€œcorrectable” at the data point level for that purpose."

2/9

28.09.2025 04:59 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Preview
Data Cleaning IS Analysis, Not Grunt Work Also, most data cleaning articles suck

I got introduced to @randyau.com's 'Data Cleaning IS Analysis, Not Grunt Work' post during the #dataBS Conf this week: www.counting-stuff.com/data-cleanin... . I just finished--it was a great read.

Here are some quotes and thoughts I'm walking away with πŸ‘‡

1/9 #RStats

28.09.2025 04:59 β€” πŸ‘ 47    πŸ” 11    πŸ’¬ 3    πŸ“Œ 0

Probably should have mentioned the date and time:

October 16, 2025, 9am PT / 12pm ET

26.09.2025 15:14 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
How to use pointblank to understand, validate, and document your data – R Consortium

Just registered for the @rconsortium.bsky.social's 'How to use `pointblank` to understand, validate, and document your data' online workshop. I'm looking forward to learning more from @richmeister.bsky.social.

Sign up here: r-consortium.org/webinars/how...

#RStats #DataBS

26.09.2025 15:13 β€” πŸ‘ 5    πŸ” 1    πŸ’¬ 2    πŸ“Œ 0

Perhaps to give the user the option to write their own custom handling behavior? πŸ€·β€β™‚οΈ

26.09.2025 13:35 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

The β€˜just work without error’ design decisions are something I appreciate from time to time.

I’ll also never stop loving finding those little things base R has already solved. You just have to explore a little bit.

25.09.2025 22:17 β€” πŸ‘ 3    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

TIL: #Rstats base::system.file() has a `mustWork` argument. It pushes an error if a file is not found. Useful.

25.09.2025 21:23 β€” πŸ‘ 5    πŸ” 0    πŸ’¬ 2    πŸ“Œ 0

I often go beyond 80 characters per line ...

The world continues to turn.

24.09.2025 16:30 β€” πŸ‘ 7    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

Great resource! πŸ‘€
Thanks, @emilhvitfeldt.bsky.social!

24.09.2025 16:26 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Preview
ggplot2 4.0.0 A new major version of ggplot2 has been released on CRAN. Find out what is new here.

I am beyond excited to announce that ggplot2 4.0.0 has just landed on CRAN.

It's not every day we have a new major #ggplot2 release but it is a fitting 18 year birthday present for the package.

Get an overview of the release in this blog post and be on the lookout for more in-depth posts #rstats

11.09.2025 11:20 β€” πŸ‘ 848    πŸ” 282    πŸ’¬ 9    πŸ“Œ 51

Clearly I don't know how link shortening works on BlueSky. So, if anyone has trouble opening the above links, just send me a DM and I'll point you in the right direction.

10.09.2025 16:49 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

Seeing you have marketing listed in your description
@drsundar.bsky.social
, you might find this dataset useful: developers.google.com/analytics/bi...

I also wrote a blog post on this topic as well. Check it out: www.collinberke.com/blog/posts/2....

10.09.2025 16:47 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 2    πŸ“Œ 0
Preview
BigQuery | AI data platform | Lakehouse | EDW BigQuery is the autonomous data and AI platform, automating the entire data lifecycle so you can go from data to AI to action faster.

Yes, of course! BigQuery is a cloud based data warehouse: cloud.google.com/bigquery. bigrquery is an #rstats package that provides an interface to bridge BigQuery and R (i.e., download data stored in this warehouse). BigQuery has public datasets to experiment with: cloud.google.com/bigquery/pub...

10.09.2025 16:42 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Preview
Bqs write by meztez Β· Pull Request #79 Β· meztez/bigrquerystorage beta test bq write

+1 for bigrquerystorage to support the BigQueryWrite interface. Keeping an eye on the package's development:

πŸ‘€: github.com/meztez/bigrq...

10.09.2025 15:00 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

I can confirm, data downloads for bigrquery are now blazingly fast πŸ”₯.

#RStats

10.09.2025 14:50 β€” πŸ‘ 6    πŸ” 0    πŸ’¬ 4    πŸ“Œ 0

I've been excitedly waiting for this feature to be released for the #rstats bigrquery package. Now it's finally here! πŸŽοΈπŸ’¨

09.09.2025 20:44 β€” πŸ‘ 7    πŸ” 2    πŸ’¬ 1    πŸ“Œ 1

Just started seeing this today. Thanks for sharing the fix!

26.08.2025 18:31 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Notes: Using gganimate to animate plots – Collin K. Berke, Ph.D. Learning more about data visualization animation from the β€˜Getting Started’ vignette

πŸ”— Links
πŸ“ Notes: www.collinberke.com/til/posts/20...
πŸ“¦ {gganimate}: gganimate.com
πŸ“¦ {cfbfastR}: cfbfastr.sportsdataverse.org

26.08.2025 04:04 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Video thumbnail

I recently enjoyed using the {gganimate} #RStats package for #DataVis animations. So, I explored it further and drafted some notes. What resulted was some example animations using the palmer `penguins` dataset and some B1G QB passing data from {cfbfastR}.

I was pleased with the outcome. Links πŸ‘‡

26.08.2025 04:04 β€” πŸ‘ 4    πŸ” 4    πŸ’¬ 1    πŸ“Œ 0
Preview
Document Listings – Quarto

TIL: #quarto allows you to easily customize the fields used within listing pages: quarto.org/docs/website...

I've been wanting to simplify and make the listings pages on my personal site a little more lightweight: www.collinberke.com

This will be the focus of my weekend πŸ§‘β€πŸ’»

#RStats #dataBS

22.08.2025 17:27 β€” πŸ‘ 7    πŸ” 3    πŸ’¬ 0    πŸ“Œ 0

@collinberke is following 20 prominent accounts