L2ashobby l2ashobby - Bluesky Statics

TIL: k-Means clustering is a special case of the Expectation-Maximization (EM) algorithm using Gaussian mixture models (GMM). If the K Gaussian clusters are assumed to have zero variance and uniform mixture coefficients (1/K), then the GMM algorithm becomes the k-means clustering algorithm.

23.12.2025 20:11 — 👍 0 🔁 0 💬 0 📌 0

TIL: While cross-entropy loss can be thought of as a comparison between two probability distributions, in ML coding practice, the loss function takes in a neural network's predicted logits and target probabilities. This way, loss computations are more efficient, with bounded stable gradients.

20.12.2025 13:46 — 👍 1 🔁 0 💬 0 📌 0

TIL: In k-means clustering, data samples are first assigned to clusters based on the squared distances to each cluster center. The algorithm then updates cluster centers by finding the centroid of the assigned data samples. Essentially, the cluster centers "follow" the newly assigned data samples.

18.12.2025 18:23 — 👍 0 🔁 0 💬 0 📌 0

TIL: "Bootstrap Aggregating" = Bagging. Commonly used for decision trees, it is also possible to use for training ensembles of other kinds of prediction functions with relatively high variance, including neural nets!

17.12.2025 19:00 — 👍 0 🔁 0 💬 0 📌 0

The basic Upper Confidence Tree (UCT) search algorithm uses UCB1 as the tree policy for selecting child nodes during Monte Carlo Tree Search (MCTS).

02.12.2025 22:12 — 👍 1 🔁 0 💬 0 📌 0

Reading: A Survey of Monte Carlo Tree Search Methods
TIL: Introduced by Auer et al., the UCB1 algorithm says a multi-arm bandit should play the j-th arm based on average reward observed for j-th arm (exploit) and inverse square root of the number of times j-th arm is played (explore).

02.12.2025 15:00 — 👍 1 🔁 0 💬 1 📌 0

Feedforward neural network - Wikipedia

Reading: en.wikipedia.org/wiki/Feedfor...
TIL: MLP is but one kind of feedforward network, specifically the kind with fully connected layers. Other kinds of feedforward network include CNNs.

24.10.2025 00:14 — 👍 0 🔁 0 💬 0 📌 0

Reading: OpenAI Spinning Up Part 3
TIL: The policy gradient used to update policy takes the general form of an expected weighted sum over the trajectory. The main summation term is the gradient of log-likelihood of policy actions. The summation weights depend on the policy optimization approach.

22.10.2025 15:06 — 👍 0 🔁 0 💬 0 📌 0

flow chart diagram showing SARSA, top, vs Q-learning, bottom, where on-policy SARSA's next action is used to update Q-values, while Q-learning's next action is instead sampled from a behavior policy. Q-learning uses a target policy to find the best action, which is used to update Q-values instead.

Reading: RL materials (David Silver RL slides, Spinning Up)
TIL: In on-policy, the action for updating target policy becomes the next action (target = behavior policy). In off-policy, the action for updating target policy is not necessarily the next action (sampled from separate behavior policy).

20.10.2025 21:17 — 👍 0 🔁 0 💬 0 📌 0

Reading: Build a Large Language Model from Scratch, Chapter 3
TIL: Attention mechanisms were initially developed to augment the RNN EncDec architecture by addressing the limitation of how the original RNN Decoder could not access the RNN Encoder's previous hidden states over the input sequence.

14.10.2025 02:49 — 👍 0 🔁 0 💬 0 📌 0

Reading: Build a Large Language Model from Scratch, Chapter 2
TIL: While the OG transformer model used a pre-defined positional encoder that remained fixed during training, early OpenAI GPT models used absolute positional embeddings that were optimized during training.

08.10.2025 16:11 — 👍 0 🔁 0 💬 0 📌 0

Reading: Build a Large Language Model from Scratch, Chapter 2
TIL: Raw text is first processed into words and special character tokens. Then a tokenizer uses a vocabulary to map tokens to integer IDs and vice versa. Special context tokens (`endoftext`) are included in the vocabulary.

08.10.2025 01:27 — 👍 0 🔁 0 💬 0 📌 0

Reading: Build a Large Language Model from Scratch, Chapter 1
TIL: "when we say language models "understand," we mean that they can process and generate text in ways that appear coherent and contextually relevant, not that they possess human-like consciousness or comprehension."

08.10.2025 01:18 — 👍 0 🔁 0 💬 0 📌 0

The universal approximation theorem states that a neural network of a specific structure can approximate a continuous function to any accuracy. While that means deeper neural nets are not needed in theory, later experiments showed practical benefits of increasing layers vs increasing hidden size.

23.09.2025 22:35 — 👍 1 🔁 0 💬 0 📌 0

Reading: FastAI Book github.com/fastai/fastb...
Section: 04_mnist_basics
TIL: Up until the 1990s, ML research usually involved neural nets with only one nonlinear layer with varying widths, not depth. This may have been caused by a misunderstanding of the universal approximation theorem.

23.09.2025 22:30 — 👍 0 🔁 0 💬 1 📌 0

Reading: FastAI Book github.com/fastai/fastb...
Section: 04_mnist_basics
TIL: In classification, using accuracy as a loss function is not a good idea because it likely does not change after model weight updates, resulting in zero gradient and "no learning."

19.09.2025 17:12 — 👍 0 🔁 0 💬 0 📌 0

Reading: FastAI Book github.com/fastai/fastb...
Section: 04_mnist_basics
TIL: "Gradient" in ML usually refers to the **computed value** of the function's derivative given input values, rather than the function's derivative expression per math/physics convention.

12.09.2025 22:18 — 👍 0 🔁 0 💬 0 📌 0

Reading: FastAI Book github.com/fastai/fastb...
Section: 04_mnist_basics
TIL: Arthur Samuel describes machine learning as "a mechanism for altering the weight assignment so as to maximize the performance." Kinda cool it doesn't rely on very formal math language.

11.09.2025 17:38 — 👍 1 🔁 0 💬 0 📌 0

so this is what being nerd-sniped feels like...

11.09.2025 16:27 — 👍 0 🔁 0 💬 0 📌 0

"ups the contrast" - small absolute differences are even smaller after squaring, while differences close to 1 remain close to 1.

09.09.2025 16:37 — 👍 0 🔁 0 💬 0 📌 0

Reading: FastAI Book [https://github.com/fastai/fastbook/]
Section: 04_mnist_basics
TIL: After ensuring differences between two tensors are between 0 and 1, the squared error "ups the contrast" of those differences relative to the absolute error. This will have implications on using L1 vs L2 norm.

09.09.2025 16:33 — 👍 0 🔁 0 💬 1 📌 0

Reading: Reality Check Slide Deck bsky.app/profile/kyun...
TIL: Best practices to consider for evaluating models: 1) test with unseen tasks rather than data instances (continual learning), 2) use metrics informed by domain, not just ML, 3) test on tasks downstream from basic predictions (esp LLMs).

15.08.2025 18:40 — 👍 2 🔁 0 💬 0 📌 0

This input data approach falls under the general body of data augmentation methods.

13.08.2025 16:45 — 👍 0 🔁 0 💬 0 📌 0

Reading: FastAI Book [https://github.com/fastai/fastbook/]
Section: 02_production.ipynb
TIL: For an object detection model, training images can be resized via crop, squishing, or padding, none of which are ideal. So the "best" solution is to crop **randomly** (above a min fraction of each image).

13.08.2025 16:43 — 👍 0 🔁 0 💬 1 📌 0

Reading: FastAI Book github.com/fastai/fastb...
Section: 01_intro.ipynb
TIL: "Computers, as any programmer will tell you, are giant morons, not giant brains." - Arthur Samuel, "Artificial Intelligence: A Frontier of Automation" doi.org/10.1177/0002...

04.08.2025 18:01 — 👍 1 🔁 0 💬 0 📌 0

Reading: Deep Learning [https://www.deeplearningbook.org]
Section: Chapter 3 - Probability and Information Theory
TIL: While KL-Divergence is sometimes referred to as a "distance" between distributions P and Q, this is not the best mental model since KL-divergence is asymmetric.

20.07.2025 16:59 — 👍 2 🔁 0 💬 0 📌 0

KL-Divergence isolates the extra information needed to encode those P-based messages as a result of wrongly assuming distribution Q.

19.07.2025 16:56 — 👍 0 🔁 0 💬 0 📌 0

Reading: Deep Learning [https://www.deeplearningbook.org]
Section: Chapter 3 - Probability and Information Theory
TIL: In information theory, cross-entropy quantifies overall information needed to encode messages with symbols sampled from distribution P when wrongly assuming distribution Q.

19.07.2025 16:56 — 👍 1 🔁 0 💬 1 📌 0

Reading: Deep Learning [https://www.deeplearningbook.org]
Section: Chapter 3 - Probability and Information Theory
TIL: In mixture distribution models, the component identity variable c is a kind of latent variable!

18.07.2025 17:57 — 👍 1 🔁 0 💬 0 📌 0

Reading: Deep Learning [https://www.deeplearningbook.org]
Section: Chapter 3 - Probability and Information Theory
TIL: Covariance measures a *linear* relationship between two variables. So Cov(x,y)=0 does not exclude the possibility of a non-linear relationship between x and y.

18.07.2025 01:53 — 👍 0 🔁 0 💬 0 📌 0

Posts by L2ashobby (@l2ashobby.bsky.social)