Haven't seen this and it sounds interesting, could you post a link to something using it. Thanks!
27.03.2025 18:29 β π 1 π 0 π¬ 1 π 0@tomkempton.bsky.social
Pure mathematician working in Ergodic Theory, Fractal Geometry, and (recently) Large Language Models. Senior Lecturer (= Associate Professor) at the University of Manchester.
Haven't seen this and it sounds interesting, could you post a link to something using it. Thanks!
27.03.2025 18:29 β π 1 π 0 π¬ 1 π 0Thanks, I'll take a look!
12.02.2025 16:06 β π 2 π 0 π¬ 1 π 0Thanks for the reply! What I meant by confidence here (possibly the wrong word) isn't how concentrated the output prob vector is, but how close we think the output prob is to the true next token distribution (if such a thing existed...).
12.02.2025 11:48 β π 1 π 0 π¬ 1 π 0I'm not sure I really believe that there's no information to be gleaned though. Maybe one needs to think more about training dynamics...
12.02.2025 11:44 β π 1 π 0 π¬ 1 π 0So one answer to my question, which I'd not thought about until your answer, is that, while softmax is not injective on R^|V|, it is injective when you restrict it to the column space of the output embedding matrix, so there's nothing to think about here.
12.02.2025 11:44 β π 0 π 0 π¬ 1 π 0I'd guess X shouldn't be in this column space, otherwise there's a wasted dimension which doesn't make it to the output (although it would be interesting to see whether, if you included it, it contained interesting info).
12.02.2025 11:44 β π 0 π 0 π¬ 1 π 0Presumably this is well studied, could anyone point me in the direction of references?
12.02.2025 08:29 β π 0 π 0 π¬ 0 π 0Let's call a logits vector 'large' if the division term in the softmax is large. Might we guess that large logits vectors correspond to confident situations where the model is satisfied with the possible choices of next token (either many good options, or just one option but it looks great?)
12.02.2025 08:29 β π 0 π 0 π¬ 2 π 0Since softmax is not injective, many different logits vectors output the same probability distribution. (Precisely, v and w output the same distribution if they differ by a constant multiple of the 'all ones' vector). Can we infer anything from the logits vector beyond the prob. dist. it outputs?
12.02.2025 08:29 β π 0 π 0 π¬ 1 π 0Is it just that we initialise the network with small weights and so our prior is that this should persist?
Tips or links would be very welcome!
Theoretically, the later missing layers could permute earlier layers, or multiply all the activations by -1. So I don't see any reason that one should expect training a language model to result in a model where naively applying the output embedding to earlier layers is a sensible thing to do.
31.01.2025 08:43 β π 0 π 0 π¬ 1 π 0Can anyone point me to a reference saying early exit from a neural network is a reasonable thing to do?
As I understand it, early exit (from say a language model) involves taking the output from some early layer and applying the output embedding.
I'm sure it's been asked a thousand times, but what's everyone's favourite method of making lists of articles they want to read?
27.11.2024 11:34 β π 4 π 0 π¬ 3 π 1Today's question from the four year old: if all of the zookeepers in the world suddenly died would the farmers look after the zoo animals or would that be the job of the vets? Had to admit I didn't know the answer...
23.11.2024 15:38 β π 0 π 0 π¬ 0 π 0