Tom Kempton's Avatar

Tom Kempton

@tomkempton.bsky.social

Pure mathematician working in Ergodic Theory, Fractal Geometry, and (recently) Large Language Models. Senior Lecturer (= Associate Professor) at the University of Manchester.

125 Followers  |  947 Following  |  14 Posts  |  Joined: 16.11.2024  |  1.6015

Latest posts by tomkempton.bsky.social on Bluesky

Haven't seen this and it sounds interesting, could you post a link to something using it. Thanks!

27.03.2025 18:29 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

Thanks, I'll take a look!

12.02.2025 16:06 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

Thanks for the reply! What I meant by confidence here (possibly the wrong word) isn't how concentrated the output prob vector is, but how close we think the output prob is to the true next token distribution (if such a thing existed...).

12.02.2025 11:48 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

I'm not sure I really believe that there's no information to be gleaned though. Maybe one needs to think more about training dynamics...

12.02.2025 11:44 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

So one answer to my question, which I'd not thought about until your answer, is that, while softmax is not injective on R^|V|, it is injective when you restrict it to the column space of the output embedding matrix, so there's nothing to think about here.

12.02.2025 11:44 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

I'd guess X shouldn't be in this column space, otherwise there's a wasted dimension which doesn't make it to the output (although it would be interesting to see whether, if you included it, it contained interesting info).

12.02.2025 11:44 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

Presumably this is well studied, could anyone point me in the direction of references?

12.02.2025 08:29 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

Let's call a logits vector 'large' if the division term in the softmax is large. Might we guess that large logits vectors correspond to confident situations where the model is satisfied with the possible choices of next token (either many good options, or just one option but it looks great?)

12.02.2025 08:29 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 2    πŸ“Œ 0

Since softmax is not injective, many different logits vectors output the same probability distribution. (Precisely, v and w output the same distribution if they differ by a constant multiple of the 'all ones' vector). Can we infer anything from the logits vector beyond the prob. dist. it outputs?

12.02.2025 08:29 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

Is it just that we initialise the network with small weights and so our prior is that this should persist?

Tips or links would be very welcome!

31.01.2025 08:43 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

Theoretically, the later missing layers could permute earlier layers, or multiply all the activations by -1. So I don't see any reason that one should expect training a language model to result in a model where naively applying the output embedding to earlier layers is a sensible thing to do.

31.01.2025 08:43 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

Can anyone point me to a reference saying early exit from a neural network is a reasonable thing to do?

As I understand it, early exit (from say a language model) involves taking the output from some early layer and applying the output embedding.

31.01.2025 08:43 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

I'm sure it's been asked a thousand times, but what's everyone's favourite method of making lists of articles they want to read?

27.11.2024 11:34 β€” πŸ‘ 4    πŸ” 0    πŸ’¬ 3    πŸ“Œ 1

Today's question from the four year old: if all of the zookeepers in the world suddenly died would the farmers look after the zoo animals or would that be the job of the vets? Had to admit I didn't know the answer...

23.11.2024 15:38 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

@tomkempton is following 20 prominent accounts