this post was submitted on 28 Jul 2023

179 points (95.4% liked)

Technology

72262 readers

2628 users here now

This is a most excellent place for technology news and articles.

Our Rules

Follow the lemmy.world rules.
Only tech related news or articles.
Be excellent to each other!
Mod approved content bots can post up to 10 articles per day.
Threads asking for personal tech support may be deleted.
Politics threads may be removed.
No memes allowed as posts, OK to post as comments.
Only approved bots from the list below, this includes using AI responses and summaries. To ask if your bot can be added please contact a mod.
Check for duplicates before posting, duplicates may be removed
Accounts 7 days and younger will have their posts automatically removed.

Approved Bots

founded 2 years ago

MODERATORS

L3s@lemmy.world

enu@lemmy.world

technopagan@lemmy.world

L4s@lemmy.world

L3s@hackingne.ws

L4s@hackingne.ws

179

An indepth explanation of how LLMs work with an minimum of jargon (open.substack.com)

submitted 2 years ago by wisdomchicken@lemm.ee to c/technology@lemmy.world

16 comments fedilink hide all child comments

top 10 comments

sorted by: hot top controversial new old

[–] Wander@yiffit.net 9 points 2 years ago

This was excellent and actually something I was looking for. Thank you so much <3

[–] TheChurn@kbin.social 9 points 2 years ago (2 children)

In the language of classical probability theory: the models learn the probability distribution of words in language from their training data, and then approximate this distribution using their parameters and network structure.

When given a prompt, they then calculate the conditional probabilities of the next word, given the words they have already seen, and sample from that space.

It is a rather simple idea, all of the complexity comes from trying to give the high-dimensional vector operations (that it is doing to calculate conditional probabilities) a human meaning.

[–] pglpm@lemmy.ca 4 points 2 years ago* (last edited 2 years ago)

I'd like to add one more layer to this great explanation.

Usually, this kind of predictions should be made in two steps:

calculate the conditional probability of the next word (given the data), for all possible candidate words;
choose one word among these candidates.

The choice in step 2. should be determined, in principle, by two factors: (a) the probability of a candidate, and (b) also a cost or gain for making the wrong or right choice if that candidate is chosen. There's a trade-off between these two factors. For example, a candidate might have low probability, but also be a safe choice, in the sense that if it's the wrong choice no big problems arise – so it's the best choice. Or a candidate might have high probability, but terrible consequences if it were the wrong choice – so it's better to discard it in favour of something less likely but also less risky.

This is all common sense! but it's at the foundation of the theory behind this (Decision Theory).

The proper calculation of steps 1. and 2. together, according to fundamental rules (probability calculus & decision theory) would be enormously expensive. So expensive that something like chatGPT would be impossible: we'd have to wait for centuries (just a guess: could be decades or millennia) to train it, and then to get an answer. This is why Large Language Models do two approximations, which obviously can have serious drawbacks:

they use extremely simplified cost/gain figures – in fact, from what I gather, the researchers don't have any clear idea of what they are;
they directly combine the simplified cost/gain figures with probabilities;
They search for the candidate with the highest gain+probability combination, but stopping as soon as they find a relatively high one – at the risk of missing the one that was actually the real maximum.

(Sorry if this comment has a lecturing tone – it's not meant to. But I think that the theory behind these algorithms can actually be explained in very common-sense term, without too much technobabble, as @TheChurn's comment showed.)

[–] pglpm@lemmy.ca 1 points 2 years ago

Superb summary!

[–] yip-bonk@kbin.social 9 points 2 years ago

Holy Flurking Schnitt. So . . . no one understands exactly how it works or even how it works past the first couple of abstractions.

That explains so much.

[–] keegomatic@kbin.social 8 points 2 years ago

This is a really terrific explanation. The author puts some very technical concepts into accessible terms, but not so far from reality as to cloud the original concepts. Most other attempts I’ve seen at explaining LLMs or any other NN-based pop tech are either waaaay oversimplified, heavily abstracted, or are meant for a technical audience and are dry and opaque. I’m saving this for sure. Great read.

[–] Tibert@compuverse.uk 3 points 2 years ago* (last edited 2 years ago) (1 children)

Very interesting. But also very complex.

In a very badly and short as possible way, they are very complex probabilities machines, which can compare the probability of each words in your sentence to chose the words to say to you.

There is also Kyle Hill who explained how Chatgpt works in a video https://youtu.be/-4Oso9-9KTQ

however from what I remember I think the article has more info on how the tool manages to reduce confusion and history on the evolution with gpt 1, 2 and 3.

But what helped me understand easier was the video, even if it doesn't describe every thing to the tiniest detail.

[–] Dark_Blade@lemmy.world 2 points 2 years ago* (last edited 2 years ago)

interesting yet complex

Makes sense that an explainer on the technology would be an appropriate match for it lol

[–] towerful@programming.dev 3 points 2 years ago

Very interesting read.

[–] AllonzeeLV@lemmy.world 2 points 2 years ago* (last edited 2 years ago)

Awesome! I always wondered how Skynets got made.

load more comments