overview for ZickZack

Controversial Physicist Faces Mounting Accusations of Scientific Misconduct in c/news@kbin.social

[–] ZickZack@kbin.social 15 points 2 years ago (1 children)

It's a different paper (e.g. https://www.nature.com/articles/s41586-022-05294-9) from a different researcher (specifically Ranga Dias). This is not connected to the recent non-peer reviewed https://arxiv.org/abs/2307.12008

2023 Belgian Grand Prix - [SPRINT] discussion thread in c/formula1@lemmy.world

[–] ZickZack@kbin.social 0 points 2 years ago

probably not: Spa is always great for the overtaker due to the long straight into and after eau rouge. McLaren is also running very strong wings which hurts their performance on the straights (but helps during e.g. rainy qualifyings....).
Realistically, if there is any chance of overtaking at all, Verstappen can overtake just due to less drag on the straights.

2023 Belgian Grand Prix - [SPRINT SHOOTOUT] discussion thread in c/formula1@lemmy.world

[–] ZickZack@kbin.social 7 points 2 years ago

I hope that heads roll at haas for that disaster. There's one thing making wrong choices in high-stakes scenarios, but they sent out their drivers too late twice within 24h. That's just an unforced, unexplainable blunder. If I were gene haas I'd be furious: spend 100 MILLION dollars to develop a car and they don't even manage to get it around the track once.

This is doubly bad considering that sprints are one of their most reliable places to get points considering that their tire wear doesn't affect them too much over shorter distances.
They might as well pack up and go home now to conserve their parts since at this point there not going to achieve anything anyways.

Musk rushes out new Twitter logo—it’s just an X that someone tweeted at him in c/technology@lemmy.world

[–] ZickZack@kbin.social 9 points 2 years ago

It's $\mathbb{X}$ or unicode 𝕏 (U+1D54F)
Maybe he really likes metric spaces??

Christopher Nolan: After Oppenheimer, no more films during strike action in c/news@kbin.social

[–] ZickZack@kbin.social 2 points 2 years ago

I mean it's very easy to side with SAGA and WGA considering he literally has no other option (all major studios are union).

*Permanently Deleted* in c/technology@lemmy.world

[–] ZickZack@kbin.social 25 points 2 years ago

They will make it open source, just tremendously complicated and expensive to comply with.
In general, if you see a group proposing regulations, it's usually to cement their own positions: e.g. openai is a frontrunner in ML for the masses, but doesn't really have a technical edge against anyone else, therefore they run to congress to "please regulate us".
Regulatory compliance is always expensive and difficult, which means it favors people that already have money and systems running right now.

There are so many ways this can be broken in intentional or unintentional ways. It's also a great way to detect possible e.g. government critics to shut them down (e.g. if you are Chinese and everything is uniquely tagged to you: would you write about Tiananmen square?), or to get monopolies on (dis)information.
This is not literally trying to force everyone to get a license for producing creative or factual work but it's very close since you can easily discriminate against any creative or factual sources you find unwanted.

In short, even if this is an absolutely flawless, perfect implementation of what they want to do, it will have catastrophic consequences.

AI model output quality decreases when trained with AI models in c/technology@beehaw.org

[–] ZickZack@kbin.social 1 points 2 years ago* (last edited 2 years ago)

The "adequate covering" of our distribution p is also pretty self-explanatory: We don't need to see the statement "elephants are big" a thousand times to learn it, but we do need to see it at least once:

Think of the p distribution as e.g. defining a function on the real numbers. We want to learn that function using a finite amount of samples. It now makes sense to place our samples at interesting points (e.g. where the function changes direction), rather than just randomly throwing billions of points against the problem.

That means that even if our estimator is bad (i.e. it can barely distinguish real and fake data), it is still better than just randomly sampling (e.g. you can say "let's generate 100 samples of law, 100 samples of math, 100 samples of XYZ,..." rather than just having a big mush where you hope that everything appears).
That makes a few assumptions: the estimator is better than 0% accurate, the estimator has no statistical bias (e.g. the estimator didn't learn things like "add all sentences that start with an A", since that would shift our distribution), and some other things that are too intricate to explain here.

Importantly: even if your estimator is bad, it is better than not having it. You can also manually tune it towards being a little bit biased, either to reduce variance (e.g. let's filter out all HTML code), or to reduce the impact of certain real-world effects (like that most stuff on the internet is english: you may want to balance that down to get a more multilingual model).

However, you have not note here that these are LANGUAGE MODELS. They are not everything models.
These models don't aim for factual accuracy, nor do they have any way of verifying it: That's simply not the purview of these systems.
People use them as everything models, because empirically there's a lot more true stuff than nonsense in those scrapes and language models have to know something about the world to e.g. solve ambiguity, but these are side-effects of the model's training as a language model.
If you have a model that produces completely realistic (but semantically wrong) language, that's still good data for a language model.
"Good data" for a language model does not have to be "true data", since these models don't care about truth: that's not their objective!
They just complete sentences by predicting the next token, which is independent of factuallity.
There are people working on making these models more factual (same idea: you bias your estimator towards more likely to be true things, like boosting reliable sources such as wikipedia, rather than training on uniformly weighted webscrapes), but to do that you need a lot more overview over your data, for which you need more efficient models, for which you need better distributions, for which you need better estimators (though in that case they would be "factuallity estimators").
In general though the same "better than nothing" sentiment applies: if you have a sampling strategy that is not completely wrong, you can still beat completely random sample models. If your estimator is good, you can substantially beat them (and LLMs are pretty good in almost everything, which means you will get pretty good samples if you just sample according to the probability that the LLM tells you "this data is good")

For actually making sure that the stuff these models produce is true, you need very different systems that actually model facts, rather than just modelling language. Another way is to remove the bottleneck of machine learning models with respect to accuracy (i.e. you build a model that may be bad, but can never give you a wrong answer):
One example would be vector-search engines that, like search engines, retrieve information from a corpus based on the similarity as predicted by a machine learning model. Since you retrieve from a fixed corpus (like wikipedia) the model will never give you wrong information (assuming the corpus is not wrong)! A bad model may just not find the correct e.g. wikipedia entry to present to you.

AI model output quality decreases when trained with AI models in c/technology@beehaw.org

[–] ZickZack@kbin.social 1 points 2 years ago (1 children)

Yes: keep in mind that with "good" nobody is talking about the content of the data, but rather how statistically interesting it is for the model.

Really what machine learning is doing is trying to deduce a probability distribution q from a sampled distribution x ~ p(x).
The problem with statistical learning is that we only ever see an infinitesimally small amount of the true distribution (we only have finite samples from an infinite sample space of images/language/etc....).

So now what we really need to do is pick samples that adequately cover the entire distribution, without being redundant, since redundancy produces both more work (you simply have more things to fit against), and can obscure the true distribution:
Let's say that we have a uniform probability distribution over [1,2,3] (uniform means everything has the same probability of 1/3).

If we faithfully sample from this we can learn a distribution that will also return [1,2,3] with equal probability.
But let's say we have some redundancy in there (either direct duplicates, or, in the case of language, close-to duplicates):
The empirical distribution may look like {1,1,1,2,2,3} which seems to make ones a lot more likely than they are.
One way to deal with this is to just sample a lot more points: if we sample 6000 points, we are naturally going to get closer to the true distribution (similar how flipping a coin twice can give you 100% tails probability, even if the coin is actually fair. Once you flip it more often, it will return to the true probability).

Another way is to correct our observations towards what we already know to be true in our distribution (e.g. a direct 1:1 duplicate in language is presumably a copy-paste rather than a true increase in probability for a subsequence).

AI model output quality decreases when trained with AI models in c/technology@beehaw.org

[–] ZickZack@kbin.social 7 points 2 years ago (3 children)

That paper makes a bunch of(implicit) assumptions that make it pretty unrealistic: basically they assume that once we have decently working models already, we would still continue to do normal "brain-off" web scraping.
In practice you can use even relatively simple models to start filtering and creating more training data:
Think about it like the original LLM being a huge trashcan in which you try to compress Terrabytes of mostly garbage web data.
Then, you use fine-tuning (like the instruction tuning used the assistant models) to increases the likelihood of deriving non-trash from the model (or to accurately classify trash vs non-trash).
In general this will produce a datasets that is of significantly higher quality simply because you got rid of all the low-quality stuff.

This is not even a theoretical construction: Phi-1 (https://arxiv.org/abs/2306.11644) does exactly that to train a state-of-the-art language model on a tiny amount of high quality data (the model is also tiny: only half a percent the size of gpt-3).
Previously tiny stories https://arxiv.org/abs/2305.07759 showed something similar: you can build high quality models with very little data, if you have good data (in the case of tiny stories they generate simply stories to train small language models).

In general LLM people seem to re-discover that good data is actually good and you don't really need these "shotgun approach" web scrape datasets.

Ricciardo to replace De Vries at AlphaTauri from the Hungarian Grand Prix in c/formula1@lemmy.ml

[–] ZickZack@kbin.social 18 points 2 years ago (1 children)

I think you also have to keep in mind the position that de Vries and redbull is in:

Redbull is looking for a second verstappen-level driver. That's always been the case not only for redbull, but all tier 1 teams: Their aspirations are championships, not points or even podiums.
De Vries is a 28 year old rookie. That's usually the time that drivers retire or lean on their superior experience to make up for their loss in reaction speed and overall pace. The problem is that De Vries has no experience, while being older than Verstappen by close to three years. The fact that he got to race at all is a miracle: He would have to beat Tsunoda every week by quite a margin to become relevant for RedBull. If he doesn't become relevant for redbull, then why have him at alpha tauri?

Meanwhile they have a young driver in the form of tsunoda which exists in a limbo due to him having nothing to compare against: He could be the fastest driver on the planet in a trash car, or he could be underdelivering without anyone noticing due to the lack of comparison.
This is bad for two reasons:

you don't know whether tsunoda is an option for redbull
you have no idea how good alpha tauri is over all, which is doubly bad considering that they want to make major changes to how alpha tauri operates.

On the other hand, you have a perfectly good Ricciardo sitting on his hands that performed really well at silverstone. Realistically, you aren't going to lose anything from having Riccardo drive the rest of the season compared to having de Vries drive, but you have to potential upside of more context to the quality of tsunoda and the team, which you wouldn't get otherwise.

In general I'm more suprised that they ever gave De Vries a chance considering his age and the context to his big achievements:
In formula 2 his stiffest competitor was Nicholas Latifi (He won with 266 vs Latifi's 214 points) in what can be described as a dud year after the majority of now F1 mainstays had already graduated (he also needed 3 years to win F2, which is never a good sign).
If you have ever seen an formula E race, you will notice that it is quite a chaotic crash-fest with very weird rules and other nonsense. Just not crashing and not driving to quickly can get you really far by surviving the carbon-fiber mayhems and fuel-conservation issues.
To put it into perspective, here are the race records in the year that De Vries won formula E [1st, 9th, retired, retired, 1st, 16th, retired, 9th, retired, 13th, 18th, 2nd, 2nd, 22nd, 8th] or, in short if we ignore all DNFs we get a mean position of 9th!

In short, there's a reason why Mercedes never even tried to get him an F1 spot: He's not a bad driver, but being "not a bad driver" is insufficient for a top team like mercedes and redbull. There's little incentive to put him into any car, even less so nowadays considering his age.

Are kbin upvotes/downvotes public? in c/kbinMeta@kbin.social

[–] ZickZack@kbin.social 29 points 2 years ago

Everything using the activityPub standard has open likes (see https://www.w3.org/TR/2018/REC-activitypub-20180123/ for the standard), and logically it makes sense to do this to allow for verification of "likes":
If you did not do that, a malicious instance could much more easily just shove a bunch of likes onto another instance's post, while, if you have "like authors" it's much easier to do like moderation.
Effectively ActivityPub treats all interactions like comments, where you have a "from" and "to" field just like email does (just imagine you could send messages without having an originator: email would have unusable levels of spam and harassment).
Specfically, here is an example of a simple activity:

POST /outbox/ HTTP/1.1
Host: dustycloud.org
Authorization: Bearer XXXXXXXXXXX
Content-Type: application/ld+json; profile="https://www.w3.org/ns/activitystreams"

{
  "@context": ["https://www.w3.org/ns/activitystreams",
               {"@language": "en"}],
  "type": "Like",
  "actor": "https://dustycloud.org/chris/",
  "name": "Chris liked 'Minimal ActivityPub update client'",
  "object": "https://rhiaro.co.uk/2016/05/minimal-activitypub",
  "to": ["https://rhiaro.co.uk/#amy",
         "https://dustycloud.org/followers",
         "https://rhiaro.co.uk/followers/"],
  "cc": "https://e14n.com/evan"
}

As you can see this has a very "email like" structure with a sender, receiver, and content. The difference is mostly that you can also publish a "type" that allows for more complex interactions (e.g. if type is comment, then lemmy knows to put it into the comments, if type is like it knows to put it to the likes, etc...).
The actual protocol is a little more complex, but if you replace "ActivityPub" with "typed email" you are correct 99% of the time.

The different services, like lemmy, kbin, mastodon, or peertube are now just specific instantiations of this standard. E.g. a "like" might have slightly different effects on different services (hence also the confusion with "boosting" vs "liking" on kbin)

Peter Molyneux teases new project with idea that's "never been seen in a game" before in c/gaming@kbin.social

[–] ZickZack@kbin.social 2 points 2 years ago

At this point it's in you if you believe this grifter.