I agree with your points 2-4 but I have observed on my own website that the crawlers who don't respect won't, and the crawlers who do respect will.
neo
Of course it's voluntary, but if entities like OpenAI say they will respect it then presumably they really will.
Truly. But this was just something I downloaded off the net and wanted to repurpose for my own needs, so rearchitecting it to use Grid or Flex was way more effort than I wished to put in.
I've used a self-hosted Llama 3 to answer some questions about css and centering a div that I was having trouble with (I'm not a web dev by profession, nor am I aspiring to be one). You have to prod at it a few times to get it to tell you something useful which it ultimately did.
That's about as far as I can work with it: asking and re-asking it very common questions that have been discussed and answered 700 times over (but the answer to which is unknown to me, specifically) in the hopes of getting something actually useful. So to that end, of course it can give me an example implementation of common leetcode questions in C, but it cannot reliably do something that requires a bit more originality.
breaking my back to let people know how much i'll publicly defend nazis, specifically, which is suddenly very important to me.
A lot of dbzer0 users are cool people
...while babies and young children rising up like never before!
support the state-sanctioned genocide or be unemployed. is that deal being made right now?
a lot of people are dying b/c of how much he sucks, though. that is true :(
Whoa, I remember this site and article, both
I used to sit and monitor my server access logs. You can tell by the access patterns. Many of the well-behaved bots announce themselves in their user agents, so you can see when they're on. I could see them crawl the main body of my website, but not go to a subdomain, which is clearly linked from the homepage but is disallowed from my robots.txt.
On the other hand, spammy bots that are trying to attack you will often instead have access patterns that try to probe your website for common configurations for common CMSes like WordPress. They don't tend to crawl.
Google also provides a tool to test robots.txt, for example.