Almost certainly this isn’t anything to do with scraping. Like with Reddit, those with a stake in Twitter stand to benefit from AI and, as far as I know, there’s no mass reposting (retweeting?) effort to something like Mastodon.
That would be trivial to block anyway, since it would be easy to identity the service accounts and source IP’s of the requests. No need to impact average users.
What’s more likely is he hasn’t paid the bill for his cloud infrastructure and no longer has the capacity to serve so many users.
IMO, that’s what you get when you fire half of your staff.
I’m not so sure, there are a lot of businesses and people training their AI models right now and sites like reddit or twitter are very attractive huge collections of user generated content. It’s not the most outrageous assumption that they’ll try to get that data for free by scraping instead of paying for API access.
I don’t think however, that it is that hard to differentiate an AI scraper between an actual user, since AI scrapers would be scraping huge amounts of data, which the average user doesn’t. Correct me if I’m wrong. wdyt
No, you’re correct. Service accounts can consume data way faster than a human user ever could. A smart business always implements rate limits or you could bankrupt them with a simple curl command. They could bankrupt themselves in testing with a simple loop!
This can be fixed in many ways, not just by putting limitations on credentials but also on source addresses. If a certain address or range of addresses seems to be running multiple service accounts and pulling huge amounts of data, you can deny requests from those IP’s.
In short, this AI angle smells like BS to save face. Musk effectively fired the SRE team who looked after critical infrastructure. It was their job to ensure service reliability, so it should not be a surprise that Twitter now has issues with service reliability.
But also, hasn’t that boat left already for several AI companies? They’ve already trained it up, no need to scrape again, they just use what they got last time for their core training, it’s only the last couple of years/months they’re missing.
I’m a bit confused. The new season is 8, but people keep referring to it as 11. Even news articles will put 11 in the title, but will refer to it as 8 in the body of the article. Is this some kind of in-joke I missed? There definitely aren’t seasons 9 or 10…
Season 6 and 7 both have 2 parts which IMDb calls seasons 6/7 and 8/9, causing season 8 to be season 10 on IMDb, while Wikipedia and TVdb have 2 parts season 6 and 7 leading to the new season being 8.
This is the bane of my Plex organizing existence. Though luckily in this case Plex metadata and torrents agree on having 2 part season 6 and 7.
That’s why sports channels are much more expensive than everything else. It’s harder to pirate live sports events. FKN corpos made it impossible for my parents to watch Wimbledon.
Assuming said data scraping is a real concern for both Twitter and Reddit, are Fediverse servers at similar risk from scrapers and various automated API hits? I don’t really know enough about networks to answer.
I think the data scraping problem is more of an opportunity cost (they think AIs should pay them more to use their content) than a concern for the traffic they account for. If traffic, and not profit, was a problem, Wikipedia would start saying they can’t support AIs either.
You make a great point about Wikipedia - it’s laughable to me that scraping is actually why Twitter is doing this. They’re just trying to find a convenient reason for why they’re failing that doesn’t stem from their own incompetence.
The idea that “AI scraping” is any more expensive than search engine indexing is flatly nonsense, only credible to people who have never run any network service at scale.
lemmy.world
Active