Will Meta scrape and crawl through all our data now?

So from my understanding, I can block an instance which prevents it from showing in my feed.

However, if the instance I post to (.world) is not blocked on the receiving instance’s end (Meta), they will still get my post (unless defederated)?

If so, doesn’t that open up the idea that Meta will be able to scrape and take ALL the data from ALL the (still federated) instances’ posts that are not blocked by the Meta instance(s)? How can I protect my information from Meta while still being federated, or is that not possible?

RightHandOfIkaros , 11 months ago (edited 11 months ago)

Always have been.

Push your local legislation to change the law in favor of consumer data protection and not infinite growing company profits.

Reply

Report

Activity

Open original URL

Copy original URL

Copy Mbin URL

Loading...

escapedgoat , 11 months ago

...now?

sweet summer child.

Reply

Report

Activity

Open original URL

Copy original URL

Copy Mbin URL

Loading...

SkyNTP , 11 months ago

If you post something online, it is public and you have to assume someone or even everyone has already scraped and harvested the data.

It has always been like this.

If you come from an incumbent social media platform, perhaps you never got yo experience this lesson for yourself. But that data has also been harvested. They just gave you a bit of illusion of privacy.

The only thing online that is private is E2E encryption directly with a party you trust, and only if you are the only ones with a copy of the keys.

Reply

Report

Activity

Open original URL

Copy original URL

Copy Mbin URL

Loading...

treadful , 11 months ago

The benefit now is that one company can’t get exclusive access to your public data. It’s open to anyone that wants it.

Reply

Report

Activity

Open original URL

Copy original URL

Copy Mbin URL

Loading...

DLSchichtl , 11 months ago

Cuts out the middle man!

Reply

Report

Activity

Open original URL

Copy original URL

Copy Mbin URL

Loading...

KarsicKarl , 11 months ago

What scraping can get is very little public information.

There's a lot of information that servers keep contained such as IP addresses of where you are when you made a post. Other info such as your email address remains contained within your own instance. Meta cannot get at that information. No other Fediverse server can get at that.

This blog from Gargoron (Eugen Rochko) who essentially created ActivityPub that underpin all these Fediverse systems including Mastodon, Calckey, Pixelfed, kbin, Lemmy etc.

https://blog.joinmastodon.org/2023/07/what-to-know-about-threads/

Reply

Report

Activity

Open original URL

Copy original URL

Copy Mbin URL

Loading...

Ab_intra , 11 months ago (edited 11 months ago)

What I wonder is how Lemmy handles this. He is writing about how Mastodon do things, not Lemmy.

Reply

Report

Activity

Open original URL

Copy original URL

Copy Mbin URL

Loading...

Jeze3D , 11 months ago

They always could? These are public facing platforms. You’re being scraped by far more than just meta.

Reply

Report

Activity

Open original URL

Copy original URL

Copy Mbin URL

Loading...

Anomander , 11 months ago

Yeah, absolutely nothing was preventing them from doing so already, without launching Threads.

Blocking Meta / Threads instances isn't going to stop them, either.

Reply

Report

Activity

Open original URL

Copy original URL

Copy Mbin URL

Loading...

UziBobuzi , 11 months ago

If you post something public, people can access it. Corporations can access it. It's one of the reasons I ditched all my social media that identified me directly. They can scrape my stuff, sure; but they won't be able to link it to my actual name, face or existence in real life.

Reply

Report

Activity

Open original URL

Copy original URL

Copy Mbin URL

Loading...

Rottcodd , 11 months ago

Meta can already scrape everything from every fediverse instance. Hell - if you wanted to and were willing to invest the time and effort, you could scrape everything from every fediverse instance.

By definition, everything that you post online is accessible to other people. The devices through which those other people view the content you posted are able to make copies of whatever they view. So literally anyone with a computer and an internet connection can already “take” whatever you post.

There’s only one way in all the world to protect your information, and that’s to not post it in the first place. The instant you post it, anyone who cares enough to do it can “take” it, and there’s NOTHING you can do to stop them.

If Meta cares about your information, they’ll go ahead and collect it anyway, with or without Threads or federation with Threads.

Reply

Report

Activity

Open original URL

Copy original URL

Copy Mbin URL

Loading...

red , 11 months ago

If you post publicly on the internet, your post is publicly available on the internet. Anyone can read it. Everything that can be read can also be scraped.

So they don’t need Threads to scrape all Lemmy data. Or to be more precise, if they wanted to scrape Lemmy data, they would most likely do it separately from Threads.

Reply

Report

Activity

Open original URL

Copy original URL

Copy Mbin URL

Loading...

marsara9 , 11 months ago

Using threads / ActivityPub does make it easier though.

If they’re using a traditional crawler you could in theory block them at the user agent level (i.e. Cloudflare). If they’re using the public APIs, they’d have to write an interface for each distinct piece of software (Lemmy, Kbin, Mastodon, etc…) (How my search engine works)

But with ActivityPub were essentially just sending them the data in near real-time all using the same rough structure. Individual instances may block them but it wouldn’t be hard to setup proxies/relays that the community as a whole just isn’t aware of. (i.e. a new “Lemmy” instance comes online that just looks like a single user server, but it’s actually just a relay to Meta). The only real gotcha with ActivityPub is that there’s no real way to get historical data (nothing from the past).

Now I still have mixed feelings about Meta joining the fediverse, but if we’re just talking about blocking them from getting the content we have here, then things get difficult.

Reply

Report

Activity

Open original URL

Copy original URL

Copy Mbin URL

Loading...

mr47 , 11 months ago

Hear me out. Why don't we create a paywalled API for 3rd party apps (like Threads), that will end up costing, say, $20 million annually to use at Meta's rate... And then use the proceeds to keep all the instances running, and for other shenanigan.

Reply

Report

Activity

Open original URL

Copy original URL

Copy Mbin URL

Loading...

Alexmitter , 11 months ago

Not more or less then they can already do by just using web bots.

Reply

Report

Activity

Open original URL

Copy original URL

Copy Mbin URL

Loading...

UziBobuzi , 11 months ago

They can do it anyway, without threads being in the mix at all. Unfortunately the only way to be sure no corporation can scrape your data is to not be on the internet at all.

Reply

Report

Activity

Open original URL

Copy original URL

Copy Mbin URL

Loading...

LilDumpy OP , 11 months ago

Ahh, very true, but aren’t there legal obligations regarding privacy if data is collected via a site vs the public web?

Reply

Report

Activity

Open original URL

Copy original URL

Copy Mbin URL

Loading...

Bishma , 11 months ago (edited 11 months ago)

If there are they are never enforced in the US. Court case after court case has sided with the scraper rather than the site. Though usually Facebook is the scrapee in these cases, not the scraper.

I used to work in real estate tech so I know a lot of efforts the US’s National Association of Realtors has made to stop scrapers of RE data. Some where via the legal system, some tried to push the onus on us as paying consumers of their data. Not a single thing worked - if anything they may have invoked the Streisand effect once or twice and gotten more of their data scraped.

Reply

Report

Activity

Open original URL

Copy original URL

Copy Mbin URL

Loading...

TimeIncarnate , 11 months ago

Short answer is “no.”

Slightly longer answer is: “all of your public posts on Lemmy or Mastodon or any other federated platform are the Public web. So no, it’s not different.”

Reply

Report

Activity

Open original URL

Copy original URL

Copy Mbin URL

Loading...

awderon , 11 months ago

OpenAI is currently being sued because they used everything they could fin to train their AI models. We will see how that works out.

edition.cnn.com/2023/06/28/tech/…/index.html

Reply

Report

Activity

Open original URL

Copy original URL

Copy Mbin URL

Loading...