Microsoft and Reddit Are Fighting About Why Bing’s Crawler Is Blocked on Reddit

theangriestbird OP , 5 hours ago

The beef between Microsoft and Reddit came to light after I published a story revealing that Reddit is currently blocking every crawler from every search engine except Google, which earlier this year agreed to pay Reddit $60 million a year to scrap the site for its generative AI products.

I know the author meant “scrape”, but sometimes it really does feel like AI is just scrapping the old internet for parts.

Reply

Report

Activity

Open original URL

Copy original URL

Copy Mbin URL

Loading...

cybermass , 4 hours ago

Yeah, aren’t like over half of reddit comments/posts by bots these days?

Reply

Report

Activity

Open original URL

Copy original URL

Copy Mbin URL

Loading...

originalucifer , 3 hours ago

yep, and the longer that happens the less value to the dataset. its becoming aged.

Reply

Report

Activity

Open original URL

Copy original URL

Copy Mbin URL

Loading...

RiikkaTheIcePrincess , 3 hours ago

[Joke] See, Reddit’s doing a nice thing here! They’re making sure nobody ends up toxifying their own dataset by using Reddit’s garbage heap of bot posts!

Reply

Report

Activity

Open original URL

Copy original URL

Copy Mbin URL

Loading...

originalucifer , 3 hours ago

google needs a checkbox of 'ignore reddit' im sick of having to manually add -reddit

Reply

Report

Activity

Open original URL

Copy original URL

Copy Mbin URL

Loading...

Cube6392 , 3 minutes ago

Hey good news. Turns out you can use bing and not get back Reddit results

Reply

Report

Activity

Open original URL

Copy original URL

Copy Mbin URL

Loading...

originalucifer , 1 minute ago

yeah but then i get back bing results. no one needs that

Reply

Report

Activity

Open original URL

Copy original URL

Copy Mbin URL

Loading...

Moonrise2473 , 5 hours ago (edited 4 hours ago)
A search engine can’t pay a website for having the honor of bringing them visits and ad views.

Fuck reddit, get delisted, no problem.

Weird that google is ignoring their robots.txt though.

Even if they pay them for being able to say that glue is perfect on pizza, having
User-agent: *
Disallow: /

should block googlebot too. That means google programmed an exception on googlebot to ignore robots.txt on that domain and that shouldn’t be done. What’s the purpose of that file then?

Because robots.txt is completely based on honor (there’s no need to pretend being another bot, could just ignore it), should be
User-agent: Googlebot
Disallow:
User-agent: *
Disallow: /

Reply

Report

Activity

Open original URL

Copy original URL

Copy Mbin URL

Loading...

MrSoup , 5 hours ago

I doubt Google respects any robots.txt

Reply

Report

Activity

Open original URL

Copy original URL

Copy Mbin URL

Loading...

Moonrise2473 , 4 hours ago

for common people they respect and even warn a webmaster if they submit a sitemap that has paths included in robots.txt

Reply

Report

Activity

Open original URL

Copy original URL

Copy Mbin URL

Loading...

DaGeek247 , 4 hours ago

My robots.txt has been respected by every bot that visited it in the past three months. I know this because i wrote a page that IP bans anything that visits it, and l also put it as a not allowed spot in the robots.txt file.

I've only gotten like, 20 visits in the past three months though, so, very small sample size.

Reply

Report

Activity

Open original URL

Copy original URL

Copy Mbin URL

Loading...

thingsiplay , 4 hours ago (edited 3 hours ago)

Interesting way of testing this. Another would be to search the search machines with adding site:your.domain (Edit: Typo corrected. Off course without - at -site:, otherwise you will exclude it, not limit to.) to show results from your site only. Not an exhaustive check, but another tool to test this behavior.

Reply

Report

Activity

Open original URL

Copy original URL

Copy Mbin URL

Loading...

mozz , 4 hours ago

I know this because i wrote a page that IP bans anything that visits it, and l also put it as a not allowed spot in the robots.txt file.

This is fuckin GENIUS

Reply

Report

Activity

Open original URL

Copy original URL

Copy Mbin URL

Loading...

Moonrise2473 , 3 hours ago

only if you don’t want any visits except from yourself, because this removes your site from any search engine

should write a “disallow: /juicy-content” and then block anything that tries to access that page (only bad bots would follow that path)

Reply

Report

Activity

Open original URL

Copy original URL

Copy Mbin URL

Loading...

Miaou , 3 hours ago

That’s exactly what was described…?

Reply

Report

Activity

Open original URL

Copy original URL

Copy Mbin URL

Loading...

Moonrise2473 , 1 hour ago

Oops. As a non-native English speaker I misunderstood what he meant. I understood wrongly that he set the server to ban everything that asked for robots.txt

Reply

Report

Activity

Open original URL

Copy original URL

Copy Mbin URL

Loading...

mozz , 3 hours ago

You need to read again the thing that was described, more carefully. Imagine for example that by “a page,” the person means a page called /juicy-content or something.

Reply

Report

Activity

Open original URL

Copy original URL

Copy Mbin URL

Loading...

MrSoup , 3 hours ago

Thank you for sharing

Reply

Report

Activity

Open original URL

Copy original URL

Copy Mbin URL

Loading...

skullgiver , 4 hours ago

I think Reddit serves Googlebot a different robots.txt to prevent issues. For instance, check Google’s cached version of robots.txt: it only blocks stuff that you’d expect to be blocked.

Reply

Report

Activity

Open original URL

Copy original URL

Copy Mbin URL

Loading...

tal , 2 hours ago (edited 2 hours ago)

I guessed in a previous comment that given their new partnership, Reddit is probably feeding their comment database to Google directly, which reduces load for both of them and permits Google to have real-time updates of the whole kit-and-kaboodle rather than polling individual pages. Both Google and Reddit are better-off doing that, and for Google it’d make sense for any site that’s large-enough and valuable enough to warrant putting forth any effort special-case to that site.

I know that Reddit built functionality for that before, used it for pushshift.io and I believe bots.

I doubt that Google is actually using Googlebot on Reddit at all today.

I would bet against either Google violating robots.txt or Reddit serving different robots.txt files to different clients (why? It’s just unnecessary complication).

Reply

Report

Activity

Open original URL

Copy original URL

Copy Mbin URL

Loading...

jarfil , 2 hours ago

Google is paying for the use of Reddit’s API, not for scraping the site.

That’s the new Reddit’s business model: want “their” (users’) content, then pay for API access.

Reply

Report

Activity

Open original URL

Copy original URL

Copy Mbin URL

Loading...

ssm , 3 hours ago

I hope all big corporate SEO trash follows suite, once they’ve all filtered themselves out for profit we can hopefully get some semblance of an unshittified search experience.

Reply

Report

Activity

Open original URL

Copy original URL

Copy Mbin URL

Loading...

CanadaPlus , 2 hours ago

Man, wouldn’t that be nice. There’s too much money in appearing on searches for me to ever expect that to happen, though.

Reply

Report

Activity

Open original URL

Copy original URL

Copy Mbin URL

Loading...

tal , 1 hour ago

The reason that robots.txt generally worked was because nobody was trying to really leverage it against bot operators. I’m not sure that this might not just kill robots.txt. Historically, search engines wanted to index stuff and websites wanted to be indexed. Their interests were aligned, so the convention worked. This no longer holds if things like the Google-Reddit partnership become common.

Reddit can also try to detect and block crawlers; robots.txt isn’t the only tool in their toolbox.

Microsoft, unlike most companies, does actually have a technical counter that Reddit probably cannot stop, if it comes to that and Microsoft wants to do a “hostile index” of Reddit.

Microsoft’s browser, Edge, is used by a bunch of people, and Microsoft can probably rig it up to send content of Reddit pages requested by their browser’s users sufficient to build their index. Reddit can’t stop that without blocking Edge users. I expect that that’d probably be exploring a lot of unexplored legal territory under the laws of many countries. It also wouldn’t be as good as Google’s (I assume real-time) access to the comments, but they’d get to them.

Browsers do report the host-referrer, which would permit Reddit to detect that a given user has arrived from Bing and block them:

en.wikipedia.org/wiki/HTTP_referer

In HTTP, “Referer” (a misspelling of “Referrer”[1]) is an optional HTTP header field that identifies the address of the web page (i.e., the URI or IRI), from which the resource has been requested. By checking the referrer, the server providing the new web page can see where the request originated.

In the most common situation, this means that when a user clicks a hyperlink in a web browser, causing the browser to send a request to the server holding the destination web page, the request may include the Referer field, which indicates the last page the user was on (the one where they clicked the link).

Web sites and web servers log the content of the received Referer field to identify the web page from which the user followed a link, for promotional or statistical purposes.[2] This entails a loss of privacy for the user and may introduce a security risk.[3] To mitigate security risks, browsers have been steadily reducing the amount of information sent in Referer. As of March 2021, by default Chrome,[4] Chromium-based Edge, Firefox,[5] Safari[6] default to sending only the origin in cross-origin requests, stripping out everything but the domain name.

Reddit could block browsers with a host-referrer off bing.com, killing the ability of Bing to link to them. I don’t know if there’s a way for a linking site to ask a browser to not give or forge the host-referrer. For Edge users – not all Bing users – Microsoft could modify the browser to do so, forcing Reddit to decide whether to block all Edge users or not.

Reply

Report

Activity

Open original URL

Copy original URL

Copy Mbin URL

Loading...

i_am_not_a_robot , 2 minutes ago

It is possible to remove the referer header:

developer.mozilla.org/en-US/docs/…/noreferrer

developer.mozilla.org/en-US/…/Referrer-Policy

Reply

Report

Activity

Open original URL

Copy original URL

Copy Mbin URL

Loading...

TehPers , 1 hour ago

Joke’s on Reddit. I’ve been blocking their results in the search engine I use for months!

I wonder if this will end up being pursued as an antitrust case. If anything, it’ll reduce traffic to Reddit from non-Google users, so hopefully that kills them off just a little faster.

Reply

Report

Activity

Open original URL

Copy original URL

Copy Mbin URL

Loading...

AVincentInSpace , 32 minutes ago

Come on. Be realistic. Chrome has 70% browser market share and people are already used to tacking “Reddit” onto the end of their search queries to find useful information. If anything this will have no effect besides steering people towards Google.

Reply

Report

Activity

Open original URL

Copy original URL

Copy Mbin URL

Loading...

TehPers , 10 minutes ago

People on Chrome adding Reddit to their Google searches already use Google. People not using Google who don’t search “Reddit” are going to see fewer Reddit results.

No, this won’t kill Reddit, but it certainly isn’t helping them get more traffic.

Reply

Report

Activity

Open original URL

Copy original URL

Copy Mbin URL

Loading...

lemmyvore , 2 minutes ago

…I thought that was the whole point of Spez blocking other spiders.

Reply

Report

Activity

Open original URL

Copy original URL

Copy Mbin URL

Loading...

Cube6392 , 47 seconds ago

They don’t care about traffic. They care about the existing barrel of data for the data models

Reply

Report

Activity

Open original URL

Copy original URL

Copy Mbin URL

Loading...

doctortofu , 1 hour ago

I can see why spez is upset about scrappers and search engines - image a company profiting from people creating lots of data, just hoarding it and using it for free, and not paying those people a cent, preposterous, right? :)

Reply

Report

Activity

Open original URL

Copy original URL

Copy Mbin URL

Loading...