There have been multiple accounts created with the sole purpose of posting advertisement posts or replies containing unsolicited advertising.

Accounts which solely post advertisements, or persistently post them may be terminated.

SearNGX should be a federated search engine

All the posts about Reddit blocking everyone except Google and Brave got me thinking: What if SearNGX was federated? I.E. when data is retrieved via a providers API, that data is then federated to all other instances.

It would spread the API load out amongst instances, removing the API bottlenecks that come from search providers.

It would allow for more anonymous search, since users could cycle between instances and get the same results.

Geographic bias would be a thing of the past.

Other than ActivityPub overhead and storage, which could be reduced by federating text-only content, I fail to see any downside.

Thoughts?

kbal ,
@kbal@fedia.io avatar

I think you are not a computer programmer. Trying to build an index of the web by querying other search engines is not an efficient or sensible way to do things. Using ActivityPub for it is insane. Sharing query results in the obvious way might help a little during events where everyone searches for the same thing all at once, but in a relatively small pool of relatively sophisticated Internet users I don't think that happens often enough to justify the enormous amount of work and complexity.

On the other hand a distributed web crawler that puts its results in a free and decentralized database (one appropriate to the task; not blockchain) might be interesting. If the load on each node could be made light enough and the software simple enough that millions of people could run it at home, maybe it could be one way to build a new search engine. If that needs doing and someone has several hundred hours of free time to get it started.

hendrik ,

If you're looking for a distributed crawler and index:

https://en.wikipedia.org/wiki/YaCy

Yacy already exists and has been around for 2 decades.

fmstrat OP ,

This is close to what I was thinking, but rather than crawling independently, leverage the API results from queries to build a list of sites (and then perhaps crawl). Potentialy a tag index of sorts. I’m not solid on any idea as I haven’t investigated SearNGX enough to see how it works under the hood, but yes, on the same plane of thought.

Max_P ,
@Max_P@lemmy.max-p.me avatar

I ran a YaCy instance for a while like a decade ago. It does federate index requests, and when you search it propagates the search request across a bunch of nodes. When my node came online it almost immediately started crawling stuff and it did get a bunch of search queries. But the network was still pretty small back then and the search results were… not great. That’s the price of independence from Google’s and Microsoft’s giant server farms, it’s hard to compete with that size.

But at the rate Google and Bing are enshittifying, I think it’s worth revisiting.

Using ActivityPub for this would be immensely wasteful. It’s just not feasable that all instances would have the whole index because it’s so large. Back when I tried it, the network still had several TBs worth of indexed pages. This is firmly in the realm of distributed P2P systems. One could have an ActivityPub plugin however to receive updates from social media near instantly and index those immediately with less overhead. But you still want to index wikipedia, forums, blogs, whatever the crawlers can find.

hendrik ,

Sure. SearX is a meta-search engine. It does (only) queries to other search engines to get results. YaCy on the other hand is itself a search engine. It has the data available and doesn't do queries to other engines. In theory you could combine the two concepts. Have a software that does both. But that requires some clever thinking. The returned (Google) ranking only applies to the exact search term. And it's questionable if you can store it and do anything useful with it except for when some other user searches for the exact same thing. And also the returned teaser texts are very short and tailored to the search query. So maybe also useless. It'd be hard.

One thing you could do is crawl the results that users actually click on. And I think YaCy already does that. AFAIK they had an browser add-on or a proxy or something to intercept visited pages (and hence search results).

fmstrat OP ,

Well, I am, including products in the Fediverse. And I never said federate the search queries.

Trying to build an index of the web by querying other search engines is not an efficient or sensible way to do things.

Never made this suggestion.

On the other hand a distributed web crawler that puts its results in a free and decentralized database

Now you’re getting there.

kbal ,
@kbal@fedia.io avatar

Okay, sorry! Still a long way to go before the idea becomes sufficiently well-specified to make much sense to me though. Perhaps an examination of yacy could provide you a concrete example of the ways in which such things are complicated. One would need to do much better to end up with a suitable replacement for the ways many of us use searx.

It was wanting to use ActivityPub and the "I fail to see any downside" which led me to read the rest of your post in a way that might've been overly pessimistic about its merits.

aldalire ,

One of the things that can get annoying about searxng is that often search engines will rate limit if a lot of people are using one searxng instance. Maybe a “federated” approach would be, if results are rate limited -> send query to another trusted searx instance -> receive the results and send back to user. That way, people can stick to their favorite searxng instance without having to manually change their instance if the search engines were rate limiting.

mesamunefire ,

I recall there is a federated search engine… somewhere. Anyone know what that was called.

toothbrush ,
@toothbrush@lemmy.blahaj.zone avatar

Are you thinking of YaCy?

kbal , (edited )
@kbal@fedia.io avatar

Ah, I wondered if something like that had been tried before. Looks like it is maybe still running: https://yacy.net/

The demo isn't giving me useful search results.

Wxnzxn ,
@Wxnzxn@lemmy.ml avatar

I ran an instance for a while out of curiosity a few years back - building the database seemed to work fine and appeared like a good idea, had a lot of fun to see the connections with other servers and my crawler filling holes of unknown spaces. But I think the search algorithm itself was (most likely is) not sophisticated enough, it just did not give relevant results often enough, and it was extremely vulnerable to very simple SEO tactics to push trash to the top.

Buelldozer ,
@Buelldozer@lemmy.today avatar

There’s only been about 700 yacy peers online in the last 30 days which is pretty low for a “crowd sourced” search engine, especially when many of those are, I think, temporary peers that come and go. It looks like it has only maybe 200 “master” servers which wouldn’t be nearly enough to keep up with the Internet these days.

The good news is that if there’s websites / urls that you care about you can point your own yacy instance at them and schedule the crawls to keep up with content changes.

I remember reading about yacy some years ago and now that I’ve bumped it into again it’s sparked my interest. I may stand up a docker instance and play with it for awhile. If nothing else it could make a very useful “arrrrr” search engine.

aldalire ,

One of the things that can get annoying about searxng is that often search engines will rate limit if a lot of people are using one searxng instance. Maybe a “federated” approach would be, if results are rate limited -> send query to another trusted searx instance -> receive the results and send back to user. That way, people can stick to their favorite searxng instance without having to manually change their instance if the search engines were rate limiting.

mesamunefire ,

I self host with yunohost it’s a good way to not bog down the system.

catloaf ,

So everyone stores a part of the search index? I think you’ve invented a machine-readable website index with extra steps.

fmstrat OP ,

Hah, could be.

  • All
  • Subscribed
  • Moderated
  • Favorites
  • [email protected]
  • random
  • lifeLocal
  • goranko
  • All magazines