Search Lemmy!!

MalReynolds@lemmy.world · 1 年前

Search Lemmy!!

orionstein@lemmy.world · 1 年前

This is going to be even worse than reddit search, unfortunately. There’s not an easy way to make a search like this scale for the small amount of instances we even know about. Considering there are tons of instance out there and there will probably be more in the future, these problems are going to crop up a lot more. It’s actually much easier to search in one centralized location, however the reddit search actually ended up being implemented.

thegoblin@digitalgoblin.uk · 1 年前

It does become very fragmented. A post on my single-user server is going to be low down the rankings compared to the same post on a subreddit with the weight of the reddit domain name behind it. I’m also not entirely sure if/how content here gets indexed, especially when it appears under different federated domains. Content discovery is very different in a distributed world.

MalReynolds@lemmy.world · 1 年前

Perhaps not, google et.al. will likely grab it all anyway, perhaps we can be forward facing. Actually while they’re privacy invading, the benefit of keeping feely given info still stands, and may, in the long term, prevail. One may hope…

MalReynolds@lemmy.world · 1 年前

Discord, for example, means all useful information is captured by discord, never to be searched by plebs. IRC is usually ephemeral. Most web search has been diluted by SEO and content farms to the point of uselessness. Perhaps we can think about next gen search right now. A point of hope is things like gigabrain which, it would seem, use LLMs to ‘cut through the noise’, but also summarize and collate, seems like a useful way forward if distributed. Happy to look into it myself, but would like to hear others input. (pleasently ppl were commenting before I finished)

MalReynolds@lemmy.world · edit-2 1 年前

Not sure how to deal with this, but I believe I am a competent coder with ideas, perhaps this is an inappropriate community for this, happy to move the question.

Barbarian@sh.itjust.works · 1 年前

Eventually I hope lemmy.directory will be great for this purpose. It’s a Lemmy instance configured to pick up every Lemmy community it can find.

lenninscjay@lemmy.world · 1 年前

Don’t know how to help but agree on how important search is. Which might be even harder to do given federation.

Also upvote for firefly user name

MalReynolds@lemmy.world · 1 年前

join! https://lemmy.world/c/firefly

NeverDaunted@kbin.social · 1 年前

I work for a small company that runs a website with lots of information and our search has always sucked. We tried several tweaks and free solutions - the final decision was to pay for search which is what we did and it is awesome now, but expensive. A major company like Reddit should be able to figure it out, but search is harder than most people realize. Google just makes it look easy.

anaximander@feddit.uk · 1 年前

Simplest implementation is that an instance searches its own content while sending requests to federated instances and merging their results in with its own based on whatever method the instance admins want (whether it puts its own results at the top, or treats them as one set, or whatever). That could cause a lot of traffic and has a load of latency while your search spreads out hop by hop, to the instances that yours is federated with, to the ones they’re federated with, etc. Plus you’d need a mechanism to stop instances from sending a search to an instance that’s already got it, to avoid hammering instances that have multiple federation paths to yours. Not an easy problem.

You might be able to do some kind of index publication where an instance publishes the most notable posts for other instances to include in their indexes, so that when you search it could show you results from among hot posts elsewhere in the fediverse - not an exhaustive list, but a search within posts that are getting attention.

There’s also other stuff I’d be tempted to experiment with, like using some kind of TF-IDF ranking to choose what counts as “most notable”, rather than just activity or view count, so that posts that are particularly relevant to certain topics could be publicised. An instance could even choose to filter that, so for example an instance who chooses to focus on tech topics could publicise highly-relevant tech posts but filter out politics keywords even when a post gets high relevance scores, so that political discussion on that instance is less visible, even when searched for.

MalReynolds@lemmy.world · 1 年前

Thankyou for applying soilid thought. What there would you consider actionable ? As in could likely be coded (for free)

anaximander@feddit.uk · 1 年前

Any of that could be done; there’s some parts that are more challenging but there are certainly harder things that have been solved by open-source software. I know almost nothing about how Lemmy’s innards are built though, so I couldn’t hazard a guess as to how much effort any of it would take. Some of it could possibly be achieved through separate services that you could host alongside a Lemmy instance, or entirely on their own, while other parts would really work best as features within Lemmy’s own codebase.

digitallyfree@kbin.social · 1 年前

In the past I normally used Pushshift to search Reddit due to how poor the search engine was. I think it was only until very recently when they finally added comment searching.

headie_sage@fanaticus.social · edit-2 1 年前

I’ve posted about this before as it relates to mod tools.¹

The search part isn’t all that difficult, there are open source search engines that are easy enough for admins to configure a decent search feature. The more difficult issue is aggregating the data from all our instances to a single source where we can make queries with those existing search engine tools.

I am going to spend some time this weekend working on a proof of concept for a search engine for mod tools. Big picture solution is:

Instance admins regularly dump anonymized (i.e. no PII) post and comment data to a public source (possibly torrent, possibly sftp)
Other instance admins download each others data and feed it into their search db (e.g. Elasticsearch)
Mods & users create tools using this data

BTW: this isn’t a novel idea:

This is what pushshift is for reddit (check out their FAQ/wiki). We’re missing mod tools big time and searching/aggregating is huge part of mod tools.
Up until recently, like last week, Stack Exchange provided a regular dump of their data to the Internet Archive for posterity’s sake

EDIT: Linked my OG post on the subject ^{[1](https://fanaticus.social/post/1955)}