DuckDuckGo, Bing, Mojeek, and other search engines are not returning full Reddit results any more.

  • Hot Potato@lemmy.worldOP
    link
    fedilink
    English
    arrow-up
    177
    arrow-down
    1
    ·
    4 months ago

    tbh I’ve never seen a Lemmy link when searching for stuff. Is it too small to show up? Or do search engines not index Lemmy instances?

    • TimeSquirrel@kbin.melroy.org
      link
      fedilink
      arrow-up
      145
      ·
      4 months ago

      A lot of Fediverse admins are just normal people like you and me with a budget, and disallowing bots and spiders helps save bandwidth, and the budget.

      • Admiral Patrick@dubvee.org
        link
        fedilink
        English
        arrow-up
        91
        ·
        4 months ago

        Yep. I block all bots to my instance.

        Most are parasitic (GPTBot, ImageSift bot, Yandex, etc) but I’ve even blocked Google’s crawler (and its ActivityPub cralwer bot) since it now feeds their LLM models. Most of my content can be found anyway because instances it federated to don’t block those, but the bandwidth and processing savings are what I’m in it for.

          • Admiral Patrick@dubvee.org
            link
            fedilink
            English
            arrow-up
            3
            ·
            4 months ago

            Kinda long, so I’m putting it in spoilers. This applies to Nginx, but you can probably adapt it to other reverse proxies.

            1. Create a file to hold the mappings and store it somewhere you can include it from your other configs. I named mine map-bot-user-agents.conf

            Here, I’m doing a regex comparison against the user agent ($http_user_agent) and mapping it to either a 0 (default/false) or 1 (true) and storing that value in the variable $ua_disallowed. The run-on string at the bottom was inherited from another admin I work with, and I never bothered to split it out.

            'map-bot-user-agents.conf'
            # Map bot user agents
            map $http_user_agent $ua_disallowed {
                default 		0;
                "~CCBot"		1;
                "~ClaudeBot"	1;
                "~VelenPublicWebCrawler"	1;
                "~WellKnownBot"	1;
                "~Synapse (bot; +https://github.com/matrix-org/synapse)" 1;
                "~python-requests"	1;
                "~bitdiscovery"	1;
                "~bingbot"		1;
                "~SemrushBot" 	1;
                "~Bytespider" 	1;
                "~AhrefsBot" 	1;
                "~AwarioBot"	1;
                "~GPTBot" 		1;
                "~DotBot"	 	1;
                "~ImagesiftBot"	1;
                "~Amazonbot"	1;
                "~GuzzleHttp" 	1;
                "~DataForSeoBot" 	1;
                "~StractBot"	1;
                "~Googlebot"	1;
                "~Barkrowler"	1;
                "~SeznamBot"	1;
                "~FriendlyCrawler"	1;
                "~facebookexternalhit" 1;
                "~*(?i)(80legs|360Spider|Aboundex|Abonti|Acunetix|^AIBOT|^Alexibot|Alligator|AllSubmitter|Apexoo|^asterias|^attach|^BackDoorBot|^BackStreet|^BackWeb|Badass|Bandit|Baid|Baiduspider|^BatchFTP|^Bigfoot|^Black.Hole|^BlackWidow|BlackWidow|^BlowFish|Blow|^BotALot|Buddy|^BuiltBotTough|
            ^Bullseye|^BunnySlippers|BBBike|^Cegbfeieh|^CheeseBot|^CherryPicker|^ChinaClaw|^Cogentbot|CPython|Collector|cognitiveseo|Copier|^CopyRightCheck|^cosmos|^Crescent|CSHttp|^Custo|^Demon|^Devil|^DISCo|^DIIbot|discobot|^DittoSpyder|Download.Demon|Download.Devil|Download.Wonder|^dragonfl
            y|^Drip|^eCatch|^EasyDL|^ebingbong|^EirGrabber|^EmailCollector|^EmailSiphon|^EmailWolf|^EroCrawler|^Exabot|^Express|Extractor|^EyeNetIE|FHscan|^FHscan|^flunky|^Foobot|^FrontPage|GalaxyBot|^gotit|Grabber|^GrabNet|^Grafula|^Harvest|^HEADMasterSEO|^hloader|^HMView|^HTTrack|httrack|HTT
            rack|htmlparser|^humanlinks|^IlseBot|Image.Stripper|Image.Sucker|imagefetch|^InfoNaviRobot|^InfoTekies|^Intelliseek|^InterGET|^Iria|^Jakarta|^JennyBot|^JetCar|JikeSpider|^JOC|^JustView|^Jyxobot|^Kenjin.Spider|^Keyword.Density|libwww|^larbin|LeechFTP|LeechGet|^LexiBot|^lftp|^libWeb|
            ^likse|^LinkextractorPro|^LinkScan|^LNSpiderguy|^LinkWalker|msnbot|MSIECrawler|MJ12bot|MegaIndex|^Magnet|^Mag-Net|^MarkWatch|Mass.Downloader|masscan|^Mata.Hari|^Memo|^MIIxpc|^NAMEPROTECT|^Navroad|^NearSite|^NetAnts|^Netcraft|^NetMechanic|^NetSpider|^NetZIP|^NextGenSearchBot|^NICErs
            PRO|^niki-bot|^NimbleCrawler|^Nimbostratus-Bot|^Ninja|^Nmap|nmap|^NPbot|Offline.Explorer|Offline.Navigator|OpenLinkProfiler|^Octopus|^Openfind|^OutfoxBot|Pixray|probethenet|proximic|^PageGrabber|^pavuk|^pcBrowser|^Pockey|^ProPowerBot|^ProWebWalker|^psbot|^Pump|python-requests\/|^Qu
            eryN.Metasearch|^RealDownload|Reaper|^Reaper|^Ripper|Ripper|Recorder|^ReGet|^RepoMonkey|^RMA|scanbot|SEOkicks-Robot|seoscanners|^Stripper|^Sucker|Siphon|Siteimprove|^SiteSnagger|SiteSucker|^SlySearch|^SmartDownload|^Snake|^Snapbot|^Snoopy|Sosospider|^sogou|spbot|^SpaceBison|^spanne
            r|^SpankBot|Spinn4r|^Sqworm|Sqworm|Stripper|Sucker|^SuperBot|SuperHTTP|^SuperHTTP|^Surfbot|^suzuran|^Szukacz|^tAkeOut|^Teleport|^Telesoft|^TurnitinBot|^The.Intraformant|^TheNomad|^TightTwatBot|^Titan|^True_Robot|^turingos|^TurnitinBot|^URLy.Warning|^Vacuum|^VCI|VidibleScraper|^Void
            EYE|^WebAuto|^WebBandit|^WebCopier|^WebEnhancer|^WebFetch|^Web.Image.Collector|^WebLeacher|^WebmasterWorldForumBot|WebPix|^WebReaper|^WebSauger|Website.eXtractor|^Webster|WebShag|^WebStripper|WebSucker|^WebWhacker|^WebZIP|Whack|Whacker|^Widow|Widow|WinHTTrack|^WISENutbot|WWWOFFLE|^
            WWWOFFLE|^WWW-Collector-E|^Xaldon|^Xenu|^Zade|^Zeus|ZmEu|^Zyborg|SemrushBot|^WebFuck|^MJ12bot|^majestic12|^WallpapersHD)" 1;
            
            }
            

            Once you have a mapping file setup, you’ll need to do something with it. This applies at the virtual host level and should go inside the server block of your configs (except the include for the mapping config.).

            This assumes your configs are in conf.d/ and are included from nginx.conf.

            The map-bot-user-agents.conf is included above the server block (since it’s an http level config item) and inside server, we look at the $ua_disallowedvalue where 0=false and 1=true (the values are set in the map).

            You could also do the mapping in the base nginx.conf since it doesn’t do anything on its own.

            If the $ua_disallowed value is 1 (true), we immediately return an HTTP 444. The 444 status code is an Nginx thing, but it basically closes the connection immediately and wastes no further time/energy processing the request. You could, optionally, redirect somewhere, return a different status code, or return some pre-rendered LLM-generated gibberish if your bot list is configured just for AI crawlers (because I’m a jerk like that lol).

            Example site1.conf
            
            include conf.d/includes/map-bot-user-agents.conf;
            
            server {
              server_name  example.com;
              ...
              # Deny disallowed user agents
              if ($ua_disallowed) { 
                return 444;
              }
             
              location / {
                ...
              }
            
            }
            
            
            
            • Mac@federation.red
              link
              fedilink
              English
              arrow-up
              2
              ·
              edit-2
              4 months ago

              So I would need to add this to every subdomain conf file I have? Preciate you!

              • Admiral Patrick@dubvee.org
                link
                fedilink
                English
                arrow-up
                2
                ·
                edit-2
                4 months ago

                I just include the map-bot-user-agents.conf in my base nginx.conf so it’s available to all of my virtual hosts.

                When I want to enforce the bot blocking on one or more virtual host (some I want to leave open to bots, others I don’t), I just include a deny-disallowed.conf in the server block of those.

                deny-disallowed.conf
                  # Deny disallowed user agents
                  if ($ua_disallowed) { 
                    return 444;
                  }
                
                site.conf
                server {
                  server_name example.com;
                   ...
                  include conf.d/includes/deny-disallowed.conf;
                
                  location / {
                    ...
                  }
                }
                
              • Admiral Patrick@dubvee.org
                link
                fedilink
                English
                arrow-up
                0
                ·
                4 months ago

                Yeah, if’s are weird in Nginx. The rule of thumb I’ve always gone by is that you shouldn’t try to if on variables directly unless they’re basically pre-processed to a boolean via a map (which is what the user agent map does).

        • MCasq_qsaCJ_234@lemmy.zip
          link
          fedilink
          English
          arrow-up
          1
          ·
          4 months ago

          I have two questions. How much do those bots consume your bandwidth? And by blocking search robots, do you stop being present in the search results or are you still present, but they do not show the content in question?

          I ask these questions because I don’t know much about the topic when managing a website or an instance of the fediverse.

          • Admiral Patrick@dubvee.org
            link
            fedilink
            English
            arrow-up
            2
            ·
            edit-2
            4 months ago

            How much do those bots consume your bandwidth?

            Pretty negligible per bot per request, but I’m not here to feed them. They also travel in packs, so the bandwidth does multiply. It also costs me money when I exceed my monthly bandwidth quota. I’ve blocked them for so long, I no longer have data I can tally to get an aggregate total (I only keep 90 days). SemrushBot alone, before I blocked it, was averaging about 15 GB a month. That one is fairly aggressive, though. Imagesift Bot, which pulls down any images it can find, would also use quite a bit, I imagine, if it were allowed.

            With Lemmy, especially earlier versions, the queries were a lot more expensive, and bots hitting endpoints that triggered a heavy query (such as a post with a lot of comments) would put unwanted load on my DB server. That’s when I started blocking bot crawlers much more aggressively.

            Static sites are a lot less impactful, and I usually allow those. I’ve got a different rule set for them which blocks the known AI scrapers but allows search indexers (though that distinction is slowly disappearing).

            And by blocking search robots, do you stop being present in the search results or are you still present, but they do not show the content in question?

            I block bots by default, and that prevents them from being indexed since they can’t be crawled at all. Searching “dubvee” (my instance name / url) in Google returns no relevant results. I’m okay with that, lol, but some people would be appalled.

            However, I can search for things I’ve posted from my instance if they’ve federated to another instance that is crawled; the link will just be to the copy on that instance.

            For the few static sites I run (mostly local business sites since they’d be on Facebook otherwise), I don’t enforce the bot blocking, and Google, etc are able to index them normally.

      • Cyborganism@lemmy.ca
        link
        fedilink
        English
        arrow-up
        23
        arrow-down
        2
        ·
        4 months ago

        Could it be possible to have one major global instance that aggregates everything so it can be indexed by search engines? Would that work? Or do I not fully understand how federation works?

        • wholookshere@lemmy.blahaj.zone
          link
          fedilink
          English
          arrow-up
          32
          arrow-down
          2
          ·
          4 months ago

          That would defeat the purpose of federation.

          It becomes a central choke point of moderation. Who gets to decide what instances are part of global and which ones aren’t. Because a free for all isn’t going to end well. And then you’re back at Reddit.

          • WanderingVentra@lemm.ee
            link
            fedilink
            English
            arrow-up
            12
            ·
            edit-2
            4 months ago

            I wonder if you could have an instance federated to every other instance just for archived purposes, to save the data on every other instance’s post and comment. Because copies of posts and comments are saved to federated instances, too, right? Or do I understand the tech wrong?

            So it could have an admin team but no users, to prevent people worried about spammers and bots joining that instance to get around defederation rules. Maybe it just has a bot that crawls Lemmy, looking for instances to federate to. Could that work?

          • rbits@lemm.ee
            link
            fedilink
            English
            arrow-up
            2
            ·
            4 months ago

            Right, but having a centralised search index thingy is better than none at all. Maybe there could be something where it’s a joint effort from admins from many of the biggest servers, idk if that would work.

        • barsoap@lemm.ee
          link
          fedilink
          English
          arrow-up
          5
          ·
          edit-2
          4 months ago

          Lemmy search already is quite excellent… at least here on lemm.ee, we don’t have many communities but tons of users subscribed to probably about everything on the lemmyverse so the servers have it all.

          It might be interesting to team up with something like YaCy: Instances could operate as YaCy peers for everything they have. That is, integrate a p2p search protocol into ActivityPub itself so that also smaller instances can find everything. Ordinary YaCy instances, doing mostly web crawling, can in turn use posts here as interesting starting points.

      • Amanda@aggregatet.org
        link
        fedilink
        English
        arrow-up
        4
        ·
        4 months ago

        I was worrying about precisely this. I’d be ok with blocking search engines if there was a better way of searching but AFAICT there isn’t federated search of any kind?

        • thejml@lemm.ee
          link
          fedilink
          English
          arrow-up
          20
          ·
          4 months ago

          Any data transit costs money. Both in the data transit itself and in the increased server resources to respond to the web queries in the first place.

          • chrischryse@lemmy.world
            link
            fedilink
            English
            arrow-up
            2
            ·
            4 months ago

            Ah that makes sense not really familiar iwth this stuff so didn’t think it’s that intensive lol

    • u/lukmly013 💾 (lemmy.sdf.org)@lemmy.sdf.org
      link
      fedilink
      English
      arrow-up
      24
      ·
      4 months ago

      I’ve seen some when I appended “Lemmy” just like “Reddit”. But it relies on lemmy being in the domain name.

      Also I assume even when people click on those results, they don’t get ranked much higher because it’s so many different domains while reddit is just one.

      • Ibuthyr@discuss.tchncs.de
        link
        fedilink
        English
        arrow-up
        10
        ·
        edit-2
        4 months ago

        Kagi has a button that lets you search fediverse forums. I haven’t tested it yet though.

        Edit: yup, works like a charm!

    • infeeeee@lemm.ee
      link
      fedilink
      English
      arrow-up
      18
      arrow-down
      2
      ·
      4 months ago

      Most of the originalish content on lemmy are linux related stuff, memes and porn. The latter 2 are mostly image/video based, so you don’t search for that very frequently and easily. I can see that in the future it will become a very relevant source of info in linux admin and user circles.

      I go back to r*ddit sometimes for some local content which is non existent on lemmy. I see that the tech related subs are mostly dead there, or at least only shadows of their former selfs. E.g. go to r/linux, sort by top all time. In the first 100 results you will barely find anything posted after the exodus.

      • MudMan@fedia.io
        link
        fedilink
        arrow-up
        12
        ·
        4 months ago

        Yeah, the notion that Lemmy is a Reddit replacement is misguided. It definitely doesn’t have the same Q&A balance Reddit does. It feels a lot more like 90s and early 2000s forums than the large-scale self-service link and customer service churn Reddit encourages.

        Which I’m all for. I was never a Reddit guy and I do like it here. But in terms of how bad it is now that Reddit is not happy to host most of the actually useful online content for free… well, that’s a different conversation.

    • BombOmOm@lemmy.world
      link
      fedilink
      English
      arrow-up
      15
      arrow-down
      3
      ·
      edit-2
      4 months ago

      You can always add “site:lemmy.world” to your search (remove the quotes). I commonly do that, as well as the same for reddit or stack overflow.

    • NotAnotherLemmyUser@lemmy.world
      link
      fedilink
      English
      arrow-up
      9
      ·
      4 months ago

      One of the major problems with Lemmy is that many posts get deleted and that nukes the comment section (which is where most of the answers will be).

      I wish Lemmy deleted posts closer to how Reddit deletes posts - the post content should be deleted, but leave the comments alone.

    • Gerudo@lemm.ee
      link
      fedilink
      English
      arrow-up
      5
      ·
      4 months ago

      Twice I have come across links to lemmy, definitely not the norm though.

    • chiisana@lemmy.chiisana.net
      link
      fedilink
      English
      arrow-up
      4
      ·
      4 months ago

      I’m inclined to think due to the nature of the platform, contents are constantly duplicated to the eyes of search engines, which hurts authoritativeness of each instance thereby hurts ranking.