While clicking through some random Lemmy instances, I found one that’s due to be shut down in about a week — https://dmv.social. I’m trying to archive what I can onto the Wayback Machine, but I’m not sure what the most efficient way to go about it is.

At the moment, what I’ve been doing is going through each community and archiving each sort type (except the ones under a month, since the instance was locked a month ago) with capture outlinks enabled. But is there a more efficient way to do it? I know of the Internet Archives save from spreadsheet tool, which would probably work well, but I don’t know how I’d go about crawling all the links into a sitemap or csv or something similar. I don’t have the know-how to setup a web crawler/spider.

Any suggestions?

  • person@lemm.ee
    link
    fedilink
    arrow-up
    10
    ·
    edit-2
    9 months ago

    Well since posts are numbered sequentially, you could archive all of them by generating the links. Tiny issue is, this would include every post that was federated with the server, which is almost 2 million it seems. A bit overkill for a relatively small instance.

    I think if you filter by local on the main page and click next until you get to the end, there aren’t that many pages. You could save those with outlinks.

    Also, I believe, the posts will live on on other instances regardless.

    • BakuOP
      link
      fedilink
      English
      arrow-up
      3
      ·
      9 months ago

      Oh good idea, thank you! Yeah, I think because of the federation stuff, it should persist, although I think that will complicate searching and finding things. I’m pretty sure this is the largest instance to go down to date, so I’d rather be safe than to lose things, even if it is only a small instance.

      This does make me a bit nervous for how archiving larger instances will look when one eventually dies, though. A spider that logs everything into a spreadsheet and then splitting into different groups would probably be the best option. Or maybe a local ArchiveBox setup could work too. All the Lemmy admins seem fairly resonable though, so perhaps they might even upload everything directly into the Internet Archive themselves

  • grue@lemmy.world
    link
    fedilink
    English
    arrow-up
    6
    ·
    9 months ago

    I don’t know enough about how ActivityPub works to be sure, but I suspect the right way to archive a Lmy instance would be to create software that acts like another instance, federates with the one you want to archive, and saves the raw stream of ActivityPub packets.

    • BakuOP
      link
      fedilink
      English
      arrow-up
      4
      ·
      9 months ago

      Oh, yeah, you’re probably right. Unfortunately I absolutely do not have the knowledge required to do that, but I’ll keep it in mind. Thanks

  • Michael Ten @lemmy.world
    link
    fedilink
    arrow-up
    3
    ·
    9 months ago

    Maybe a plug-in for Lemmy server could be developed to automatically back up and / or restore instances from Arweave. Some protocol could be used to turn the instances into Json, which could then be uploaded as documents and parsed, or something like that. And then the Json could then be potentially restored. There might be many pages for a large instance, but they could perhaps be organized in a thoughtful and functional way.