This is terribly hard to write. If you flushed your cache right now you would see all the newest posts without images. These are now 404s, even thought the images exist. In 2 hours everyone will see this. Unfortunately there is no going back, recovering the key store for all the “new” images.

What happened?

After the picture migration from our local file store to our object storage, i made a configuration change so that our Docker container no longer reference the internal file store. This resulted in the picture service having an internal database that was completely empty and started from scratch 😔

What makes this worse is that this was inside the ephemeral container. When the containers are recreated that data is lost. This had happened multiple times over the 2 day period.

What made this harder to debug was our CDN caching was hiding the issues, as we had a long cache time to reduce the load on our server.

The good news is that after you read this post, every picture will be correctly uploaded and added to the internal picture service database! 😊 The “better” news is the all original images from the 28th of June and before will start working again instantly.

Timeframe

The issue existed from the period from 29th of June to 1st of July.

Resolution

Right now. 1st of July 8:48 am UTC.
From now on, everything will work as expected.

Going forward

Our picture service migration has been fraught with issues and I cannot express how annoyed and disheartened by the accidents that have occurred. I am yet to have provided a service that I would be happy with.

I am very sorry that this happened and I will strive to do better! I hope you all can accept this apology

Tiff

  • jeremy@reddthat.com
    link
    fedilink
    English
    arrow-up
    3
    ·
    1 year ago

    Containers are complex! You’re doing great.

    Honestly, stateful services in containers are… Often a lot of work.

    • tsz@reddthat.com
      link
      fedilink
      English
      arrow-up
      4
      ·
      1 year ago

      often a lot of work

      I have yet to find a legitimate use case where the infrastructure, time, etc required to get this right results is a better product.