Upcoming 0.18 upgrade, 404 errors and infrastructure costs

sunaurus@lemm.ee · edit-2 1 year ago

Upcoming 0.18 upgrade, 404 errors and infrastructure costs

two_wheel2@lemm.ee · 1 year ago

Looks great! Thanks for doing this! I don’t see anywhere here the approximate monthly costs… only what the money is being spent on. Do you have a figure for how much goes into running this instance?

sunaurus@lemm.ee · 1 year ago

The current projected bill for our whole infra in the month of June is $147. This covers the load balancer, 3 servers + database server, object storage for image uploads and our e-mail service. This may increase a little bit if we go higher than expected on bandwidth, object storage or outgoing e-mails.

two_wheel2@lemm.ee · edit-2 1 year ago

Alright, I’m tossing my tiny hat into the sponsor ring. Thanks so much for putting this community together! I’m excited to see it grow. Just out of curiosity, what does the incremental cost look like? Does it scale well with users? Or does it explode a little bit?

sunaurus@lemm.ee · edit-2 1 year ago

That’s greatly appreciated!

In terms of costs of scaling, I would say we’re positioned a bit better than many other Lemmy instances at the moment, thanks to the fact that we employ horizontal scaling as much as possible for the Lemmy software itself.

By the way, AFAIK, lemm.ee is the only non-experimental Lemmy instance that has chosen to go with horizontal scaling so far. If anybody knows of any other instance that is doing it, I would be super interested to know about it! All the admins I’ve spoken to so far myself have confirmed that they are only doing vertical scaling.

More technical details below for anybody who is interested:

There are two approaches you can generally take for scaling - horizontal, where you add more load balanced nodes of more or less the same power, or vertical, where you increase the power of an individual node (of course a mixture of both is also possible).

One of the benefits of horizontal scaling is that in most cases, it’s significantly more flexible compared to vertical scaling. For example, at my current cloud provider, the only upgrade path for vertical scaling a server would be 8 CPU -> 16 CPU - 32 CPU -> 40 CPU. So if you’re on a 16 CPU server, and you need just a little bit more headroom, then your only option is to upgrade to the 32CPU server, which is straight up double the power (and cost!). Meanwhile, with horizontal scaling, you can just keep adding smaller servers (say 2 CPU each) one at a time, thus growing costs more gradually and appropriately for your actual needs.

So for lemm.ee, this horizontal scaling means that when our backend servers start getting overloaded, I can just add one or two more servers without exponentially increasing costs.

OneDimensionPrinter@lemm.ee · 1 year ago

As someone who has “been there and done that” at a much larger scale than many devs may ever get a chance to (not a brag, it can suck royally) this really seems like the smart choice.

This is effectively a basic web server scenario and horizonal scaling tends to with really well to a point. And frankly it’ll be a long while before that becomes the bottleneck.

Smart choices you’re making. All the best and I’m happy to help out monetarily where I can!

electromage@lemm.ee · 1 year ago

Are your servers in one geographic region? Could you scale across regions for better performance?

sunaurus@lemm.ee · 1 year ago

I am already leveraging Cloudflare’s globally distributed cache, which helps improve performance even if you’re far away from the backend server. But this only helps partially, not with all types of requests.

lemm.ee is hosted in central Europe, and based on monitoring, it does seem that most users are having a pretty decent experience on lemm.ee regardless of their geographic location so far. One key exception to this are short windows of database load spikes, which last for roughly 10 seconds every 5 minutes. For these spikes, everybody is suffering equally, regardless of where they are in the world 😅.

But in general I agree with the sibling comment by @Notorious - rather than scaling one instance to be some massive globally distributed powerhouse, it makes sense to spread out the load amongst a lot of different instances.

electromage@lemm.ee · 1 year ago

Thank you for your work and communication! I agree it doesn’t make sense to invest in global infrastructure unless everyone does it, and the return wouldn’t be worth it. We’ll just have to get used to some performance issues as the fediverse takes off!

OneDimensionPrinter@lemm.ee · 1 year ago

Are the DB spikes ACTUALLY every 5 minutes or is that just kind of a guess? I ask because if it’s consistent, it’s gotta be some sidecar process somewhere in the stack that can be fiddled with.

That said, it really sounds like you know what you’re doing already so I’ll just go play with my new communities.

sunaurus@lemm.ee · 1 year ago

The spikes are caused by a specific reoccurring process which happens every 5 minutes. I have already significantly optimized it with a patch on lemm.ee, I’m working on getting it merged upstream as well!

Notorious@lemm.ee · 1 year ago

Personal opinion is that is outside the scope for a single instance. The whole idea behind Lemmy is to have multiple instances to accommodate different geos and different languages.

electromage@lemm.ee · 1 year ago

I think this could be problematic if instances aren’t providing a consistent user experience in different regions. If my Flashlight community is on an instance in California, and my Linux community is in Finland, I’m going to have a very asymmetric experience.

sunaurus@lemm.ee · 1 year ago

Home instances act as mirrors for posts and comments, so the experience should still be quite symmetric for you overall if you’re browsing both communities from the same instance

xavier666@lemm.ee · 1 year ago

For storage, I can understand how horizontal scaling works (add more storage nodes to, say glusterfs). But how does it work for CPU? Since adding a 2CPU VM can be physically on another server, it would need lemmy to work in a highly distributed manner, i.e., CPU instructions need to cross the network.

Is this distributed feature a part of lemmy or is there another abstraction layer?

sunaurus@lemm.ee · edit-2 1 year ago

This is where our load balancer comes in. All requests go through the load balancer, and this load balancer will try to evenly distribute the requests to all of our backend servers.

Is this distributed feature a part of lemmy … ?

In fact it’s the opposite - Lemmy has so far had some assumptions built in to the code which make it quite hard to run on multiple servers. I have made some modifications in order to improve this (and contributed those modifications back to the main repo as well). It’s one of the things I want to keep improving as we grow.

xavier666@lemm.ee · edit-2 1 year ago

Here is my oversimplified understanding of the backend of lemm.ee This

Am I correct? Or is there another loadbalancer in front of the DB?

Sorry for asking so many questions, but I’m new to system design and trying to learn about practical deployments.

sunaurus@lemm.ee · edit-2 1 year ago

That’s pretty close, but there are some nuances.

One of the servers is currently exclusively dedicated to handling images (processing, indexing, resizing, uploading to object storage)
One of the servers is only handling Lemmy HTTP requests
One of the servers is handling Lemmy HTTP requests + at the same time also handling Lemmy background tasks (different cleanups, updating the front page rankings, etc)

Additionally, we are not using Docker at all for lemm.ee. Not that I have anything against Docker - I use it regularly in other projects - it just wouldn’t provide any advantages for lemm.ee at the moment.

xavier666@lemm.ee · 1 year ago

Thanks for the clarifications. I now understand the architecture of lemm.ee.

However, by the way you have horizontally scaled things, it had to be done manually. You basically tried to decouple different lemmy functionalities and put them in different servers. It’s not as simple as setting a simple env variable as the number of servers.

Also, with this approach i feel like it’s possible some servers will be loaded more than others. Eg, server 1 which handles images will be more CPU/RAM-heavy, where as server 2 which handles HTTP requests will be mostly network-heavy. So there will be cases where the scaling is not unform.

Please don’t consider this as criticism (i personally just play around with my raspberry pi) but rather as observations.

bric@lemm.ee · 1 year ago

Same. I’m not putting in a ton, but monthly donations go a long way to help with monthly server costs. We just need 150 people to put in $1 a month and we’ll be covered indefinitely

two_wheel2@lemm.ee · 1 year ago

Exactly. I’ve tossed in $5/mo and I literally just realized that with Reddit never in my WILDEST DREAMS would I have imagined kicking in some money for something like gold or trophies or even Apollo (RIP), but $5 a month to contribute to supporting a distributed community of people beyond myself feels like nothing to me. I think that speaks to the potential federation + good will can offer the world

OneDimensionPrinter@lemm.ee · 1 year ago

Shit. That’s a bunch of hardware/services. I hope the donations keep coming in. I’ll gladly drop a few bucks a month for quality updates and a relatively stable instance.

Thank you for running this so I don’t have to deal with it myself.

dan@upvote.au · 1 year ago

If they’re all powerful servers, $147 is pretty good for that many of them! Out of curiosity, are you using Hetzner? VPSes or physical servers?

sunaurus@lemm.ee · 1 year ago

Not Hetzner. We’re on VPSes for now!

dan@upvote.au · 1 year ago

Oh cool. That makes sense. Which provider?

AndromedusGalacticus@lemm.ee · 1 year ago

Yeah, I feel being transparent of cost will bring a lot more goodwill and help people want to participate in the goals of the server.

Upcoming 0.18 upgrade, 404 errors and infrastructure costs

Upcoming 0.18 upgrade, 404 errors and infrastructure costs

Hello, fellow lemmings!

Upcoming 0.18 upgrade

Why do we even want 0.18?

Random 404 errors

Server costs

Pinning updates on the front page