Edit: it seems like my explanation turned out to be too confusing. In simple terms, my topology would look something like this:
I would have a reverse proxy hosted in front of multiple instances of git servers (let’s take 5 for now). When a client performs an action, like pulling a repo/pushing to a repo, it would go through the reverse proxy and to one of the 5 instances. The changes would then be synced from that instance to the rest, achieving a highly available architecture.
Basically, I want a highly available git server. Is this possible?
I have been reading GitHub’s blog on Spokes, their distributed system for Git. It’s a great idea except I can’t find where I can pull and self-host it from.
Any ideas on how I can run a distributed cluster of Git servers? I’d like to run it in 3+ VMs + a VPS in the cloud so if something dies I still have a git server running somewhere to pull from.
Thanks
What does this even mean? You want to replicate between git repositories? Can you do that with receive/update hooks on the servers?
Apologies for not explaining better. I want to run a loadbalancer in front of multiple instances of a git server. When my client performs an action like a pull or a push, it will go to one of the 5 instances, and the changes will then be synced to the rest.
I have edited the post to hopefully make my thoughts a bit more clear
I wonder if you could use HAProxy for that. It’s usually used with web servers. This is a pretty surprising request though, since git is pretty fast. Do you have an actual real world workload that needs such a setup? Otherwise why not just have a normal setup with one server being mirrored, and a failover IP as lots of VPS hosts can supply?
And, can you use round robin DNS instead of a load balancer?
I think I messed up my explanation again.
The load-balancer in front of my git servers doesn’t really matter. I can use whatever, really. What matters is: how do I make sure that when the client writes to a repo in one of the 5 servers, the changes are synced in real-time to the other 4 as well? Running rsync every 0.5 second doesn’t seem to be a viable solution
Why do you want 5 git servers instead of, say, 2? Are you after something more than high availability? Are you trying to run something like GitHub where some repos might have stupendous concurrent read traffic? What about update traffic?
What happens if the servers sometimes get out of sync for 0.5 sec or whatever, as long as each is in a consistent state at all times?
Anyway my first idea isn’t rsync, but rather, use update hooks to replicate pushes to the other servers, so the updates will still look atomic to clients. Alternatively, use a replicated file system under Ceph or the like, so you can quickly migrate failed servers. That’s a standard cloud hosting setup.
What real world workload do you have, that appeared suddenly enough that your devs couldn’t stay in top of it, and you find yourself seeking advice from us relatively clueless dweebs on Lemmy? It’s not a problem most git users deal with. Git is pretty fast and most users are ok with a single server and a backup.
Thanks for the comment. There’s no special use-case: it’ll just be me and a couple of friends using it anyway. But I would like to make it highly available. It doesn’t need to be 5 - 2 or 3 would be fine too but I don’t think the number would change the concept.
Ideally I’d want all servers to be updated in real-time, but it’s not necessary. I simply want to run it like so because I want to experience what the big cloud providers run for their distributed git services.
Thanks for the idea about update hooks, I’ll read more about it.
Well the other choice was Reddit so I decided to post here (Reddit flags my IP and doesn’t let me create an account easily). I might ask on a couple of other forums too.
Thanks
I see, fair enough. Replication is never instantaneous, so do you have definite bounds on how much latency you’ll accept? Do you really want independent git servers online? Most HA systems have a primary and a failover, so users only see one server. If you want to use Ceph, in practice all servers would be in the same DC. Is that ok?
I think I’d look in one of the many git books out there to see what they say about replication schemes. This sounds like something that must have been done before.
Well it’s a tougher question to answer when it’s an active-active config rather than a master slave config because the former would need minimum latency possible as requests are bounced all over the place. For the latter, I’ll probably set up to pull every 5 minutes, so 5 minutes of latency (assuming someone doesn’t try to push right when the master node is going down).
I don’t think the likes of Github work on a master-slave configuration. They’re probably on the active-active side of things for performance. I’m surprised I couldn’t find anything on this from Codeberg though, you’d think they have already solved this problem and might have published something. Maybe I missed it.
I didn’t find anything in the official git book either, which one do you recommend?
Before you can decide on how to do this, you’re going to have to make a few choices:
Authentication and Access
Theres two main ways to expose a git repo, HTTPS or SSH, and they both have pros and cons here:
-
HTTPS A standard sort of protocol to proxy, but you’ll need to make sure you set up authentication on the proxy properly so that only only thise who should have access can get it. The git client will need to store a username and password to talk to the server or you’ll have to enter them on every request.
gitweb
is a CGI that provides a basic, but useful, web interface. -
SSH Simpler to set up, and authentication is a solved problem. Proxying it isn’t hard, just forward the port to any of the backend servers, which avoids decrypting on the proxy. You will want to use the same hostkey on all the servers though, or SSH will refuse to connect. Doesn’t require any special setup.
Replication
Git is a distributed version control system, so you could replicate it at that level, alternatively you could use a replicated file system, or a simple file based replication. Each has it’s own trade-offs.
-
Git replication Using
git pull
to replicate between repositories is probably going to be your most reliable option, as it’s the job git was built for, and doesn’t rely on messing with it’s underlying files directly. The one caveat is that, if you push to different servers in quick suscession you may cause a merge confict, which would break your replication. The cleanest way to deal with that is to have the load balancer send all requests to server1 if it’s up, and only switch to the next server if all the prior ones are down. That way writes will alk be going to the same place. Then set up replication in loop, with server2 pulling from server1, server3 pulling from server2, and so on up to server1 pulling from server5. With frequent pulls changes that are commited to server1 will quickly replicate to all the other servers. This would effectively be a shared nothing solution as none of the servers are sharing resources, which would make it easier to geigraphically separate them. The load balancer could be replaced by a CNAME record in DNS, with a daemon that updates it to point to the correct server. -
Replicated filesystem Git stores its data in a fairly simple file structure, so placing that on a replicated filesystem such as GlusterFS or Ceph would mean multiple servers could use the same data. From experience, this sort of thing is great when it’s working, but can be fragile and break in unexpected ways. You don’t want to be up at 2am trying to fix a file replication issue if you can avoid it.
-
File replication. This is similar to the git replication option, in that you have to be very aware of the risk of conflicts. A similar strategy would probably work, but I’m not sure it brings you any advantages.
I think my prefered solution would be to have SSH access to the git servers and to set up pull based replication on a fairly fast schedule (where fast is relative to how frequently you push changes). You mention having a VPS as obe of the servers, so you might want to push changes to that rather than have be able to connect to your internal network.
A useful property of git is that, if the server is missing changesets you can just push them again. So if a server goes down before your last push gets replicated, you can just push again once the system has switched to the new server. Once the first server comes back online it’ll naturally get any changesets it’s missing and effectively ‘heal’.
This is a fantastic comment. Thank you so much for taking the time.
I wasn’t planning to run a GUI for my git servers unless really required, so I’ll probably use SSH. Thanks, yes that makes the part of the reverse proxy a lot easier.
I think your idea of having a designated “master” (server 1) and having rolling updates to the rest of the servers is a brilliant idea. The replication procedure becomes a lot easier this way, and it also removes the need for the reverse-proxy too! - I can just use Keepalived, set up weights to make one of them the master and corresponding slaves for failover. It also won’t do round-robin so no special stuff for sticky sessions! This is great news from the perspective of networking for this project.
Hmm, you said to enable pushing repos to the remote git repo instead of having it pull? I was going create a wireguard tunnel and have it accessible from my network for some stuff but I guess it makes sense.
Thanks again for the wonderful comment.
-
If I am understanding correctly I would run Forgejo in a k8s/k3s pod
This will be your starting point but you would have to modify the setup to bring it into k8s or k3s
Thank you. I did think of this but I’m afraid this might lead me into a chicken and egg situation, since I plan to store my Kubernetes manifests in my git repo. But if the Kubernetes instances go down for whatever reason, I won’t be able to access my git server anymore.
I edited the post which will hopefully clarify what I’m thinking about
I would have a standalone Forgejo server to act as your infrastructure server. Make it separate from your production k8s/k3s environment.
If something knocks out your infrastructure Forgejo instance then your prod instance will continue to work. If something knocks out your prod, then your infrastructure instance is still there to pull on.One of the reasons I suggest k8s/k3s if something happens k8s/k3s will try to automatically bring the broken node back online.
You mean have two git servers, one “PROD” and one for infrastructure, and mirror repos in both? I suppose I could do that, but if I were to go that route I could simply create 5 remotes for every repo and push to each individually.
For the k8s suggestion - what happens when my k8s cluster goes down, taking my git server along with it?
chicken and egg situation, since I plan to store my Kubernetes manifests in my git repo
Not really.
K8s would use a “checked-out” visible representation, not the repo database itself.
Sorry, I don’t understand. What happens when my k8s cluster goes down taking my git server with it?
You do not let your k8s control instance look “live” at your git server during the start (or reformation) of the whole cluster. It needs the (repo and) files checked out somewhere locally, and this local “somewhere” must exist at start time.
Later, when your git is alive, you do a regular git pull for keeping it up to date.
Oh I get it. Auto-pull the repos to the master nodes’ local storage for if something bad happens, and when that does, use the automatically pulled (and hopefully current) code to fix what broke.
Good idea
So, to be clear, GitHub is not git. Git is intrinsically distributed. GitHub is basically a repository Management service.
I did some googling for about 10 seconds and afaik GitHub does not support any type of self hosting. I know you can selfhost gitlab , but I don’t see a project for either GitHub or gitlab called spokes.
Not knowing anymore than this about what you actually want to accomplish, my advice would be to just figure out how to run your own git server (without the management fluff) and do a 3-2-1 backup scheme. You could of course also create a gitlab instance with an HA set-up, plus backing that up to the cloud.
Apologies for not explaining it properly. Essentially, I want to have multiple git servers (let’s take 5 for now), have them automatically sync with each other and run a loadbalancer in front. So when a client performs an action with a repository, it goes to one of the 5 instances and the changes are written to the rest.
I have edited the post, hopefully the explanation makes more sense now
GitHub didn’t publish the source code for their project, previously known as DGit (Distributed Git), now known as spokes. The only mention of it is in a blog post on their website but I don’t have the link handy right now
There is radicle - for which you could run five nodes and have them seed each other.
Wouldn’t it be better to have highly available storage for the git repo?
Something like Ceph, Minio, Seaweedfs, GarageFS etc.
Cause git is file system based.git is file system based.
Git is also a database.
Have you considered a distributed filesystem such as GlusterFS or DRBD? I believe those support synchronous replication so writes will go to all the configured machines before acknowledging the write. Performance will likely take a hit the greater the number of clusters in the cluster.