GlusterFS woes

If your looking for the gluster error ‘brick2.mount_dir not present’ 
jump to the end

Time for another post 🙂

I’ve been using DigitalOcean for some time now, and I’m still tweaking my setup. Once thing I really hope they sort soon is proper private lan between your own droplets, for now we just have to use a vpn between them.

Being responsible for a new website can give a lot of headaches, especially when you have to try to guess just how popular it will be. So about a year ago I setup a new droplet to host the new site, testing it was going well and I increased the droplet before we launched to handle a spike. Sadly I underestimated just how busy it would get, based on the numbers I was given I think I was about right but unfortunately those numbers were way off.

But each failure it just another learning curve 🙂 so as quickly as possible that was fixed, then the site got to normal volume so we scaled it back down (yes it’s a whole cost exercise especially when your paying for it). Then we had the lead up to Christmas, in an attempt to not repeat the problems at launch, I changed the whole configuration so that I could (if needed) take a server down while staying partly operational. This kind of worked and was needed when some brightspark promoted the site a day early and we hadn’t scaled back up!

Come the new year I decided it was time to seriously sort the infrastructure for the site. It now has an online store and it’s important it keep running, it’s not just a blog anymore. So I put in place the following setup (working around various obstacles).

DNS:

  • All websites name servers are pointing to cloudflare and they handle the first web connection. It works really well on their free tier, and changes (adding new servers) are pretty quick to take effect.

DigitalOcean Droplets:

  • 1x Server running as a load balancer.
  • 2x Servers running as webservers.
  • 1x Server running as database server.
  • 1x Server running as email (not quite running).
  • At the same time as making this setup I decided to ditch apache and move to nginx, so loadbalancer and webservers are running that.

Software:

  • 4x Nginx (loadbalancer and webservers, and installed on database server for stats).
  • 2x Syncthing (webservers) to keep the www folders in sync.
  • 1x MySQL Server.
  • 4x OpenVPN (connections between loadbalancer and webservers, webservers and database).
  • 1x Redis Server (for session data, I tried nginx load balancing options but it still screwed up if I had to take one of the web servers out for maintenance, so installed Redis on the Database server.

As this progressed I dropped the VPN between loadbalancer and webservers and just use HTTPS/SSL instead. Syncthing already has it’s own SSL built-in so I could leave that over the semi-private LAN. but I really would like to change MySQL to be encrypted and drop the vpn from that too, but find info on doing this for wordpress seems to be non-existent at the moment.

Roll forward a few months, this has been working but still has areas to resolve. Such as syncthing: yes it keeps the folders in sync and is actually really good that I can also store them on another system easily. but it doesn’t listen to the OS for changes to files. Instead it polls every x seconds. Although there’s nothing much changing, updating plugins became a problem if you click the update button it downloads but then nginx sends your next request to another server and now the plugin.zip isn’t there so wordpress throws an error.

My whole reason for running syncthing was I wanted the files to be available on each server independently. So if Server A goes down it doesn’t matter Server B has all the files locally anyway. NFS would still give me a single point of failure. On looking into resolving this though I remembers GlusterFS. I’d played with it a long time ago, but dropped it as a solution (can’t remember what I was doing or why it wasn’t working). Now it’s time to try it again. downside I’m back to needing VPN’s and OpenVPN isn’t the easiest to quickly add a new server.

So I’ve done the following:

  • Added a new server just for the files (I don’t like gluster being in a 2 replica incase there’s a problem, there should be a majority who thinks they are holding the correct file).
  • Swapped out OpenVPN for Tinc, I have to say one of the best decisions. yes there are downsides, it creates a mesh (only doable with OpenVPN by running quagga for manually forcing routes) but I have no idea which Server is actually connected to which Servers. There’s no VPN status and I can’t see how much traffic has gone between 2 particular servers (iptables helps but it’s not 100%)
  • Added another new server for nagios and central logging.

There were a load of changes within a few weeks of each other, but I now have a setup I’m confident I can scale more quickly than ever before. Yes it has single points (load balancer, mysql) but I know if the load balancer has a problem it’s pretty static so can be wiped and redeployed quickly, as well as it will take a few minutes to open the webservers to the world and let cloudflare hit them directly. So MySQL is the real problem and I’ll be addressing that one soon enough.

 

So now onto today’s problem 🙂

I’ve had gluster running a few weeks, and I have our testing website (for theme changes etc) setup on our webservers behind the loadbalancer. The last few days I’ve need to do more extensive testing than just changing bits in a theme, so I’ve decided to split the tester site onto it’s own droplet (still behind the loadbalancer and with VPN to the databases). I thought I may as well make use of Gluster here too (yes it would be in a 2 replica setup itself and the fileserver. I don’t like that idea). So I brought up a new server and configured it: new users, firewall rules, tinc, nginx, php, etc.

I added gluster and copied the /etc/hosts entries over from the other servers. All looked good. I gluster peer probe ServerX and it worked, gluster peer status and I could see it fine. but on trying to add a new volume:

gluster volume create xxx-yyy-zzz replica 2 transport tcp FILESERVER:/GLUSTER/xxx.yyy-zzz TESTSERVER:/GLUSTER/xxx.yyy-zzz force

I was getting the error:

volume create: xxx-yyy-zzz: failed: Commit failed on localhost. Please check the log file for more details.

Checking the logs on both servers would show (maybe slight variation):

[2015-07-28 16:00:41.612907] E [glusterd-hooks.c:328:glusterd_hooks_run_hooks] 0-management: Failed to open dir /var/lib/glusterd/hooks/1/create/pre, due to No such file or directory
[2015-07-28 16:00:41.614499] E [glusterd-volume-ops.c:1811:glusterd_op_create_volume] 0-management: brick2.mount_dir not present
[2015-07-28 16:00:41.614587] E [glusterd-syncop.c:1288:gd_commit_op_phase] 0-management: Commit of operation 'Volume Create' failed on localhost

I tried a series of things to fix it:

I thought maybe the /GLUSTER/xxx-yyy-zzz needed to be created (I already made /GLUSTER) – Nope

  • I detached the peer and reattached – No.
  • I reboot the file server and test server – No.
  • I detached, reboot, reattached – No.
  • I tried creating the volume with just the test server and no replica – No.
  • I tried creating the volume on just the fileserver with no replica – Yes.

So the problem is point to the new system, but it’s a brand new system. They’re peers and connected.

  • I tried uninstalling and reinstalling gluster – No.
  • I tried uninstalling, purging and reinstalling – No.
  • I tried uninstalling, purging, manually deleting the /var/lib/gluster (probably a mistake that I didn’t detach first :() and reinstalling – No.
  • I have no idea why this wont WORK!!!!!

Let’s go further back, check the VPN, ping the servers.

  • Ping fileserver from testserver – Yes.
  • Ping testserver from fileserver – Yes/Hang on that’s the wrong IP!! Yes I’d copied an entry from webserverB into /etc/hosts, update the name but missed the IP address. Idiot! correct that. Ping – OK.
  • Try gluster again – Yes.

So if you’re having problems and seeing brickX.mount_dir not present make sure your DNS between servers is correct.

I don’t really know how the peer probe worked, but I think I must have done that from a servers who’s hosts was correct

One thought on “GlusterFS woes”

  1. The challenge then is to walk away from the Unicorn.  You can’t have a single storage system to solve all of your woes.  However, you probably don’t have to live with tens or hundreds of storage systems either.

Leave a Reply

Your email address will not be published. Required fields are marked *