GlusterFS woes

If your looking for the gluster error ‘brick2.mount_dir not present’ 
jump to the end

Time for another post 🙂

I’ve been using DigitalOcean for some time now, and I’m still tweaking my setup. Once thing I really hope they sort soon is proper private lan between your own droplets, for now we just have to use a vpn between them.

Being responsible for a new website can give a lot of headaches, especially when you have to try to guess just how popular it will be. So about a year ago I setup a new droplet to host the new site, testing it was going well and I increased the droplet before we launched to handle a spike. Sadly I underestimated just how busy it would get, based on the numbers I was given I think I was about right but unfortunately those numbers were way off.

But each failure it just another learning curve 🙂 so as quickly as possible that was fixed, then the site got to normal volume so we scaled it back down (yes it’s a whole cost exercise especially when your paying for it). Then we had the lead up to Christmas, in an attempt to not repeat the problems at launch, I changed the whole configuration so that I could (if needed) take a server down while staying partly operational. This kind of worked and was needed when some brightspark promoted the site a day early and we hadn’t scaled back up!

Come the new year I decided it was time to seriously sort the infrastructure for the site. It now has an online store and it’s important it keep running, it’s not just a blog anymore. So I put in place the following setup (working around various obstacles).

DNS:

  • All websites name servers are pointing to cloudflare and they handle the first web connection. It works really well on their free tier, and changes (adding new servers) are pretty quick to take effect.

DigitalOcean Droplets:

  • 1x Server running as a load balancer.
  • 2x Servers running as webservers.
  • 1x Server running as database server.
  • 1x Server running as email (not quite running).
  • At the same time as making this setup I decided to ditch apache and move to nginx, so loadbalancer and webservers are running that.

Software:

  • 4x Nginx (loadbalancer and webservers, and installed on database server for stats).
  • 2x Syncthing (webservers) to keep the www folders in sync.
  • 1x MySQL Server.
  • 4x OpenVPN (connections between loadbalancer and webservers, webservers and database).
  • 1x Redis Server (for session data, I tried nginx load balancing options but it still screwed up if I had to take one of the web servers out for maintenance, so installed Redis on the Database server.

As this progressed I dropped the VPN between loadbalancer and webservers and just use HTTPS/SSL instead. Syncthing already has it’s own SSL built-in so I could leave that over the semi-private LAN. but I really would like to change MySQL to be encrypted and drop the vpn from that too, but find info on doing this for wordpress seems to be non-existent at the moment.

Roll forward a few months, this has been working but still has areas to resolve. Such as syncthing: yes it keeps the folders in sync and is actually really good that I can also store them on another system easily. but it doesn’t listen to the OS for changes to files. Instead it polls every x seconds. Although there’s nothing much changing, updating plugins became a problem if you click the update button it downloads but then nginx sends your next request to another server and now the plugin.zip isn’t there so wordpress throws an error.

My whole reason for running syncthing was I wanted the files to be available on each server independently. So if Server A goes down it doesn’t matter Server B has all the files locally anyway. NFS would still give me a single point of failure. On looking into resolving this though I remembers GlusterFS. I’d played with it a long time ago, but dropped it as a solution (can’t remember what I was doing or why it wasn’t working). Now it’s time to try it again. downside I’m back to needing VPN’s and OpenVPN isn’t the easiest to quickly add a new server.

So I’ve done the following:

  • Added a new server just for the files (I don’t like gluster being in a 2 replica incase there’s a problem, there should be a majority who thinks they are holding the correct file).
  • Swapped out OpenVPN for Tinc, I have to say one of the best decisions. yes there are downsides, it creates a mesh (only doable with OpenVPN by running quagga for manually forcing routes) but I have no idea which Server is actually connected to which Servers. There’s no VPN status and I can’t see how much traffic has gone between 2 particular servers (iptables helps but it’s not 100%)
  • Added another new server for nagios and central logging.

There were a load of changes within a few weeks of each other, but I now have a setup I’m confident I can scale more quickly than ever before. Yes it has single points (load balancer, mysql) but I know if the load balancer has a problem it’s pretty static so can be wiped and redeployed quickly, as well as it will take a few minutes to open the webservers to the world and let cloudflare hit them directly. So MySQL is the real problem and I’ll be addressing that one soon enough.

 

So now onto today’s problem 🙂

I’ve had gluster running a few weeks, and I have our testing website (for theme changes etc) setup on our webservers behind the loadbalancer. The last few days I’ve need to do more extensive testing than just changing bits in a theme, so I’ve decided to split the tester site onto it’s own droplet (still behind the loadbalancer and with VPN to the databases). I thought I may as well make use of Gluster here too (yes it would be in a 2 replica setup itself and the fileserver. I don’t like that idea). So I brought up a new server and configured it: new users, firewall rules, tinc, nginx, php, etc.

I added gluster and copied the /etc/hosts entries over from the other servers. All looked good. I gluster peer probe ServerX and it worked, gluster peer status and I could see it fine. but on trying to add a new volume:

gluster volume create xxx-yyy-zzz replica 2 transport tcp FILESERVER:/GLUSTER/xxx.yyy-zzz TESTSERVER:/GLUSTER/xxx.yyy-zzz force

I was getting the error:

volume create: xxx-yyy-zzz: failed: Commit failed on localhost. Please check the log file for more details.

Checking the logs on both servers would show (maybe slight variation):

[2015-07-28 16:00:41.612907] E [glusterd-hooks.c:328:glusterd_hooks_run_hooks] 0-management: Failed to open dir /var/lib/glusterd/hooks/1/create/pre, due to No such file or directory
[2015-07-28 16:00:41.614499] E [glusterd-volume-ops.c:1811:glusterd_op_create_volume] 0-management: brick2.mount_dir not present
[2015-07-28 16:00:41.614587] E [glusterd-syncop.c:1288:gd_commit_op_phase] 0-management: Commit of operation 'Volume Create' failed on localhost

I tried a series of things to fix it:

I thought maybe the /GLUSTER/xxx-yyy-zzz needed to be created (I already made /GLUSTER) – Nope

  • I detached the peer and reattached – No.
  • I reboot the file server and test server – No.
  • I detached, reboot, reattached – No.
  • I tried creating the volume with just the test server and no replica – No.
  • I tried creating the volume on just the fileserver with no replica – Yes.

So the problem is point to the new system, but it’s a brand new system. They’re peers and connected.

  • I tried uninstalling and reinstalling gluster – No.
  • I tried uninstalling, purging and reinstalling – No.
  • I tried uninstalling, purging, manually deleting the /var/lib/gluster (probably a mistake that I didn’t detach first :() and reinstalling – No.
  • I have no idea why this wont WORK!!!!!

Let’s go further back, check the VPN, ping the servers.

  • Ping fileserver from testserver – Yes.
  • Ping testserver from fileserver – Yes/Hang on that’s the wrong IP!! Yes I’d copied an entry from webserverB into /etc/hosts, update the name but missed the IP address. Idiot! correct that. Ping – OK.
  • Try gluster again – Yes.

So if you’re having problems and seeing brickX.mount_dir not present make sure your DNS between servers is correct.

I don’t really know how the peer probe worked, but I think I must have done that from a servers who’s hosts was correct

Quagga Automatic Restart/Recovery

As I’m sure I’ve posted before, I use OpenVPN and Quagga to build up my network. After recently updating all my Ubuntu servers something strange happened, quagga that had been pretty rock solid started screwing up. Previously I’ve had the odd problem where a VPN would drop out and somehow block coming back up, so I scripted some VPN checks to confirm each link was up. If not the script restarts the VPN link that’s down. This has been working fine on each server and with quagga running routes around the entire network just keep working. Until of course the recent updates on each system that seem to have introduced a fault with quagga. Although quagga is remains running, all the routes disappear and just wont come back. An error does get logged in one of the logs (I’ll try to find what the error was and update here), but the quagga watchdog doesn’t see a problem since everything is still running. So I’ve put together a little script below that checks the routing table, and if there’s no entries relating to other networks (not local) then it’s considered that quagga has gone faulty and restarts it.

nano -w /usr/sbin/check_quagga_routes.sh
#!/bin/bash
checktime=`date`
echo $checktime : Checking Routing... >> /var/log/connection.info
routing=`route -n | grep -i 255.255.255.0 | grep -vi eth0`
if [ -z "$routing" ]
then
# No Routes to VPNs Detected. Restart Quagga
/etc/init.d/quagga restart
# log the restart
echo $checktime : VPN Routes NOT Detected. Restarting Quagga! >> /var/log/connection.info
fi
echo $checktime : Routing Check Complete. >> /var/log/connection.info
exit 0

The crontab entry is:-

*/1 * * * *   root     /usr/sbin/check_quagga_routes.sh > /dev/null 2>&1

This should mean that I wont have to manually restart quagga again if the fault occurs. Hopefully whatever has happened in the update will be fixed, but there’s no harm in leaving this in place as far as I can see.

Normally I’d opt for Nagios to run a check and on failure run a handler script, but since all the nagios checks and handlers get run across the VPN, as soon as the routes go down nagios is pretty useless. So this has to be run on each of the servers.

Twonky Server Slow Scanning

Ok so first a little about my setup.
I have twonky running on a RaspberryPI along with OpenVPN. The whole point is so that I can play my files when away from home. This worked great a few months ago simply plug in to the internet, the VPN connects and then the shares are connected, and twonky scans the folders. It’s not perfect in that twonky can scan empty folders and remove stuff from it’s database so it kinda screws with the playlists and I can never remember what episode I got upto. But it still works. Then I upgraded to 6.0.39 and things went a little weird. It used to complete a scan within a few minutes. But now it was taking over an hour to complete. For the most part it didn’t really bother me, plug it in and leave it do it’s thing. But if the VPN ever went a bit weird it could cause a full rescan, it also seemed to use more data, previously a few Mb now it could be a few 100Mb.

It was more of an annoyance than anything. I did have a search around but couldn’t find anything that would have caused it in the version changes that jumped out at me.

That is until today.
Today I found an article on Series and Movie thumbnails http://server.vijge.net/tw-video-scraper/ so I grabbed the files. I’m not sure if they work 100% yet, I’m getting a symlink error when I run it manually, but it does pull and save a thumbnail. As I’m watching something I don’t want to restart twonky to test it.
But I noticed in the cgi-bin folder a few other {scripts}, in particular ffmpeg-video-thumb.desc
Now the stuff in the link does say to disable this, but it got me thinking. Is this running on the PI, so I decided to have a look. There’s alot more files in the cgi-bin for 6.0.39 than previous versions, and this will try to make a thumbnail for each video file. So I disabled the code by putting # at the start of each line. It may not be the cleanest approach but I want to be able to put it back if it breaks something.

Restarted twonky on the PI, and watched the status page. It managed to scan everything across the VPN within a few minutes again. And looking at the network stats probably pulled around 10Mb of data.
So I think that’s solved this little problem of slow scanning in Twonky for me.

OpenVPN Connection Sever 2 Server Ubuntu

I’ve got a few VPN links in place using OpenVPN and thought it’s about time I documented how incase something goes wrong in the future.

First was the install:-

apt-get install openvpn

Once installed:-

cd /etc/openvpn

Then Generate a key using:-

openvpn --genkey --secret static.key

Create the config file using:-

nano -w server2.conf
remote server2.example.co.uk
float
port 8008
dev tun203
ifconfig 192.168.204.203 192.168.203.204
proto udp
persist-tun
persist-local-ip
persist-remote-ip
comp-lzo
ping 15
secret /etc/openvpn/static.key
;route 192.168.203.0 255.255.255.0
chroot /var/empty
user nobody
group nogroup
log vpn.log
verb 0

Things that need changing are the server address and the local and remote IP addresses. Also check the chroot, user and group exist.
Next copy the static.key from server1 to server2. Then create a server1.conf in /etc/openvpn/

nano -w server1.conf
remote server1.example.co.uk
float
port 8008
dev tun203
ifconfig 192.168.203.204 192.168.204.203
proto udp
persist-tun
persist-local-ip
persist-remote-ip
comp-lzo
ping 15
secret /etc/openvpn/static.key
;route 192.168.204.0 255.255.255.0
chroot /var/empty
user nobody
group nogroup
log vpn.log
verb 0

Notice how the IP configuration on this server is reversed. These IP addresses are just used by the tunnel and do not have to be within your normal network range. However for routing to multiple network segments it seemed easier to keep these addresses within the network range for me so it’s easier to trace where the fault lies if something goes wrong.
Last thing to check is that ip_forwarding is also enabled

cat /proc/sys/net/ipv4/ip_forward

Once the connections are established, it’s probably worth having some routing info pushed to each server. For simple routing you can uncomment the option in the configs above. For more advanced routing it’s worth installing Quagga.

I’ll need to update this with better details on what the configs do. But that’ll get it running in a simple setup.