One of the best features of Ganeti is its ability to grow linearly by adding new servers easily. We recently purchased a new server to expand our ever growing production cluster and needed to rebalance cluster. Adding and expanding the cluster consisted of the following steps:

  1. Installing the base OS on the new node
  2. Adding the node to your configuration management of choice and/or installing ganeti
  3. Add the node to the cluster with gnt-node add
  4. Check Ganeti using the verification action
  5. Use htools to rebalance the cluster

For simplicity sake I'll cover the last three steps.

Adding the node

Assuming you're using a secondary network, this is how you would add your node:

gnt-node add -s <secondary ip> newnode

Now lets check and make sure ganeti is happy:

gnt-cluster verify

If all is well, continue on otherwise try and resolve any issue that ganeti is complaining about.

Using htools

Make sure you install ganeti-htools on all your nodes before continuing. It requires haskell so just be aware of that requirement. Lets see what htools wants to do first:

$ hbal -m
Loaded 5 nodes, 73 instances
Group size 5 nodes, 73 instances
Selected node group: default
Initial check done: 0 bad nodes, 0 bad instances.
Initial score: 41.00076094
Trying to minimize the CV...
1. g1.osuosl.bak:g2.osuosl.bak g5.osuosl.bak:g1.osuosl.bak 38.85990831 a=r:g5.osuosl.bak f
2. g3.osuosl.bak:g1.osuosl.bak g5.osuosl.bak:g3.osuosl.bak 36.69303985 a=r:g5.osuosl.bak f
3. g2.osuosl.bak:g4.osuosl.bak g5.osuosl.bak:g2.osuosl.bak 34.61266967 a=r:g5.osuosl.bak f


28. g3.osuosl.bak:g1.osuosl.bak g3.osuosl.bak:g5.osuosl.bak 4.93089388 a=r:g5.osuosl.bak
29. g2.osuosl.bak:g1.osuosl.bak g1.osuosl.bak:g5.osuosl.bak 4.57788814 a=f r:g5.osuosl.bak
30. g1.osuosl.bak:g3.osuosl.bak g1.osuosl.bak:g5.osuosl.bak 4.57312216 a=r:g5.osuosl.bak
Cluster score improved from 41.00076094 to 4.57312216
Solution length=30

I've shortened the actual output for the sake of this blog post. Htools automatically calculates which virtual machines to move and how using the least amount of operations. In most these moves, the VMs may simply be migrated, migrated & secondary storage replaced, or migrated, secondary storage replaced, migrated. In our environment we needed to move 30 VMs around out of the total 70 VMs that are hosted on the cluster.

Now lets see what commands we actually would need to run:

$ hbal -C -m

Commands to run to reach the above solution:

echo jobset 1, 1 jobs
echo job 1/1
gnt-instance replace-disks -n g5.osuosl.bak
gnt-instance migrate -f
echo jobset 2, 1 jobs
echo job 2/1
gnt-instance replace-disks -n g5.osuosl.bak
gnt-instance migrate -f
echo jobset 3, 1 jobs
echo job 3/1
gnt-instance replace-disks -n g5.osuosl.bak
gnt-instance migrate -f


echo jobset 28, 1 jobs
echo job 28/1
gnt-instance replace-disks -n g5.osuosl.bak
echo jobset 29, 1 jobs
echo job 29/1
gnt-instance migrate -f
gnt-instance replace-disks -n g5.osuosl.bak
echo jobset 30, 1 jobs
echo job 30/1
gnt-instance replace-disks -n g5.osuosl.bak

Here you can see the commands it wants you to execute. Now you can either put these all in a script and run them, split them up, or just run them one by one. In our case I ran them one by one just to be sure we didn't run into any issues. I had a couple of VMs not migration properly but those were exactly fixed. I split this up into a three day migration running ten jobs a day.

The length of time that it takes to move each VM depends on the following factors:

  1. How fast your secondary network is
  2. How busy the nodes are
  3. How fast your disks are

Most of our VMs ranged in size from 10G to 40G in size and on average took around 10-15 minutes to complete each move. Addtionally, make sure you read the man page for hbal to see all the various features and options you can tweak. For example, you could tell hbal to just run all the commands for you which might be handy for automated rebalancing.


Overall the rebalancing of our cluster went without a hitch outside of a few minor issues. Ganeti made it really easy to expand our cluster with minimal to zero downtime for our hosted projects.


comments powered by Disqus