Last update: Mar 17, 2014
Dennis van Dok
Update: I received the following explaination from Andras Korn in February 2012 about why I was seeing the performance issues that I raised in the original writeup. With his permission, I'm including it up front.
In order to maintain snapshot consisteny, the copy-on-write update to the snapshot has to be committed to disk before the write to the origin volume is committed to disk. Otherwise the snapshot might not contain a copy of the original data in case power is lost. Most storage hardware doesn't support write barriers, which means the only way to ensure this kind of consistency is to sync disk buffers of the snapshot before issuing the write to the origin volume. And sync operations are slow on spinning media.
Essentially, having a snapshot turns async writes into sync writes (or, more precisely, every async write can potentially imply an additional sync write). If the snapshot is stored on the same disk as the origin volume, you're also causing a lot of seeking, which slows things down even further.
The good news is that this only happens if you overwrite parts of the origin volume that haven't been copied to the snapshot yet; if you keep overwriting the same bits, things'll be fast again after the first slow write.
Also, snapshots can be mounted read/write too (and they can be merged back into the origin volume, even while they're mounted IIRC). Thus, if you can umount the origin volume and mount the snapshot in its place, you can avoid much of the performance penalty (there'll still be more seeking, but no need to perform an additional sync write for each async write).
That said, it's still preferable from a snapshot performance perspective to store snapshots on SSDs, or to use a filesystem such as zfs (or btrfs?) that doesn't overwrite data in place anyway, making snapshotting cheap.
Nov 1, 2007
Using the snapshotting features of Logical Volume Management (LVM) under Linux causes abysmal disk write performance, it turns out after running some experiments. This undermines the case for using snapshots on highly utilized volumes. I've witnessed performance degradation between a factor of 20 to 30.
There are many uses for LVM, scaling from home computers to large disk arrays. The performance problems that I ran into happened when using LVM snapshots on single disk systems. That is not to say that these problems wouldn't crop up in other situations, so I strongly urge you to do some kind of performance testing if you plan to use snapshots on any sort of system.
Below, I will present a few easy to follow steps to reproduce the tests, even if you don't have LVM configured. All you need is a sizeable blob of free disk space.
LVM has the ability to create a snapshot of a logical volume, which is like an instant copy of the original. Changes to the snapshot are not visible in the original and vice versa. This is done by using a technique called copy-on-write (COW). At the beginning the two volumes are identical, and reading data from the snapshot will refer to the corresponding block in the original volume. But when a write action takes place, first a copy is made of a chunk of the original data. Subsequent reads will either refer to the original data, if the block is still untouched, or to the copied chunk. As more and more changes happen, more chunks are allocated and LVM has to keep track of which chunks are used for which parts of the volume.
Note that it doesn't matter if the write action takes place on the original or on the snapshot: in both cases the copy action has to be done.
There are two ways to proceed: the best way is to have a physical device, such as a spare disk or disk partition. If you have no spare disk, you can make do with a large empty file as a loop device. This will taint the performance tests with more overhead, but as we're really interested in the relative performance of snapshots versus no snapshots, this shouldn't matter much.
Let's say your disk device is
/dev/hdx
. First turn it into a physical volume
(or PV, this is LVM terminology).
pvcreate /dev/hdx
Using this PV, generate a volume group (VG) named vgtest.
vgcreate vgtest /dev/hdx
Now proceed to 'Setting up the logical volume'.
Create a large empty file in your free space somewhere.
dd if=/dev/zero of=/tmp/bigfile bs=1M count=5000
This creates a 5 GB file full of zeros. Now associate it with a loop device.
losetup /dev/loop0 /tmp/bigfile
Turn the loop device into a physical volume (PV, in LVM terminology).
pvcreate /dev/loop0
Using this PV, generate a volume group (VG) named vgtest.
vgcreate vgtest /dev/loop0
create the 'original' logical volume on the volume group, but be careful not to use all the available space. We need to leave room for the snapshot's COW chunks. If the VG has 5 GB of space, 3 GB is enough.
lvcreate -L 3G -n original vgtest
For easier testing, let's put a filesystem on it…
mkfs.ext2 /dev/vgtest/original
…and mount it.
mkdir /scratch mount /dev/vgtest/original /scratch
Generate a 1 GB file in the newly mounted filesystem.
sync; time sh -c "dd if=/dev/zero of=/scratch/moo bs=1M count=1000; sync"
Note that flushing the disk cache is necessary to measure the real performance.
Create a snapshot of the LV, reserving enough space to allow sufficient changes to be written.
lvcreate -L 2G -s -n snapshot /dev/vgtest/original
(you may need to load the dm-snapshot
kernel module first).
Now simply repeat the above test and observe the difference.
sync; time sh -c "dd if=/dev/zero of=/scratch/moo bs=1M count=1000; sync"
What is particularly illustrative is letting a vmstat 1
run along in a separate window.
Results may depend on the used hardware and operating system. I have tested this on several types of systems and a performance hit was seen in every case.
Dell PE1950, hardware RAID 1 with two ST3750640NS drives
CentOS 5.3, Linux 2.6.18-128.1.6.el5xen x86_64
without snapshot:
sync; time sh -c "dd if=/dev/zero of=/scratch/moo bs=1M count=1000; sync" 1000+0 records in 1000+0 records out 1048576000 bytes (1.0 GB) copied, 10.3395 seconds, 101 MB/s real 0m20.285s user 0m0.000s sys 0m1.084s
with snapshot:
sync; time sh -c "dd if=/dev/zero of=/scratch/moo bs=1M count=1000; sync" 1000+0 records in 1000+0 records out 1048576000 bytes (1.0 GB) copied, 796.627 seconds, 1.3 MB/s real 16m44.222s user 0m0.000s sys 0m1.132s
Dell PE1950, SEAGATE ST373455SS
Debian squeeze (testing), Linux 2.6.26-2-xen-amd64
without snapshot:
# sync; time sh -c "dd if=/dev/zero of=/mnt/moo bs=1M count=1000; sync" 1000+0 records in 1000+0 records out 1048576000 bytes (1.0 GB) copied, 4.89106 s, 214 MB/s real 0m11.801s user 0m0.000s sys 0m2.660s
with snapshot:
# sync; time sh -c "dd if=/dev/zero of=/mnt/moo bs=1M count=1000; sync" 1000+0 records in 1000+0 records out 1048576000 bytes (1.0 GB) copied, 54.0916 s, 19.4 MB/s real 2m3.906s user 0m0.004s sys 0m3.340s
I don't want to dismiss snapshots as a useless feature – obviously they are used today by many people for various purposes. But it is a feature that needs to be used with care, because it has definitive performance issues. I was confronted with these problems and surprised by them. More surprising was that I could not find web pages that shout in your face that snapshots spell trouble. All I could find was how convenient they are for doing live backups of your database, which I would like to see in real life on a database that grinds through dozens of transactions per second.
This is not a widely studied subject (by me), so I may be entirely wrong. I would be very happy if other people tried this out and send me their findings.
Several people have sent me mail to comment on my findings. Most of them mentioned they experienced similar problems, so it is good to know I'm not alone! So far I've not had word from the LVM community about this issue.
John Newbigin actually repeated the tests on similar hardware. I include his findings here. Thanks, John!
After some reading and testing on similar hardware, I found that setting the snapshot chunksize to the largest possible value of 512k gave the best results in your dd test.
The default value of 64k seems to low, at least for my hardware. This was on an HP DL380 G5. 6 * 72Gig 10k RPM SAS in hardware RAID5 (Smart Array P400i. 512Meg cache 25% read 75% write).
This is the results I got by using a slightly modified version of your test script:
sync ; time sh -c "dd if=/dev/zero of=asdf bs=1M count=1000 ; sync"Test on un-snapshotted disk:
5sTest on the snapshotted disk:
512k 55s 256k 49s 128k 49s 64k 83s 32k 63s 16k 200s 8k 304s 4k 625sOn the snapshot:
512k 49s 256k 53s 128k 58s 64k 58s 32k 66s 16k 105s 8k 169s 4k 179s
When I did a test on the live server with the 512k chunk size, the server froze solid and was reset by the hardware watchdog.