Efficient Data Transfers – DD + PV – Performance Analysis
1.0 Introduction
Recently, I had the need of observing the progress of a dd command while transferring data between devices. While searching online, I noticed that one of the suggestions to obtain progress information is to place the tool pv in the middle of the dd command.
The command typically looks as follows:
dd if=<source> | pv | dd of=<destination>
This works a treat and provides some very handy output in the following form:
1GB 0:00:02 [ 607MB/s] [<==============> ]
2.0 Performance Analysis
I was specifically interested in what effect, if any, introducing pv has on the dd command in terms of performance.
To begin the analysis, we can use strace to see what is happening at the system call level.
2.1 Analyzing Plain dd Usage
Let’s take the simple case of just using dd with the below command. Here we are reading one block of 1MB from /dev/zero and writing it to /dev/null.
dd if=/dev/zero of=/dev/null bs=1M count=1
Let’s see what size the reads/writes are, performed by dd. The strace output is as follows:
read(0, ""..., 1048576) = 1048576 write(1, ""..., 1048576) = 1048576
dd is very efficient, it does the read and write with the exact block size we specified.
If we try even larger block sizes say 100MB, dd does not chunk anything; it attempts the file operations with the specified block size.
strace dd if=/dev/zero of=/dev/null bs=100M count=1 read(0, ""..., 104857600) = 104857600 write(1, ""..., 104857600) = 104857600
2.2 Analyzing dd + pv Usage
What happens now when we introduce pv in the middle; what is the effect on the input/output devices?
Let’s run the following command to obtain strace logs of the input and output path of the dd operation.
strace -o if.txt dd if=/dev/zero bs=1M count=1 | pv | strace -o of.txt dd of=/dev/null
Looking now at the if.txt and of.txt and filtering out all the irrelevant information, we find the following:
DD input side
read(0, ""..., 1048576) = 1048576 write(1, ""..., 1048576) = 1048576 write(2, "1+0 records in\n1+0 records out\n", 31) = 31 write(2, "1048576 bytes (1.0 MB) copied", 29) = 29 write(2, ", 0.00262319 s, 400 MB/s\n", 25) = 25
DD output side
read(0, ""..., 1048576) = 131072 write(1, ""..., 131072) = 131072 read(0, ""..., 1048576) = 131072 write(1, ""..., 131072) = 131072 read(0, ""..., 1048576) = 131072 write(1, ""..., 131072) = 131072 read(0, ""..., 1048576) = 131072 write(1, ""..., 131072) = 131072 read(0, ""..., 1048576) = 131072 write(1, ""..., 131072) = 131072 read(0, ""..., 1048576) = 131072 write(1, ""..., 131072) = 131072 read(0, ""..., 1048576) = 131072 write(1, ""..., 131072) = 131072 read(0, ""..., 1048576) = 131072 write(1, ""..., 131072) = 131072 read(0, "", 1048576) = 0 write(2, "0+8 records in\n0+8 records out\n", 31) = 31 write(2, "1048576 bytes (1.0 MB) copied", 29) = 29 write(2, ", 0.00476568 s, 220 MB/s\n", 25) = 25
We now see that the output dd is reading and writing the 1MB in pieces of 16KB or 131072 bits. Let’s try setting the block size on the dd of portion as well. The command becomes:
strace -o if.txt dd if=/dev/zero bs=1M count=1 | pv | strace -o of.txt dd of=/dev/null bs=1M
I will spare you another strace dump, but the output in this case is exactly as above, the block size argument on the output dd, makes no difference. dd will not buffer the input, it will not wait for 1MB to be accumulated before the write is posted to the output device.
It is now clear that pv is chunking data at 16KB boundries, let’s check the pv documentation:
-B BYTES, –buffer-size BYTES
Use a transfer buffer size of BYTES bytes. A suffix of “k”, “m”, “g”, or “t” can be added to denote kilobytes (*1024), megabytes, and so on. The default buffer size is the block size of the input file’s filesystem multiplied by 32 (512kb max), or 400kb if the block size cannot be determined.
If we now change our command to be as follows:
strace -o if.txt dd if=/dev/zero bs=1M count=1 | pv --buffer-size=1m | strace -o of.txt dd of=/dev/null bs=1M
the strace output is:
read(0, ""..., 104857600) = 104857600 write(1, ""..., 104857600) = 104857600 write(2, "1+0 records in\n1+0 records out\n", 31) = 31 write(2, "104857600 bytes (105 MB) copied", 31) = 31 write(2, ", 0.287254 s, 365 MB/s\n", 23) = 23 read(0, ""..., 104857600) = 104857600 write(1, ""..., 104857600) = 104857600 read(0, "", 104857600) = 0 write(2, "1+0 records in\n1+0 records out\n", 31) = 31 write(2, "104857600 bytes (105 MB) copied", 31) = 31h write(2, ", 0.371378 s, 282 MB/s\n", 23) = 23
and life is pretty good, we are not making unnecessary system calls to transfer data out to the device.
3.0 Throughput Comparison
Now let’s calculate execution speed of 10GB transfer in block sizes of 100MB:
time dd if=/dev/zero bs=100M count=100 | pv --buffer-size=100m | dd of=/dev/null bs=100M 10485760000 bytes (10 GB) copied, 9.65571 s, 1.1 GB/s 10485760000 bytes (10 GB) copied, 9.70629 s, 1.1 GB/s real 0m9.713s user 0m0.008s sys 0m11.712s time dd if=/dev/zero of=/dev/null bs=100M count=100 10485760000 bytes (10 GB) copied, 1.21998 s, 8.6 GB/s real 0m1.224s user 0m0.000s sys 0m1.224s
Wow, almost 10x difference, the same 10x difference is seen for transfers with a larger count value.
So let’s do an strace on pv and see what its doing during the transfer:
0.000057 write(1, ""..., 104857600) = 104857600$ 0.051468 alarm(0) = 1$ 0.000054 select(1, [0], [], NULL, {0, 90000}) = 1 (in [0], left {0, 89998})$ 0.000068 read(0, ""..., 104857600) = 104857600$ 0.043925 select(2, [], [1], NULL, {0, 90000}) = 1 (out [1], left {0, 89997})$ 0.000080 rt_sigaction(SIGALRM, {SIG_IGN, [ALRM], SA_RESTORER|SA_RESTART, 0x7f93aa1fad40}, {SIG_IGN, [ALRM], SA_RESTORER|SA_RESTART, 0x7f93aa1fad40}, 8) = 0$ 0.000072 alarm(1) = 0$ 0.000047 write(1, ""..., 104857600) = 104857600$
It seems that a signal is being armed and fired after every block transfer in addition to a select system call on stderr.
Let’s compare it with an strace of dd called with the same arguments (bs=100M count=100)
0.011930 write(1, ""..., 104857600) = 104857600 0.000043 read(0, ""..., 104857600) = 104857600 0.011857 write(1, ""..., 104857600) = 104857600 0.000042 read(0, ""..., 104857600) = 104857600 0.011777 write(1, ""..., 104857600) = 104857600
Each block takes dd approximately 0.01s vs 0.1s when accompanied by pv. This is largely due to the select() and alarm(0) calls.
It is worth while to note, that we have not considered any latencies which would typically be introduced by the source and/or destination devices. We are simply focused on the overhead which pv is introducing.
4.0 Real World Devices
To see what effect this has on a real world device, I used dd and pv to read data from an HDD and a m.2 flash drive using the following commands:
dd if=/dev/sda1 bs=1M count=2000 of=/dev/null dd if=/dev/sda1 bs=1M count=2000 | pv -B 1m | dd of=/dev/null bs=1M
The results are as follows:
For the HDD using pv + dd vs plain dd had no discernible effect on throughput, the average read speed was 133MB/s in both cases.
For the m.2 flash drive, pv + dd had a read speed of 517MB/s vs plain dd with 537MB/s.
5.0 Conclusion
- Be sure to set the pv –buffer-size to whatever the block size is set in dd to avoid pv generating more write system calls than necessary.
- To get the absolute best performance on RAM disks or the fastest flash drives, using plain dd is recommended especially where the transfers need to happen in very small block sizes. Smaller block sizes imply more frequent write system calls and associated pv overhead.
- For slow media and large block sizes, the dd + pv overhead gets lost in the noise and no noticeable penalties are observed, the throughput is dominated by the device latency.
- Git – Cherry Picking Specific Commit Ranges
- Bit Bang GPIO UART from Linux Kernel