Efficient Data Transfers – DD + PV – Performance Analysis

1.0 Introduction

Recently, I had the need of observing the progress of a dd command while transferring data between devices. While searching online, I noticed that one of the suggestions to obtain progress information is to place the tool pv in the middle of the dd command.

The command typically looks as follows:

dd if=<source> | pv | dd of=<destination> 

This works a treat and provides some very handy output in the following form:

 1GB 0:00:02 [ 607MB/s] [<==============>               ]

2.0 Performance Analysis

I was specifically interested in what effect, if any, introducing pv has on the dd command in terms of performance.
To begin the analysis, we can use strace to see what is happening at the system call level.

2.1 Analyzing Plain dd Usage

Let’s take the simple case of just using dd with the below command. Here we are reading one block of 1MB from /dev/zero and writing it to /dev/null.

dd if=/dev/zero of=/dev/null bs=1M count=1

Let’s see what size the reads/writes are, performed by dd. The strace output is as follows:

read(0, ""..., 1048576) = 1048576
write(1, ""..., 1048576) = 1048576

dd is very efficient, it does the read and write with the exact block size we specified.
If we try even larger block sizes say 100MB, dd does not chunk anything; it attempts the file operations with the specified block size.

strace dd if=/dev/zero of=/dev/null bs=100M count=1
read(0, ""..., 104857600) = 104857600
write(1, ""..., 104857600) = 104857600

2.2 Analyzing dd + pv Usage

What happens now when we introduce pv in the middle; what is the effect on the input/output devices?

Let’s run the following command to obtain strace logs of the input and output path of the dd operation.

strace -o if.txt dd if=/dev/zero bs=1M count=1 | pv | strace -o of.txt dd of=/dev/null

Looking now at the if.txt and of.txt and filtering out all the irrelevant information, we find the following:

DD input side

read(0, ""..., 1048576) = 1048576
write(1, ""..., 1048576) = 1048576
write(2, "1+0 records in\n1+0 records out\n", 31) = 31
write(2, "1048576 bytes (1.0 MB) copied", 29) = 29
write(2, ", 0.00262319 s, 400 MB/s\n", 25) = 25

DD output side

read(0, ""..., 1048576) = 131072
write(1, ""..., 131072) = 131072
read(0, ""..., 1048576) = 131072
write(1, ""..., 131072) = 131072
read(0, ""..., 1048576) = 131072
write(1, ""..., 131072) = 131072
read(0, ""..., 1048576) = 131072
write(1, ""..., 131072) = 131072
read(0, ""..., 1048576) = 131072
write(1, ""..., 131072) = 131072
read(0, ""..., 1048576) = 131072
write(1, ""..., 131072) = 131072
read(0, ""..., 1048576) = 131072
write(1, ""..., 131072) = 131072
read(0, ""..., 1048576) = 131072
write(1, ""..., 131072) = 131072
read(0, "", 1048576)                    = 0
write(2, "0+8 records in\n0+8 records out\n", 31) = 31
write(2, "1048576 bytes (1.0 MB) copied", 29) = 29
write(2, ", 0.00476568 s, 220 MB/s\n", 25) = 25

We now see that the output dd is reading and writing the 1MB in pieces of 16KB or 131072 bits. Let’s try setting the block size on the dd of portion as well. The command becomes:

strace -o if.txt dd if=/dev/zero bs=1M count=1 | pv | strace -o of.txt dd of=/dev/null bs=1M 

I will spare you another strace dump, but the output in this case is exactly as above, the block size argument on the output dd, makes no difference. dd will not buffer the input, it will not wait for 1MB to be accumulated before the write is posted to the output device.

It is now clear that pv is chunking data at 16KB boundries, let’s check the pv documentation:

-B BYTES, –buffer-size BYTES
Use a transfer buffer size of BYTES bytes. A suffix of “k”, “m”, “g”, or “t” can be added to denote kilobytes (*1024), megabytes, and so on. The default buffer size is the block size of the input file’s filesystem multiplied by 32 (512kb max), or 400kb if the block size cannot be determined.

If we now change our command to be as follows:

strace -o if.txt dd if=/dev/zero bs=1M count=1 | pv --buffer-size=1m | strace -o of.txt dd of=/dev/null bs=1M

the strace output is:

read(0, ""..., 104857600) = 104857600
write(1, ""..., 104857600) = 104857600
write(2, "1+0 records in\n1+0 records out\n", 31) = 31
write(2, "104857600 bytes (105 MB) copied", 31) = 31
write(2, ", 0.287254 s, 365 MB/s\n", 23) = 23

read(0, ""..., 104857600) = 104857600
write(1, ""..., 104857600) = 104857600
read(0, "", 104857600)                  = 0
write(2, "1+0 records in\n1+0 records out\n", 31) = 31
write(2, "104857600 bytes (105 MB) copied", 31) = 31h
write(2, ", 0.371378 s, 282 MB/s\n", 23) = 23

and life is pretty good, we are not making unnecessary system calls to transfer data out to the device.

3.0 Throughput Comparison

Now let’s calculate execution speed of 10GB transfer in block sizes of 100MB:

time dd if=/dev/zero bs=100M count=100 | pv --buffer-size=100m | dd of=/dev/null bs=100M
10485760000 bytes (10 GB) copied, 9.65571 s, 1.1 GB/s
10485760000 bytes (10 GB) copied, 9.70629 s, 1.1 GB/s
real    0m9.713s
user    0m0.008s
sys     0m11.712s

time dd if=/dev/zero of=/dev/null bs=100M count=100
10485760000 bytes (10 GB) copied, 1.21998 s, 8.6 GB/s

real    0m1.224s
user    0m0.000s
sys     0m1.224s

Wow, almost 10x difference, the same 10x difference is seen for transfers with a larger count value.

So let’s do an strace on pv and see what its doing during the transfer:

     0.000057 write(1, ""..., 104857600) = 104857600$
     0.051468 alarm(0)                  = 1$
     0.000054 select(1, [0], [], NULL, {0, 90000}) = 1 (in [0], left {0, 89998})$
     0.000068 read(0, ""..., 104857600) = 104857600$
     0.043925 select(2, [], [1], NULL, {0, 90000}) = 1 (out [1], left {0, 89997})$
     0.000080 rt_sigaction(SIGALRM, {SIG_IGN, [ALRM], SA_RESTORER|SA_RESTART, 0x7f93aa1fad40}, {SIG_IGN, [ALRM], SA_RESTORER|SA_RESTART, 0x7f93aa1fad40}, 8) = 0$
     0.000072 alarm(1)                  = 0$
     0.000047 write(1, ""..., 104857600) = 104857600$

It seems that a signal is being armed and fired after every block transfer in addition to a select system call on stderr.
Let’s compare it with an strace of dd called with the same arguments (bs=100M count=100)

  0.011930 write(1, ""..., 104857600) = 104857600
  0.000043 read(0, ""..., 104857600) = 104857600
  0.011857 write(1, ""..., 104857600) = 104857600
  0.000042 read(0, ""..., 104857600) = 104857600
  0.011777 write(1, ""..., 104857600) = 104857600

Each block takes dd approximately 0.01s vs 0.1s when accompanied by pv. This is largely due to the select() and alarm(0) calls.

It is worth while to note, that we have not considered any latencies which would typically be introduced by the source and/or destination devices. We are simply focused on the overhead which pv is introducing.

4.0 Real World Devices

To see what effect this has on a real world device, I used dd and pv to read data from an HDD and a m.2 flash drive using the following commands:

dd if=/dev/sda1 bs=1M count=2000 of=/dev/null
dd if=/dev/sda1 bs=1M count=2000 | pv -B 1m | dd of=/dev/null bs=1M

The results are as follows:
For the HDD using pv + dd vs plain dd had no discernible effect on throughput, the average read speed was 133MB/s in both cases.
For the m.2 flash drive, pv + dd had a read speed of 517MB/s vs plain dd with 537MB/s.

5.0 Conclusion

  • Be sure to set the pv –buffer-size to whatever the block size is set in dd to avoid pv generating more write system calls than necessary.
  • To get the absolute best performance on RAM disks or the fastest flash drives, using plain dd is recommended especially where the transfers need to happen in very small block sizes. Smaller block sizes imply more frequent write system calls and associated pv overhead.
  • For slow media and large block sizes, the dd + pv overhead gets lost in the noise and no noticeable penalties are observed, the throughput is dominated by the device latency.

Leave a Reply