Transfer 15TB of Tiny Files

Days ago, when I was setting up an Internet Cafe System, I met a problem of transferring 600+ GB gaming files from one machine to another.

Most of them are small files, I have two options:

  1. Compress and then transfer, which will cost lots of CPU, and of course time.
  2. Directly transfer these files, which will be very slow since too many tiny files.

In the End, I used the rsync method.

Relax time, here is the Windows start animation.

1. So, How?

1.1. The Rsync method

My scenario was transferring files from Windows XP to a Linux Server. About how to setup Cygwin in Windows XP/2003, please review Setup Cygwin in Windows XP/2003

1
2
$ rsync -a -z -vvv -S -e 'ssh -o "ServerAliveInterval 7200"' \ 
/path/of/source_dir root@destination_server:/path/of/destination_dir

This will transfer /path/of/source_dir to /path/of/destination_dir by using SSH tunnel and keep alive for 2 hours, using Rsync algorithm.

1.2. Compress & Package & Transfer

As above, we mentioned that we would like to compress these files and then transfer. Here is how we implement it by using tar and pigz(parallel gzip) command which wins over the rsync method.

Source Machine

1
$ tar -cf - -C /path/of/small/files . | pigz | nc -l 9876

Destination Machine (Download)

1
$ nc source_machine_ip 9876 | pigz -d | tar -xf - -C /put/stuff/here

Destination Machine (Archive)

1
$ nc source_machine_ip 9876 > smallstuff.tar.gz

2. But which one is better?

The result is nc + pigz solution always win for the first time transfer, no matters for large files or small files.

My laptop has 4 CPU cores built-in, so the rsync solution is about 3-4 times slower than the nc + pigz solution for the first-time transfer.

However, for incremental files update, of course, rsync wins over another since it only compute the changes by using modification timestamp and size .

Here is the script I used to generate 640,000 files with each as size of 8 kb.

1
2
3
4
5
6
7
8
9
10
11
#!/bin/bash
for i in {1..10}; do
for j in {1..10}; do
mkdir -p "test_$i/test_$j"
cd "test_$i/test_$j"
for k in {1..6400}; do
dd if=/dev/zero of=test_"$k" bs=1024 count=8 > /dev/null 2>&1 &
done
cd -
done
done

3. Conclusion

  1. First Time Transfer, use pigz + nc
  2. Incremental Files Update, use rsync

References:

0%