okbanlon

rsync

The rsync utility program is one of the most powerful (and dangerous) tools around when it comes to keeping directories and files synchronized. Many people fiddle around with it, get burned, and never touch it again - however, it's a really useful tool once you learn a few of its quirks.

In this document, we'll take a look at how rsync works and go through some examples ranging from the simple to the ridiculously complex. There are a myriad of ways to do all this stuff - this document is by no means an exhaustive study of rsync, just a handful of examples and some explanation. If you want the complete story on rsync, go get the O'Reilly book on it.

General Advice

Keep these simple guidelines in mind when playing with rsync:

Script your rsync commands. No matter how good you think you are, don't get in the habit of trying to type in rsync commands by hand. Don't do a lot of fancy mouse work to cut and paste rsync commands, either. Sooner or later, you will make a mistake - and believe me, rsync can make a hell of a mess in a hurry if you fumble-finger the command.
Use the --dry-run option first, and don't proceed until you understand and approve of the action rsync will undertake. Just take the time - it's better to say, "Woops! That's not right!" than "Uh, Boss? I think I just blew away half the system with a fouled-up rsync command." The --dry-run option is your friend; use it wisely.
Try to avoid specifying the target directory as a variable if possible, especially when using the --delete option. The rsync program is read-only on the data in your source directory, but it's read/write and delete on the target directory if you're using --delete (which you will, in most cases). Everyone likes to abstract things like directory paths in shell variables, but this is one case where you'll sleep a lot better at night by explicitly specifying the target directory. If your shell variable gets messed up somehow, you could wind up specifying the wrong target directory for the rsync run. In addition to spraying data files all over the place and making a mess, the --delete option will happily chainsaw the destination directory as it wipes out stuff that doesn't exist in the source directory. I've seen some truly horrific train wrecks initiated with badly-handled variables used to specify destination directories with --delete.
Use the --blocking-io option if rsync exhibits flaky behavior during the file copy operations, especially when running against a remote machine. I don't particularly know or care exactly why this makes a big difference, but I beat my brains out one day over an rsync operation that threw weird, non-repeatable errors until I added --blocking-io to the options. Go figure.

Simple Directory Sync

In this example, I'm just syncing one directory with another. The 'source' directory here happens to be an NFS mount of a remote machine's directory, but that doesn't matter - you can sync any two directories on your machine this way.

#!/usr/bin/ksh
#
# Use 'rsync' to keep local copies of
# patchinfo files up-to-date with master
# copies on remote.machine

RSYNC=/software/opt/sfw/bin/rsync

SOURCEDIR=/net/remote.machine/foo/bar

$RSYNC --archive --quiet --delete --delete-excluded \
${SOURCEDIR}/ /spare_1/patchinfos

Simple, eh? Just remember that trailing slash on the "from" directory and explicitly specify the destination directory (no variables that can get messed up) and you're done.

Accessing Files On Remote Machines

In addition to just syncing local directories (or things that appear as local directories, like NFS shares), you can use rsync to access files on remote systems. This is really what rsync is all about, and it's quite good at it.

The simplest way to do this is to operate rsync over an rsh connection to the remote machine. You can also do this over ssh, but that's left as an exercise for the student.

You don't have to know a lot about rsh to set up a simple password-less connection between two machines. Let's imagine a setup like this:

Local machine
Machine name: ALPHA
User name: FRED

Remote machine
Machine name: OMEGA
User name: BARNEY

You're FRED, and you want to sync a local directory on your ALPHA machine from a directory accessible to BARNEY on the remote OMEGA machine. In setups like this, both FRED and BARNEY are likely to be special accounts set up for automated software execution, not home accounts used by actual people.

Log in as BARNEY on the OMEGA machine and create a file called .rhosts in BARNEY's home directory. Put this line in it:

ALPHA FRED

This has the effect of allowing FRED on the ALPHA machine to log in (rsh) as BARNEY on OMEGA without a password. As you can imagine, it's easy to create some fairly serious security loopholes this way, so don't do this lightly.

If ALPHA is actually something like alpha.central.foo.com, you need to explicitly spell all that out.

Once you've got BARNEY's .rhosts file in place, log into ALPHA as FRED. Then, try this:

rsh -l BARNEY OMEGA

You should immediately find yourself logged into OMEGA as BARNEY - no password prompt, no challenge/response, just "plop" and you're there, looking at a command prompt. If this is not what happens, seek professional help from your sysadmins - I'm not going to go into all the ways rsh can be fouled up.

Once you've got rsh set up for this automatic login behavior, you can run rsync on top of it. Here's a script that runs rsync over rsh to sync a remote machine directory over to a local directory:

#!/usr/bin/ksh
#
# Use 'rsync' to keep local copies of
# patchinfo files up-to-date with master
# copies on remote.machine
#
# This version of the script uses an 'rsh'
# session against the 'BARNEY' user account
# on OMEGA, rather than relying on
# the horribly slow NFS mounts. 

LOCAL_RSYNC=/software/opt/sfw/bin/rsync
REMOTE_RSYNC=/opt/SENSrsync/bin/rsync

SOURCEDIR=/builds/foo/bar/quux

$LOCAL_RSYNC --archive --quiet --blocking-io \
--delete --delete-excluded \
--rsh="rsh -l BARNEY" --rsync-path ${REMOTE_RSYNC} \
OMEGA:${SOURCEDIR}/ /spare_1/patchinfos

The --rsh option specifies the command-line option FRED would use to establish an rsh session logged in as BARNEY on OMEGA, without the OMEGA machine name. Adjust as needed. The OMEGA:${SOURCEDIR}/ syntax alerts rsync to perform a remote connection rather than look at a locally-accessible directory. The --rsync-path option tells your local rsync where to find the rsync executable on the remote machine - when running rsync this way, your local rsync process fires up a partner rsync process on the remote system. The two rsync processes examine their respective directories and work together to move data and synchronize the systems.

The password-less login trick is not strictly required, but it's handy. You could fool around with hard-coding the remote account password in the script on ALPHA, but that's not generally a good idea. Setting up the .rhosts file on OMEGA means that FRED never needs to know BARNEY's password; the BARNEY password can be changed, too, without notifying FRED.

In situations where the remote machine is a production server and strict steps must be taken to avoid data loss or corruption, sysadmins often create the BARNEY account with group privileges that allow only read access to the file systems being read by rsync. Then, FRED (connected over rsh as BARNEY) can read the data but can't hurt it, even by accident.

Complex Inclusion/Exclusion Filtering

In this example, I've got a huge amount of data stored on a remote machine, and I want to sync only selected files out of the remote directory for local storage on my machine. The remote machine has a directory that contains over 20,000 subdirectories that follow a naming convention like 10000-01, 234567-42, 654321-98, and so on. I want to sync up the following information to my local file system:

The ######-## subdirectories themselves
Subdirectories one level below each ######-## directory (but NO DEEPER)
Files named patchinfo (found in the ######-## subdirectory) and pkginfo (found in first-level subdirectories of the ######-## directories), and files whose names begin with README* (found in the ######-## directories).

There are a lot of other subdirectories and files in the source directory on the remote machine, but these specific files are the only ones I want.

Here's the ksh script:

#!/usr/bin/ksh
#
# Let's see if we can rsync just the README and pkginfo
# and patchinfo files from the expanded patch directories
# over to local disk..
#
# This version of the script uses an 'rsh'
# session against the 'BARNEY' user account
# on OMEGA, rather than relying on
# the horribly slow NFS mounts.


LOCAL_RSYNC=/software/opt/sfw/bin/rsync
REMOTE_RSYNC=/usr/local/bin/rsync

SOURCEDIR=/patches/intpatchroot/all_patchdb/patchdb

begin=`date`
echo "Started rsync of patchinfo, pkginfo and READMEs at $begin"

# The first 'include' picks up the patch directory (123456-78)
# The second 'include' picks up pkg subdirectories (SUNWfoo)
# The next three includes grab pkginfo, patchinfo, and README
# The final 'exclude' ignores everything else (including
# subdirectory trees).

$LOCAL_RSYNC --archive --quiet --blocking-io \
--rsh="rsh -l BARNEY" --rsync-path ${REMOTE_RSYNC} \
--include "[0-9][0-9][0-9][0-9][0-9][0-9]-[0-9][0-9]/" \
--include "[0-9][0-9][0-9][0-9][0-9][0-9]-[0-9][0-9]/*/" \
--include "patchinfo" --include "pkginfo" --include "README*" \
--exclude "*" \
--delete --delete-excluded \
OMEGA:${SOURCEDIR}/ \
/spare_1/patchdata_rsh

end=`date`
echo "Finished rsync of patchinfo, pkginfo and READMEs at $end"

Ta-da! You can assemble some truly spectacular combinations of include and exclude directives; see the man page for rsync (or a good book) for more information.

File-Driven Inclusion

In this example, I want to sync up selected subdirectories based on the contents of a couple of control files. The control files contain simple lines of text - strings in the form ######-## (one entry per line). The idea here is that some other process can manage the contents of the control files, and my script can dynamically adapt to the new set of requested subdirectories.

Here's the script:

#!/usr/bin/ksh # # Let's see how tricky it is to do a full sync # on just the 'suspect' patches (as opposed to # the whole population of patches).. RSYNC=/software/opt/sfw/bin/rsync SOURCEDIR=/net/foo/bar TARGET1="/tester/a-list.txt" TARGET2="/tester/b-list.txt" RSYNC_LIST=/tester/data/suspect_rsync_includes # Build a list of 'include' expressions, one for # each suspect patch ID. Stick on a leading # slash and trailing double asterisk and stuff # the whole list into a file. cat $TARGET1 $TARGET2 | sort -u | sed -e "s/^/\//" | sed -e "s/$/\*\*/" > $RSYNC_LIST begin=`date` echo "Started rsync of suspect patches at $begin" # The 'include' picks up the patches listed in $RSYNC_LIST. # The 'exclude' ignores everything else. $RSYNC --archive --quiet \ --include-from=$RSYNC_LIST \ --exclude "*" \ --delete --delete-excluded \ ${SOURCEDIR}/ /spare_1/suspects end=`date` echo "Finished rsync of suspect patches at $end"

The sort and sed game sucks the ######-## entries out of the two control files and dresses them up as rsync include expressions, prepending a forward slash and appending two asterisks to each ######-## entry. In other words 123456-78 becomes /12345678** in the $RSYNC_LIST file. The list of include expressions is fed into rsync as a file, rather than as a stack of command-line arguments.

The syntax and coding of include directives is weird and subtle - in this case, we're syncing selected subdirectories from /net/foo/bar into /spare_1/suspects, recursing all the way down each selected ######-## subdirectory.

Summary

If you pay attention and take some time to learn the subtleties of rsync, you'll find it to be one of the most useful and powerful tools in your toolbox. You can pull off some really spectacular miracles with rsync, and you can just as easily create titanic messes by being careless. Be careful!