rsync
The rsync utility program is one of the most powerful (and
dangerous) tools around when it comes to keeping directories
and files synchronized. Many people fiddle around with it,
get burned, and never touch it again - however, it's a really
useful tool once you learn a few of its quirks.
In this document, we'll take a look at how rsync works and go
through some examples ranging from the simple to the
ridiculously complex. There are a myriad of ways to do
all this stuff - this document is by no means an exhaustive
study of rsync, just a handful of examples and some
explanation. If you want the complete story on rsync,
go get the O'Reilly book on it.
General Advice
Keep these simple guidelines in mind when playing with
rsync:
- Script your rsync commands. No matter how good
you think you are, don't get in the habit of trying to
type in rsync commands by hand. Don't do a lot of fancy
mouse work to cut and paste rsync commands, either.
Sooner or later, you will make a mistake - and
believe me, rsync can make a hell of a mess in a hurry
if you fumble-finger the command.
- Use the --dry-run option first, and don't proceed
until you understand and approve of the action rsync
will undertake. Just take the time - it's better
to say, "Woops! That's not right!" than "Uh, Boss?
I think I just blew away half the system with a fouled-up
rsync command." The --dry-run option is your friend;
use it wisely.
- Try to avoid specifying the target directory as
a variable if possible, especially when using the
--delete option. The rsync program is read-only
on the data in your source directory, but it's read/write
and delete on the target directory if you're
using --delete (which you will, in most cases). Everyone
likes to abstract things like directory paths in shell
variables, but this is one case where you'll sleep a
lot better at night by explicitly specifying the
target directory. If your shell variable gets messed up
somehow, you could wind up specifying the wrong target
directory for the rsync run. In addition to spraying
data files all over the place and making a mess, the
--delete option will happily chainsaw the destination
directory as it wipes out stuff that doesn't exist in
the source directory. I've seen some truly horrific
train wrecks initiated with badly-handled variables
used to specify destination directories with --delete.
- Use the --blocking-io option if rsync exhibits
flaky behavior during the file copy operations, especially
when running against a remote machine. I don't
particularly know or care exactly why this makes a
big difference, but I beat my brains out one day over
an rsync operation that threw weird, non-repeatable
errors until I added --blocking-io to the options. Go
figure.
Simple Directory Sync
In this example, I'm just syncing one directory with
another. The 'source' directory here happens to be
an NFS mount of a remote machine's directory, but
that doesn't matter - you can sync any two directories
on your machine this way.
#!/usr/bin/ksh
#
# Use 'rsync' to keep local copies of
# patchinfo files up-to-date with master
# copies on remote.machine
RSYNC=/software/opt/sfw/bin/rsync
SOURCEDIR=/net/remote.machine/foo/bar
$RSYNC --archive --quiet --delete --delete-excluded \
${SOURCEDIR}/ /spare_1/patchinfos
Simple, eh? Just remember that trailing slash on the
"from" directory and explicitly specify the destination
directory (no variables that can get messed up) and you're
done.
Accessing Files On Remote Machines
In addition to just syncing local directories (or things
that appear as local directories, like NFS shares), you
can use rsync to access files on remote systems. This
is really what rsync is all about, and it's quite good
at it.
The simplest way to do this is to operate rsync over
an rsh connection to the remote machine. You can also
do this over ssh, but that's left as an exercise for
the student.
You don't have to know a lot about rsh to set up a
simple password-less connection between two machines.
Let's imagine a setup like this:
Local machine |
Machine name: ALPHA |
User name: FRED |
|
Remote machine |
Machine name: OMEGA |
User name: BARNEY |
|
You're FRED, and you want to sync a local directory on your
ALPHA machine from a directory accessible to BARNEY on the
remote OMEGA machine. In setups like this, both FRED and
BARNEY are likely to be special accounts set up for
automated software execution, not home accounts used by
actual people.
Log in as BARNEY on the OMEGA machine and create a file
called .rhosts in BARNEY's home directory. Put
this line in it:
ALPHA FRED
This has the effect of allowing FRED on the ALPHA machine
to log in (rsh) as BARNEY on OMEGA without a password.
As you can imagine, it's easy to create some fairly serious
security loopholes this way, so don't do this lightly.
If ALPHA is actually something like alpha.central.foo.com,
you need to explicitly spell all that out.
Once you've got BARNEY's .rhosts file in place, log into
ALPHA as FRED. Then, try this:
rsh -l BARNEY OMEGA
You should immediately find yourself logged into OMEGA as
BARNEY - no password prompt, no challenge/response, just
"plop" and you're there, looking at a command prompt.
If this is not what happens, seek professional help from
your sysadmins - I'm not going to go into all the ways
rsh can be fouled up.
Once you've got rsh set up for this automatic login behavior,
you can run rsync on top of it. Here's a script that runs
rsync over rsh to sync a remote machine directory over to
a local directory:
#!/usr/bin/ksh
#
# Use 'rsync' to keep local copies of
# patchinfo files up-to-date with master
# copies on remote.machine
#
# This version of the script uses an 'rsh'
# session against the 'BARNEY' user account
# on OMEGA, rather than relying on
# the horribly slow NFS mounts.
LOCAL_RSYNC=/software/opt/sfw/bin/rsync
REMOTE_RSYNC=/opt/SENSrsync/bin/rsync
SOURCEDIR=/builds/foo/bar/quux
$LOCAL_RSYNC --archive --quiet --blocking-io \
--delete --delete-excluded \
--rsh="rsh -l BARNEY" --rsync-path ${REMOTE_RSYNC} \
OMEGA:${SOURCEDIR}/ /spare_1/patchinfos
The --rsh option specifies the command-line
option FRED would use to establish an rsh session
logged in as BARNEY on OMEGA, without the OMEGA
machine name. Adjust as needed. The OMEGA:${SOURCEDIR}/
syntax alerts rsync to perform a remote connection
rather than look at a locally-accessible directory.
The --rsync-path option tells your local
rsync where to find the rsync executable on the remote
machine - when running rsync this way, your local
rsync process fires up a partner rsync process on
the remote system. The two rsync processes examine
their respective directories and work together to
move data and synchronize the systems.
The password-less login trick is not strictly required,
but it's handy. You could fool around with hard-coding
the remote account password in the script on ALPHA,
but that's not generally a good idea. Setting up the
.rhosts file on OMEGA means that FRED never needs to
know BARNEY's password; the BARNEY password can be changed,
too, without notifying FRED.
In situations where the remote machine is a production
server and strict steps must be taken to avoid data
loss or corruption, sysadmins often create the BARNEY
account with group privileges that allow only read access
to the file systems being read by rsync. Then, FRED
(connected over rsh as BARNEY) can read the data but
can't hurt it, even by accident.
Complex Inclusion/Exclusion Filtering
In this example, I've got a huge amount of data stored on a
remote machine, and I want to sync only selected files out
of the remote directory for local storage on my machine.
The remote machine has a directory that contains over
20,000 subdirectories that follow a naming convention
like 10000-01, 234567-42, 654321-98,
and so on. I want to sync up the following information
to my local file system:
- The ######-## subdirectories themselves
- Subdirectories one level below each ######-## directory
(but NO DEEPER)
- Files named patchinfo (found in the ######-##
subdirectory) and pkginfo (found in first-level
subdirectories of the ######-## directories), and
files whose names begin with README* (found
in the ######-## directories).
There are a lot of other subdirectories and files in
the source directory on the remote machine, but these
specific files are the only ones I want.
Here's the ksh script:
#!/usr/bin/ksh
#
# Let's see if we can rsync just the README and pkginfo
# and patchinfo files from the expanded patch directories
# over to local disk..
#
# This version of the script uses an 'rsh'
# session against the 'BARNEY' user account
# on OMEGA, rather than relying on
# the horribly slow NFS mounts.
LOCAL_RSYNC=/software/opt/sfw/bin/rsync
REMOTE_RSYNC=/usr/local/bin/rsync
SOURCEDIR=/patches/intpatchroot/all_patchdb/patchdb
begin=`date`
echo "Started rsync of patchinfo, pkginfo and READMEs at $begin"
# The first 'include' picks up the patch directory (123456-78)
# The second 'include' picks up pkg subdirectories (SUNWfoo)
# The next three includes grab pkginfo, patchinfo, and README
# The final 'exclude' ignores everything else (including
# subdirectory trees).
$LOCAL_RSYNC --archive --quiet --blocking-io \
--rsh="rsh -l BARNEY" --rsync-path ${REMOTE_RSYNC} \
--include "[0-9][0-9][0-9][0-9][0-9][0-9]-[0-9][0-9]/" \
--include "[0-9][0-9][0-9][0-9][0-9][0-9]-[0-9][0-9]/*/" \
--include "patchinfo" --include "pkginfo" --include "README*" \
--exclude "*" \
--delete --delete-excluded \
OMEGA:${SOURCEDIR}/ \
/spare_1/patchdata_rsh
end=`date`
echo "Finished rsync of patchinfo, pkginfo and READMEs at $end"
Ta-da! You can assemble some truly spectacular
combinations of include and exclude directives;
see the man page for rsync (or a good book) for
more information.
File-Driven Inclusion
In this example, I want to sync up selected subdirectories
based on the contents of a couple of control files. The
control files contain simple lines of text - strings in
the form ######-## (one entry per line). The idea here is
that some other process can manage the contents of the
control files, and my script can dynamically adapt to the
new set of requested subdirectories.
Here's the script:
#!/usr/bin/ksh
#
# Let's see how tricky it is to do a full sync
# on just the 'suspect' patches (as opposed to
# the whole population of patches)..
RSYNC=/software/opt/sfw/bin/rsync
SOURCEDIR=/net/foo/bar
TARGET1="/tester/a-list.txt"
TARGET2="/tester/b-list.txt"
RSYNC_LIST=/tester/data/suspect_rsync_includes
# Build a list of 'include' expressions, one for
# each suspect patch ID. Stick on a leading
# slash and trailing double asterisk and stuff
# the whole list into a file.
cat $TARGET1 $TARGET2 |
sort -u |
sed -e "s/^/\//" |
sed -e "s/$/\*\*/" > $RSYNC_LIST
begin=`date`
echo "Started rsync of suspect patches at $begin"
# The 'include' picks up the patches listed in $RSYNC_LIST.
# The 'exclude' ignores everything else.
$RSYNC --archive --quiet \
--include-from=$RSYNC_LIST \
--exclude "*" \
--delete --delete-excluded \
${SOURCEDIR}/ /spare_1/suspects
end=`date`
echo "Finished rsync of suspect patches at $end"
The sort and sed game sucks the ######-## entries out
of the two control files and dresses them up as rsync
include expressions, prepending a forward slash and
appending two asterisks to each ######-## entry.
In other words 123456-78 becomes /12345678** in the
$RSYNC_LIST file. The list of include expressions
is fed into rsync as a file, rather than as a
stack of command-line arguments.
The syntax and coding of include directives is
weird and subtle - in this case, we're syncing
selected subdirectories from /net/foo/bar into
/spare_1/suspects, recursing all the way down
each selected ######-## subdirectory.
Summary
If you pay attention and take some time to learn the
subtleties of rsync, you'll find it to be one of the
most useful and powerful tools in your toolbox. You
can pull off some really spectacular miracles with
rsync, and you can just as easily create titanic
messes by being careless. Be careful!
|