Friends & Family Fileserver & Backup Idea
patrick — Mon, 2007-07-02 11:47
This was originally started somewhere around 2007-06-27, but has taken several days of digging through information and sorting ideas to put this together.
I've been wanting to set up a raid5 array on a local file server for a while. I'm wanting something somewhat small and hopefully with a lower power consumption than a standard ATX. Something I can haul along with me fairly easily if and when I move again. So I've been looking for ARM, MIPS, and mini-ITX motherboards again, but considering I've not really found much for ARM or MIPS I think I may just go with a mini-ITX. Something similar to the Tux Server Project
is my target.
And then the web/file/etc server at my parents' house went down last Monday or Tuesday. Long story short - 1 of the drives in the raid array is dying and there's no spare drive.
Back to my local file server... I started thinking about how I could get some sort of backup or redundancy for my file server so I wouldn't lose data. I could set up a 2nd raid5 array and then set both raid arrays as a raid1 array (mirrored). While a possibility this causes issues in regards to being portable and low power consumption. Not to mention what if - flood, tornado, stolen, thrown out a window - there goes everything.
I've thought about remote backups before, but I don't really want to be paying somebody else some sort of fee (monthly? yearly?) and maybe get stuck with how often I can upload/download data... plus, do they backup the data as well or is it just a second location for the data? Plus, why pay a monthly fee for something if I can set it up myself?
I've looked at rsync before and thought it might be a great idea for keeping stuff backed up between computers on a LAN. For those that don't know, rsync allows you to keep 2 locations in sync. 2 directories, 2 computers, 2 whatever. One thing that makes rsync preferable over ftp, sftp, scp, etc is that rsync does a checksum calculation between files to determine whether a certain file needs to be transferred or not. Unfortunately, again with storing backups in 1 location, the idea of keeping rsync'ed copies on a LAN is prone to flood, tornado, etc.
The mirrored raid5 solution got me to thinking about all of my previous ideas, but I'm wanting to make things easier... I've tried working out various backup ideas before, but the only 1 I've really implemented is Subversion when dealing with code. It's one of those, if I have to do to much every single time to get it to work, then I might not get around to it
kind of issues. Sure, I've done tape, CD, DVD, flash drive backups at different times, but these require me to sit down and waste my time in preparation for it. Most of the time it also means it's running while I'm awake so I can switch the backup media... Which means all I get to accomplish during that period of time is to watch TV and tap my finger as I wait for the process to get done.
I'm a geek gosh darn it! Not only that I'm lazy! Surely there's a way for me to set something up and just leave it to run by itself. Sure, at some point I'll probably have to intervene and probably do some maintenance, but every time I need a backup? I just want something that works without me messing with it except to fix hardware issues on occasion.
I'm thinking of fixing up my parents' file server in Kansas City, MO and then setting up my own locally (currently Tulsa, OK) - both with about the same amount of disk space. After setting them up I'll do an initial rsync on my parents' LAN. With a file server in 2 different locations I can then set them up to do a daily rsync. Unfortunately I've run into a couple of snags with the idea.
In looking at the various raid controllers, hard drive enclosures and hard drives some information is fairly apparent by reading the detailed information. Other stuff I'm not sure where to look for answers...
- If a drive in the raid array goes bad, will the computer beep (this would be useful as my parents could call and say that the computer is beeping)? Or at the least, will the hard drive enclosure light up in a certain way or do something else to notify you to the fact that a drive has gone bad?
- Will some brands of RAID cards work together? I.e. if I get a 4-port RAID controller today and later I want 8 hard drives in my array, can I buy another 4-port RAID controller or am I going to have to toss the 4-port and get an 8-port?
- Are there any available hardware RAID cards that support raid6?
- How close do the hard drives in a RAID array have to match? Do they only have to match in size or do they have to match in the number of cylinders, heads, and/or sectors as well? So far everyone I've talked to simply use the exact same brand and model for all the hard drives... which works fine, but what happens in 3+ years after the warranty is up on the hard drive and you can no longer find any more of that specific brand and model?
What I'm looking at is that the users whose fileserver is local will determine which server is the 1 allowed to delete files and the other servers simply acts as a storage device. The problem area is the public storage... I think at a certain point I may announce (by calling or emailing them) that a certain day is a maintenance day and this will be a deletion day... If there are certain files in public storage that need to be deleted, they should send me a list or something... Otherwise I'll simply go through on my server and delete public files that no longer need to be around (do I really need the last 6 versions of Firefox installation files for windows?). After that is done I can run a
command pushing deletions out to the other computers.rsync --delete
My plan is to have a cron job that will execute a shell script once a day from 1 of the computers that will look something like this -
#!/bin/bash
# all of the users along with @domain to show which
# server is their local domain.
users="patrick@my.dynamic.domain.com
mom@some.dynamic.domain.com
dad@some.dynamic.domain.com
sister@another.dynamic.domain.com
bro-in-law@another.dynamic.domain.com";
# the server that this script is being run at
localserver="my.dynamic.domain.com";
# the other servers in the rsync backup queue.
remoteservers="some.dynamic.domain.com
another.dynamic.domain.com";
for user in `echo $users`; do
# the username is everything before the @
username=`echo $user | awk -F\@ '{print $1}'`;
# the servername is everything after the @
servername=`echo $user | awk -F\@ '{print $2}'`;
# If the servername is not the same as the
# local server, then we need to sync the
# local server with the remote server.
# Anything that the user deleted on their
# local file server should be deleted on the
# local server as well.
#
# we skip the local server because there's
# no need to sync the local server with itself.
if [ "$servername" != "$localserver" ]; then
rsync --delete --compress --archive \
backupuser@$servername:/home/$username/ \
/home/$username/;
fi;
# local server is not in the remoteservers list
# so we don't have to worry about excluding
# it here.
#
# sync the remote servers with the remaining
# user home dirs.
for remote in `echo $remoteservers`; do
if [ "$servername" != "$remote" ]; then
rsync --delete --compress --archive \
/home/$username/ \
backupuser@$remote:/home/$username/;
fi
done;
done;
# sync the public storage space
for remote in `echo $remoteservers`; do
# copy files from remote to local
rsync --compress --archive \
backupuser@$remote:/home/public/ \
/home/public/;
# copy files from local to remote
rsync --compress --archive \
/home/public/ \
backupuser@$remote:/home/public/;
done;
for remote in `echo $remoteservers`; do
# copy files from local to remote
rsync --compress --archive \
/home/public/ \
backupuser@$remote:/home/public/;
done;
The biggest issue I have with this is the public storage. My understanding of the way rsync works is that you have to do all the calculations for what files are different on the 1 computer then copy those over. After doing that you have to redo the same calculation to discover the reverse in order to be able to copy those back. The biggest problem, that I could see anyways, is that the files on the local server and the first remote server will be backed up on all of the remaining remote servers. The following remote servers wouldn't be replicated on preceding remote servers until the next time the script was run.
One option might be after the initial public storage is updated to the local server from all of the remote servers to then drop the last remote server from the list (it already has a complete copy of all the other remote servers). Then reverse the list of remote servers, then run the rsync again. However, each time rsync is run the servers have to calculate the file differences between them. Over a couple 100 meg this is probably not that big of a deal, but I'm currently looking at having file servers with between 500GB and 1TB worth of storage space available.
Even after going through several more pages on rsync calculations I still haven't figured out which server does what calculations. It did, however, give my subconscious some more time to think about the rsync controller/round robin issue and I think I've got a bit better of an idea. The shell script above was modified (red is deletions, green is additions) to reflect my new thoughts on how I could deal with the issue of keeping all of the servers in sync without trying to figure out a way to do do multiple loops making rsync calls.
Oh, and then there's Unison.
Redbeard (not verified) — Sat, 2007-07-07 01:25Oh, and then there's Unison. Looks like it might be a perfect match. Found link after reading a bit more about csync2 (which is at http://oss.linbit.com/csync2/ ). It's at http://www.cis.upenn.edu/~bcpierce/unison/ .
I'll have to do some more
Redbeard (not verified) — Sat, 2007-07-07 01:20I'll have to do some more research, but there are better methods than rsync for what you want. Specifically, there's a filesystem designed for remote syncing. The csync tool might work to. I think it's relatively new but it appears to be more robust for what you're looking at than rsync.
In your case I think DRBD is overkill, but it might work, too. It's designed for high availability. Basically you keep a system with an extra drive and everything that's written to the master get's written to the extra drive. The URLs are http://www.linux-ha.org/DRBD and http://www.drbd.org/ . But it might not even work arcross internet-level connections.
Another thought, along the RAID lines is a nifty new device called the Drobo. Put put to four drives in and it makes them redundant. They can be mismatched in size. There are dummy indicators (LEDs across the bottom showing how "full" the system is, flashing lights when a drive is having problems). It's hot swap. Pull out a drive, put in a new one, and it auto-syncs. And flashes lights to let you know what it's doing. At the moment it's a strictly USB 2.0 device, but that was just so they could get it to market. Of course, the big drawback is that it is $400 (IIRC) for just the unit. No drives. URL is http://www.drobo.com/ .
Well, there's my drool for the night :)
Michael
huh? ^_^ ;)
David S (not verified) — Mon, 2007-07-02 18:38huh? ^_^
;)