System Backup

From trapsink.com
Jump to: navigation, search


Overview

There are a million and one ways to configure backups, everything is situationally dependent - for our purposes, we have these challenges and needs as a laptop user, cloud server administrator or just a FOSS advocate:

  • Random uptime, laptops suspend/hibernate frequently
  • Slow upload bandwidth on home DSL-style networks or spotty wifi
  • Avoid vendor lock-in on both the client and server side
  • Portability on the client and server side between distributions
  • Reduce server administration - use a cloud files/storage provider
  • Include standardized encryption algorithms in the backups
  • Ability to tune backups granularly on a per-case basis
  • Ability to verify backups and restore, even if it's a manual action (worst case)


Given these principles, the solution being implemented will use:


More specifically this article is using Arch Linux and Google Cloud Storage with Durable Reduced Availability Storage to reduce costs to a trivial (truly - < $5/mo, see the pricing) amount by excluding things like Downloads and Music. There are better ways to back up data like that, what you're really after are the things that matter (pictures, documents, configs, etc.) that can't be replaced.

All software is available in every distro (within reason), just replace the Arch pacman installs with yum/apt-get/emerge/etc. as appropriate. You may wish to consider using a custom PPA (ppa:duplicity-team/ppa) or similar source if your distribution's mainline version is too old.


Installation

These actions are typically performed as root or with sudo - adjust as needed.


fcron

Like anacron, fcron assumes the computer is not always running and, unlike anacron, it can schedule events at intervals shorter than a single day which is useful for systems which suspend/hibernate regularly. For an always-on system like a cloud server there's no real need to replace the standard cron package, however.

When replacing cronie with fcron be aware the spool directory is /var/spool/fcron and the fcrontab command is used instead of crontab to edit the user crontabs. These crontabs are stored in a binary format with the text version next to them as foo.orig in the spool directory. Any scripts which manually edit user crontabs may need to be adjusted due to this difference in behavior.

A quick scriptlet which will replace cronie and convert traditional user crontabs to fcron format:

systemctl stop cronie; systemctl disable cronie

pacman -Sy; pacman -S fcron

cd /var/spool/cron && (
 for ctab in *; do
  fcrontab ${ctab} -u ${ctab}
 done
)

systemctl start fcron; systemctl enable fcron


duplicity / boto / gnupg / duply

Duplicity has minimal dependencies for a python application, no need to pip install a lot of extra modules. Many other backends are available for use however we are only focusing on GCS and/or S3 here. If a duply package is not available in your distribution, it's a single bash script - visit the website and grab a copy, place it somewhere handy (such as ~/bin/duply) as needed.

Duplicity 0.6.22 or newer is required for Google Cloud Storage

Arch

These are in the default repositories - most folks have gnupg already installed:

pacman -Sy; pacman -S duplicity python2-boto gnupg --noconfirm

Duply is currently in AUR - many folks use pacaur (which uses cower for the heavy lifting):

pacaur -S duply

CentOS

These are split between the main repositories and EPEL repo:

It's common the CentOS/EPEL repositories are behind in versions - I recommend downloading the SRPMs from EPEL and rebuild new RPMs using the latest versions of the duplicity, duply and python-boto packages.

yum -y install duplicity python-boto duply gnupg


gsutil

While not strictly necessary, having gsutil installed and ready makes working with unexpected issues easy; for example, while working your your Excludes you may try several times until you get it perfect -- gsutil makes it easy to delete en-mass or list bucket contents. Highly recommended - use a tools subdirectory for things like this:

[ ! -d ~/tools ] && mkdir ~/tools; cd ~/tools
wget https://storage.googleapis.com/pub/gsutil.tar.gz
tar -zxf gsutil.tar.gz; rm gsutil.tar.gz

# Arch default is python v3, gsutil needs python v2
sed -i.orig '1 s/python/python2/g' gsutil/gsutil


Setup

These actions are all performed as yourself, not root, for a typical backup of a home directory. This is where your usage pattern of a Linux system comes into play -- I personally have a ~/System/ directory where I copy any config change made outside my home directory (i.e. system level). Keep your home directory on a separate partition or Logical Volume and encrypt it with LUKS.

This makes your home directory the source of all evil and allows a copy of the home directory to a new machine, a fresh laptop install, or even moving from one distro to another very contained. It also makes backups trivial - backup your home directory, the rest is disposable and easier re-installed rather than restored.

This same method can apply to a cloud server -- implement a methodology so that all your routine database dumps, git trees and so forth are all parented under a single higher directory instead of scattered about the filesystem. When making edits to system level files such as /etc/httpd/conf/httpd.conf, /etc/php.ini, /etc/my.cnf, cron tasks and so on copy them (or just use symlinks) over to a collected location under one tree for easier backup and restore. This has a side effect of making your whole infrastructure portable to another server with little work.


Google Cloud Storage

This section is always subject to change as it depends on the Google web links in question - they frequently update and shift things around. So as an overview, our mission is to:

  1. Enable Google Cloud Storage for your Google account
  2. Set up a Project and attach billing (credit card) to it
  3. Enable Interoperable Access and generate Storage Access Keys
  4. Configure gsutil for random operations
  5. Set up a Bucket for each backup (i.e. laptop)

The first four are one-time only, the last one is repeated for each backupset you configure with duply/duplicity later. The setup generated by this section is then easily copied to another laptop for use on a second backup, etc.

Enable GCS

Log into your Google account, then visit https://cloud.google.com/products/cloud-storage/ and click the "Get Started" or "Go to my console" (or similar) link usually at the top and follow any instructions to get the basics set up. You may have to agree to Terms of Service and all that jazz - do the needful.

Create a Project

A Project is the higher level umbrella where you will be billed for usage; as of this writing the URL is at https://console.developers.google.com/project and you click Create Project and choose a meta-level name like Backups (not a specific laptop name, e.g.). This will create a PROJECT ID displayed, jot that down in your notepad for use later.

On the left of Google's console, click Billing to connect the new Project to your credit card -- follow the instructions as appropriate.

Interoperable Access

This part can be the most confusing, as Google's webUI seems to be in flux a lot and the exact links change. As of this writing, the way to access the area:

  1. Click into your Project from the console
  2. On the left, click Storage then Cloud Storage
  3. Click Project Dashboard which opens a new tab

You're in a different UI at this point - on the left should be Google Cloud Storage with two sub-menus Storage Access and Interoperable Access. Enable Interoperable Access, then generate new Keys and jot down in the notepad both parts of the Key (one is secret). It's possible these two links may work for you:

Configure gsutil

Use the gsutil tool to generate a default configuration:

cd ~/tools/gsutil
./gsutil config -a

This creates the ~/.boto file which has a lot of comments. You need to insert the Project ID and Google keypair from the above steps. If you strip out the comments, here's the required portions - replace AAA, BBB and ZZZ with your data:

[Credentials]
gs_access_key_id = AAAAAAAAAAAAAAAAAAAA
gs_secret_access_key = BBBBBBBBBBBBBBBBBBBB
[Boto]
https_validate_certificates = True
[GSUtil]
content_language = en
default_api_version = 1
default_project_id = ZZZZZZZZZZZZ
[OAuth2]

Create a Bucket

A bucket is where the encrypted tarballs will actually be stored - so you want a Bucket for each backupset (system) you'll configure. A good choice might be the short name of the laptop for example, "mylaptop" will be used here as an example. We'll enable Durable Reduced Availability on the bucket to save money as well. The gsutil tool can perform many actions, just run ./gsutil --help and investigate.

cd ~/tools/gsutil
./gsutil mb -c DRA gs://mylaptop/

The bucket name is global to all users, you may get an error if the name chosen is already in use.


GnuPG

Generate a standard key specifically for use with your backups; because the password will be stored in plaintext in the duply config in your home directory, create a new key and use a unique password and not one of your existing keys. This keypair can then be copied to the other systems for backup encryption with a common key.

Creating a key

Create a standard GPG key:

GnuPG 2.0 and earlier
echo "pinentry-program /usr/bin/pinentry-curses" >> ~/.gnupg/gpg-agent.conf
GPG_AGENT_INFO=""; gpg --gen-key
GnuPG 2.1 and later
echo "allow-loopback-pinentry" >> ~/.gnupg/gpg-agent.conf
gpg --gen-key

Creating a key requires entropy to be generated by the system. If this is a virtual instance (i.e. Virtualbox guest) consider installing rng-tools and starting the rngd daemon to provide the required entropy.

Check the key is available:

$ gpg --list-keys QQQQQQQQ
pub   2048R/QQQQQQQQ 2014-07-18
uid       [ultimate] duply <duply@localhost>
sub   2048R/RRRRRRRR 2014-07-18

$ gpg --list-secret-keys QQQQQQQQ
sec   2048R/QQQQQQQQ 2014-07-18
uid                  duply <duply@localhost>
ssb   2048R/RRRRRRRR 2014-07-18

Migrating a key

If you are already using a key on another system, it can be exported and imported so that all your backups upstream are encrypted with the same key. First, export the public and private keys on the source and copy them to the new system:

gpg --export -a QQQQQQQQ > duply_public.asc
gpg --export-secret-keys -a QQQQQQQQ > duply_secret.asc
scp duply*.asc user@remote:

On the new device, import the keys:

gpg --import duply_public.asc
gpg --import duply_secret.asc

Finally, edit the key and set trust to Ultimate:

gpg --edit-key QQQQQQQQ

Command> trust
  [...]
  5 = I trust ultimately
Your decision? 5

Import Preferences

During import, if your source GPG secret key is newer than the destination GPG you may get an error about incompatible preferences, and an offer to fix them - choose Yes:

$ gpg --import duply_secret.asc 
gpg: key QQQQQQQQ: secret key imported
gpg: key QQQQQQQQ: "duply <duply@localhost>" not changed
gpg: WARNING: key QQQQQQQQ contains preferences for unavailable algorithms on these user IDs:
gpg:          "duply <duply@localhost>": preference for cipher algorithm 1
gpg: it is strongly suggested that you update your preferences and
gpg: re-distribute this key to avoid potential algorithm mismatch problems

Set preference list to:
     Cipher: AES256, AES192, AES, CAST5, 3DES
     Digest: SHA256, SHA1, SHA384, SHA512, SHA224
     Compression: ZLIB, BZIP2, ZIP, Uncompressed
     Features: MDC, Keyserver no-modify
Really update the preferences? (y/N) y

For the curious, this usually means the newer version of GPG supports a newer cipher or digest that the older one needs to remove; for example a key on the newer GPG contains the IDEA cipher:

source GPG 2.0.26
gpg> showpref
[ultimate] (1). duply <duply@localhost>
     Cipher: AES256, AES192, AES, CAST5, 3DES, IDEA
     Digest: SHA256, SHA1, SHA384, SHA512, SHA224
     Compression: ZLIB, BZIP2, ZIP, Uncompressed
     Features: MDC, Keyserver no-modify
destination GPG 2.0.14
Command> showpref
[ultimate] (1). duply <duply@localhost>
     Cipher: AES256, AES192, AES, CAST5, 3DES
     Digest: SHA256, SHA1, SHA384, SHA512, SHA224
     Compression: ZLIB, BZIP2, ZIP, Uncompressed
     Features: MDC, Keyserver no-modify


Duply

Generate a default configuration for the backupset - we'll use the same name as the laptop and Bucket, "mylaptop":

duply mylaptop create

This creates two files that need edited:

~/.duply/mylaptop/conf
~/.duply/mylaptop/exclude

There are two other files that can be configured, pre and post that run commands before and after a duply backup. These are not created by default, however might come in handy if you need to mount/umount a filesystem, dump a database, etc. as part of the process. See the duply documentation for further info.

conf

Very similar to the gsutil setup, you'll need to configure the GCS data in this file for storing your backups, as well as all the other settings as to what should be backed up, retention periods and so forth. This part is situationally dependent -- I choose to manage my own Full backups manually since they require over 9 hours to upload and I need to disable Suspend on my laptop. Given that, my configuration looks like so (without all the comments):

The use of GPG_OPTS='--pinentry-mode loopback' is required for GnuPG 2.1 and later, along with the above setting in ~/gnupg/gpg-agent.conf 'allow-loopback-pinentry'. Failure to configure these will result in the passphrase not working from an unattended mode.

With duply 1.10 and above, do not set TARGET_USER and TARGET_PASS in this config file - they are now configured elsewhere, see below.

GPG_KEY='XXXXXXXX'
GPG_PW='YYYYYYYYY'
GPG_OPTS='--pinentry-mode loopback'
TARGET='gs://mylaptop'
TARGET_USER='AAAAAAAAAAAAAAAAAAAA'
TARGET_PASS='BBBBBBBBBBBBBBBBBBBB'
SOURCE='/home/CCCCCC'
FILENAME='.duplicity-ignore'
DUPL_PARAMS="$DUPL_PARAMS --exclude-if-present '$FILENAME'"
MAX_AGE=2M
MAX_FULL_BACKUPS=2

...where you're obviously replacing AAA, BBB, CCC, XXX and YYY with your information as created above. This file should be mode 0600 so that only you can read it, as it contains both your GPG key password and GCS access keypair.

exclude

Configure the exclude file to ignore things you do not want in the backup - it uses globs (wildcards) to make things a little easier. As an average MATE desktop user with the typical applications, here is a basic exclude file that tends to work as a good starting point:

- /home/*/Downloads
- /home/*/Misc
- /home/*/Movies
- /home/*/Music
- /home/*/VirtualBox**
- /home/*/abs
- /home/*/builds
- /home/*/tools/android-sdk-linux
- /home/*/tools/jdk**
- /home/*/.ICEauthority
- /home/*/.Xauthority
- /home/*/.adobe
- /home/*/.android/cache
- /home/*/.cache
- /home/*/.cddb*
- /home/*/.config/**metadata*
- /home/*/.config/*/sessions
- /home/*/.config/*session*
- /home/*/.config/VirtualBox
- /home/*/.config/libreoffice
- /home/*/.config/pulse
- /home/*/.gstreamer*
- /home/*/.hplip
- /home/*/.icons
- /home/*/.java/deployment
- /home/*/.java/fonts
- /home/*/.local/share/gvfs-metadata
- /home/*/.local/share/icons
- /home/*/.macromedia
- /home/*/.mozilla/firefox/Crash**
- /home/*/.mozilla/firefox/*/storage
- /home/*/.purple/icons
- /home/*/.thumbnails
- /home/*/.thunderbird/**.msf
- /home/*/.thunderbird/Crash**
- /home/*/.xsession-errors*

Everyone will have a slight variation on this file, adjust as needed. It's a bit difficult to get globbing to work right with dot-files so I tend to just avoid that specific pattern usage.


Backup

This part is dead simple - just run duply with the name of the backupset; it will detect it's the first time run and trigger a full backup. Be sure to use screen, disable suspend/hibernate, etc. if your backup is going to take a really long time:

duply mylaptop backup

With duply 1.10 and above, you must first export the environment variables with your Google Cloud Storage API credentials. As such, skip to the "Scripted Backup" section below to see what's needed.

You may wish to increase verbosity and/or add --dry-run to the conf file the first time to ensure what you think it happening is actually happening. The default conf generated has the options and instructions present to set those up. For normal usage just use the default verbosity level.

From this point, every time you run duply backup foo it will detect the full backup and run Incrementals instead; how much and how long it takes depends on your changeset each time. You can force a full or incremental by using full or incr instead of backup as a command as well. The bkp action will skip the pre/post files execution.


Cleanup

Duply has several actions to help keep track of your backups - verify to show changed local files since the backup, status to list the upstream full and incremental statistics, and various forms of purge to flush older backups. Keep in mind that you cannot purge incrementals until a full backup supersedes them, so there's a bit of an art to knowing when you should generate a new full backup.

For a laptop it's probably sufficient to perform a full backup once a month (or less) and roll with the incrementals, unless you have massive amounts of change. For a scenario where you're backing up databases and such it might make sense to perform full backups weekly, since the size of your incrementals will grow rapidly and consume space. After a full backup, purge the incrementals older than that full backup to save space.

duply mylaptop status
duply mylaptop full
duply mylaptop purge
duply mylaptop purge --force

The purge actions will respect the settings in conf as outlined above. From time to time you may wish to use the cleanup action that will attempt to find orphaned backup bits and fetch to randomly test restoring files from the online backups to ensure everything is working as intended.


Scripted Backups

Create a small shell script that will be run via fcron to perform the backup, check status and email the results to yourself from the saved logfile. I use a very basic script with mailx (the standard commandline mail):

#!/bin/bash

MAILTO="me@mydomain.com"
LOGDIR="/home/CCCCCC/.logs"
TIMESTAMP=$(date +%Y-%m-%d_%H%M)
MAILSUB="mylaptop backup report: ${TIMESTAMP}"
LOGFILE="${LOGDIR}/duply_${TIMESTAMP}.log"

# Duply 1.10+ requires ENV vars
export GS_ACCESS_KEY_ID='<my API user, the old TARGET_USER in duply>'
export GS_SECRET_ACCESS_KEY='<my secret key, the old TARGET_PASS in duply>'

echo "" >> ${LOGFILE}
duply mylaptop backup 1>>${LOGFILE} 2>&1
duply mylaptop status 1>>${LOGFILE} 2>&1
echo "" >> ${LOGFILE}

cat ${LOGFILE} | mail -s "${MAILSUB}" ${MAILTO}
find "${LOGDIR}" -type f -mtime +30 -delete
exit 0


Scheduled Backups

Insert the above script into your fcrontab using the %daily keyword; fcron will run any missed jobs on the next hour after the system is online when 0 is specified in the minutes field and * in the hours:

%daily,mail(no) 0 * /home/CCCCCC/bin/mylaptopduply.sh

See the fcrontab(5) man page for more information.


Manual Recovery

The encrypted backup files are uploaded in chunks of 25M by default (configurable in the conf file of duply backupset); these files can be downloaded using a web browser from GCS, decrypted and untarred manually in the worst case scenario. The GPG key is still required so be sure that your ~/.gnupg keychain is backed up in some fashion not inside your duply/duplicity backups. Assuming the GPG key used to encrypt is still available:

  1. Go to the GCS console in a web browser
  2. Click the Project, Storage, Cloud Storage, Storage Browser
  3. Click into the Bucket, find a file you think might have what you need
  4. Download the file to your local system

Once downloaded, decrypt it (you will be prompted for the GPG secret key password) and untar:

gpg -d duplicity-inc.20140802T010001Z.to.20140802T170542Z.vol1.difftar.gpg > recover.tar
tar -xf recover.tar

This process is really only for recovery in a disaster; as the files upstream are chunked tarballs it's a rather random process to know which backup-file might have the specific file you need. There are manifests and signatures upstream as well, so downloading those first and perusing might help.

gpg -d duplicity-inc.20140802T010001Z.to.20140802T170542Z.manifest.gpg > recover.manifest
less recover.manifest

It would definitely be quicker to download all the manifests first, decrypt them then just grep the files to find the target; then you can download the backup-file in question. If your gsutil is working it can be used instead of a browser:

mkdir ~/recover; cd ~/recover
../tools/gsutil/gsutil cp gs://mylaptop/*.manifest.gpg .
for ii in *.gpg; do gpg -d "${ii}" > "${ii%%.gpg}"; done;
grep "some filename" *.manifest

It would be easier to set up another Linux instance and use duply/duplicity to recover the data properly (and quicker), but the option here is available to do it all by hand.


Backend Portability

An extension of the Manual Recovery could be to copy all GCS files down to a local path using gsutil (or web browser), copy/upload them to a filesystem or different provider then reconfigure your duply config to use the new backend without losing your existing backups. This might even be used just to download a copy of everything and save onto a USB drive that is kept in a fireproof safe.

Given the duplicity gpg-tarball storage design, your solution is upstream provider independent - the backup files can be ported from one backend to another with a bit of scripting and elbow grease. This could also be leveraged to keep a backup on different providers at the same time or use different backends for short vs. long term storage.


References