Using HPSSPy

Introduction

The primary command-line interface to HPSSPy is the script missing_from_hpss, which is automatically generated by the package install process. If you need to generate this script manually, it is equivalent to:

#!/usr/bin/env python
from sys import exit
from hpsspy.scan import main
exit(main())

Options

There are several of command-line options. missing_from_hpss --help will display all of them. Just the short versions of the commands are shown here.

-c DIR

Cache files (described below) are written to $HOME/cache by default. This option allows the user to choose any directory.

-D

Delete and recreate the disk cache file (described below).

-E

Exit if an error is detected while processing files on disk or on HPSS.

-H

Delete and recreate the HPSS cache file (described below).

-l N

Limit archive files to this size in GB. The default is 1024 GB (1 TB).

-p

Issue the HPSS commands necessary to actually back up the files found that need to be backed up.

-r N

Issue a progress report on how many files have been analyzed after N files (default 10,000).

-t

Test mode. Try not to make any changes. Also pretend that there are no files backed up to HPSS.

-v

Print lots of extra information.

--version

Print a version string and exit.

Besides the options described above, missing_from_hpss requires two positional arguments:

missing_from_hpss config.json section

The two arguements are the path to a configuration file and a section of that file to process. These are extensively described in the configuration document.

Cache Files

missing_from_hpss uses a few cache files primarily to reduce memory footprint. These files will be stored in $HOME/cache by default. The files are:

Disk Cache

A CSV file of the form disk_cache_<section>.csv, where <section> is the section (as defined above) specified on the command-line. The columns are file name, file size in bytes and modification time.

HPSS Cache

A CSV file of the form hpss_cache_<section>.csv, where <section> is the section (as defined above) specified on the command-line. The columns are file name, file size in bytes and modification time.

Missing File Cache

A JSON file of the form missing_files_<section>.json, where <section> is the section (as defined above) specified on the command-line. It contains a map of HPSS archive files to the files that belong in that archive. In addition the size of the resulting files (modulo small overheads from the archive file creation process) will be saved to this file.

These files are not cleaned up by default because they are very useful for debugging purposes.

Testing and Quality Assurance

To test a configuration file just run missing_from_hpss with the --test option as described above. Aside from creating cache files in a directory as described above, this mode will not alter any of the data, neither on disk nor on HPSS.

In addition to validating JSON files and regular expressions, as described in the configuration document, missing_from_hpss will:

  1. Make sure all regular expressions are actually used.

  2. Make sure all files actually match one and only one regular expression.

  3. Create a manifest file containing the actual files on disk matched and the archive file they map to. This is one and the same as the “Missing File Cache” described above.

  4. Make sure that all archive file sizes are less than a user-defined limit (default 1 TB), configurable on the command-line.

HPSSPy Library

For programmatic access to HPSS, the HPSSPy library provides equvalents of os and os.path that operate on the HPSS filesystem.