Welcome to HPSSPy’s documentation!¶
Introduction¶
HPSSPy is a Python package for interacting with the HPSS tape storage system at NERSC. It is currently being developed on GitHub.
Requirements¶
HPSSPy assumes that the HPSS utilities hsi and htar are installed. As of 2023, these utilities are only available within the NERSC environment.
HPSSPy expects these utilities to exist in the directory ${HPSS_DIR}/bin
, so
be sure the environment variable HPSS_DIR
is defined.
Contents¶
HPSSPy Configuration¶
Introduction¶
The primary HPSSPy command-line program missing_from_hpss is
configured with a JSON file. Both the JSON standard and the
Python json
library are very strict. There is a very quick way
to check the validity of JSON files however:
python -c 'import json; j = open("config.json"); data = json.load(j); j.close()'
where "config.json"
should be replaced with the name of the file to be
tested.
The top-level JSON container should be an “object”, equivalent to a Python
dict
. The simplest possible file that satisfies this requirement
is:
{
}
Obviously, that’s not very much to go on. You will need further data described below.
Metadata¶
The configuration file should contain a top-level keyword "__config__"
.
The value should itself be a dict
, containing some important
metadata:
{
"__config__": {
"root": "/global/project/projectdirs/my_project",
"hpss_root": "/nersc/projects/my_project",
"physical_disks": ["my_project"]
}
}
- root/
The directory that contains all the data associated with the project.
- hpss_root/
The path on the HPSS tape system that will contain the backups.
- physical_disks/
If the data are spread across several physical disks and linked into the root path via symlinks, the various physical disks need to be listed here. If the value is equivalent to
False
, e.g., [null
,false
,[]
] this is means that the"root"
disk contains all the physical data. If the value of is equivalent to a one-item list containingos.path.basename(root)
, then this also means that the"root"
disk contains all the physical data. A list of simple names generates the physical disks by substitution on the basename of the"root"
value. More complicated configurations are possible, seehpsspy.scan.physical_disks()
.
Sections¶
Inside the root directory, as described above, there may be several top-level directories. For the purposes of this documentation, these are called “sections” or “releases”. The terms are interchangeable. Each section has configuration items that describe its structure:
{
"__config__": {
"root": "/projects/my_project",
"hpss_root": "/hpss/projects/my_project",
"physical_disks": ["my_project"]
},
"data": {
"__exclude__": [],
"d1": {
"d1/batch/.*$": "d1/batch.tar",
"d1/([^/]+\\.txt)$": "d1/\\1",
"d1/templates/[^/]+$": "d1/templates/templates_files.tar"
}
}
}
The missing_from_hpss command works on one section at a time. The name of the section is passed on the command-line:
missing_from_hpss config.json data
This would read the "data"
section above.
Each section should have an "__exclude__"
keyword, whose value is a list
of files to be ignored. In the example above, in order to ignore the file
/projects/my_project/data/d1/README.html
, the "__exclude__"
value
would be ["d1/README.html"]
. Note that this is relative to the
path /projects/my_project/data
, since "data"
is the section being
processed. Generally, this should only be used for a handful of top-level
files, like README files. For more precise exclusion, see the "EXCLUDE"
statement below.
In the special case where a section contains only files, and no
subdirectories, the special pseudo-subdirectory "__top__"
can be
used to contain the configuration.
Mapping File Names to HPSS Archives¶
Within a section, each immediate subdirectory should be described with
a keyword in the configuration file. missing_from_hpss will
complain if not, but it won’t necessarily cause it to fail. In the
example above, /projects/my_project/data/d1
is configured.
There are many possible ways to bundle files for archiving. Generally you want to make archives as large as possible, without spilling onto multiple tapes. However, with highly structured, deeply-nested directory structures, this isn’t always the best way to do it from a data retrieval viewpoint.
Consider this scenario. /projects/my_project/data
has been archived to
ten tape archives called data00.tar
, data01.tar
, … data09.tar
.
The file /projects/my_project/data/d1/templates/d1_template_05.fits
needs to be recovered. Which tape archive contains it?
Now consider the scenario where the files in
/projects/my_project/data/d1/templates
have been archived to
/hpss/projects/my_project/data/d1/templates/d1_templates_files.tar
.
Now is it easier to recover the file?
One should still try to make archives as big as possible, but generally speaking, long-term archiving of large, complex data sets should be done by someone who actually knows the structure of the data set .
In coding terms we describe a portion of a directory tree hierarchy using regular expressions to match files in that portion. Then we map files that match that regular expression to tape archive files.
Finally, it should be noted that the configuration of each section is organized by subdirectory in order to speed up the process of mapping files to backup files. Instead of looking through every possible configuration of files, only the configurations in a subdirectory need to be considered when examining files in that subdirectory.
Regular Expression Details¶
The HPSSPy package, and missing_from_hpss will validate the regular expressions used in the configuration file, in addition to checking the overall validity of the JSON file itself. That is, a bad regular expression will be rejected before it has any chance to “touch” any real data.
The regular expressions should follow Python’s conventions,
described in re
. In addition to those conventions, this package
imposes some additional requirements, conventions and idioms:
Requirements
Backslashes must be escaped in JSON files. For example the metacharacter (match a single decimal digit)
\d
becomes\\d
.Regular expressions should end with the end-of-line marker
$
.
Conventions
Any archive file name ending in
.tar
is assumed to be an HTAR file, and that command will be used to construct it.Any archive file not ending in
.tar
will simply be copied to HPSS as is.The special string
"EXCLUDE"
can be used to prevent backups of parts of a directory tree that might otherwise be archival. For example,"d1/data/preproc/.*$" : "EXCLUDE"
would prevent thepreproc
directory from being backed up, even if other parts ofd1/data
were configured for backup.The special string
"AUTOMATED"
behaves the same way as"EXCLUDE"
, but is a human-readable way to denote data sets that are backed up by automation independently of missing_from_hpss, as opposed to not being backed up at all.When constructing an archive file, missing_from_hpss will obtain the directory it needs to archive from the name of the archive file, not the regular expression itself. This is because regular expression substition is performed on the archive file name. For example
batch.tar
means “archive a batch/ directory”. For longer file names, any “prefix” of the file name will be stripped off, and the “suffix” of the file will be used. For example,d1/data_d1_batch.tar
also means “archive a batch/ directory”, becausedata_d1_
recognized as a prefix and stripped off. In particular, this allows directory names to contain underscores.An archive filename that ends with
_files.tar
, e.g.foo/bar_files.tar
is a signal to missing_from_hpss to construct the archive file in a certain way, not by descending into a directory, but by constructing an explicit list of files and building an archive file out of that.
Idioms
Archive the entire contents of a directory into a single file:
"foo/.*$" : "foo.tar"
.Archive several subdirectories of a directory, each into their own file:
"foo/(bar|baz|flub)/.*$" : "foo/foo_\\1.tar"
. The name of the directory matched in parentheses will be substituted into the file name.Archive arbitrary subdirectories of a set of subdirectories:
"d1/foo/(ab|bc|cd|de|ef)/([^/]+)/.*$" : "d1/foo/\\1/d1_foo_\\1_\\2.tar"
Match files in a directory, but not any files in any subdirectory:
"foo/[^/]+$" : "foo_files.tar"
. See also the_files.tar
convention mentioned above.Group some but not all subdirectories in a directory into a single archive file for efficiency:
"foo/([0-9])([0-9][0-9])/.*$" : "foo/foo_\\1XX.tar"
. Note the ending of the archive file, and that the directories have to have a very uniform naming convention (three and only three digits in this example). Also, the placeholderX
needs to be at the end of the file name.Do not create an archive file, just copy the file, as is, to HPSS:
"d1/README\\.txt$" : "d1/README.txt"
. Similarly, for a set of TXT files:"d1/([^/]+\\.txt)$" : "d1/\\1"
.An example with lots of substitutions:
"d1/foo/([0-9a-zA-Z_-]+)/sub-([0-9]+)/([0-9]+)/.*$" : "d1/foo/\\1/spectra-\\2/\\1_spectra-\\2_\\3.tar"
Finally, for truly monumentally-complicated directory trees, there is a
JSON file included with this distribution describing the SDSS data tree
that can be used for examples. To view the equivalent files and directories
for section "dr12"
, for example, visit https://data.sdss.org/sas/dr12.
Using HPSSPy¶
Introduction¶
The primary command-line interface to HPSSPy is the script missing_from_hpss, which is automatically generated by the package install process. If you need to generate this script manually, it is equivalent to:
#!/usr/bin/env python
from sys import exit
from hpsspy.scan import main
exit(main())
Options¶
There are several of command-line options. missing_from_hpss --help
will
display all of them. Just the short versions of the commands are
shown here.
- -c DIR
Cache files (described below) are written to
$HOME/cache
by default. This option allows the user to choose any directory.- -D
Delete and recreate the disk cache file (described below).
- -E
Exit if an error is detected while processing files on disk or on HPSS.
- -H
Delete and recreate the HPSS cache file (described below).
- -l N
Limit archive files to this size in GB. The default is 1024 GB (1 TB).
- -p
Issue the HPSS commands necessary to actually back up the files found that need to be backed up.
- -r N
Issue a progress report on how many files have been analyzed after
N
files (default 10,000).- -t
Test mode. Try not to make any changes. Also pretend that there are no files backed up to HPSS.
- -v
Print lots of extra information.
- --version
Print a version string and exit.
Besides the options described above, missing_from_hpss requires two positional arguments:
missing_from_hpss config.json section
The two arguements are the path to a configuration file and a section of that file to process. These are extensively described in the configuration document.
Cache Files¶
missing_from_hpss uses a few cache files primarily to reduce
memory footprint. These files will be stored in $HOME/cache
by default. The files are:
- Disk Cache
A CSV file of the form
disk_cache_<section>.csv
, where<section>
is the section (as defined above) specified on the command-line. The columns are file name, file size in bytes and modification time.- HPSS Cache
A CSV file of the form
hpss_cache_<section>.csv
, where<section>
is the section (as defined above) specified on the command-line. The columns are file name, file size in bytes and modification time.- Missing File Cache
A JSON file of the form
missing_files_<section>.json
, where<section>
is the section (as defined above) specified on the command-line. It contains a map of HPSS archive files to the files that belong in that archive. In addition the size of the resulting files (modulo small overheads from the archive file creation process) will be saved to this file.
These files are not cleaned up by default because they are very useful for debugging purposes.
Testing and Quality Assurance¶
To test a configuration file just run missing_from_hpss with the
--test
option as described above. Aside from creating cache files in
a directory as described above, this mode will not alter any of the
data, neither on disk nor on HPSS.
In addition to validating JSON files and regular expressions, as described in the configuration document, missing_from_hpss will:
Make sure all regular expressions are actually used.
Make sure all files actually match one and only one regular expression.
Create a manifest file containing the actual files on disk matched and the archive file they map to. This is one and the same as the “Missing File Cache” described above.
Make sure that all archive file sizes are less than a user-defined limit (default 1 TB), configurable on the command-line.
HPSSPy Library¶
For programmatic access to HPSS, the HPSSPy library provides
equvalents of os
and os.path
that operate on the HPSS filesystem.
HPSSPy API¶
hpsspy¶
Python interface to the HPSS system.
hpsspy.os¶
Reproduces some features of the Python built-in os
.
- hpsspy.os.chmod(path, mode)[source]¶
Reproduces the behavior of
os.chmod()
for HPSS files.- Parameters:
- Raises:
HpssOSError – If the underlying hsi reports an error.
- hpsspy.os.listdir(path)[source]¶
List the contents of an HPSS directory, similar to
os.listdir()
.- Parameters:
path (
str
) – Directory to examine.- Returns:
A list of
HpssFile
objects.- Return type:
- Raises:
HpssOSError – If the underlying hsi reports an error.
- hpsspy.os.lstat(path)[source]¶
Perform the equivalent of
os.lstat()
on the HPSS file path.- Parameters:
path (
str
) – Path to file or directory.- Returns:
An object that contains information similar to the data returned by
os.stat()
.- Return type:
- Raises:
HpssOSError – If the underlying hsi reports an error.
- hpsspy.os.makedirs(path, mode=None)[source]¶
Reproduces the behavior of
os.makedirs()
.- Parameters:
- Raises:
HpssOSError – If the underlying hsi reports an error.
Notes
Unlike
os.makedirs()
, attempts to create existing directories raise no exception.
- hpsspy.os.mkdir(path, mode=None)[source]¶
Reproduces the behavior of
os.mkdir()
.- Parameters:
- Raises:
HpssOSError – If the underlying hsi reports an error.
Notes
Unlike
os.mkdir()
, attempts to create existing directories raise no exception.
- hpsspy.os.stat(path, follow_symlinks=True)[source]¶
Perform the equivalent of
os.stat()
on the HPSS file path.- Parameters:
path (
str
) – Path to file or directory.follow_symlinks (
bool
, optional) – IfFalse
, makesstat()
behave likeos.lstat()
.
- Returns:
An object that contains information similar to the data returned by
os.stat()
.- Return type:
- Raises:
HpssOSError – If the underlying hsi ls reports an error.
- hpsspy.os.walk(top, topdown=True, onerror=None, followlinks=False)[source]¶
Traverse a directory tree on HPSS, similar to
os.walk()
.- Parameters:
- Returns:
This function can be used in the same way as
os.walk()
.- Return type:
iterable
hpsspy.os.path¶
Reproduces some features of the Python built-in os.path
.
- hpsspy.os.path.isdir(path)[source]¶
Reproduces the behavior of
os.path.isdir()
for HPSS files.
- hpsspy.os.path.isfile(path)[source]¶
Reproduces the behavior of
os.path.isfile()
for HPSS files.
- hpsspy.os.path.islink(path)[source]¶
Reproduces the behavior of
os.path.islink()
for HPSS files.
hpsspy.scan¶
Functions for scanning directory trees to find files in need of backup.
- hpsspy.scan._options()[source]¶
Parse command-line options.
- Returns:
The parsed command-line arguments.
- Return type:
- hpsspy.scan.compile_map(old_map, section)[source]¶
Compile the regular expressions in a map.
- Parameters:
- Returns:
A new dictionary containing compiled regular expressions.
- Return type:
- hpsspy.scan.extract_directory_name(filename)[source]¶
Extract a directory name from a HTAR filename that may contain various prefixes.
- hpsspy.scan.files_to_hpss(hpss_map_cache, section)[source]¶
Create a map of files on disk to HPSS files.
- Parameters:
- Returns:
A tuple contiaining the compiled mapping and an additional configuration dictionary.
- Return type:
tuple()
- hpsspy.scan.find_missing(hpss_map, hpss_files, disk_files_cache, missing_files, report=10000, limit=1024.0)[source]¶
Compare HPSS files to disk files.
- Parameters:
hpss_map (
dict
) – A mapping of file names to HPSS files.hpss_files (
dict
) – The list of actual HPSS files.disk_files_cache (
str
) – Name of the disk cache file.missing_files (
str
) – Name of the file that will contain the list of missing files.report (
int
, optional) – Print an informational message when N files have been scanned.limit (
float
, optional) – HPSS archive files should be smaller than this size (in GB).
- Returns:
True
if no serious problems were found.- Return type:
- hpsspy.scan.main()[source]¶
Entry-point for command-line scripts.
- Returns:
An integer suitable for passing to
sys.exit()
.- Return type:
- hpsspy.scan.physical_disks(release_root, config)[source]¶
Convert a root path into a list of physical disks containing data.
- hpsspy.scan.process_missing(missing_cache, disk_root, hpss_root, dirmode='2770', test=False)[source]¶
Convert missing files into HPSS commands.
- Parameters:
missing_cache (
str
) – Name of a JSON file containing the missing file data.disk_root (
str
) – Missing files are relative to this root on disk.hpss_root (
str
) – Missing files are relative to this root on HPSS.dirmode (
str
, optional) – Create directories on HPSS with this mode (defaultdrwxrws---
).test (
bool
, optional) – Test mode. Try not to make any changes.
- hpsspy.scan.scan_disk(disk_roots, disk_files_cache, overwrite=False)[source]¶
Scan a directory tree on disk and cache the files found there.
- hpsspy.scan.scan_hpss(hpss_root, hpss_files_cache, overwrite=False)[source]¶
Scan a directory on HPSS and return the files found there.
- hpsspy.scan.validate_configuration(config)[source]¶
Check the configuration file for validity.
- Parameters:
config (
str
) – Name of the configuration file.- Returns:
An integer suitable for passing to
sys.exit()
.- Return type:
hpsspy.util¶
Low-level utilities.
- class hpsspy.util.HpssFile(*args)[source]¶
This class is used to store and access an HPSS file’s metadata.
- Parameters:
args (iterable) – This object this will normally be initialized by a tuple produced by
hpsspy.os.listdir()
.
- htar_contents()[source]¶
Return (and cache) the contents of an htar file.
- Returns:
List containing the contents.
- Return type:
- property isdir¶
True
if the file is a directory or a symbolic link that points to a directory.
- property islink¶
True
if the file is a symbolic link.
- property name¶
Name of the file.
- property path¶
Full path to the file.
- property readlink¶
Destination of symbolic link.
- property st_mode¶
File permission mode.
- property st_mtime¶
File modification time.
- hpsspy.util.get_tmpdir(**kwargs)[source]¶
Return the path to a suitable temporary directory.
Resolves the path to the temporary directory in the following order:
If
tmpdir
is present as a keyword argument, the value is returned.If
TMPDIR
is set, its value is returned.If neither are set,
/tmp
is returned.
- hpsspy.util.hsi(*args, **kwargs)[source]¶
Run hsi with arguments.
- Parameters:
args (
tuple()
) – Arguments to be passed to hsi.tmpdir (
str
, optional) – Write temporary files to this directory. Defaults to the value returned byhpsspy.util.get_tmpdir()
. This option must be passed as a keyword!
- Returns:
The standard output from hsi.
- Return type:
- Raises:
KeyError – If the
HPSS_DIR
environment variable has not been set.
Release Notes¶
0.7.0 (2023-07-17)¶
0.6.1 (2022-05-20)¶
Bumped version due to malformed PyPI upload.
0.6.0 (2022-05-20)¶
Reorganization of package structure and metadata; no changes to user-facing API.
Support full-precision timestamps on HPSS files (PR #14).
0.5.1 (2019-08-20)¶
Unused patterns and over-large backup files no longer trigger a critical error (PR #12).
0.5.0 (2019-05-18)¶
This release drops support for Python 2.
0.4.2 (2019-01-29)¶
Further fixes for mapping HTAR file names back to directories (PR #6).
0.4.1 (2019-01-16)¶
Handle directory names that contain underscore characters; improve test coverage (PR #4).
0.4.0 (2017-08-10)¶
Add
--version
option.Add Python 3.6, remove 3.3.
Add many quality-assurance checks and additional documentation (PR #2).
0.3.0 (2017-01-18)¶
General refresh of Python code, documentation, test suite. However, no major changes to the API.
Command-line inputs are no longer rigidly restricted to SDSS or DESI.
0.2.1 (2015-04-22)¶
Fixed some setup.py errors, no code changes.
0.2.0 (2015-04-22)¶
Moved configuration items to JSON files.
Started adding support for DESI.
Add tests to util subpackage.
Add
__future__
statements.Clean up API documentation.
Minor bug fixes.
0.1.0 (2015-03-25)¶
Initial release. Used to scan all SDSS data.