Using remote archive storage

1. Overview

1.1. Basic overview

It is the storage that is hosted on the tapes on the remote location. Currently we only have one archive provider - SURF.

The following guidelines apply to this archive storage

  • It is a remote storage to write once and read rarely.
  • Performance accommodates occasional (once a year or less) access to the existing data on the remote tape storage.
  • Archive is not a backup! It provides storage with lower costs than a normal disk storage.
  • Tape data does not have a backup, so be extremely careful with the data deletion.
  • If permissions and metadata of the files are needed to be kept, then they should be first packaged before uploading (use of tar or similar tool).
  • File size must be considered - there should be no small files (see 'Best practices' below)
  • Tape storage has the ISO 27001 certification
  • Data is stored in two physical locations in the Netherlands.

1.2. How it works

Archive is automatically mounted when user navigates to the /groups/[GROUP]/arc[XX] folder. At that moment storage from remote server gets mounted on the folder. It remains accessible until some specific idle time is reached.

Data-manager account of the specific group is the only account that has read or write access to the archive folder of the group. This is to prevent the potential problems of accidentally recalling files online when not needed. Also to make sure that all the files are stored in correct format (see 'Best practices' below).

2. Managing data

After some time, all the files on remote archive server get automatically migrated to the tape. When this happens, all the folders and files can be still normally seen in the structure. The access to any (sub)directory is possible and ls should list the files. All the filenames and their permissions and metadata (age, size, ownerhip) can be seen.

The difference is that the file content is not directly available anymore - that is, not without calling it back first. If anything is done on the file content, like edit (like compressing, edit with vim/nano) or read (with less, cat or grep) then the command will get stuck in order to retrieve the file from the tapes - which take a lot of time. During this time the command line will be stuck unusable.

Therefore the correct procedure is to first stage (recall from the tape) the file, and access the content when it is available again.

2.1. Data states

The data migrates on remote server from disk to tape and during this it has different states. As long as the data is online (on disks), it is available to the user. It can be read or modified..

State Code Online (data on disks) Offline (data on tape) Explanation
Regular REG Yes No Files are only on disk. File content can be accessed and changed.
Migrating MIG Yes Not yet File content is copied from disks to tape. Content is still available.
Dual-state DUL Yes Yes Content is both on disk and on tape.
Offline OFL No Yes Content is no longer online (on disks). It is only on tape. Can be recalled back to disks.
Unmigrating UNM Not yet Yes Files content is copied from tape and is not available until copy is finised.

Note that the folders are always online (in state REG) and as such you can always browse folders and check file permissions and their metadata information.

3. Workflow example

How to upload and modify states of the remote files.

User

Become the data manager

   user $ sudo -u [group]-dm bash

Bundling

(optional, but highly recommended) Preperate the data by merging multiple files/folders into one compressed tar file

   dm-user $ tar -czvf /groups/[group]/arc01/projects/project-x.tar.gz /groups/[group]/prmxx/projects/x/*

Uploading

Upload file(s) to the archive

   dm-user $ cp /groups/[group]/prmXX/project-x.tar /groups/[group]/arc01/projects/project-x.tar

Checksum

(highly recommended) If file was copied recently, it can be still on regular disks on the remote archive server, so we can simply issue remote command to calculate the sha256sum value of it

   arc_surf --sha256sum /groups/[GROUP]/arcXX/subfolder/file

Migrating

(optionally) If file is still online, it can be moved to the tape (or simply wait for it to automatically move there)

   dm-user $ arc_surf --dmput /groups/[group]/arc01/projects/project-x.tar
   Submitted to remote host, waiting for reply ...
   ( You can press CTRL+C and check later for the output in /var/cache/arcq//output/tmp.5eHsc2kAPj )

Status

Check the file status

   dm-user $ arc_surf --dmls /groups/[group]/arc01/projects/project-x.tar
   Submitted to remote host, waiting for reply ...
   ( You can press CTRL+C and check later for the output in /var/cache/arcq//output/tmp.ECc4X0dAEz )
   -rw-r-----  1 dm-user    dm-user    10485760000 2024-11-26 18:08 (OFL) project-x.tar

Unmigrating

If file is offline, we can call it back to disks - stage it online with

   dm-user $ arc_surf --dmget /groups/[group]/arc01/projects/project-x.tar
   Submitted to remote host, waiting for reply ...
   ( You can press CTRL+C and check later for the output in /var/cache/arcq//output/tmp.EeHDV2kAPj )

Check the status again

   dm-user $ arc_surf --dmls /groups/[group]/arc01/projects/project-x.tar
   Submitted to remote host, waiting for reply ...
   ( You can press CTRL+C and check later for the output in /var/cache/arcq//output/tmp.qo7tO9CtVB )
   -rw-r-----  1 dm-user    dm-user    10485760000 2024-11-26 18:08 (UNM) project-x.tar
   dm-user $ # note that the file is unmigrating now

Now wait until status of UNM (unmigrating) is changed to DUL (Dual-state) or REG (Regular state).

4. Other command line options

Use --help argument to get more information

   dm-user $ $ arc_surf --help
   Provide one of the following arguments
    --dmfind-reg <path>      print regular files / files that reside only on disk
    --dmfind-mig <path>      print files that are being copied from disk to tape
    --dmfind-dul <path>      print files that reside both online and offline
    --dmfind-ofl <path>      print data that is no longer on disk (is on tape)
    --dmfind-unm <path>      print files which are being copied from tape to disk
    --dmget      <path>      recall / stage online FROM TAPE
    --dmls       <path>      list state
    --dmput      <path>      send to offline / stage TO TAPE
    --sha256sum  <path>      compute the sha256sum of the file

5. Best practices

File sizes are extremely important for archive. Tape storage performance and management is better when the files are larger size.

Therefore - files should be in range 1 and 100GB (checksums are exception) - average file size should not be lower than a 1GB - the archive filesystem was build around the idea of occasional (as in once or twice a year at most) accessing the data content

The average size is monitored and the groups with average size lower than this will have locked accounts.

6. Performance

The speed of upload and download depends on the following conditions

  • the total bandwidth usage of the network by all the users on the Login node
  • (for restoring the data) the usage of the prm/tmp disk utilization by all users
  • load of the data and network on the remote tape archive system that hosts the data

So far the tests have shown the upload speeds in between of 30 and 50 MB/s. Which means that archiving and restoring of the large datasets can take (depending on the size) anywhere from several hours to several days.

7. Issues

So far the bugs have been resolved, but it could happen that

  • archive folder is not available - please inform helpdesk, this should not happen, but it can be that remote system is temporarily down,
  • download/upload perfomance occasionally drops - this most probably depends on the Login node usage (and data copy by other users) - notify helpdesk if it persists for a longer period,
  • submitting the commands did not provide the results - has happened in first implemenentation of the archive solution, should be fixed now.

If you expirence any issues with the archive solution, please notify helpdesk.

8. Additional information

https://servicedesk.surf.nl/jira/servicedesk/customer/kb/view/1474651?applicationId=1ce5558f-6f9a-3c77-9454-661953e955cb&spaceKey=WIKI&portalId=13&title=Data%20Archive

(from Feb. 2025)

Where is my data stored?

The Data Archive maintains two tape libraries for security and redundancy in two physically separate locations in the Amsterdam and Haarlemmermeer municipalities.  When data is uploaded to the Data Archive using SSH, (HPN)SCP, SFTP, rsync, GridFTP, iRODS, etc. it ends up on an online disk space managed by the Data Migration Facility (DMF). The DMF will then manage the careful migration of files from the disk space to two tape libraries until your data is available on both tape libraries. Once your data is safely stored in the two tape libraries it may be removed from the disk space (aka offline). Offline data can be interacted with in the same manner as online data though users may notice a delay in access time.