Using remote archive storage
1. Overview
1.1. Basic overview
It is the storage that is hosted on the tapes on the remote location. Currently we only have one archive provider - SURF.
The following guidelines apply to this archive storage
- It is a remote storage to write once and read rarely.
- Performance accommodates occasional (once a year or less) access to the existing data on the remote tape storage.
- Archive is not a backup! It provides storage with lower costs than a normal disk storage.
- Tape data does not have a backup, so be extremely careful with the data deletion.
- If permissions and metadata of the files are needed to be kept, then they should be first packaged before uploading (use of
tar
or similar tool). - File size must be considered - there should be no small files (see 'Best practices' below)
- Tape storage has the ISO 27001 certification
- Data is stored in two physical locations in the Netherlands.
1.2. How it works
Archive is automatically mounted when user navigates to the /groups/[GROUP]/arc[XX]
folder. At that moment storage from remote server gets mounted on the folder. It remains accessible until some specific idle time is reached.
Data-manager account of the specific group is the only account that has read or write access to the archive folder of the group. This is to prevent the potential problems of accidentally recalling files online when not needed. Also to make sure that all the files are stored in correct format (see 'Best practices' below).
2. Managing data
After some time, all the files on remote archive server get automatically migrated to the tape. When this happens, all the folders and files can be still normally seen in the structure. The access to any (sub)directory is possible and ls
should list the files. All the filenames and their permissions and metadata (age, size, ownerhip) can be seen.
The difference is that the file content is not directly available anymore - that is, not without calling it back first. If anything is done on the file content, like edit (like compressing, edit with vim/nano) or read (with less
, cat
or grep
) then the command will get stuck in order to retrieve the file from the tapes - which take a lot of time. During this time the command line will be stuck unusable.
Therefore the correct procedure is to first stage (recall from the tape) the file, and access the content when it is available again.
2.1. Data states
The data migrates on remote server from disk to tape and during this it has different states. As long as the data is online (on disks), it is available to the user. It can be read or modified..
State | Code | Online (data on disks) | Offline (data on tape) | Explanation |
---|---|---|---|---|
Regular | REG |
Yes | No | Files are only on disk. File content can be accessed and changed. |
Migrating | MIG |
Yes | Not yet | File content is copied from disks to tape. Content is still available. |
Dual-state | DUL |
Yes | Yes | Content is both on disk and on tape. |
Offline | OFL |
No | Yes | Content is no longer online (on disks). It is only on tape. Can be recalled back to disks. |
Unmigrating | UNM |
Not yet | Yes | Files content is copied from tape and is not available until copy is finised. |
Note that the folders are always online (in state REG
) and as such you can always browse folders and check file permissions and their metadata information.
3. Workflow example
How to upload and modify states of the remote files.
User
Become the data manager
user $ sudo -u [group]-dm bash
Bundling
(optional, but highly recommended) Preperate the data by merging multiple files/folders into one compressed tar file
dm-user $ tar -czvf /groups/[group]/arc01/projects/project-x.tar.gz /groups/[group]/prmxx/projects/x/*
Uploading
Upload file(s) to the archive
dm-user $ cp /groups/[group]/prmXX/project-x.tar /groups/[group]/arc01/projects/project-x.tar
Checksum
(highly recommended) If file was copied recently, it can be still on regular disks on the remote archive server, so we can simply issue remote command to calculate the sha256sum
value of it
arc_surf --sha256sum /groups/[GROUP]/arcXX/subfolder/file
Migrating
(optionally) If file is still online, it can be moved to the tape (or simply wait for it to automatically move there)
dm-user $ arc_surf --dmput /groups/[group]/arc01/projects/project-x.tar
Submitted to remote host, waiting for reply ...
( You can press CTRL+C and check later for the output in /var/cache/arcq//output/tmp.5eHsc2kAPj )
Status
Check the file status
dm-user $ arc_surf --dmls /groups/[group]/arc01/projects/project-x.tar
Submitted to remote host, waiting for reply ...
( You can press CTRL+C and check later for the output in /var/cache/arcq//output/tmp.ECc4X0dAEz )
-rw-r----- 1 dm-user dm-user 10485760000 2024-11-26 18:08 (OFL) project-x.tar
Unmigrating
If file is offline, we can call it back to disks - stage it online
with
dm-user $ arc_surf --dmget /groups/[group]/arc01/projects/project-x.tar
Submitted to remote host, waiting for reply ...
( You can press CTRL+C and check later for the output in /var/cache/arcq//output/tmp.EeHDV2kAPj )
Check the status again
dm-user $ arc_surf --dmls /groups/[group]/arc01/projects/project-x.tar
Submitted to remote host, waiting for reply ...
( You can press CTRL+C and check later for the output in /var/cache/arcq//output/tmp.qo7tO9CtVB )
-rw-r----- 1 dm-user dm-user 10485760000 2024-11-26 18:08 (UNM) project-x.tar
dm-user $ # note that the file is unmigrating now
Now wait until status of UNM
(unmigrating) is changed to DUL
(Dual-state) or REG
(Regular state).
4. Other command line options
Use --help
argument to get more information
dm-user $ $ arc_surf --help
Provide one of the following arguments
--dmfind-reg <path> print regular files / files that reside only on disk
--dmfind-mig <path> print files that are being copied from disk to tape
--dmfind-dul <path> print files that reside both online and offline
--dmfind-ofl <path> print data that is no longer on disk (is on tape)
--dmfind-unm <path> print files which are being copied from tape to disk
--dmget <path> recall / stage online FROM TAPE
--dmls <path> list state
--dmput <path> send to offline / stage TO TAPE
--sha256sum <path> compute the sha256sum of the file
5. Best practices
File sizes are extremely important for archive. Tape storage performance and management is better when the files are larger size.
Therefore - files should be in range 1 and 100GB (checksums are exception) - average file size should not be lower than a 1GB - the archive filesystem was build around the idea of occasional (as in once or twice a year at most) accessing the data content
The average size is monitored and the groups with average size lower than this will have locked accounts.
6. Performance
The speed of upload and download depends on the following conditions
- the total bandwidth usage of the network by all the users on the Login node
- (for restoring the data) the usage of the prm/tmp disk utilization by all users
- load of the data and network on the remote tape archive system that hosts the data
So far the tests have shown the upload speeds in between of 30 and 50 MB/s. Which means that archiving and restoring of the large datasets can take (depending on the size) anywhere from several hours to several days.
7. Issues
So far the bugs have been resolved, but it could happen that
- archive folder is not available - please inform helpdesk, this should not happen, but it can be that remote system is temporarily down,
- download/upload perfomance occasionally drops - this most probably depends on the Login node usage (and data copy by other users) - notify helpdesk if it persists for a longer period,
- submitting the commands did not provide the results - has happened in first implemenentation of the archive solution, should be fixed now.
If you expirence any issues with the archive solution, please notify helpdesk.
8. Additional information
https://servicedesk.surf.nl/jira/servicedesk/customer/kb/view/1474651?applicationId=1ce5558f-6f9a-3c77-9454-661953e955cb&spaceKey=WIKI&portalId=13&title=Data%20Archive
(from Feb. 2025)
Where is my data stored?
The Data Archive maintains two tape libraries for security and redundancy in two physically separate locations in the Amsterdam and Haarlemmermeer municipalities. When data is uploaded to the Data Archive using SSH, (HPN)SCP, SFTP, rsync, GridFTP, iRODS, etc. it ends up on an online disk space managed by the Data Migration Facility (DMF). The DMF will then manage the careful migration of files from the disk space to two tape libraries until your data is available on both tape libraries. Once your data is safely stored in the two tape libraries it may be removed from the disk space (aka offline). Offline data can be interacted with in the same manner as online data though users may notice a delay in access time.