...
Home is not a high performance filesystem. NFS is tried a true but it is not designed to handle the stresses an High Performance Computing cluster can place on it. As a result, home should not be used for writing output from multiple jobs or jobs that can generate a lot of data. Jobs of this nature should take advantage of Koa Scratch instead, which is a file system designed for the stresses an HPC can generate.
...
File system | Per user storage quota | Per user file limit | Compression | Persistent |
---|---|---|---|---|
ZFS + NFS v4 | 50 GB | N/A | zstd-3 | Yes |
...
...
Scratch
Scratch, also known by the symlink “koa_scratch”, provides each user access to a 800TB pool of storage on which files may live 90 days on from the last time they were modified. In total, this file system can support up to 400,000,000 files and directories. Users are not provided an individual quota allowing for flexibility based on need. Scratch directly access the underlying Koa Storage system providing the high performance possible from the storage system.
Performance
Scratch provides direct access to the underlying high performance file system: Koa Storage, which utilizes Lustre. Lustre is designed for situations where many servers and workloads are needing to read and write data as quickly as possible. While Lustre works best with long sequential reads and writes, methods exist to work around this limitation when the need to read as input a lot of small files, such as using squashfs.
For scalability and performance, files written to Scratch use a progressive file layout in which at certain size boundaries different types of storage or more storage targets overall are recruited to storage parts of a file.
Automatic File Purging
Scratch is not a persistent storage location for users data. Scratch provides a 90 day grace period after a file was last written to before the file is removed automatically from the file system.
Parameters of the automated file purge
Only files under ~/koa_scratch/ or /mnt/lustre/koa/scratch/${USER} will be subject to purge
Files and folders not modified for *90 days* will be deleted from scratch
The purge process will run daily
The process cannot be paused for individual users and files that are removed cannot be recovered
In the case the file system still does not drop below the 85-90% threshold, ITS-CI will contact users who have large occupancy of the scratch to voluntarily reduce their usage and/or purge files from oldest to newest until we have gotten back below 70% usage of scratch.
Details
Scratch, or more specifically, Koa Storage utilizes Lustre with ZFS as its underlying file system. For transport to Koa, it also utilizes RDMA using multiple servers all connected at 200Gbit Infiniband. The ZFS components are setup to provide the same zstd-3 compression seen on the home file system providing space savings and in some cases faster access to file as you are needing to read less data from slower storage medium. The underlying storage system utilizes a mixture of spinning enterprise hard drives (HDD) and Solid State storage, (SAS SSD and NVMe).
Meta Data
Meta data is stored in separate targets from the files data, which for performance reason is entirely using Solid State Drives. The meta data is split up among multiple targets, which are then served out by different servers. In case of a server failure, “failing-over” the storage to another server is possible allowing for minimal down time. Each folder is assigned to one of the Meta Data storage targets and all files under it are also assigned to that target. Load balancing is done in some cases to try and balancing the different meta data targets so they do not grow too out of sync size wise.
Object Data
Object data is stored is storage on a mixture of spinning enterprise hard drives, SAS Solid State Drives and NVMe. Of the current storage (7 PB), about 1/7th of Koa’s object data storage is either SAS SSD or NVMe.
Storage targets are split up into different configurations providing at least a 2 disk parity. Current Koa has 58 Object Storage Targets, of which 10 targets are SAS SSD or NVMe.
Object Data is written to Scratch using a progressive file layout used for scratch is currently setup to follow the following rules:
The first 512K of every file is written to some form of solid state storage (SSD or NVMe)
Next, file data up to 64MB is written to a single HDD target
Next, file data up to 512MB is written to two HDD targets in 4 MB stripes
Next, file data up to 1 GB is written to four HDD targets in 4 MB stripes
Finally, any remaining data for a file is written to eight HDD targets in 4 MB stripes
File system | Total Storage | Max Files and Directories | Compression | Persistent |
---|---|---|---|---|
Lustre + ZFS | 800 TB | 400,000,000 | zstd-3 | No (90 day purge) |
...