SquashFS

File systems on Koa have different properties. Of these, the file system that is utilized for group/lab spaces, KoaStore (for fee long term storage) and scratch all have limitations on how many files a user can have. Currently all three of these spaces are setup to to allow full usage of the space provided if the average file size stored in the given location is 1 MB in size or larger.

In some cases, users may have inputs or datasets that comprise of a lot of files that are much smaller than 1 MB each, resulting in hitting a limit on how many files they can store before exhausting the storage space they have been allocated. This limitation can be frustrating, but in the case the files are not needing to be modified, utilization of an archive format is ideal and in fact encouraged. For some this could be a simple tar file and for others they need the ability to access these files on a regular basis without having to unpack the contents of the archive. To create a simple to use, read only archive that can be mounted on Koa as a folder during a jobs execution, we encourage users to consider using a SquashFS archive.

SquashFS is a read-only file format. This means you cannot directly modify the files stored in the archive.

The benefits of using something like squashFS for the three storage locations that have limits on the number of files store are:

Only counts as a single file
Larger file size stored as it can merge multiple smaller files
Reduction in overhead on file access by requiring less network communication to access each file stored in the archive. As a result, possible better performance on accessing the read-only data
Users can mount/unmount each archive and access it like any other storage location

Usage of squashFS in this manner has also been tested and shown to be beneficial for read only databases utilized by the NCBI bioinformatics tool , Blast. https://arxiv.org/abs/2002.06129

Building a squashfs file

On Koa, users cane create a squashfs archive using the mksquashfs on a compute node. For example, let us assume we have a folder located at ~/koa_scratch/database which contains 10K files with an average file size of 512K. I could create a single file, let us call it database.sqfs and also save it in ~/koa_scratch with the following command:

mksquashfs ~/koa_scratch/database ~/koa_scratch/database.sqfs -comp zstd -Xcompression-level 3 -b 1M -no-xattrs

Mounting and using a squashfs file

Once the archive is created, one would be able to mount and access the archive on a per node basis by use the squashfuse_ll command. While one could use just squashfuse we recommend the _ll variant as it allows you to setup an automatic clean up of the mount when it hasn’t been accessed for a while. This is useful in the case you have multiple jobs using the same node all accessing the same archive from the same location. The automatic unmount via timeout would allow your jobs to not unmount the squash archive yet it will eventually be removed automatically if it no longer see any other file accesses.

Note: the squashfs archive can only be mounted on folder that reside in /tmp on a given node. Also be aware that squashfuse will only mount the archive on the node the command is executed from and would only be accessible on that node. An archive can be mounted on multiple nodes at the same time at the same path for each node that needs access to the archive.

mkdir /tmp/database
squashfuse_ll -otimeout=43200 ~/koa_scratch/database.sqfs /tmp/database

In the above example, the database.sfs will be unmounted from /tmp/database after 43,200 seconds (12 hrs) of inactivity. Also

If you know you no longer need the squashfs mounted on a given node, it can be easily unmounted by a user by using the umount command, but it may be optional in the case a sensible timeout is set with the squashfuse_ll command

umount /tmp/database

https://www.mankier.com/1/mksquashfs

https://github.com/vasi/squashfuse