What Is Lustre?
Lustre Overview
Lustre is a parallel distributed filesystem used in most large scale computing facilities. It allows NRAO desktops, public machines and clusters at a particular site to share a large file space thus removing the need for repeatedly copying data between systems for processing. It's primarily designed for performance which is achieved by aggregating individual disk throughput across a large number of disks. As a side effect, the resulting storage volume is typically large compared to desktop storage. As of September 26, 2023, NRAO/NM's Lustre filesystem is 2.2PB of storage and capable of sustaining ~10GB/s reads or writes; NAASC and CV Lustre filesystems have similar I/O performance, with 3.0 and 2.8 PB respectively. For similarly designed systems each new OSS contributes ~5GB/s I/O.
The described Lustre configuration is designed to produce maximum throughput and storage volume for minimal money. The cost per node is at most 60% greater than the raw cost of disks. It is not a suitable design for high availability of a large number of nodes or large volumes of small I/Os. The configuration attempts to balance disk spindle speed limits (180MB/s per disk), RAID card limits (~500MB/s per card), chassis volume (44 disks) and uniform distribution of data across 2^n data disks such that 1MB I/O's stripe uniformly and network throughput via infiniband (40Gb/s or higher).
A schematic type drawing which shows the original Lustre system at NRAO/NM can be found here: lustre-schematic.pdf. The Lustre system has increased substantially since this drawing but the concept is the same.
Lustre Usage
The Lustre file system behaves much like any other Unix filesystem with respect to ownership and permissions. Computers must be explicitly set up to work with Lustre.
Observers with accounts like nm-<number> or cv-<number> already have a lustre area set up for them. It is their home account. Other users will need an area like /lustre/aoc/users/<username> or /lustre/cv/users/<username> set up for them by the local helpdesk. This area will be the same on any machine which can mount that Lustre file system. Note that NAASC Lustre is intended as "scratch" storage for ALMA dataflow operations.
By default, the Archive Access Tool writes data to subdirectories in /home/e2earchive (NM systems only). On systems that support Lustre, that should be referenced as /lustre/aoc/ftp/e2earchive to ensure all reads and writes are done via Lustre and not NFS. Lustre can be up to 10x faster than NFS.
FAQ
- Background
- What is Lustre?
Lustre is a parallel distributed file system. Data is distributed across multiple RAID arrays hosted on multiple servers. The entire set of arrays is then presented to the client as a single coherent file system. - What is its purpose?
We're using Lustre mostly for performance. Local disks are limited by spindle speed and head seek time. AIPS and CASA should see significant improvements in performance on machines with high speed access to Lustre. Parallel variants of CASA or Obit will see even larger gains. Lustre performance scales with the number of arrays, so adding storage improves total throughput and storage capacity.
- What is Lustre?
- Usage
- Where should I write data?
Observers with accounts like nm-<number> or cv-<number> should write data to their home account (~/) as it is already on Lustre. Other users should write data to an area like /lustre/aoc/users/<username> or /lustre/cv/users/<username>. Contact your local helpdesk to create such areas. - Can I write anything to Lustre?
Technically yes, but practically you shouldn't. Lustre is not designed for general storage. It's intended for short term storage to facilitate data reduction of EVLA/VLBA or other instrument data. - CASA
- Nothing specific at this time.
- AIPS
- How do I setup AIPS to see Lustre?
See AIPS on Lustre. It involves having a personal .dadevs.always file in your home account (~/) that tells AIPS to look at the central AIPS configuration files and then append your Lustre area. That way you see the local system disks plus your Lustre area whether you're on your desktop, a public machine or the eventual cluster.
- How do I setup AIPS to see Lustre?
- Why can I see my Lustre AIPS area on machine-A but not on machine-B?
Most likely because the second machine is not set up to support Lustre.
- Where should I write data?
- Access
- Can I access Lustre from any machine?
No, only machines configured for Lustre can access it. It is not accessible on all Linux machines, and currently is unavailable for Windows and Mac systems. - How do I go get access from my Linux desktop?
Contact your local helpdesk. They will need to install kernel modules and if local networking permits, a 10Gb/s network card on your system. If you only have a 1Gb/s network card, performance will be limited. - Is there any difference between accessing Lustre from my machine versus the cluster or public machines?
Superficially no. All machines see exactly the same filesystem. Some machines, notably the cluster, will have much faster access.
- Can I access Lustre from any machine?
- Performance
- How fast is Lustre?
It depends. (NOTE: figures date from 2016) The aggregate bandwidth of the NRAO/NM Lustre file system is about 22,000MByte/second (MB/s). A single cluster node with an Infiniband connection can sustain about 3,000MB/s writes and 2,000MB/s reads. A single process on a single cluster node can sustain about 570MB/s writes and 500MB/s reads. However, as the available space in Lustre decreases so does the performance. - What causes the various limits?
The aggregate rate is limited by the total throughput of each raid array. The NRAO/NM Lustre consists of 24 arrays each capable of around 350MB/s. The 700MB/s client rate is limited by its ability to reassemble network replies from multiple storage nodes. The per process limit is limited by individual disk array speeds and/or packet re-assembly overhead. - Why can't I get that speed on my desktop?
Some desktops are limited to 100MB/s by their gigabit network which is further limited by contention for inter-floor and inter-switch connections. We have deprecated such systems in favor of installing 10Gb/s cards which should be able to sustain 500 to 700MB/s. - How does that compare to local disk?
A modern local disk can sustain around 110MB/s for a single task, but performance drops off dramatically as multiple processes contend for access. In some cases, remote access to Lustre will be faster than local disk despite network limits. Desktops, cluster nodes and public machines with high speed access will experience much better performance to Lustre than to local disk.
- How fast is Lustre?
- Capacity and Data Retention
- How large is the Lustre storage?
As of September 26, 2023, the NRAO/NM Lustre system has 2.2PB of addressable storage; NAASC Lustre had 3.0 PB and CV Lustre 2.8 PB. As we need more performance or capacity we will add additional storage nodes. Each node will increase aggregate performance by approximately 5 GB/s. - Is the data backed up?
Mostly no, as it's not practical to back up a file system this size. All data is presumed to be transient, so data loss is annoying but not disastrous. We do not expect data loss but failure can happen. Users should back up critical data, CASA tables, images etc.
NOTE: There is a "disaster recovery" copy of CV Lustre but this can lag the actual content by a month or more. - How long can I leave data on the Lustre file system?
Currently there is no limit. Bear in mind Lustre is designed for performance, it is not intended for mass, long term storage. We will almost certainly be implementing an aging policy unless users can be convinced to voluntarily limit usage. The final cleaning frequency will be a balance of usage and budget. - Is there a limit to how much data I can store on Lustre?
Currently no. Lustre does support quotas but we are not using them yet. As we gain more experience with usage we may implement size limits in addition to age limits.
- How large is the Lustre storage?
- Miscellaneous
- How stable is Lustre?
Lustre has been running at the EVLA and AOC in production and test modes since 2010; and at the NAASC since shortly thereafter. So far we've had no real problems. All components must be functioning for the file system as a whole to function. We may have to implement periodic regular maintenance as the server and client base grows. - What's the most frequent failure?
We average approximately 1 failed Lustre hard drive every 2 months. They are hot swappable and cause no down time. - What is /lustre/aoc/ftp? And /lustre/cv/ftp?
Those are the ftp areas for ftp.aoc.nrao.edu and ftp.cv.nrao.edu respectively. It's the same as /home/ftp. If you're writing data to your ftp area for others to access from outside the NRAO you can refer to it as /lustre/aoc/ftp or /lustre/cv/ftp. This has a slight performance advantage over /home/ftp since the latter uses NFS. - Why /lustre/aoc?
To differentiate the Lustre file system in the Science Operations Center in Socorro from the EVLA Widar output area at /lustre/evla as well as Lustre file systems at CV and GB.
- How stable is Lustre?
Lustre Disk Use Policy
The Lustre Disk Use Policy is only viewable from within the NRAO network.
Definitions
- OSS:
- Object Storage Server, consists of 1 or more OSTs, stores actual block data.
- OST:
- Object Storage Target, physical disks, consists of 1 or more disks in a RAID configuration
- MDS:
- Metadata Server. consists of MDT and MGS, stores file metadata (owner, date stamps, permissions etc)
- MDT:
- Metadata Target, physical disk which contains metadata
- MGS:
- Message Server, communications server for OSS/MDS/client traffic