Lustre FAQ
Lustre FAQ
Background
- What is Lustre?
- Lustre is a parallel distributed file system. Data is distributed across multiple RAID arrays hosted on multiple servers. The entire set of arrays is then presented to the client as a single coherent file system.
- What is its purpose?
- We're using Lustre mostly for performance. Local disks are limited by spindle speed and head seek time. AIPS and CASA should see significant improvements in performance on machines with high speed access to Lustre. Parallel variants of CASA or Obit will see even larger gains. Lustre performance scales with the number of arrays, so adding storage improves total throughput and storage capacity.
Usage
- Where should I write data?
- Observers with accounts like nm-4386 or cv-4386 should write data to their home account (~/) as it is already on Lustre. Other users should write data to an area like /lustre/aoc/observers/<username> or /lustre/naasc/users/<username>. Contact your local helpdesk to create such areas.
- Can I write anything to Lustre?
- Technically yes, but practically you shouldn't. Lustre is not designed for general storage. It's intended for short term storage to facilitate data reduction of VLA/VLBA/ALMA or other instrument data.
- How do I setup AIPS to see Lustre?
- See AIPS on Lustre. It involves having a personal .dadevs.always file in your home account (~/) that tells AIPS to look at the central AIPS configuration files and then append your Lustre area. That way you see the local system disks plus your Lustre area whether you're on your desktop, a public machine or the eventual cluster. $ Why can I see my Lustre AIPS area on machine-A but not on machine-B?: Most likely because the second machine is not set up to support Lustre. Contact your local helpdesk to get a different machine or Lustre access added to the machine you are using.
Access
- Can I access Lustre from any machine?
- No, only machines configured for Lustre can access it. It is not accessible on all Linux machines, and currently is unavailable for Windows and OS/X machines.
- How do I go get access from my Linux desktop?
- Contact your local helpdesk. They will need to install kernel modules and a 10Gb/s network card on your system.
- Is there any difference between accessing Lustre from my machine versus cluster or public machines?
- Superficially no. All machines see exactly the same filesystem. Some machines, notably the clusters, will have much faster access.
Performance
- How fast is Lustre?
- It depends. The aggregate bandwidth of the NMASC Lustre file system is about 22,000MByte/second (MB/s). A single cluster node with an Infiniband connection can sustain about 3,000MB/s writes and 2,000MB/s reads. A single process on a single cluster node can sustain about 570MB/s writes and 500MB/s reads. However, as the available space in Lustre decreases so does the performance.
- What causes the various limits?
- The aggregate rate is limited by the total throughput of each raid array. The NMASC Lustre consists of 24 arrays each capable of around 350MB/s. The 700MB/s client rate is limited by its ability to reassemble network replies from multiple storage nodes. The per process limit is limited by individual disk array speeds and/or packet re-assembly overhead.
- Why can't I get that speed on my desktop?
- Some desktops are limited to 100MB/s by their gigabit network which is further limited by contention for inter-floor and inter-switch connections. We have deprecated such systems in favor of installing 10Gb/s cards which should be able to sustain 500MB/s to 700MB/s.
- How does that compare to local disk?
- A modern local disk can sustain around 110MB/s for a single task, but performance drops off dramatically as multiple processes contend for access. In some cases, remote access to Lustre will be faster than local disk despite network limits. Desktops, cluster nodes and public machines with high speed access will experience much better performance to Lustre than to local disk.
Capacity and Data Retention
- How large is the Lustre storage?
- As of Dec. 2017, NMASC's Lustre filesystem is 1.4PB and NAASC's Lustre filesystem 819TB. As we need more performance or capacity we will add additional storage nodes. Each node will add about 176TB of storage depending on disk size.
- Is the data backed up?
- No, it's not practical to back up a file system this size. All data is presumed to be transient, so data loss is annoying but not disastrous. We do not expect data loss but failure can happen. Users should back up critical data, CASA tables, images etc.
- How long can I leave data on the Lustre file system?
- Currently there is no limit. Bear in mind Lustre is designed for performance, it is not intended for mass, long-term storage. We are implementing quotas to keep disk usage in check.
- Is there a limit to how much data I can store on Lustre?
- Yes. We are beginning to implement quotas as well as monitor disk usage. Observers in /lustre/aoc/observers or /lustre/naasc/observers are limited to 5TB. Staff scientists in /lustre/aoc/users and /lustre/naasc/users are also limited to 5TB.
Miscellaneous
- How stable is Lustre?
- Lustre has been running at the EVLA and AOC in production and test modes since 2010. So far we've had no real problems. All components must be functioning for the file system as a whole to function. We may have to implement periodic regular maintenance as the server and client base grows.
- What's the most frequent failure?
- We average approximately 1 failed Lustre hard drive every 2 months. They are hot swappable and cause no down time.
- What is /lustre/aoc/ftp?
- That's the ftp area. It's the same as /home/ftp. If you're writing data to your ftp area for others to access from outside the NRAO you can refer to it as /lustre/aoc/ftp. This has a slight performance advantage over /home/ftp since the latter uses NFS.
- Why /lustre/aoc?
- To differentiate the Lustre file system in the Science Operations Center in Socorro from the EVLA Widar output area at /lustre/evla as well as Lustre file systems at CV and GB.