Preventing and Recovering from Disk Full Errors
It is important to monitor the disk usage of Geode members. If a member lacks sufficient disk space for a disk store, the member attempts to shut down the disk store and its associated cache, and logs an error message. A shutdown due to a member running out of disk space can cause loss of data, data file corruption, log file corruption and other error conditions that can negatively impact your applications.
After you make sufficient disk space available to the member, you can restart the member.
You can prevent disk file errors using the following techniques:
- If you are using ext4 file system, we recommend that you pre-allocate disk store files and disk store metadata files. Pre-allocation reserves disk space for these files and leaves the member in a healthy state when the disk store and regions are shut down, allowing you to restart the member once sufficient disk space has been made available. Pre-allocation is enabled by default.
- Configure critical usage thresholds (disk-usage-warning-percentage and disk-usage-critical-percentage) for the disk. By default, these are set to 90% for warning and 99% for errors that will shut down the cache.
- Follow the recommendations in Optimizing a System with Disk Stores for general disk management best practices.
When a disk write fails due to disk full conditions, the member is shutdown and removed from the cluster.
Recovering from Disk Full Errors
If a member of your cluster fails due to a disk full error condition, add or make additional disk capacity available and attempt to restart the member normally. If the member does not restart and there is a redundant copy of its regions in a disk store on another member, you can restore the member using the following steps:
- Delete or move the disk store files from the failed member.
- Use the gfsh
show missing-disk-stores
command to identify any missing data. You may need to manually restore this data. - Revoke the missing disk stores using the revoke missing-disk-store gfsh command.
- Restart the member.
See Handling Missing Disk Stores for more information.