Maintaining Cache Consistency
Maintaining data consistency between caches in a distributed Geode system is vital for ensuring its functional integrity and preventing data loss.
General Guidelines
Before Restarting a Region with a Disk Store, Consider the State of the Entire Region
Note: If you revoke a member’s disk store, do not restart that member with its disk stores—in isolation—at a later time.
Geode stores information about your persisted data and prevents you from starting a member with a revoked disk store in the running system. But Geode cannot stop you from starting a revoked member in isolation, and running with its revoked data. This is an unlikely situation, but it is possible to do:
- Members A and B are running, both storing Region data to disk.
- Member A goes down.
- Member B goes down.
- At this point, Member B has the most recent disk data.
- Member B is not usable. Perhaps its host machine is down or cut off temporarily.
- To get the system up and running, you start Member A, and use the command line tool to revoke Member B’s status as member with the most recent data. The system loads Member A’s data and you run forward with that.
- Member A is stopped.
- At this point, both Member A and Member B have information in their disk files indicating they are the gold copy members.
- If you start Member B, it will load its data from disk.
- When you start Member A, the system will recognize the incompatible state and report an exception, but by this point, you have good data in both files, with no way to combine them.
Understand Cache Transactions
Understanding the operation of Geode transactions can help you minimize situations where the cache could get out of sync.
Transactions do not work in distributed regions with global scope.
Transactions provide consistency within one cache, but the distribution of results to other members is not as consistent.
Multiple transactions in a cache can create inconsistencies because of read committed isolation. Since multiple threads cannot participate in a transaction, most applications will be running multiple transactions.
An in-place change to directly alter a key’s value without doing a put can result in cache inconsistencies. With transactions, it creates additional difficulties because it breaks read committed isolation. If at all possible, use copy-on-read instead.
In distributed-no-ack scope, two conflicting transactions in different members can commit simultaneously, overwriting each other as the changes are distributed.
If a cache writer exists during a transaction, then each transaction write operation triggers a cache writer’s related call. Regardless of the region’s scope, a transaction commit can invoke a cache writer only in the local cache and not in the remote caches.
A region in a cache with transactions may not stay in sync with a region of the same name in another cache without transactions.
Two applications running the same sequence of operations in their transactions may get different results. This could occur because operations happening outside a transaction in one of the members can overwrite the transaction, even in the process of committing. This could also occur if the results of a large transaction exceed the machine’s memory or the capacity of Geode. Those limits can vary by machine, so the two members may not be in sync.
Guidelines for Multi-Site Deployments
Optimize socket-buffer-size
In a multi-site installation using gateways, if the link between sites is not tuned for optimum throughput, it could cause messages to back up in the cache queues. If a queue overflows because of inadequate buffer sizes, it will become out of sync with the sender and the receiver will be unaware of the condition. You can configure the send-receive buffer sizes of the TCP/IP connections used for data transmissions by changing the socket-buffer-size attribute of the gateway-sender and gateway-receiver elements in the cache.xml
file. Set the buffer size by determining the link bandwidth and then using ping to measure the round-trip time.
When optimizing socket-buffer sizes, use the same value for both gateway senders and gateway receivers.
Prevent Primary and Secondary Gateway Senders from Going Offline
In a multi-site installation, if the primary gateway server goes offline, a secondary gateway sender must take over primary responsibilities as the failover system. The existing secondary gateway sender detects that the primary gateway sender has gone offline, and a secondary one becomes the new primary. Because the queue is distributed, its contents are available to all gateway senders. So, when a secondary gateway sender becomes primary, it is able to start processing the queue where the previous primary left off with no loss of data.
If both the primary gateway sender and all its secondary senders go offline and messages are in their queues, data loss could occur, because there is no failover system.
Verify That isOriginRemote Is Set to False
The isOriginRemote flag for a server or a multi-site gateway is set to false by default, which ensures that updates are distributed to other members. Setting its value to true in the server or the receiving gateway member applies updates to that member only, so updates are not distributed to peer members.