LATEST VERSION: 8.0.0 - CHANGELOG
Pivotal GemFire® v8.0

Handling Forced Cache Disconnection

Handling Forced Cache Disconnection

A GemFire member may be forcibly disconnected from a GemFire distributed system if the member is unresponsive for a period of time, or if a network partition separates one or more members into a group that is too small to act as the distributed system.

After being disconnected from a distributed system a GemFire member automatically shuts down and then restarts into a "reconnecting" state, while periodically attempting to rejoin the distributed system. If the member succeeds in reconnecting, the member rebuilds its view of the distributed system from existing members and receives a new distributed system ID.

When a locator is in the reconnecting state, it provides no discovery services for the distributed system. Therefore, if all locators in the system are in the "reconnecting" state, you may need to manually restart at least one locator for the distributed system to recover.

Note: Automatic reconnect is supported by members that you start with the gfsh utility, the ServerLauncher or LocatorLauncher interface.

By default a GemFire member will try to reconnect to the distributed until the maximum number of attempts is made (the max-num-reconnect-tries property) or until it is told to stop by using the DistributedSystem.stopReconnecting() or Cache.stopReconnecting() method. You can configure the amount of time that GemFire waits between reconnection attempts using the max-wait-time-reconnect property. You can disable automatic reconnection entirely by setting disable-auto-reconnect GemFire property to "true."

You can use DistributedSystem and Cache callback methods to perform actions during the reconnect process, or to cancel the reconnect process if necessary.

The DistributedSystem and Cache API provide several methods you can use to take actions while a member is reconnecting to the distributed system:
  • DistributedSystem.isReconnecting() returns true if the member is in the process of reconnecting and recreating the cache after having been removed from the system by other members, or has shut down due to missing Roles and is reconnecting.
  • DistributedSystem.waitUntilReconnected(long, TimeUnit) waits for a period of time, and then returns a boolean value to indicate whether the member has reconnected to the DistributedSystem. Use a value of -1 seconds to wait indefinitely until the reconnect completes or the member shuts down. Use a value of 0 seconds as a quick probe to determine if the member has reconnected.
  • DistributedSystem.getReconnectedSystem() returns the reconnected DistributedSystem.
  • DistributedSystem.stopReconnecting() stops the reconnection process and ensures that the DistributedSystem stays in a disconnected state.
  • Cache.isReconnecting() returns true if the cache is attempting to reconnect to a distributed system.
  • Cache.waitForReconnect(long, TimeUnit) waits for a period of time, and then returns a boolean value to indicate whether the DistributedSystem has reconnected. Use a value of -1 seconds to wait indefinitely until the reconnect completes or the cache shuts down. Use a value of 0 seconds as a quick probe to determine if the member has reconnected.
  • Cache.getReconnectedCache() returns the reconnected Cache.
  • Cache.stopReconnecting() stops the reconnection process and ensures that the DistributedSystem stays in a disconnected state.

Once the cache has reconnected applications must fetch a reference to the new Cache, Regions, DistributedSystem and other artifacts. Old references will continue to throw cancellation exceptions like CacheClosedException(cause=ForcedDisconnectException).

See the GemFire DistributedSystem and Cache Java API documentation for more information.