EXCESSIVE_PARITY_ERROR: EARL 0: Parity error detected in VRAM

Issue: –  EXCESSIVE_PARITY_ERROR: EARL 0: Parity error detected in VRAM

Details: –

What is parity error?

We know two types of parity errors:

– software parity errors – caused by an environmental disruption – this are more likely one time errors that are not appearing again on a device
– hardware parity errors – caused by a physical malfunction – if this is the caused they are repeating constantly

If you want to know more about parity errors please see below:
Background

What is a processor or memory parity error?

Parity checking is the storage of an extra binary digit (bit) in order to represent the parity (odd or even) of a small amount of computer data (typically one byte) while that data is stored in memory. The parity value calculated from the stored data is then compared to the final parity value. If these two values differ, this indicates a data error, and at least one bit must have been changed due to data corruption.

Within a computer system, electrical or magnetic interference from internal or external causes can cause a single bit of memory to spontaneously flip to the opposite state. This event makes the original data bits invalid and is known as a parity error.

Such memory errors, if undetected, may have undetectable and inconsequential results or may cause permanent corruption of stored data or a machine crash.

There are many causes of memory parity errors, which are classified as either soft parity errors or hard parity errors.

Soft Errors

Most parity errors are caused by electrostatic or magnetic-related environmental conditions.

The majority of single-event errors in memory chips are caused by background radiation (such as neutrons from cosmic rays), electromagnetic interference (EMI), or electrostatic discharge (ESD). These events may randomly change the electrical state of one or more memory cells or may interfere with the circuitry used to read and write memory cells.

Known as soft parity errors, these events are typically transient or random and usually occur once. Soft errors can be minor or severe:

Minor soft errors that can be corrected without component reset are single event upsets (SEUs).
Severe soft errors that require a component or system reset are single event latchups (SELs).

Soft errors are not caused by hardware malfunction; they are transient and infrequent, are mostly likely a SEU, and are caused by an environmental disruption of the memory data.

If you encounter soft parity errors, analyse recent environmental changes that have occurred at the location of the affected system. Common sources of ESD and EMI that may cause soft parity errors include:

– Power cables and supplies
– Power distribution units
– Universal power supplies
– Lighting systems
– Power generators
– Nuclear facilities (radiation)
– Solar flares (radiation)

Hard Errors

Other parity errors are caused by a physical malfunction of the memory hardware or by the circuitry used to read and write memory cells.

Hardware manufacturers take extensive measures to prevent and test for hardware defects. However, defects are still possible; for example, if any of the memory cells used to store data bits are malformed, they may be unable to hold a charge or may be more vulnerable to environmental conditions.

Similarly, while the memory itself may be operating normally, any physical or electrical damage to the circuitry used to read and write memory cells may also cause data bits to be changed during transfer, which results in a parity error.

Known as hard parity errors, these events are typically very frequent and repeated and occur whenever the affected memory or circuitry is used. The exact frequency depends on the extent of the malfunction and how frequently the damaged equipment is used.

Remember that hard parity errors are the result of a hardware malfunction and reoccur whenever the affected component is used.

If you encounter hard parity errors, analyze physical changes that have occurred at the location of the affected system. Common sources of hardware malfunction that may lead to hard parity errors include:

– Power surges (no ground)
– ESD
– Overheating or cooling
– Incorrect or partial installation
– Component incompatibility
– Manufacturing defect

How to identify the module:

%EARL-SW2_STBY-1-EXCESSIVE_PARITY_ERROR: EARL 0: Parity error detected in VRAM  —> Standby SUP in Switch 2 (VSS)

EARL-DFC4-1-EXCESSIVE_PARITY_ERROR: EARL 0: Parity error detected in VRAM   —-> Module 4

: %EARL-SW1_DFC2-1-EXCESSIVE_PARITY_ERROR: EARL 0: Parity error detected in VRAM   —-> Module 2 in Switch 1

Solution: – Re-seat the affected module. Monitor the same for 48 hours. If error repeats, replace the module.

SATCTRL-FEX-4-SOHMS_DIAG_WARN

Issue:  SATCTRL-FEX108-4-SOHMS_DIAG_WARN  error on nexus 5k

Device affected

Nexus 2K

Issue details

%SATCTRL-FEX108-4-SOHMS_DIAG_WARN: FEX-108 Module 1: Runtime diag detected minor event: Correctable ECC errors <dev=0, count=1>

There seems to be an issue with a single bit stored in memory, but it gets corrected before it can cause any issues each time. It might be a hardware fault, though not one with any current impact, or it could be a transient issue. If it’s the latter, a reload might clear the issue. Otherwise, replacement of FEX is required. A single bit being off, especially if it gets corrected every time, shouldn’t have an impact on the system aside from the logs being generated. If it were to get worse and cause issues with multiple bits, however, then it wouldn’t be correctible, and a reload would occur on the FEX module.

RMA required cases – Multiple error

Sat xx  2 01:17:53 2017@498402 (112/212/0x0): FEX-111 Module 1: Runtime diag detected minor event: Correctable ECC errors <dev=0, count=3>

Sat xx  2 06:17:53 2017@513576 (112/212/0x0): FEX-111 Module 1: Runtime diag detected minor event: Correctable ECC errors <dev=0, count=1>

This error means that a single-bit ECC correction (error correction) was made on FEX memory. It is harmless because hardware was able to correct the memory error via ECC. FEX will not reboot and it is self-corrected. ECC is the memory protection in the fex, and it is corrected which means there was a problem, but the fex ECC compensated for it.

Action Plan: –

Replace the 2k if the logs are recurring.