- ABSTRACT
ABSTRACT
This program was primarily developed for use in those
cases where a customer has an operating system that does not
have the DSA Bad Block Replacement capability already in it.
Several operating systems already have a method of doing DSA
Bad Block Replacement (BBR) and would only rarely need the
services of this program (VMS is one of many). However, their
are cases where even with an operating system that has the BBR
capability, the need to replace a specific LBN, or allow this
program to search for bad blocks, may be appropriate.
In those cases where a customer is running an operating
system that does not have the BBR capability, bad blocks may
develop and show up as ECC errors in the system error log.
When this occurs, this program can be used to replace those bad
blocks and eliminate the logging of ECC errors related to
specific bad block(s). It's a known fact that bad blocks can
become worse with time. What may start out as a 6 symbol ECC
error may eventually end up as an uncorrectable ECC error, if
action is not taken to replace that block.
- OPERATIONAL_CONCEPT
OPERATIONAL CONCEPT
In order to verify that an LBN indeed needs replacement,
this program (And all other Operating Systems with BBR
capability) tries to "verify" that the bad block report is not
a transient error and in fact relates to a defect in the media.
If all reports of Bad Blocks (Like ECC errors) caused
replacement, without question, then a substantial possibility
would exist that blocks with no defect would be replaced. As
you know, many times errors are caused by the environment (Like
static, electrical problems etc). Errors reported that are not
related to a media defect should not be replaced. Thus, an
attempt is made to verify that a bad block report can be
duplicated and therefore related to a higher probability of a
media defect being the cause. This is the way it happens in
the most usual case:
1. This program uses the MSCP ACCESS command to do large Byte
count transfers (Read) from the drive being tested. It
starts with LBN 0 and progresses to the highest user LBN on
the drive. The ACCESS command moves data from the drive to
the controller, where the data is checked for any errors.
The data IS NOT actually moved to host memory. This way we can
read data at the fastest possible rate and thereby make it
possible for more passes to be made across the media in the
shortest possible amount of time.
2. If an error (Lets say an 8 symbol ECC error) is detected by
the controller, a flag gets set called the Bad Block
Reported flag. This flag is reported to this program by
whats called an "end packet". This packet describes the
success or failure of the ACCESS command that we gave the
controller. The end packet also provides the LBN address
that caused the error.
3. When this program sees this Bad Block Reported Flag, it
takes the LBN address that caused the Bad Block reported
flag to set and goes into a series of READ commands to the
LBN that reported the bad block. A series of 32 reads are
attempted. If a second Bad Block report occurs, the
program terminates the testing and replaces the block.
4. If 32 re-READ's are accomplished without error or bad block
report then the error is considered a transient and
replacement is not attempted.
5. HOWEVER, if it is considered a transient error, the LBN
address is logged in a memory table called the transient
error table. If on a subsequent passes across the media,
this same LBN address reports an error or Bad Block, the
LBN is replaced without argument and without any further
testing. In other words, if we have a transient error and
then later this same LBN reports another error, it is
replaced.
THAT IS WHY RUNNING THIS PROGRAM FOR MULTIPLE PASSES IS
BENEFICIAL. The transient error table is kept in memory and is
kept updated as long as the program is running. If the program
is terminated, the table is lost and any entries in it. The
benefit of this table is that any marginal bad blocks that just
does not show an error each time it is read can be replaced.
- ATTACH
! For VAX 11/780
ATT DW780 HUB DWn 3 4 ! attaches the DW780
! For VAX 11/750
ATT DW750 HUB DWn ! attaches the DW750
! For VAX 11/730
ATT DW730 HUB DWn ! attaches the DW730
! For VAX 8200
ATT DWBUA HUB DWn 0 4 ! attaches the DWBUA
! For VAX 8800
ATT NBIA HUB NBIAm n ! attaches the BI
ATT NBIB NBIAn NBIBm N M ! attaches the BI adapter
ATT DWBUA NBIBm DWn 0 4 ! attaches the DWBUA
ATT KDB50 HUB DUa ! attaches the KDB50 for 8200
ATT KDB50 DBIBm DUa 5 ! attaches the KDB50 for 8800
ATT UDA50 DWn DUa 772150 154 5 2 ! attaches the UDA50
ATT RA80 DUa DUan ! attaches the RA80
ATT RA81 DUa DUan ! attaches the RA81
ATT RA60 DUa DJan ! attaches the RA60
ATT RA82 DUa DUan ! attaches the RA82
- DEVICES
Devices Supported by this program
EVRLK supports the UDA50 and KDB50 controllers and the RA60, RA80,
RA81 RA82 disk drives. The UDA50 may be attached to the VAX-11 UNIBUS
adapter (DW780, DW750, DW730), or the VAX-BI UNIBUS adapter (DWBUA).
The KDB50 must be attached to the HUB (for VAX 8200), or the NBIB (VAX 8600).
- KDB50
Device: KDB50
Link: HUB or DBIBn
Generic name: DUa
Additional information:
BI Node Number (HEX) [hex 0-f] ? <n>
BR [4-7] ? <5>
- RA60
Device: RA60
Link: DUa
Generic name: DJan
- RA80
Device: RA80
Link: DUa
Generic name: DUan
- RA81
Device: RA81
Link: DUa
Generic name: DUan
- RA82
Device: RA82
Link: DUa
Generic name: DUan
- UDA50
Device: UDA50
Link: DWn
Generic name: DUa
Additional information:
UNIBUS address (UDAIP): [octal 760000-777774] <772150>
UNIBUS vector: [octal 4-774] <154>
UNIBUS BR level: [decimal 4-7] <5>
UNIBUS bandwidth (Burst Rate) [decimal 0-63] <0>
- END_OF_PASS_SUMMARY
RUN SUMMARY DISPLAYED AT THE END OF PASS
The operator is notified of End of Pass, for the drive
being tested, by the following:
1. EOM reached. Elapsed runtime hh:mm:ss
The End of Media is reached in Automatic
replacement mode when the highest LBN on the media
has been tested.
2. EOF detected.
The End of File is reached in Manual replacement
mode when the operator enters a null LBN for
replacement (CARRIAGE RETURN).
When End Of Pass (EOP) is reached in either Automatic or
Manual mode, the following pass summary is displayed for the
drive being tested. This summary will also be displayed by the
operator "PRINT" command at the diagnostic supervisor prompt.
REPLACEMENT INFORMATION FOR THIS PASS
BAD BLOCKS reported: ddd.
ECC error detected: ddd.
FORCED ERRORs detected: ddd.
FORCED ERRORs written: ddd.
PRIMARY replacement: ddd.
SECONDARY replacements: ddd.
RBNs marked unusable: ddd.
Total replacements made: ddd.
The entries should be self-explanatory, relative to other
help available. The Bottom line indicates the total number of
replacements that were made, which should not exceed a few
hundred, unless something more is wrong with the drive. The
number of Forced Errors written, indicates the number of
Uncorrectable ECC (ECH) errors that were encountered. If you
have any forced errors written, or forced errors detected, you
should not fail to restore the customer data from the back-up
that was made, before you started execution of this program.
If excessive replacements were made, read RA81-TT-12 for a
possible explanation and recovery procedure.
- ERRORS
Error reporting:
This program will report all fatal System, controller,
MSCP, BBR and Error log errors. SA register contents, MSCP Status
and current LBN are displayed. After displaying you the error
information it will drop the drive or terminate.
If you see any of the indicated errors reported, you
should discontinue the use of this program and use the other
diagnostics, or system error logger, to better isolate any
problems. Some of the status codes (HELP EVRLK STATUS_CODES)
could be taken for errors, so be sure you know what the program
is trying to tell you.
There is always a possibility of a drive/controller generating
a hard error that will cause an incomplete replacement. If you
get a hard error, you should run at least one pass of this
program, in automatic mode, to take care of any incomplete
replacements and any other bad blocks.
Listed below will be the categories of hard errors. You
can request help for the category your interested in and get a
list of the errors within that category. Also available, are
decoding charts to provide the meanings of many of the entries
in an error report (Like a chart of the Status/Event Codes ---
"HELP EVRLK ERRORS STATUS_EVENT")
- MSCP_ERRORS
MSCP command/response errors:
The following sections list the possible errors reported
that indicate the failure of an MSCP command issued by the
host. Remember that a period after a number indicates that the
number is in Decimal. Many of these errors will display an
item called the "Endmsg status:" This stands for End Message
Status and provides detail on the error that occurred. The
meanings of the Endmsg status can be found by looking it up in
the list of Status/Event codes provided by typing: "HELP EVRLK
ERRORS STATUS_EVENT"
Also, an item called "Endmsg Flag" will be displayed for
many of the errors. The Endmsg Flags report additional status
to the host.
- SET_CHAR
Failed SET CONTROLLER CHARACTERISTICS:
"SET CONTROLLER CHARACTERISTICS" is an MSCP command that
is used by the host to set certain controller characteristics
(Like timeouts and to enable various kinds of messages). This
error occurs when command completion status (Endmsg status)
indicates a problem executing this command. This type of error
will be reported in the following manner:
FAILED SET CONTROLLER CHARACTERISTICS;
Host access timeout not disabled.
Default timeout is 60 seconds.
Error logging is not enabled.
Endmsg status: xxx
The default host access timeout is retained, and error
logging will not be enabled. Errors are reported via the "End
Message" Status field. The meanings of the Endmsg status can be
found by looking it up in the list of Status/Event codes
provided by typing: "HELP EVRLK ERRORS STATUS_EVENT"
Processing of the specified unit (drive) will continue.
However, if this error message is seen, the user should
terminate the test and try running it over. If the problem is
again reported, the problem should be investigated using other
diagnostics.
- READ
Failed READ:
This error occurs when an MSCP "READ" command fails with
an end message status (Endmsg status:) other than Success or
Data Error. The error is reported in the following manner:
Failed READ; LBN: dddddd.
Endmsg flags: xxx (X)
Endmsg status: xxx (X)
This error can be the result of many different conditions.
The host issued a "READ" command to the controller being used
for testing and then received an End Packet indicating that an
error was detected while trying to execute that command. The
LBN is the one the "READ" was attempting to access, at the time
the error occurred, as reported by the End Packet. The
meanings of the Endmsg status can be found by looking it up in
the list of Status/Event codes provided by typing: "HELP EVRLK
ERRORS STATUS_EVENT"
The Endmsg Flags report additional status to the host.
The meanings of the Endmsg Flags can be found by typing: "HELP
EVRLK ERRORS COMMAND_RESPONSE_ERRORS ENDMSG_FLAGS"
- WRITE
Failed WRITE:
This error occurs when an MSCP "WRITE" command fails with
an end message status (Endmsg status:) other than Success or
Data Error. The error is reported in the following manner:
Failed WRITE; LBN: dddddd.
Endmsg flags: xxx (X)
Endmsg status: xxx (X)
This error can be the result of many different conditions.
The host issued a "WRITE" command to the controller being used
for testing and then received an End Packet indicating that an
error was detected while trying to execute that command. The
LBN is the one the "WRITE" was attempting to write at the time
the error occurred, as reported by the End Packet. The
meanings of the Endmsg status can be found by looking it up in
the list of Status/Event codes provided by typing: "HELP EVRLK
ERRORS STATUS_EVENT"
The Endmsg Flags report additional status to the host.
The meanings of the Endmsg Flags can be found by typing: "HELP
EVRLK ERRORS COMMAND_RESPONSE_ERRORS ENDMSG_FLAGS"
- ACCESS
Failed ACCESS:
This error occurs when an MSCP "ACCESS" command fails with
an end message status (Endmsg status:) other than Success or
Data Error. PLEASE READ THIS COMPLETE ERROR DESCRIPTION. The
error is reported in the following manner:
Failed ACCESS; LBN: dddddd.
Endmsg flags: xxx (X)
Endmsg status: xxx (X)
This error can be the result of many different conditions.
The host issued an "ACCESS" command to the controller being
used for testing and then received an End Packet indicating
that an error was detected while trying to execute that
command. The LBN is the one the "ACCESS" was attempting to
read at the time the error occurred, as reported by the End
Packet. An ACCESS command is the MSCP command used by this
program to read all the LBN's on the media. The ACCESS command
causes a read type operation, checks for any error conditions
(Bad Block), but does not transfer any data to the host memory.
This purpose for this command is to verify that data can be
read without error. The meanings of the Endmsg status can be
found by looking it up in the list of Status/Event codes
provided by typing: "HELP EVRLK ERRORS STATUS_EVENT"
The Endmsg Flags report additional status to the host.
The meanings of the Endmsg Flags can be found by typing: "HELP
EVRLK ERRORS COMMAND_RESPONSE_ERRORS ENDMSG_FLAGS"
- REPLACE
Failed REPLACE:
This error occurs when an MSCP "REPLACE" command fails
with an end message status (Endmsg status:) other than Success
or Data Error. The error is reported in the following manner:
Failed REPLACE; LBN: dddddd.
Endmsg flags: xxx (X)
Endmsg status: xxx (X)
This error can be the result of many different conditions.
The host issued a "REPLACE" command to the controller being
used for testing and then received an End Packet indicating
that an error was detected while trying to execute that
command. The LBN is the one the "REPLACE" was attempting to
use at the time the error occurred, as reported by the End
Packet. A REPLACE command is the MSCP command used by programs
that do bad block replacement to actually cause the bad LBN to
be logically replaced. The execution of the REPLACE command is
very involved and leaves little room for error. The header of
the bad LBN is written with special codes and the replacement
RBN is allocated to the bad LBN.
If this error is seen, the media should be reformatted using
the field formatter program. The failure of this command could
be due to a defect in the drive, other than media bad blocks.
Be sure that any other causes for a failure like this is
investigated and any appropriate repairs made, before
attempting to reformat the media. A failure like this one is
one of the good reasons for using the formatter, however, be
sure not to use "RECONSTRUCT" mode of the formatter (See
RA81-TT-19). The meanings of the Endmsg status can be found by
looking it up in the list of Status/Event codes provided by
typing: "HELP EVRLK ERRORS STATUS_EVENT"
The Endmsg Flags report additional status to the host.
The meanings of the Endmsg Flags can be found by typing: "HELP
EVRLK ERRORS COMMAND_RESPONSE_ERRORS ENDMSG_FLAGS"
- CMD_REF
Command/Response reference number mismatch:
In Mass Storage Control Protocol (MSCP) commands that are
issued to the controller are given a "unique" number. In this
way, the host can distinguish this command, and any responses,
from any other commands that may be issued. When a response to
a command is received, the host attempts to associate the
"unique" reference number to a command that has not yet
received a response. If a response reference number can not be
matched to a command with the same number (from commands that
have not had responses yet) then this error occurs. This error
is reported in the following manner:
Command/response reference number mismatch;
Command ref: xxxx
Endmsg ref: xxxx
Endmsg flags: xxx
Endmsg status: xxx
The user should find that the Command Ref and Endmsg ref
do not match. Why this would happen, is hard to say. The
source for this kind of trouble, is not usually the drive being
tested. Rather it is more likely that a controller or "system"
problem exists. The meanings of the Endmsg status can be found
by looking it up in the list of Status/Event codes provided by
typing: "HELP EVRLK ERRORS STATUS_EVENT"
The Endmsg Flags report additional status to the host.
The meanings of the Endmsg Flags can be found by typing: "HELP
EVRLK ERRORS COMMAND_RESPONSE_ERRORS ENDMSG_FLAGS"
For this kind of error, you may find that the Endmsg flags
and Endmsg status will not indicate an error. In this case,
possibly a "system" type error would be more likely.
- ENDCODE
Fatal Endcode detected:
An "end message" is the means by which the controller
tells the host about how the command was processed and whether
any errors occurred while executing a command. This error is
indicating that a fatal status was reported by an end message.
The error is reported in the following manner:
Fatal endcode detected;
Endmsg endcode: xxxx
Since the error is reporting Fatal end message status, you
should find the an "Endmsg status" code displayed. The
meanings of the Endmsg status can be found by looking it up in
the list of Status/Event codes provided by typing: "HELP EVRLK
ERRORS STATUS_EVENT"
The Endmsg Flags report additional status to the host.
The meanings of the Endmsg Flags can be found by typing: "HELP
EVRLK ERRORS COMMAND_RESPONSE_ERRORS ENDMSG_FLAGS"
- VERIFY
Failed Replacement Verification:
After a "REPLACE" command has been issued and completed
with good results, the customers data must be written to the
replacement block (RBN). At this point in time, the original
bad LBN has been written with a code that will cause all
references to the LBN address to be "revectored" to the
replacement block. The host issues a "WRITE" command,
with the compare modifier, and the customer data will end up in
the replacement block. If this process reports a problem, this
error message will result:
Failed Replacement verification; LBN: dddddd.
Endmsg flags: xxx (X)
Endmsg status: xxx (X)
The program assumes that the cause for the error is that
the replacement block is bad. If this the case, a retry
of the replacement process takes place. This time another
replacement block is used and the original one is marked
unusable. The meanings of the Endmsg status can be found by
looking it up in the list of Status/Event codes provided by
typing: "HELP EVRLK ERRORS STATUS_EVENT"
The Endmsg Flags report additional status to the host.
The meanings of the Endmsg Flags can be found by typing: "HELP
EVRLK ERRORS COMMAND_RESPONSE_ERRORS ENDMSG_FLAGS"
- ENDMSG_FLAGS
End message flags:
Bit flags, collectively called end flags, used to report
various conditions detected due to this command but not
directly related to success or failure. The following flags
are defined:
Bad Block Reported: (200 OCTAL -- Bit 7) (80 Hex)
Set if one or more bad blocks were detected and the
host is expected to perform bad block replacement.
Indicates that the host should replace the bad block
identified by the LBN provided.
Bad Blocks Unreported: (100 OCTAL -- Bit 6) (40 Hex)
Set if one or more bad blocks were detected and not
reported in the "first bad block" field of an End Message
Packet.
If the "Bad Block Reported" flag is also set, this
indicates that two or more bad blocks were detected, and
the host (This program) should perform bad block
replacement. In this case the "first bad block" (LBN:)
field only reports the first bad block in the transfer.
If this happens, the program should be run for multiple
passes.
Error Log Generated: (40 OCTAL -- Bit 5) (20 Hex)
Set if one or more error log messages were generated
that refer to this command -- i.e., that contain this
command's command reference number. This flag allows the
host to save any outstanding command context that it
wishes to include in the error report. The MSCP server
must send the error log messages either before or shortly
after it sends the end message containing this flag.
Depending upon the type of error log report referenced by
this flag, this program will determine whether to display
the error log report or not. In most cases, the Disk
Transfer Error Log report is used to report blocks that
need replacement, reporting this error log report is
somewhat meaningless, since the program will make a
replacement due to having received this error log type.
- CTRLR_INIT
Controller initialization errors:
The following errors can be reported during controller
initialization. These are basically "HARD" serious errors and
usually indicate a very sick controller. This program, as with
any operating system, must first initialize the controller, to
prepare it for use. Any unexpected or error values read from the
controller register will be displayed.
- ADDRESS
Controller address error occurred:
This error occurs when the program attempts to read a
controller register and a memory access error occurred. This
can be caused by the operator specifying an incorrect
controller address. The specified Unit will be dropped from
testing. This type of error will be reported in the following
manner:
ADDRESS ERROR OCCURRED WHILE ACCESSING CONTROLLER REGISTER;
UDASA address: oooooo (O)
UDASA contents: xxxx (X)
UDASA expected: xxxx (X)
XOR: xxxx (X)
Check controller address.
The SA address displays the UNIBUS address of the
controller register that was being accessed at the time of the
error. This is most likely caused by the operator specifying
an incorrect controller address. Unit is dropped from testing.
- CONTROLLER_DIAG
Controller resident diagnostics detected failure:
This error indicates that the controller resident
diagnostics reported an error during initialization. This type
of error will be reported in the following manner:
CONTROLLER RESIDENT DIAGNOSTICS DETECTED FAILURE;
UDASA address: oooooo (O)
UDASA contents: xxxx (X)
UDASA expected: xxxx (X)
XOR: xxxx (X)
The SA register contains the detected error code. The
specified Unit will be dropped from testing. Decoding
information for the SA register contents (error code) can be
obtained by typing: "HELP EVRLK ERRORS SA_CODES"
- STEP_BIT
Step bit error in SA register during initialization:
This error indicates that during the 4 step (phase)
initialization of the controller a error was detected. This
type of error will be reported in the following manner:
STEP BIT ERROR IN SA REGISTER DURING INITIALIZATION;
UDASA address: oooooo (O)
UDASA contents: xxxx (X)
UDASA expected: xxxx (X)
XOR: xxxx (X)
The SA address is the UNIBUS address of the controller
register that was being accessed, during the 4 step
initialization of the controller, when the error was detected.
The SA contents is what the controller register contained when
the error was detected and may be an error code. SA expected
is what this program expected, for correct results during that
step in the initialization. Decoding information for the SA
register contents (error code) can be obtained by typing:
"HELP EVRLK ERRORS SA_CODES"
- STEP_3
SA register did not zero after step 3 write:
This error occurs when the SA register does not zero
during Step 3 of controller initialization. This type of error
will be reported in the following manner:
SA register did not zero after step 3 write:
UDASA address: oooooo (O)
UDASA contents: xxxx (X)
UDASA expected: xxxx (X)
XOR: xxxx (X)
The SA address is the UNIBUS address of the controller
register that was being accessed, during the 4 step
initialization of the controller, when the error was detected.
The SA contents is what the controller register contained when
the error was detected and may be an error code. SA expected
is what this program expected, for correct results during that
step in the initialization. Decoding information for the SA
register contents (error code) can be obtained by typing:
"HELP EVRLK ERRORS SA_CODES"
- SA_INIT
SA register error during initialization:
This is the more common of the Controller Initialization
Errors that can be reported. The SA register Error bit (Bit 15
should be set). This type of error will be reported in the
following manner:
SA REGISTER ERROR DURING INITIALIZATION;
UDASA address: oooooo (O)
UDASA contents: xxxx (X)
UDASA expected: xxxx (X)
XOR: xxxx (X)
The SA address is the UNIBUS address of the controller
register that was being accessed, during the 4 step
initialization of the controller, when the error was detected.
The SA contents is what the controller register contained when
the error was detected and may be an error code. SA expected
is what this program expected, for correct results during that
step in the initialization. Decoding information for the SA
register contents (error code) can be obtained by typing:
"HELP EVRLK ERRORS SA_CODES"
- CLEAR
Controller did not clear ring structure in host memory:
An error occurred during initialization of the Host
communication area. At a certain point during host/controller
initialization, the controller is responsible for clearing the
host communication area (Rings-In Host Memory) and this error
results from a failure of the controller to perform this
function (As DETECTED BY THE HOST).
CONTROLLER DID NOT CLEAR RING STRUCTURE IN HOST MEMORY;
Host address: xxxxxx (X)
The Host address is a host memory location in the host
communication area that the controller was to have cleared.
The problem could be a problem in the controller or possibly
the controller had problems accessing the hosts memory
locations of the communications area.
- CONTROLLER
Host/controller communication errors:
The following errors can be reported during normal program
execution. These are basically "HARD" serious errors and
usually indicate a very serious problem. Errors are reported
via the "Status/Address" (SA) register.
- NO_INTERRUPT
No interrupt received for 30 seconds:
This error occurs when an interrupt timeout occurs during
a wait for an MSCP end message. The response ring buffer is only
one slot and an interrupt is expected for every END PACKET that
the controller sends to the host.
NO INTERRUPT RECEIVED FOR 30 SECONDS;
UDASA address: oooooo (O)
UDASA contents: xxxx (X)
The Host got tired of waiting for an expected interrupt
from the controller. This does not always necessarily mean
that the controller was at fault. The SA address is the UNIBUS
address of the controller Status/Address register that was
being used at the time the problem was detected by the host.
If the controller detected a problem, that accounted for no
interrupt being delivered to the host, The Status/Address (SA)
register contents should show an error code and the SA register
error bit should be set (Bit 15). The specified Unit will be
dropped from testing.
- SA_REGISTER
Fatal error reported in SA register:
This error can occur when the Error Bit is set in the SA
register, during normal on-line use of the controller. This is
a common error reporting mechanism for the controller and this
type of error may occur more often than most others. The error
bit (Bit 15) setting causes the controller to discontinue
fetching command packets from the host. The host will read the
SA register, detecting that an error condition exists and
display contents of the SA register in this message.
FATAL ERROR REPORTED IN SA REGISTER;
UDASA address: oooooo (O)
UDASA contents: xxxx (X)
The SA register contains the detected error code. The
specified Unit will be dropped from testing. Decoding
information for the SA register contents (error code) can be
obtained by typing: "HELP EVRLK ERRORS SA_CODES"
- ERRORLOG
Errorlog packets:
Their are several sources of error information in an RA
subsystem. One of the most common, is error log packets that
the controller assembles and sends to the host. Their are 5
types of error log packets. Three of these types can be
reported by this program.
1. SDI Error Log Packet (Drive detected errors).
2. Controller Error Log Packet.
3. Host Memory Access Error Log Packet.
- CONTROLLER
Controller error log packet:
This error occurs when an Error log packet is received
specifying that a Controller detected an error within itself.
The Unit being tested will be dropped. This type of error will
be reported in the following manner:
Controller Error Reported:
Status/event: xxx
The Status/Event code will give you an indication of what
the trouble may be. A list of the Status/Event codes and their
meanings can be obtained by typing:
"HELP EVRLK ERROR STATUS_EVENT"
- MEMORY
Host memory access error log packet:
This error occurs when an Error log packet is received
specifying that a Host Memory Access Error was detected. This
type of error packet reports problems that the controller
experienced "dealing" with host memory (Like while doing data
transfers etc). This type of error will be reported in the
following manner:
Host Memory Access Error reported;
Status/event: xxx
Host address: xxxxxx
The host bus (Like UNIBUS) address is displayed and was
the one being used at the time of the error. The Status/Event
code will give you and indication of what the problem was. A
list of the Status/Event codes and their meanings can be
obtained by typing: "HELP EVRLK ERROR STATUS_EVENT"
- SDI
SDI error log packet:
This error occurs when an Error log packet is received
specifying that an SDI Error or Drive Detected Error was
detected. If a Status/Event code of EB (S/E Codes are in HEX)
is reported, a valid Drive error code may be reported. The
Drive Error Code is displayed along with the Drive dependent
information in the packet, IN HEXADECIMAL. This type of error
will be reported in the following manner.
SDI Error reported; LBN: dddddd.
Status/event: xxx
Drive error code: xxxxxx
Drive dependent information (X)
Byte: 15 14 13 12 11 10 09 08 07 06 05 04
Contents: xx xx xx xx xx xx xx xx xx xx xx xx
The LBN is the Logical Block Number that was being
accessed at the time the error was detected. Many LBN
addresses do not relate directly to the occurrence of an error,
the LBN address may be zero when you may not expect it to
be. The LBN is always reported by this program in decimal.
The "Drive Dependent Information", and Drive Error Code, are
typed in HEX because that's the way this information is
provided in most of the drive service manuals. You can look-up
the drive error code in the drive service manual for the
meaning of the reported problem. Most of the drive service
manuals (Like RA80 and RA81) provide some information on the
meanings of the bytes displayed for the Drive Dependent
information. This information is sometimes known as the "drive
specific status bytes" of the "extended status".
- INITIALIZATION
Initialization errors:
The following sections list the possible errors reported
during initialization of program tables, etc. These errors
could indicate a possible "system" problem or programming
error. Only one type of error exists in this category.
This error occurs when the memory table allocation fails.
Possibly insufficient memory exists on the system to support
the testing of the specified number of drives. The error is
reported in the following manner:
FAILED DYNAMIC MEMORY ALLOCATION FOR SELECTED UNITS;
RESTART PROGRAM AND SELECT FEWER UNITS.
- RCT
RCT read/write errors:
The following sections list the possible errors reported
when a READ or WRITE to the Replacement Control Table (RCT)
occurs. The RCT does not require special MSCP commands, just
normal READ and WRITE commands with the LBN address such that
the commands are going to reference blocks in the RCT. An
error is reported only once for a particular RCT block.
Depending upon the severity of the error, program execution may
or may not continue. Their must be at least one good copy of
an RCT block for execution to continue. Their are four copies
of the RCT and at least one good copy must be found for
replacements to occur. If a good copy cannot be obtained, then
reformatting the media must be accomplished. Reformatting for
this kind of error is a good use of the formatter. However, be
sure that a good copy of the Format Control Table (FCT) exists
(See RA81-TT-19).
Replacements of bad blocks within the RCT are not possible
(That's why their are four copies). You may see RCT errors
frequently, however, they are not always fatal.
Example of "soft" errors in RCT blocks:
SCANNING RCT...
RCT copy 1, block 352 ( 891424.) Status/event: 000110
RCT copy 2, block 352 ( 892189.) Status/event: 000710
RCT copy 1, block 552 ( 891624.) Status/event: 000110
Meaning of the Status/Event codes (Error codes) can be
obtained by typing "HELP EVRLK ERRORS STATUS_EVENT". "Badness"
would be when all 4 copies of the same block number would have
an error. That would mean that a good copy of an RCT could not
be assembled.
- READ
Failed RCT READ:
Simply stated, a READ command was issued to a block in one
of the copies of the RCT and an error was reported back for
that command. This is not necessarily fatal, as their are four
copies of the RCT and requirements are that at least one copy
should have a good block for the one needed. The error is
reported in the following manner:
Failed RCT READ;
RCT copy: ddd
Block no: ddd
Endmsg status: xxx
The RCT copy defines which of the four copies were being
referenced when the error was detected. Block no is the Block
number (LBN) that was being referenced within the copy, when
the error was detected. The Endmsg status should provide an
indication of what the error condition was and can be
interpreted from the list of Status/Event codes by typing "HELP
EVRLK ERRORS STATUS_EVENT" Program execution should continue.
- WRITE
Failed RCT WRITE:
Simply stated, a WRITE command was issued to a block in
one of the copies of the RCT and an error was reported back for
that command. This is not necessarily fatal, as their are four
copies of the RCT and requirements are that at least one copy
should have a good block for the one needed. The error is
reported in the following manner:
Failed RCT WRITE;
RCT copy: ddd
Block no: ddd
Endmsg status: xxx
The RCT copy defines which of the four copies were being
referenced when the error was detected. Block no is the Block
number (LBN) that was being referenced within the copy, when
the error was detected. The Endmsg status should provide an
indication of what the error condition was and can be
interpreted from the list of Status/Event codes by typing "HELP
EVRLK ERRORS STATUS_EVENT" Program execution should continue.
- READ_ALL
Failed READ of all copies of RCT:
This is serious. This error occurs when the program
detects a failure during a READ of all copies of a particular
RCT block. Replacements can not be done unless enough good RCT
blocks exist to make-up a good RCT copy. This program will not
do replacements, under this condition, and neither can any
operating system that has the capability of doing replacements
(Dynamic BBR). The error is reported in the following manner:
Failed READ of all copies of RCT;
Block no: ddd
Endmsg status: xxx
The Block no is the block (LBN) that was being referenced.
This same block exists in each of the RCT copies.
References to this block in each of the copies resulted in an
error. The Endmsg status should provide an indication of what
the error condition was and can be interpreted from the list of
Status/Event codes by typing "HELP EVRLK ERRORS STATUS_EVENT"
The Unit (drive) is dropped from testing.
The appropriate recovery for this kind of error, should be
to first make sure that the drive does not have a hardware
problem causing data errors that are not related to the media.
If your certain that the basic drive is functioning properly,
the recovery would be to try reformatting the media. The
Format Control Table (FCT) must be able to be used during the
format (Read RA81-TT-19). Once this has been accomplished, you
can re-run this program, to assure yourself that the drive is
now operating properly. If the problem persists, after having
reformatted, the HDA/Pack may need replacement.
- WRITE_ALL
Failed WRITE of all copies of RCT:
This is serious. This error occurs when the program
detects a failure during a WRITE of all copies of a particular
RCT block. Replacements can not be done unless enough good RCT
blocks exist to make-up a good RCT copy. This program will not
do replacements, under this condition, and neither can any
operating system that has the capability of doing replacements
(Dynamic BBR). The error is reported in the following manner:
Failed WRITE of all copies of RCT;
Block no: ddd
Endmsg status: xxx
The Block no is the block (LBN) that was being referenced.
This same block exists in each of the RCT copies.
Rreferences to this block in each of the copies resulted in an
error. The Endmsg status should provide an indication of what
the error condition was and can be interpreted from the list of
Status/Event codes by typing: "HELP EVRLK ERRORS STATUS_EVENT".
The Unit (drive) is dropped from further testing.
The appropriate recovery for this kind of error, should be
to first make sure that the drive does not have a hardware
problem causing data errors that are not related to the media.
If your certain that the basic drive is functioning properly,
the recovery would be to try reformatting the media. The
Format Control Table (FCT) must be able to be used during the
format (Read RA81-TT-19). Once this has been accomplished, you
can re-run this program, to assure yourself that the drive is
now operating properly. If the problem persists, after having
reformatted, the HDA/Pack may need replacement.
- NO_NULL
No Null descriptor entry found in RCT:
This is serious. In the case of the RA81, The Replacement
Control Table (RCT) contains descriptors for 17,000 plus
replacement blocks. If they are all used (No Null Descriptor)
then something BIG is wrong. The descriptors could in fact all
be used or possibly the RCT has been written with Garbage,
which makes it appear that all replacement blocks have been
used. Never-the-less, if this error occurs, you should read
RA81-TT-12. The recovery will be to reformat the media. If
the drive is operating normally, this should fix the RCT, as
long as the Format Control Table (FCT) is good (See
RA81-TT-19). You can re-run this program after having
reformatted, just to be sure the problem has been corrected.
The drive will be dropped form further testing. The error is
reported in the following manner:
No Null descriptor entry found in RCT;
RCT is probably corrupt.
Reformat media.
- CRASH
Crash Occurred during previous replacement Phase (1 or 2)
The following is not necessarily an ERROR. It may have
been the result of an error but when detected is not an error
condition. The condition described is the detection of an
incomplete replacement (i.e. a replacement that for some
reason was not finished). This condition is detected by this
program when scanning the RCT. Action is taken to complete the
replacement and the following output gives the results of this
operation.
Crash occurred during previous replacement Phase 1.
Attempting Recovery...
LBN: 679864. Status: RBN: 13331. Operation: SEC
This example indicates that a crash occurred, sometime in
the past, during a phase 1 replacement. Replacement at phase 1
will assign a replacement RBN, and attempt recovery. If this
program shows a crash recovery, it would be wise to allow at
least one pass of this program, in Automatic Mode, to take
place. No "Status" will be listed in this situation, as the
recovery process does not know the original reason (status)
replacement was requested.
- SA_CODES
Controller error codes (reported by SA register):
The SA (Status Address) register can display an error code
during the initialization of the controller as well as when it
is "on-line" to the host. However, the format of the SA
register is different for reporting initialization error codes,
as opposed to when it is on-line.
SA REGISTER (CONTROLLER) ERROR CODES DURING INITIALIZATION
----------------------------------------------------------
15 11 10 0
+-+-+-+-+-+-------------------+ SA REGISTER
|E|S|S|S|S| INITIALIZATION | Formatted as would be
|R|4|3|2|1| ERROR CODE | represented when an error
+-+-+-+-+-+-------------------+ occurs during initialization.
^|-------|
| \____ STEP BITS. Initialization is a four step
| process. These bits describe which initialization
| step the error occurred in.
|
|__________ ERROR BIT. Qualifies the other bits of the
register as containing valid information
relating to an error that occurred during
initialization.
UDA50 initialization
SA register error codes
---------------------------
Possible
Octal Hex Error Description FRU
------ ---- ---------------------------------- -------
104000 8800 Fatal Sequence Error 7485
104040 8820 D processor ALU 7485
104041 8821 D processor control rom parity error 7485
105102 8A42 D proc with no board #2 or RAM Parity
error 7486
105105 8A45 D proc RAM Buffer Error 7486
105152 8A6A D proc SDI Error 7486
105153 8A6B D proc Write Mode Wrap SERDES Error 7486
105154 8A6C D proc Read SERDES,RSGEN & ECC Error 7486
106040 8C20 U proc ALU Error 7485
106041 8C21 U proc Control Register Error 7485
106042 8C22 U proc DFAIL/Control Rom Parity/BD #1 7485
106047 8C27 U proc Constant PROM error with D proc
running SDI test 7485
106055 8C2D Unexpected trap found,Abort Diagnostic 7485
106071 8C39 U proc constant PROM Error 7485
106072 8C3A U proc Control ROM Parity Error 7485
106200 8C80 Step 1 Data Error (MSB not set) 7485
107103 8E43 U proc RAM Parity Error 7486
107107 8E47 U proc RAM Buffer Error 7486
107115 8E4D Test Count was wrong (BD#2) 7486
112300 94C0 Step 2 Error 7485
122240 14A0 NPR Error 7485
122300 A4C0 Step 3 Error 7485
142300 C4C0 Step 4 Error 7485
Controller error codes while on-line to host (SA register)
----------------------------------------------------------
This section provides a list of the various controller
error codes that can reside in the SA register, when the
controller detects a hard error during on-line use of the
controller. Do not get these confused with the SA register
initialization error codes ABOVE. They are different.
15 10 0
+-+-+-+-+-+----------------+ SA REGISTER
|E| | ONLINE | Formatted as would be represented
|R| | Error Code | when the UDA is "ONLINE" to the
+-+-+-+-+-+----------------+ Host.
^
|___ ERROR BIT (Error Code will be displayed in bits 0 - 10)
IF bit 15 is NOT set, the contents of bits 0 - 10 are
undefined.
UDA/KDA ONLINE
SA REGISTER ERROR CODES
-------------------------
POSSIBLE
Octal Hex Error Description FRU
------ --- --------------------------- -----
100001 8001 UNIBUS Packet Read Error 7485*
100002 8002 UNIBUS Packet Write Error 7485*
100003 8003 UDA ROM & RAM Parity Error 7485/7486
100004 8004 UDA RAM Parity Error 7486
100005 8005 UDA ROM Parity error 7485
100006 8006 UNIBUS Ring Read Error 7485*
100007 8007 UNIBUS Ring Write Error 7485*
100010 8008 UNIBUS Interrupt Master Failure 7485
100011 8009 Host Access Timeout Error 7485*
100012 800A Host Exceeded Command Limit 7485*
100013 800B UNIBUS Bus Master Failure 7486
100014 800C DM XFC Fatal Error 7486
100015 800D Hardware Timeout of Instruction Loop 7485*
100016 800E Invalid Virtual Ckt Identifier 7485*
100017 800F Interrupt Write Error on UNIBUS 7485*
* - denotes possible host CPU error.......
- STATUS_EVENT
Status/event codes:
The following list shows the various Status/Event codes
used in many of the messages displayed by this program.
Certain deviations may exist for specific controllers. This is
the "generic" list. Remember, this program reports
Status/Event codes in HEX.
DESCRIPTION
Octal Hex -----------------------------------------------
0 0 Successful completion
1 1 Invalid command (high byte = byte offset of bad
command field)
2 2 Command aborted
3 3 Drive off line (unknown unit or Online to
another controller)
4 4 Drive available
5 5 Media Format Error
6 6 Unit Write Protected
7 7 Compare error (on compare command or compare
modifier)
10 8 Data Error (Or may have been written with
"Forced Error Flag")
11 9 Host Buffer Access Error
12 A Controller Error (Command Time Out/Retry
Exceeded)
13 B Drive Error
40 20 Spin down ignored (multi-unit drives only)
43 23 No volume mounted or drive run/stop switch out
45 25 Format Control Table unreadable - EDC Error
51 29 Odd transfer start address in MSCP packet
52 2A SERDES overrun error (controller probably
broken)
* 53 2B SDI level two response timeout (Or maybe seek
incomplete) (53/2B also shows up with a drive
error as a result NOT CAUSE)
100 40 Still Connected (multi-unit drives only)
103 43 Drive inoperative (UDA cannot communicate with
drive)
105 45 Format Control Table unreadable - Invalid
Sector Header
110 48 Header compare error (header not found and no
revector)
111 49 Odd byte count in MSCP packet
112 4A EDC error (SERDES broken or EDC written bad --
Controller)
*113 4B Invalid SDI level two response (unsuccessful or
trash)
145 65 Format Control Table unreadable - Data Sync
Timeout
150 68 Data Sync timeout (Data Sync field in sector)
151 69 UNIBUS nonexistent memory error
152 6A Inconsistent controller state
153 6B Positioner Error (headers consistent but not on
cyl)
200 80 Duplicate Unit Numbers
203 83 Drive offline and duplicate unit numbers
211 89 UNIBUS parity error (UNIBUS read)(Host Memory
Parity?)
*213 8B Lost read/write ready during or between
transfers ('213 also shows up with drive
errors as a result NOT CAUSE)
245 A5 Drive not 512 Byte format (16 bit format)
253 AB Lost Drive clock during data/SDI transfer
305 C5 Drive not formatted or Format Control Table
Corrupted
*313 CB Lost drive receiver ready during transfer (see
213 comment)
345 E5 ECC error on FCT read (media format - FCT
unreadable)
350 E8 Uncorrectable ECC error (This uncorrectable
data may be re-written to an RBN with the
Forced Error Flag)
353 EB Drive detected error (DRIVE HAD ERROR! -- Find
drive error code)
400 100 Already Online (in response to ONLINE)
403 103 Drive offline. By field service or internal
Diagnostics
410 108 One symbol ECC error
413 10B RTDS pulse/parity error (IS UDA M7486 UP TO
REV ?)
450 128 Two symbol ECC error
510 148 Three symbol ECC error
550 168 Four symbol ECC error
610 188 Five symbol ECC error
650 1A8 Six symbol ECC error
710 1C8 Seven symbol ECC error
750 1E8 Eight symbol ECC error
10006 1006 Drive software write protected
20006 2006 Drive hardware write protected
KEY: EDC = Error Detection Code (Written in each sector)
SDI = Standard Disk Interface (Drive Bus)
SERDES = SERializer/DESerializer (In Controller)
FCT = Format Control Table (Written on media surface)
ECC = Error Correction Code (Written in each sector)
NOTE
Codes 53,113,213 and 313 occur most often as
the RESULT of a "Drive Error" (S/E Code EB
HEX). You should look for a problem in the
drive before believing that these codes really
represent the description of your problem.
- TRANSIENT
Transient error table overflow:
The Transient Error Table, that is kept in memory as long
as this program is running, has the ability to keep track of
512 Transient errors for each drive being tested. This table
should be of sufficient size for all cases where bad blocks
that are only marginally bad need to be tracked. If this table
overflows, a message is typed and the testing terminates. IF
THIS HAPPENS, YOU HAVE MORE TROUBLE THAN WHAT THIS PROGRAM CAN
HELP WITH. This message is reported in the following manner:
Transient error table overflow;
ADDITIONAL MAINTENANCE REQUIRED.
If you log more than 512 transients, then the table will
in fact overflow and it may be more than bad blocks causing the
errors you see. Most likely you have a data path problem
giving false indications of ECC errors or something like a
worn-out spindle ground brush.
The recovery from this condition would be to first fix
whatever is causing the high transient error rate then reformat
the media. Most likely their are many replacements that were
made on blocks that were in fact not bad. Reformatting will
put the media back to a known state (Please read RA81-TT-12).
You may want to run several passes of this program, after
having reformatted, just to be sure that the formatter found
all the bad spots.
For additional information type:
"HELP EVRLK BACKGROUND OPERATIONAL_CONCEPT"
- UNIT
Unit initialization errors:
The following errors are reported when an error occurs
during the initialization of a unit (Drive). These are
basically "HARD" serious errors and usually indicate a problem
either in a drive or communicating with a drive. Errors are
reported via the "End Message" Status field (Endmsg status:) If
the problem is severe, the specified unit (drive) will not
continue being tested.
NOTE
The meanings of the Endmsg status can be found
by looking it up in the list of Status/Event
codes provided by typing: "HELP EVRLK ERRORS
STATUS_EVENT"
- GET_UNIT_STATUS
Failed GET UNIT STATUS:
"Get Unit Status" is an MSCP command that is used by the
host to get certain information about a drive connected to an
RA drive controller. This error occurs when an attempt to "GET
UNIT STATUS" fails during Unit initialization. This type of
error will be reported in the following manner:
FAILED GET UNIT STATUS;
Drive ddd is not accessible.
Endmsg status: xxx
Drive ddd is the decimal drive logical unit address that
the "Get Unit Status" was issued for. Possibly the drive port
switches are not pushed in or the incorrect drive address was
given to the program. Errors are reported via the "End
Message" Status field (Endmsg status:) The meanings of the
Endmsg status can be found by looking it up in the list of
Status/Event codes provided by typing: "HELP EVRLK ERRORS
STATUS_EVENT"
- ONLINE
Failed ONLINE:
"ONLINE" is an MSCP command that is used by the host to
bring a unit (drive) Online to the controller. It also makes
certain drive specific information available to the host and
the drive to become usable. This error occurs when command
completion status indicates a problem executing this command
successfully. This type of error will be reported in the
following manner:
Failed ONLINE;
Endmsg status: xxx
Errors are reported via the "End Message" Status field
(Endmsg status:) The meanings of the Endmsg status can be found
by looking it up in the list of Status/Event codes provided by
typing: "HELP EVRLK ERRORS STATUS_EVENT"
- HELP
VDS Bad Block Replacement Utility:
This program is a highly specialized utility for use in
those cases where an RA drive has been diagnosed to have bad
blocks on the media. These bad blocks may be causing ECC
errors that can be isolated to a specific Logical Block
Number(s) (LBN). This program will replace those blocks that
it either defines as bad (AUTOMATIC Replacement Mode) or will
take an LBN address that the user supplies and replace only
that block (MANUAL Mode). Their is also a VERIFY mode that
allows the user to quickly determine if any bad blocks exist.
In order to properly execute this program in AUTOMATIC or MANUAL
mode, the following steps must be followed without exception.
Unpredictable results could result if not followed.
1. Customer must backup the data from the drive(s) to be used
and verify its correctness.
2. Execute this program on the drive(s). When performing AUTO-
MATIC replacment, multiple passes on each drive, is
recommended (DS>Start/pass:nnn).
3. Customer must restore the backed up data to those drives
from which it was backed up and again verify its correctness.
- INIT_INFO
Initialization information:
Each Unit selected for test is brought Online, and the
following information is displayed for that unit (Drive).
Remember a unit is not necessarily the logical unit address of
a drive.
INITIALIZATION INFORMATION FOR _DUan
Controller: UDA50
Address: 172150
Drive no: d
Volume SN: dddd.
Volume size: dddddd.
o The CONTROLLER will be a UDA50.
o DRIVE NO: Is the Drive being initialized logical address
o VOLUME SN: This is the Serial Number of the HDA or Pack.
This information comes from the Format Control Table (FCT),
which is read during the execution of the MSCP "ONLINE"
command.
o VOLUME SIZE: This is the number of user LBN's on the
media. For example, the RA81 formatted for a VAX (16 bit)
has 891072 (decimal) user LBN's. Actually, starting with
LBN zero, the last user LBN would be 891071.
- LBN_STATUS_CODES
LBN status/ Replacement operations:
This program was designed to not only replace bad blocks
but also to provide understandable information about the
"integrity" of the media and the replacements that are
accomplished. As such, many different types of messages are
provided. Errors are listed separately under "HELP EVRLK
ERRORS". Anything that is not an error will be addressed in
this section.
The most common types of messages relate to the
replacement of a block or what the program finds when a "read"
of the block occurs. The following decode chart is displayed
on the user terminal, when this program is first run to give
the meanings of what is found or what action is taken on a
block.
LBN Status codes:
BBR - Bad Block reported.
TRA - Transient Error table entry.
ECH - Hard ECC Error encountered.
HCE - Header Compare Error encountered.
DST - Data Sync Timeout encountered.
FER - Forced Error encountered.
WFE - Block written with FE.
RBN Operation types:
PRIMARY - Primary replacement for LBN.
SECONDARY - Secondary replacement for LBN.
UNUSABLE - RBN marked unusable.
- PRIMARY
Primary replacement:
The LBN was tested, and the Primary replacement RBN was
used to store the LBN's data.
LBN: 23271. Status: BBR RBN: 456. Operation: PRIMARY
- SECONDARY
Secondary replacement:
The LBN was tested, and the BBR flag was set repeatedly.
The LBN's Primary replacement RBN is currently used by another
LBN. A Secondary replacement RBN was used.
LBN: 23271. Status: BBR RBN: 455. Operation: SECONDARY
- UNUSABLE
Replacement block marked unusable:
Upon reading data from an RBN, an error resulted. This
indicated that the RBN is bad and must be marked as unusable.
Also, the customer data residing in the RBN must be moved to a
new RBN. A Secondary replacement always results from this
condition.
LBN: 23271. Status: BBR RBN: 456. Operation: UNUSABLE
LBN: 23271. Status: BBR RBN: 455. Operation: SECONDARY
- ECH/WFE
Uncorrectable ECC error/Write with Forced Error:
An LBN is read repeatedly with an uncorrectable ECC error
(ECH). The indicated replacement operation is performed. If
enabled, the LBN data is written with the Forced Error (FE)
flag set, and the following information appears.
LBN: 23271. Status: WFE RBN: 456. Operation: SECONDARY
If the question about writing uncorrectable ECC errors
WITHOUT the Forced Error Flag is answered with a NO, the
corrupted data is written to the RBN as is, and the following
information appears:
LBN: 23271. Status: ECH RBN: 456. Operation: SECONDARY
- FER
Forced Error encountered:
When an uncorrectable ECC error is encountered, several
attempts are made to read the data correctly (Using every means
available). If these attempts fail, the block generating the
Uncorrectable ECC error is assumed to be bad and in need of
replacement. If you have a system that does "Dynamic Bad Block
Replacement " (Like VMS, RSTS, RSX, IAS) , or your running
this program, the replacement process will take the
uncorrectable (Corrupt) data and move it to a good replacement
block (RBN). Now, we have a condition where we have corrupt
data in a good block and if left this way the corrupt data
would be read with no "indication" that the data is "corrupt".
Therefore, in order to tell a user that the data was at one
time uncorrectable and exists now as "corrupted" data (In
a good RBN), the "Forced Error Flag" is attached to the block
of "corrupted" data. When this block is read by a user, this
flag (Inverted EDC character) is also read and is intended to
inform the user that the requested data is "not reliable". THE
FORCED ERROR FLAG IS NOT (REPEAT NOT) AN ERROR !!. WHEN READ,
IT WILL NOT MAKE AN ENTRY IN THE SYSTEM ERROR LOG. IT IS
REPORTED AS A "STATUS" CODE 10 (OCTAL) IN A TRANSFER REQUEST
"END PACKET" (This program calls the end packet status field
the "Endmsg status", in several message types that can be
displayed)
LBN: 23271. Status: FER
- TRA
Transient error table entry:
During the "scan" for bad blocks, the BBR flag was set for
a particular LBN. Once this condition was noted, the program
tries to verify the report of a bad block. If the bad block
report can not be verified, the operator is notified of the LBN
and NO replacement is attempted. If you do not feel good about
an ECC error being reported and then the program not being able
to verify the error, you can use Manual Mode to replace the
reported LBN. Notice in the example below that no status is
shown. This indicates that a transient ECC error occurred on
the indicated LBN.
LBN: 23271. Status: TRA
NOTE
Many times the occurrence of an ECC error that
is not repeatable can indicate a Data Path
problem or other conditions that are not
related to bad spots. If this condition occurs
many times during a pass of this program, I
would start to look for a data path problem or
some other problem causing transient errors.
- HCE
HEADER COMPARE ERROR
An LBN is read and a Header Compare Error is detected.
LBN's can be replaced for header errors as well as ECC errors.
If this occurs, the LBN will be replaced in a manner similar to
the following example:
LBN: 23271. Status: HCE RBN: 456. Operation: SECONDARY
- DST
DATA SYNC TIMEOUT
An LBN is read and a Data Sync Timeout is detected. LBN's
can be replaced for this error as well as ECC errors. The Data
Sync is a field that is between the header and data field. It
is used to "sync-up" the drive logic just before the data field
is read. If this occurs, the LBN will be replaced in a manner
similar to the following example:
LBN: 23271. Status: DST RBN: 456. Operation: SECONDARY
- BBR
BAD BLOCK REPORTED
Status Code BBR is used to indicate that when the block
was read an ECC error occurred. This ECC error was NOT
uncorrectable but the block needed replacement anyway. A request
for Bad Block Replacement is made and the block is
replaced with either a primary or secondary replacement block.
When this occurs, a message similar to the following is
displayed on the console terminal.
LBN: 23271. Status: BBR RBN: 456. Operation: SECONDARY
- ELAPSED_RUNTIME
Elapsed runtime:
In Automatic mode, the following runtime message is
periodically displayed to indicate that execution is in
progress:
LBN dddddd. Elapsed runtime is hh:mm:ss
The LBN listed indicates where on the media the program is
currently testing; the elapsed runtime is zeroed when each Unit
is selected for test.
- SCANNING_RCT
Scanning RCT... (CAN A GOOD COPY BE ASSEMBLED)
Just after typing the Initialization information for a
drive that is going to be tested, the statement
"Scanning the RCT..." is typed. This portion of the program,
attempts to find enough good RCT blocks to account for a good
complete copy. Any error that it finds doing this is
displayed. These messages (Shown Below) are only status
messages and DO NOT necessarily mean that the drive is
defective. The status is shown in the following manner:
RCT copy 1, block 184 ( 891256.) Status/event: 000110
RCT copy 1, block 185 ( 891257.) Status/event: 000153
RCT copy 2, block 184 ( 892021.) Status/event: 000350
These messages do not become critical until the same block
number shows an error for each copy. In the above example
Block 184 is bad in two copies. This is not fatal, since their
are more than 2 copies in our example. You will get a hard
error report, if all copies of a block are bad. The
Status/Event codes can be interpreted by typing: "HELP EVRLK
ERRORS STATUS_EVENT").
- ZERO_RCT_SIZE
ZERO RCT SIZE -- CAN'T DO REPLACEMENTS ON THIS PRODUCT
This message is telling you that your running this program
on a drive that does not support Bad Block replacement. The
zero RCT size means that the drive has no "real" Replacement
Control Table and replacements are not possible. If think that
you have bad blocks on a drive like this, you should consult
the appropriate service manual for action you need to take.
The message is reported in the following manner:
Zero RCT size detected:
BBR not supported by this drive.
Endmsg status: xxx
The Endmsg status will most likely not tell you much
(Unless something really out of the ordinary is happening).
The Endmsg status can be interpreted from the list of
Status/Event codes by typing: "HELP EVRLK ERRORS STATUS_EVENT".
The Unit (drive) is dropped from testing.
- MODES_OF_OPERATION
MODES OF OPERATION (VERIFY, AUTOMATIC, MANUAL)
This program will run in one of three modes
-------------------------------------------
VERIFY MODE
This allows the user to "scan" the media without the
program taking any action on what it finds. In this way,
you can get a "picture" of any problem found, without the
program writing in any way on the media. For additional
information, and instructions on invoking VERIFY mode,
type:
"HELP EVRLK MODES_OF_OPERATION VERIFY"
AUTOMATIC MODE
This is the mode used when you do not know what the bad
block LBN(s) are, or do not have the bad block LBN
address(es) available in decimal. Or, possibly you just
want the program to run unattended, and take action on any
blocks it finds bad. Normally you should use Manual Mode
and enter the bad blocks, given that you know (from the
error logger or other sources of information) what blocks
are bad. Large pass counts in Automatic Mode are
recommended (Run EVRLK/Pass:"number of passes"). For
additional information, and instructions on invoking
AUTOMATIC mode, type:
"HELP EVRLK MODES_OF_OPERATION AUTOMATIC"
MANUAL MODE
This is the preferred method for using the program and the
most efficient. If you can obtain a decimal address of the
bad block(s) you want replaced, this mode will take the
information (LBN address) and without hesitation replace
it. Operating System error loggers, should provide the LBN
address of those blocks that generate ECC and Header
errors. For additional information, and instructions on
invoking MANUAL mode, type:
"HELP EVRLK MODES_OF_OPERATION MANUAL"
- AUTOMATIC
AUTOMATIC Replacement mode:
This is the mode used when you do not know what the bad
block LBN(s) are, or do not have the bad block LBN address(es)
available in decimal. Or, possibly you just want the program
to run unattended, and take action on any blocks it finds bad.
Normally you should use Manual Mode and enter the bad blocks,
given that you know (from the error logger or other sources of
information) what blocks are bad. Large pass counts in
Automatic Mode are recommended (Run EVRLK/Pass: "number of
passes").
Execution in this mode is entered by:
1. Insure the Customer has backed-up and verified the
data from the subject drive.
2. Specify AUTOMATIC replacements.
3. Answer "Enable Replacements" with a "Yes".
4. After the completion of the program, the customer can
restore and verify the backed-up data.
+---------------------------------------------------------+
| If the system crashes (CPU failure, power fail etc) |
| DURING THE EXECUTION OF THIS PROGRAM (Either Manual or |
| Automatic Mode) IT IS REQUIRED that you run at least one|
| pass of this program after recovering from the crash |
| condition. Abrupt termination of this program could |
| leave an incomplete replacement. One quick pass of this |
| program would allow for the completion of any incomplete|
| replacements. Using the control "C" to terminate this |
| program is not considered an abrupt termination and |
| should not leave any incomplete replacements, although |
| use of control "C" to terminate this program is not |
| recommended. |
+---------------------------------------------------------+
CAUTION
It is imperative that every effort be used to establish
the fact that errors showing up in the error log are
indeed related to a bad block, before attempting to use
this program. If ECC errors showing up in an error log
are the result of a hardware problem, this program (When
used in Automatic Mode) could replace a considerable
number of blocks that are indeed good. If this happens,
and later you fix the problem causing the ECC errors, you
should reformat the HDA/Pack. A description of this
situation is given in RA81-TT-12. The information in
this Tech Tip applies to all RA series drives and would
be helpful in this situation.
- RUN_TIMES
RUN TIMES IN AUTOMATIC MODE
Sample Run times in Automatic Mode
----------------------------------
RA82 -- About 17 Minutes
RA81 -- About 9 Minutes
RA60 -- About 4.5 Minutes
RA80 -- About 3 Minutes
NOTE
This time is extended, when replacements occur.
If you have many bad blocks, the program will
take considerably longer.
- MANUAL
MANUAL Replacement mode:
This is the preferred method for using the program and the
most efficient. If you can obtain a decimal address of the bad
block(s) you want replaced, this mode will take the information
(LBN address) and without hesitation replace it. Operating
System error loggers, should provide the LBN address of those
blocks that generate ECC and Header errors. Sometimes you may
have to "convert" the LBN address from Hexadecimal or Octal, as
provided in the error log report, to decimal so it can be used
by this program. Providing the wrong decimal address of a bad
block is not good. The program will replace any LBN you
specify and if you specify the wrong one, you will have one
additional replaced block on the media.
If your operating system has Dynamic Bad Block Replacement
(BBR) -- Like VMS -- you will want to be sure that the block in
question has not already been replaced, before you replace it
in MANUAL mode. To do this run this program in VERIFY mode
(Type: "HELP EVRLK MODES_OF_OPERATION VERIFY") and answer yes
to the question "Display RCT Replacement Descriptors". This
will dump out the current set of replaced blocks and you can
look for the one in question.
If you have no idea what the LBN addresses of the bad
block(s) are, you can possibly run this program in Verify Mode
and if Verify finds any bad blocks, it will report them to you
in decimal. Taking these addresses and using them in Manual
Mode, will result in the quickest method for replacing known
bad blocks. However, the customers operating system has much
more time to find bad blocks than does Verify Mode. You should
use the error log for bad block determination whenever possible
Execution in this mode is entered by:
1. Insure the Customer has backed-up and verified the
data from the subject drive.
2. Specify MANUAL replacements.
3. Answer "Enable Replacements" with a "Yes".
4. Provide the LBN address of the bad block in decimal,
when asked for. A Carriage Return at this question
will exit MANUAL mode.
5. After the completion of the program, the customer can
restore the backed-up data and verify it.
- VERIFY
VERIFY MODE
This allows the user to "scan" the media without the
program taking any action on what it finds. In this way, you
can get a "picture" of any problem found, without the program
writing in any way on the media. This is extremely valuable,
since the drive can be write protected and thus the lengthy
Backup and Restore operations recommended need not be done. By
getting a "look" at the condition of the media, you can then
decide whether to commit to either Automatic or Manual mode and
do the backup and restore of the customer data. This mode can
also be used to get the replacement descriptors displayed (By
answering the question "Display Replacement Descriptors" with a
yes) without the Backup and Restore operations. This way you
can get a "look" at the total number of replacements quickly.
NOTE
The best way to identify bad blocks is the
Operating System Error Logger. Verify only
runs as long as allowed to, whereas, the error
logger will log possible bad blocks as they
occur. It is possible for a block to be
"pattern sensitive" and the current
pattern in a block will not exhibit as a bad
block while VERIFY is running.
Execution in this mode is entered by:
1. Write protect the subject drive --- after the drive has
become ready. CAUTION: spinning up an RA80 (only) with it
write protected will result in a fault, because of a write
fail during the spin-up write tests.
2. Start the program and specify AUTOMATIC mode.
3. Answer the question "Enable replacements" with a "N0".
Verify Mode can also be entered by specifying Manual Mode
with replacements disabled. In this way, you can "look" at the
"status" of a particular LBN.
- PROMPTS
Operator Prompts:
The program will require the user to answer several
questions, depending upon how the program is run. Several
questions are also only relative to whether MANUAL or AUTOMATIC
mode is selected (which is the first question asked). You
should be aware of how answering the questions affects the
operation of the program. For information about each kind of
question, see the list below.
- BACKUP
Have you backed-up customer data on all drives? [(No), Yes] Yes<cr>
Before execution of the utility, each drive must be backed
up. The operator must answer this question with a "Y" to continue.
Default is "No".
- AUTO_MODE
Automatic or manual replacement? [(AUTOMATIC), MANUAL] <cr>
The default response selects Automatic replacement mode; a
"MANUAL" selects Manual replacement mode. Default is "AUTOMATIC".
If AUTOMATIC REPLACEMENT MODE is selected, the entire user
LBN area of each Unit will be processed, and this program will
replace all blocks found to be bad. The selected drives will
be run "one at a time" in a serial fashion.
If MANUAL REPLACEMENT MODE is selected, the operator must
specify the desired block(s) to be replaced. In this mode, the
LBN address sup- plied to the program must be in decimal (Not
Octal, Hexadecimal etc). Only a single Unit can be processed
during each program execution, in Manual Mode.
For information on VERIFY MODE and additional information
on this subject type: "HELP EVRLK MODES_OF_OPERATION"
- ENTER_LBN
Enter LBN to be replaced (decimal) or <cr> to exit [(-1), 0, nnnnnn.]
This is asking for you to specify which LBN you wish to
have replaced. Be sure you are prepared to provide the LBN
address in Decimal. Different Operating Systems error loggers
display the LBN address in various radix's (Like Hex, Octal,
Decimal etc) Be sure you know what your dealing with and make
the proper conversions, if necessary.
The LBN entered is checked against the maximum number of
user LBN's available If the limit check fails, the following
message appears:
Out of range or overflow
HI would be the maximum LBN address that can be replaced
on that drive. If the LBN passes the limit check, the
replacement is made and the operation is displayed.
Once you have provided the LBN address for the
replacement, and the replacement has occurred, this same
question will be asked. If you have no more LBN's for
replacement, just hit Carriage Return to exit.
- DISPLAY_RCT
Display RCT replacement descriptors? [(no), Yes] <cr>
This parameter affects program execution in both
replacement modes. The Replacement Control Table (RCT) keeps a
"log" of all LBN's that get replaced and the Replacement Block
Number (RBN) that is used. In Automatic replacement mode,
display selection will cause the updated Replacement Control
Table (RCT) descriptors to be printed at the end of each pass,
for each Unit processed. In Manual replacement mode, this
question is asked twice: When the Unit is brought Online, the
operator may display the existing RCT replacement descriptors.
After all desired LBN's have been replaced, the operator may
display the updated RCT replacement descriptors. Default is "No".
If the display of the RCT replacement descriptors is
selected, the following description of the RCT contents is
provided on the console terminal. Remember that the period
used after a number indicates that the number is displayed in
decimal.
RCT Descriptor information for _DUan
RBN: 455. is SECONDARY replacement for LBN: 23271.
RBN: 456. is PRIMARY replacement for LBN: 23272.
RBN: 457. is UNUSABLE
After having described the "status" of all replacements
logged in the Replacement Control Table (RCT), The following
summary is provided:
RCT Descriptor summary for _DUan
Primary replacement blocks: ddd.
Secondary replacement blocks: ddd.
Unusable replacement blocks: ddd.
dddd. RBNs used out of ddddd. RBNs on media.
Displaying this information will provide an indication of
the TOTAL number of replacements that are "logged" in the RCT.
These replacements may have been made by the Operating System
(If the O/S has the capability of doing Bad Block Replacement),
the formatter or possibly this program. Although their is no
"specification" for how many replacements are too many, you
should be aware of certain conditions that would indicate a
problem. An RA81, for example, can accommodate 17,472 total
replacements. A good working HDA can have a thousand or more
primary replacements, however, the number of secondary
replacements should be small. If you displayed something more
than this, it could indicate that the drive has/had a data path
problem. ECC errors being generated by a Data Path problem can
cause a significant number of good blocks to be replaced before
the data path problem can be repaired. If you see what may
appear to be this condition, you should read RA81-TT-12. The
Tech Tip applies to all RA drives, even though it is written as
an RA81 Tech Tip. If this is the condition you sense, the tech
tip will recommend that you reformat the HDA. Now that this
program exists, it would be wise to not only reformat, to
recover from this condition, but to run as many passes of this
program as possible, after having reformatted. Any bad blocks
that the formatter may not have found, may be found by this
program and replaced.
- ENABLE_REPLACE
Enable replacements? [(Yes), No] <cr>
This is a unique feature of this program. Disabling
replacements (Answering with a NO) makes this program run in a
similar fashion to the HSC50 "Verify" program. In other words,
this program will "Scan" the media and report anything it finds
but not do any replacements. Using it in this mode could
possibly help you determine the "condition" of the media,
without having any replacements occur. Default is "Yes".
NOTE - This program is not intended for use as a diagnostic and
should not be run on drives/controllers that may have hardware
problems.
Using this program in this mode (Answering with a "No")
will disable all writes and therefore you can run it without
the normal requirement for a back-up and restore. However, we
can not be held responsible for any errors or subsystem
problems that may create situations that cause this program (or
subsystem) to do unexpected things.
What you can do is run with replacements disabled just to see
if its worth your time to do the extensive Backup and Restore
operations that are required when replacements are enabled.
Remember, you can not run this program with replacements
enabled and the drives write protected. One pass may not find
all the "marginal" bad spots. It is highly recommend that
when running in the "disable replacement" mode (Answer=NO) that
you allow the program to run several passes.
This mode can also be used to get a quick "look" at the
number and types of replaced blocks that are "logged" in the
Replacement Control table (RCT). By going into this mode, and
answering the question "Display Replacement Descriptors" with a
yes, you can get these typed without the Backup and Restore.
When analyzing the descriptors that are typed, be sure you read
and understand the discussion for the question "Display
Replacement Descriptors" (above).
- ENABLE_FE
Enable write with Forced Error flag? [(Yes), No] <cr>
When an uncorrectable ECC error is encountered, several
attempts are made to read the data correctly (Using every means
available). If these attempts fail, the block generating the
Uncorrectable ECC error is assumed to be bad and in need of
replacement. Thus, if you have a system that does "Dynamic Bad
Block Replacement" (Like VMS, RSTS, RSX, IAS), or your running
this program, the replacement process will take the
uncorrectable (Corrupt) data and move it to a good replacement
block (RBN). Now, we have a condition where we have corrupt
data in a good block and if left this way the corrupt data
would be read with no "indication" that the data is "corrupt".
Therefore, in order to tell a user that the data was at one
time uncorrectable and exists now as "corrupted" data (In
a good RBN), the "Forced Error Flag" is attached to the block
of "corrupted" data. When this block is read by a user, this
flag (Inverted EDC character) is also read and is intended to
inform the user that the requested data is "not reliable". The
default is "Yes".
The FORCED ERROR flag is not an error. When read,
it will not make an entry in the system error log. it is
reported as a "status" code 10 (octal) in a transfer request
"end packet" (This program calls the end packet status field
the "Endmsg status", in several message types that can be
displayed)
Some Operating Systems have trouble dealing with, or reporting,
the "Forced Error Flag". Also, some others do not have an
intelligent way of reporting the flag to the user (Makes it
appear as a "hardware error". Currently (March 1985) UNIX type
systems (ULTRIX, Berkeley UNIX, ATT UNIX etc) can "give up" (I
assume that means crash) in certain situations when a "Forced
Error Flag" is read. This being what we are told, we added
this question to allow the user to disable the function that
writes the forced error flag to an RBN with the uncorrectable
data.
If you answer the question "Enable write with Forced Error
Flag" with a NO, you will put uncorrectable data into a good
RBN (Just normal replacement). The "corrupted" data will read
good, with NO indication of an error or the "corrupted data
flag" (Forced error Flag). Therefore, the possibility of
reading corrupted data with NO indication that it is corrupted.
If you follow the recommended procedure for running this
program and do the backup, and then restore the backed-up data
after running this program, you will not have a problem.
Restoring the backed-up data, to a drive that had this question
answered with a NO, should eliminate any problem condition for
these systems. For those systems that do not know how to
handle the "Forced Error Flag", following the procedure of
answering the question with a no and then doing a backed-up
data restore is a requirement WITHOUT EXCEPTION.