The target signals its readiness to receive write data via the R2T PDU. The target also uses the R2T PDU to request retransmission of missing Data-Out PDUs. In both cases, the PDU format is the same, but an R2T PDU sent to request retransmission is called a Recovery R2T PDU . Figure 8-11 illustrates the iSCSI BHS of a R2T PDU. All fields marked with "." are reserved.
Figure 8-11. iSCSI R2T BHS Format
[View full size p_w_picpath]

A brief description of each field follows. The description of each field is abbreviated unless a field is used in a PDU-specific manner:
  • Reserved This is 1 bit.
  • Reserved This is the 1 bit redefined as Reserved.
  • Opcode This is 6 bits long. It is set to 0x31.
  • F bit This is always set to 1.
  • Reserved This is 23 bits long.
  • TotalAHSLength This is 8 bits long. It is always set to 0.
  • DataSegmentLength This is 24 bits long. It is always set to 0.
  • LUN This is 64 bits (8 bytes) in length.
  • ITT This is 32 bits in length.
  • TTT This is 32 bits long. It contains a tag that aids the target in associating Data-Out PDUs with this R2T PDU. All values are valid except 0xFFFFFFFF, which is reserved for use by initiators during first burst.
  • StatSN This is 32 bits long. It contains the StatSN that will be assigned to this command upon completion. This is the same as the ExpStatSN from the initiator's perspective.
  • ExpCmdSN This is 32 bits long.
  • MaxCmdSN This is 32 bits long.
  • R2TSN This is 32 bits long. It uniquely identifies each R2T PDU within the context of a single SCSI task . Each task is identified by the ITT. This field is incremented by 1 for each new R2T PDU transmitted within a SCSI task. A retransmitted R2T PDU carries the same R2TSN as the original PDU. This field is also incremented by 1 for each Data-In PDU transmitted during bidirectional command processing.
  • Buffer Offset This is 32 bits long. It indicates the position of the first byte of data requested by this PDU relative to the first byte of all the data transferred by the SCSI command.
  • Desired Data Transfer Length This is 32 bits long. This field indicates how much data should be transferred in response to this R2T PDU. This field is expressed in bytes. The value of this field cannot be 0 and cannot exceed the negotiated value of MaxBurstLength (see the iSCSI Login Parameters section of this chapter).
iSCSI supports PDU retransmission and PDU delivery acknowledgment on demand via the SNACK Request PDU. Each SNACK Request PDU specifies a contiguous set of missing single-type PDUs. Each set is called a run. Figure 8-12 illustrates the iSCSI BHS of a SNACK Request PDU. All fields marked with "." are reserved.
Figure 8-12. iSCSI SNACK Request BHS Format
[View full size p_w_picpath]

A brief description of each field follows. The description of each field is abbreviated unless a field is used in a PDU-specific manner:
  • Reserved This is 1 bit.
  • Reserved This is the 1 bit redefined as Reserved.
  • Opcode This is 6 bits long. It is set to 0x10.
  • F bit This is always set to 1.
  • Reserved This is 3 bits long.
  • Type This is 4 bits long. The SNACK Request PDU serves multiple functions. So, RFC 3720 defines multiple SNACK Request PDU types. This field indicates the PDU function. The PDU format is the same for all SNACK Request PDUs regardless of type, but some fields contain type-specific information. All PDU types must be supported if an ErrorRecoveryLevel greater than 0 is negotiated during login (see the iSCSI Login Parameters section of this chapter). Currently, only four PDU types are defined (see Table 8-4). All other types are reserved.
  • Reserved This is 16 bits long.
  • TotalAHSLength This is 8 bits long.
  • DataSegmentLength This is 24 bits long.
  • LUN or Reserved This is 64 bits (8 bytes) long. It contains a LUN if the PDU type is DataACK. The value in this field is copied from the LUN field of the Data-In PDU that requested the DataACK PDU. Otherwise, this field is reserved.
  • ITT or 0xFFFFFFFF This is 32 bits long. It is set to 0xFFFFFFFF if the PDU type is Status or DataACK. Otherwise, this field contains the ITT of the associated task.
  • TTT or SNACK Tag or 0xFFFFFFFF This is 32 bits long. It contains a TTT if the PDU type is DataACK. The value in this field is copied from the TTT field of the Data-In PDU that requested the DataACK PDU. This field contains a SNACK Tag if the PDU type is R-Data. Otherwise, this field is set to 0xFFFFFFFF.
  • Reserved This is 32 bits long.
  • ExpStatSN This is 32 bits long.
  • Reserved This is 64 bits (8 bytes) long.
  • BegRun or ExpDataSN This is 32 bits long. For Data/R2T and Status PDUs, this field contains the identifier (DataSN, R2TSN or StatSN) of the first PDU to be retransmitted. This value indicates the beginning of the run. Note that the SNACK Request does not request retransmission of data based on relative offset. Instead, one or more specific PDUs are requested. This contrasts the FCP model. For DataACK PDUs, this field contains the initiator's ExpDataSN. All Data-in PDUs up to but not including the ExpDataSN are acknowledged by this field. For R-Data PDUs, this field must be set to 0. In this case, all unacknowledged Data-In PDUs are retransmitted. If no Data-In PDUs have been acknowledged, the entire read sequence is retransmitted beginning at DataSN 0. If some Data-In PDUs have been acknowledged, the first retransmitted Data-In PDU is assigned the first unacknowledged DataSN.
  • RunLength This is 32 bits long. For Data/R2T and Status PDUs, this field specifies the number of PDUs to retransmit. This field may be set to 0 to indicate that all PDUs with a sequence number equal to or greater than BegRun must be retransmitted. For DataACK and R-Data PDUs, this field must be set to 0.
Table 8-4 summarizes the SNACK Request PDU types that are currently defined in RFC 3720. All PDU types excluded from Table 8-4 are reserved.
Table 8-4. iSCSI SNACK Request PDU Types
Type
Name
Function
0
Data/R2T
Initiators use this PDU type to request retransmission of one or more Data-In or R2T PDUs. By contrast, targets use the Recovery R2T PDU to request retransmission of one or more Data-Out PDUs.
1
Status
Initiators use this PDU type to request retransmission of one or more Login Response PDUs or a SCSI Response PDU. By contrast, targets do not request retransmission of SCSI Command PDUs.
2
DataACK
Initiators use this PDU type to provide explicit, positive, cumulative acknowledgment for Data-In PDUs. This frees buffer space within the target device and enables efficient recovery of dropped PDUs during long read operations. By contrast, targets do not provide acknowledgment for Data-Out PDUs. This is not necessary because the SAM requires initiators to keep all write data in memory until a SCSI status of GOOD, CONDITION MET, or INTERMEDIATE-CONDITION MET is received.
3
R-Data
Initiators use this PDU type to request retransmission of one or more Data-In PDUs that need to be resegmented. The need for resegmentation occurs when the initiator's MaxRecvDataSegmentLength changes during read command processing. By contrast, targets use the Recovery R2T PDU to request retransmission of one or more Data-Out PDUs if the target's MaxRecvDataSegmentLength changes during write command processing. Even when resegmentation is not required, initiators use this PDU type. If a SCSI Response PDU is received before all associated Data-In PDUs are received, this PDU type must be used to request retransmission of the missing Data-In PDUs. In such a case, the associated SCSI Response PDU must be retransmitted after the Data-In PDUs are retransmitted. The SNACK Tag must be copied into the duplicate SCSI Response PDU to enable the initiator to discern between the duplicate SCSI Response PDUs.

iSCSI initiators manage SCSI and iSCSI tasks via the TMF Request PDU. Figure 8-13 illustrates the iSCSI BHS of a TMF Request PDU. All fields marked with "." are reserved.
Figure 8-13. iSCSI TMF Request BHS Format
[View full size p_w_picpath]

A brief description of each field follows. The description of each field is abbreviated unless a field is used in a PDU-specific manner:
  • Reserved This is 1 bit.
  • I This is 1 bit.
  • Opcode This is 6 bits long. It is set to 0x02.
  • F bit This is always set to 1.
  • Function This is 7 bits long. It contains the TMF Request code of the function to be performed. iSCSI currently supports six of the TMFs defined in the SAM-2 specification and one TMF defined in RFC 3720 (see Table 8-5). All other TMF Request codes are reserved.
  • Reserved This is 16 bits long.
  • TotalAHSLength This is 8 bits long. It is always set to 0.
  • DataSegmentLength This is 24 bits long. It is always set to 0.
  • LUN or Reserved This is 64 bits (8 bytes) long. It contains a LUN if the TMF is ABORT TASK, ABORT TASK SET, CLEAR ACA, CLEAR TASK SET, or LOGICAL UNIT RESET. Otherwise, this field is reserved.
  • ITT This is 32 bits long. It contains the ITT assigned to this TMF command. This field does not contain the ITT of the task upon which the TMF command acts.
  • Referenced Task Tag (RTT) or 0xFFFFFFFF This is 32 bits long. If the TMF is ABORT TASK or TASK REASSIGN, this field contains the ITT of the task upon which the TMF command acts. Otherwise, this field is set to 0xFFFFFFFF.
  • CmdSN This is 32 bits long. It contains the CmdSN of the TMF command. TMF commands are numbered the same way SCSI read and write commands are numbered. This field does not contain the CmdSN of the task upon which the TMF command acts.
  • ExpStatSN This is 32 bits long.
  • RefCmdSN or Reserved This is 32 bits long. If the TMF is ABORT TASK, this field contains the CmdSN of the task upon which the TMF command acts. The case of linked commands is not explicitly described in RFC 3720. Presumably, this field should contain the highest CmdSN associated with the RTT. This field is reserved for all other TMF commands.
  • ExpDataSN or Reserved This is 32 bits long. It is used only if the TMF is TASK REASSIGN. Otherwise, this field is reserved. For read and bidirectional commands, this field contains the highest acknowledged DataSN plus one for Data-In PDUs. This is known as the data acknowledgment reference number (DARN). If no Data-In PDUs were acknowledged before connection failure, this field contains the value 0. The initiator must discard all unacknowledged Data-In PDUs for the affected task(s) after a connection failure. The target must retransmit all unacknowledged Data-In PDUs for the affected task(s) after connection allegiance is reassigned. For write commands and write data in bidirectional commands, this field is not used. The target simply requests retransmission of Data-Out PDUs as needed via the Recovery R2T PDU.
  • Reserved This is 64 bits long.
Table 8-5 summarizes the TMF Request codes that are currently supported by iSCSI. All TMF Request codes excluded from Table 8-5 are reserved.
Table 8-5. iSCSI TMF Request Codes
TMF Code
TMF Name
Description
1
ABORT TASK
This function instructs the Task Manager of the specified LUN to abort the task identified in the RTT field. This TMF command cannot be used to terminate TMF commands.
2
ABORT TASK SET
This function instructs the Task Manager of the specified LUN to abort all tasks issued within the associated session. This function does not affect tasks instantiated by other initiators.
3
CLEAR ACA
This function instructs the Task Manager of the specified LUN to clear the ACA condition. This has the same affect as ABORT TASK for all tasks with the ACA attribute. Tasks that do not have the ACA attribute are not affected.
4
CLEAR TASK SET
This function instructs the Task Manager of the specified LUN to abort all tasks identified by the task set type (TST) field in the SCSI Control Mode Page. This function can abort all tasks from a single initiator or all tasks from all initiators.
5
LOGICAL UNIT RESET
This function instructs the Task Manager of the specified LUN to abort all tasks, clear all ACA conditions, release all reservations, reset the logical unit's operating mode to its default state and set a Unit Attention condition. In the case of hierarchical LUNs, these actions also must be taken for each dependent logical unit.
6
TARGET WARM RESET
This function instructs the Task Manager of LUN 0 to perform a LOGICAL UNIT RESET for every LUN accessible via the target port through which the command is received. This function is subject to SCSI access controls and also may be subject to iSCSI access controls.
7
TARGET COLD RESET
This function instructs the Task Manager of LUN 0 to perform a LOGICAL UNIT RESET for every LUN accessible via the target port through which the command is received. This function is not subject to SCSI access controls but may be subject to iSCSI access controls. This function also instructs the Task Manager of LUN 0 to terminate all TCP connections for the target port through which the command is received.
8
TASK REASSIGN
This function instructs the Task Manager of the specified LUN to reassign connection allegiance for the task identified in the RTT field. Connection allegiance is reassigned to the TCP connection on which the TASK REASSIGN command is received. This function is supported only if the session supports an ErrorRecoveryLevel of two. This function must always be transmitted as an immediate command.

Each TMF Request PDU precipitates one TMF Response PDU. Figure 8-14 illustrates the iSCSI BHS of a TMF Response PDU. All fields marked with "." are reserved.
Figure 8-14. iSCSI TMF Response BHS Format
[View full size p_w_picpath]

A brief description of each field follows. The description of each field is abbreviated unless a field is used in a PDU-specific manner:
  • Reserved This is 1 bit.
  • Reserved This is the 1 bit redefined as Reserved.
  • Opcode This is 6 bits long. It is set to 0x22.
  • F bit This is always set to 1.
  • Reserved This is 7 bits long.
  • Response This is 8 bits long. This field indicates the completion status for the TMF command identified in the ITT field. RFC 3720 currently defines eight TMF Response codes (see Table 8-6). All other values are reserved.
  • Reserved This is 8 bits long.
  • TotalAHSLength This is 8 bits long. It is always set to 0.
  • DataSegmentLength This is 24 bits long. It is always set to 0.
  • Reserved This is 64 bits (8 bytes) long.
  • ITT This is 32 bits long.
  • Reserved This is 32 bits long.
  • StatSN This is 32 bits long.
  • ExpCmdSN This is 32 bits long.
  • MaxCmdSN This is 32 bits long.
  • Reserved This is 96 bits (12 bytes) long.
Table 8-6. iSCSI TMF Response Codes
TMF Code
TMF Name
Description
0
Function Complete
The TMF command completed successfully.
1
Task Does Not Exist
The task identified in the RTT field of the TMF request PDU does not exist. This response is valid only if the CmdSN in the RefCmdSN field in the TMF request PDU is outside the valid CmdSN window. If the CmdSN in the RefCmdSN field in the TMF request PDU is within the valid CmdSN window, a function complete response must be sent.
2
LUN Does Not Exist
The LUN identified in the LUN or Reserved field of the TMF request PDU does not exist.
3
Task Still Allegiant
Logout of the old connection has not completed. A task may not be reassigned until logout of the old connection successfully completes with reason code "remove the connection for recovery".
4
Task Allegiance Reassignment Not Supported
The session does not support ErrorRecoveryLevel 2.
5
TMF Not Supported
The target does not support the requested TMF command. Some TMF commands are optional for targets.
6
Function Authorization Failed
The initiator is not authorized to execute the requested TMF command.
255
Function Rejected
The initiator attempted an illegal TMF request (such as ABORT TASK for a different TMF task).

Table 8-6 summarizes the TMF Response codes that are currently supported by iSCSI . All TMF Response codes excluded from Table 8-6 are reserved.
The Reject PDU signals an error condition and rejects the PDU that caused the error. The Data segment (not shown in Figure 8-15) must contain the header of the PDU that caused the error. If a Reject PDU causes a task to terminate, a SCSI Response PDU with status CHECK CONDITION must be sent. Figure 8-15 illustrates the iSCSI BHS of a Reject PDU. All fields marked with "." are reserved.
Figure 8-15. iSCSI Reject BHS Format
[View full size p_w_picpath]

A brief description of each field follows. The description of each field is abbreviated unless a field is used in a PDU-specific manner:
  • Reserved This is 1 bit.
  • Reserved This is the 1 bit redefined as Reserved.
  • Opcode This is 6 bits long. It is set to 0x3F.
  • F bit This is always set to 1.
  • Reserved This is 7 bits long.
  • Reason This is 8 bits long. This field indicates the reason the erroneous PDU is being rejected. RFC 3720 currently defines 11 Reject Reason codes (see Table 8-7). All other values are reserved.
  • Reserved This is 8 bits long.
  • TotalAHSLength This is 8 bits long. It is always set to 0.
  • DataSegmentLength This is 24 bits long.
  • Reserved This is 64 bits (8 bytes) long.
  • ITT This is 32 bits long. It is set to 0xFFFFFFFF.
  • Reserved This is 32 bits long.
  • StatSN This is 32 bits long.
  • ExpCmdSN This is 32 bits long.
  • MaxCmdSN This is 32 bits long.
  • DataSN/R2TSN or Reserved This is 32 bits long. This field is valid only when rejecting a Data/R2T SNACK Request PDU. The Reject Reason code must be 0x04 (Protocol Error). This field indicates the DataSN or R2TSN of the next Data-In or R2T PDU to be transmitted by the target. Otherwise, this field is reserved.
  • Reserved This is 64 bits (8 bytes) long.
Table 8-7. iSCSI Reject Reason Codes
Reason Code
Reason Name
0x02
Data-Digest Error
0x03
SNACK Reject
0x04
Protocol Error
0x05
Command Not Supported
0x06
Immediate Command RejectedToo Many Immediate Commands
0x07
Task In Progress
0x08
Invalid DataACK
0x09
Invalid PDU Field
0x0a
Long Operation RejectCannot Generate TTTOut Of Resources
0x0b
Negotiation Reset
0x0c
Waiting For Logout

Table 8-7 summarizes the Reject Reason codes that are currently supported by iSCSI . All Reject Reason codes excluded from Table 8-7 are reserved.
The preceding discussion of iSCSI PDU formats is simplified for the sake of clarity. Comprehensive exploration of all the iSCSI PDUs and their variations is outside the scope of this book. For more information, readers are encouraged to consult IETF RFC 3720 and the ANSI T10 SAM-2, SAM-3, SPC-2, and SPC-3 specifications.

5.iSCSI Login Parameters

During the Login Phase, security and operating parameters are exchanged as text key-value pairs. As previously stated, text keys are encapsulated in the Data segment of the Login Request and Login Response PDUs. Some operating parameters may be re-negotiated after the Login Phase completes (during the Full Feature Phase) via the Text Request and Text Response PDUs. However, most operating parameters remain unchanged for the duration of a session. Security parameters may not be re-negotiated during an active session. Some text keys have a session-wide scope, and others have a connection-specific scope. Some text keys may be exchanged only during negotiation of the leading connection for a new session. Some text keys require a response (negotiation), and others do not (declaration). Currently, RFC 3720 defines 22 operational text keys. RFC 3720 also defines a protocol extension mechanism that enables the use of public and private text keys that are not defined in RFC 3720. This section describes the standard operational text keys and the extension mechanism. The format of all text key-value pairs is:
=
The SessionType key declares the type of iSCSI session . Only initiators send this key. This key must be sent only during the Login Phase on the leading connection. The valid values are Normal and Discovery. The default value is Normal. The scope is session-wide.
The HeaderDigest and DataDigest keys negotiate the use of the Header-Digest segment and the Data-Digest segment, respectively . Initiators and targets send these keys. These keys may be sent only during the Login Phase. Values that must be supported include CRC32C and None. Other public and private algorithms may be supported. The default value is None for both keys. The chosen digest must be used in every PDU sent during the Full Feature Phase. The scope is connection-specific.
The SendTargets key is used by initiators to discover targets during a Discovery session . This key may also be sent by initiators during a Normal session to discover changed or additional paths to a known target. Sending this key during a Normal session is fruitful only if the target configuration changes after the Login Phase. This is because, during a Discovery session, a target network entity must return all target names, sockets, and TPGTs for all targets that the requesting initiator is permitted to access. Additionally, path changes occurring during the Login Phase of a Normal session are handled via redirection. This key may be sent only during the Full Feature Phase. The scope is session-wide.
The TargetName key declares the iSCSI device name of one or more target devices within the responding network entity . This key may be sent by targets only in response to a SendTargets command. This key may be sent by initiators only during the Login Phase of a Normal session, and the key must be included in the leading Login Request PDU for each connection. The scope is session-wide.
The TargetAddress key declares the network addresses, TCP ports, and TPGTs of the target device to the initiator device . An address may be given in the form of DNS host name, IPv4 address, or IPv6 address. The TCP port may be omitted if the default port of 3260 is used. Only targets send this key. This key is usually sent in response to a SendTargets command, but it may be sent in a Login Response PDU to redirect an initiator. Therefore, this key may be sent during any phase. The scope is session-wide.
The InitiatorName key declares the iSCSI device name of the initiator device within the initiating network entity . This key identifies the initiator device to the target device so that access controls can be implemented. Only initiators send this key. This key may be sent only during the Login Phase, and the key must be included in the leading Login Request PDU for each connection. The scope is session-wide.
The InitiatorAlias key declares the optional human-friendly name of the initiator device to the target for display in relevant user interfaces . Only initiators send this key. This key is usually sent in a Login Request PDU for a Normal session, but it may be sent during the Full Feature Phase as well. The scope is session-wide.
The TargetAlias key declares the optional human-friendly name of the target device to the initiator for display in relevant user interfaces . Only targets send this key. This key usually is sent in a Login Response PDU for a Normal session, but it may be sent during the Full Feature Phase as well. The scope is session-wide.
The TargetPortalGroupTag key declares the TPGT of the target port to the initiator port . Only targets send this key. This key must be sent in the first Login Response PDU of a Normal session unless the first Login Response PDU redirects the initiator to another TargetAddress. The range of valid values is 0 to 65,535. The scope is session-wide.
The ImmediateData and InitialR2T keys negotiate support for immediate data and unsolicited data, respectively . Immediate data may not be sent unless both devices support immediate data. Unsolicited data may not be sent unless both devices support unsolicited data. Initiators and targets send these keys. These keys may be sent only during Normal sessions and must be sent during the Login Phase on the leading connection. The default settings support immediate data but not unsolicited data. The scope is session-wide for both keys.
The MaxOutstandingR2T key negotiates the maximum number of R2T PDUs that may be outstanding simultaneously for a single task . This key does not include the implicit R2T PDU associated with unsolicited data. Each R2T PDU is considered outstanding until the last Data-Out PDU is transferred (initiator's perspective) or received (target's perspective). A sequence timeout can also terminate the lifespan of an R2T PDU. Initiators and targets send this key. This key may be sent only during Normal sessions and must be sent during the Login Phase on the leading connection. The range of valid values is 1 to 65,535. The default value is one. The scope is session-wide.
The MaxRecvDataSegmentLength key declares the maximum amount of data that a receiver (initiator or target) can receive in a single iSCSI PDU . Initiators and targets send this key. This key may be sent during any phase of any session type and is usually sent during the Login Phase on the leading connection. This key is expressed in bytes. The range of valid values is 512 to 16,777,215. The default value is 8,192. The scope is connection-specific.
The MaxBurstLength key negotiates the maximum amount of data that a receiver (initiator or target) can receive in a single iSCSI sequence . This value may exceed the value of MaxRecvDataSegmentLength, which means that more than one PDU may be sent in response to an R2T Request PDU. This contrasts the FC model. For write commands, this key applies only to solicited data. Initiators and targets send this key. This key may be sent only during Normal sessions and must be sent during the Login Phase on the leading connection. This key is expressed in bytes. The range of valid values is 512 to 16,777,215. The default value is 262,144. The scope is session-wide.
The FirstBurstLength key negotiates the maximum amount of data that a target can receive in a single iSCSI sequence of unsolicited data (including immediate data) . Thus, the value of this key minus the amount of immediate data received with the SCSI command PDU yields the amount of unsolicited data that the target can receive in the same sequence. If neither immediate data nor unsolicited data is supported within the session, this key is invalid. The value of this key cannot exceed the target's MaxBurstLength. Initiators and targets send this key. This key may be sent only during Normal sessions and must be sent during the Login Phase on the leading connection. This key is expressed in bytes. The range of valid values is 512 to 16,777,215. The default value is 65,536. The scope is session-wide.
The MaxConnections key negotiates the maximum number of TCP connections supported by a session . Initiators and targets send this key. Discovery sessions are restricted to one TCP connection, so this key may be sent only during Normal sessions and must be sent during the Login Phase on the leading connection. The range of valid values is 1 to 65,535. The default is value is 1. The scope is session-wide.
The DefaultTime2Wait key negotiates the amount of time that must pass before attempting to logout a failed connection . Task reassignment may not occur until after the failed connection is logged out. Initiators and targets send this key. This key may be sent only during Normal sessions and must be sent during the Login Phase on the leading connection. This key is expressed in seconds. The range of valid values is 0 to 3600. The default value is 2. A value of 0 indicates that logout may be attempted immediately upon detection of a failed connection. The scope is session-wide.
The DefaultTime2Retain key negotiates the amount of time that task state information must be retained for active tasks after DefaultTime2Wait expires . When a connection fails, this key determines how much time is available to complete task reassignment. If the failed connection is the last (or only) connection in a session, this key also represents the session timeout value. Initiators and targets send this key. This key may be sent only during Normal sessions and must be sent during the Login Phase on the leading connection. This key is expressed in seconds. The range of valid values is 0 to 3600. The default value is 20. A value of 0 indicates that task state information is discarded immediately upon detection of a failed connection. The scope is session-wide.
The DataPDUInOrder key negotiates in-order transmission of data PDUs within a sequence . Because TCP guarantees in-order delivery, the only way for PDUs of a given sequence to arrive out of order is to be transmitted out of order. Initiators and targets send this key. This key may be sent only during Normal sessions and must be sent during the Login Phase on the leading connection. The default value requires in-order transmission. The scope is session-wide.
The DataSequenceInOrder key negotiates in-order transmission of data PDU sequences within a command . For sessions that support in-order transmission of sequences and retransmission of missing data PDUs (ErrorRecoveryLevel greater than zero), the MaxOustandingR2T key must be set to 1. This is because requests for retransmission may be sent only for the lowest outstanding R2TSN, and all PDUs already received for a higher outstanding R2TSN must be discarded until retransmission succeeds. This is inefficient. It undermines the goal of multiple outstanding R2T PDUs. Sessions that do not support retransmission must terminate the appropriate task upon detection of a missing data PDU, and all data PDUs must be retransmitted via a new task. Thus, no additional inefficiency is introduced by supporting multiple outstanding R2T PDUs when the ErrorRecoveryLevel key is set to 0 . Initiators and targets send the DataSequenceInOrder key. This key may be sent only during Normal sessions and must be sent during the Login Phase on the leading connection. The default value requires in-order transmission. The scope is session-wide.
The ErrorRecoveryLevel key negotiates the combination of recovery mechanisms supported by the session. Initiators and targets send this key. This key may be sent only during the Login Phase on the leading connection. The range of valid values is 0 to 2. The default value is 0. The scope is session-wide.
The OFMarker and IFMarker keys negotiate support for PDU boundary detection via the fixed interval markers (FIM) scheme . Initiators and targets send these keys. These keys may be sent during any session type and must be sent during the Login Phase. The default setting is disabled for both keys. The scope is connection-specific.
The OFMarkInt and IFMarkInt keys negotiate the interval for the FIM scheme . These keys are valid only if the FIM scheme is used. Initiators and targets send these keys. These keys may be sent during any session type and must be sent during the Login Phase. These keys are expressed in 4-byte words. The range of valid values is 1 to 65,535. The default value is 2048 for both keys. The scope is connection-specific.
A mechanism is defined to enable implementers to extend the iSCSI protocol via additional key-value pairs. These are known as private and public extension keys. Support for private and public extension keys is optional . Private extension keys are proprietary. All private extension keys begin with "X-" to convey their proprietary status. Public extension keys must be registered with the IANA and must also be described in an informational RFC published by the IETF. All public extension keys begin with "X#" to convey their registered status. Private extension keys may be used only in Normal sessions but are not limited by phase. Public extension keys may be used in either type of session and are not limited by phase. Initiators and targets may send private and public extension keys. The scope of each extension key is determined by the rules of that key. The format of private extension keys is flexible but generally takes the form:
X-ReversedVendorDomainName.KeyName
The format of public extension keys is mandated as:
X#IANA-Registered-String
For more information about iSCSI text key-value pairs, readers are encouraged to consult IETF RFC 3720.

6.iSCSI Delivery Mechanisms

The checksum used by TCP does not detect all errors. Therefore, iSCSI must use its own CRC-based digests (as does FC) to ensure the utmost data integrity. This has two implications:
  • When a PDU is dropped due to digest error, the iSCSI protocol must be able to detect the beginning of the PDU that follows the dropped PDU. Because iSCSI PDUs are variable in length, iSCSI recipients depend on the BHS to determine the total length of a PDU. The BHS of the dropped PDU cannot always be trusted (for example, if dropped due to CRC failure), so an alternate method of determining the total length of the dropped PDU is required. Additionally, when a TCP packet containing an iSCSI header is dropped and retransmitted, the received TCP packets of the affected iSCSI PDU and the iSCSI PDUs that follow cannot be optimally buffered. An alternate method of determining the total length of the affected PDU resolves this issue.
  • To avoid SCSI task abortion and re-issuance in the presence of digest errors, the iSCSI protocol must support PDU retransmission. An iSCSI device may retransmit dropped PDUs (optimal) or abort each task affected by a digest error (suboptimal).
Additionally, problems can occur in a routed IP network that cause a TCP connection or an iSCSI session to fail. Currently, this does not occur frequently in iSCSI environments because most iSCSI deployments are single-subnet environments. However, iSCSI is designed in a such a way that it supports operation in routed IP networks. Specifically, iSCSI supports connection and session recovery to prevent IP network problems from affecting the SAL. This enables iSCSI users to realize the full potential of TCP/IP. RFC 3720 defines several delivery mechanisms to meet all these requirements.
Error Recovery Classes
RFC 3720 permits each iSCSI implementation to select its own recovery capabilities. Recovery capabilities are grouped into classes to simplify implementation and promote interoperability. Four classes of recoverability are defined:
  • Recovery within a command (lowest class)
  • Recovery within a connection
  • Recovery of a connection
  • Recovery of a session (highest class)
RFC 3720 mandates the minimum recovery class that may be used for each type of error. RFC 3720 does not provide a comprehensive list of errors, but does provide representative examples. An iSCSI implementation may use a higher recovery class than the minimum required for a given error. Both initiator and target are allowed to escalate the recovery class. The number of tasks that are potentially affected increases with each higher class. So, use of the lowest possible class is encouraged. The two lowest classes may be used in only the Full Feature Phase of a session. Table 8-8 lists some example scenarios for each recovery class.
Table 8-8. iSCSI Error Recovery Classes
Class Name
Scope of Affect
Example Error Scenarios
Recovery Within A Command
Low
Lost Data-In PDU, Lost Data-Out PDU, Lost R2T PDU
Recovery Within A Connection
Medium-Low
Request Acknowledgement Timeout, Response Acknowledgement Timeout, Response Timeout
Recovery Of A Connection
Medium-High
Connection Failure (see Chapter 7, "OSI Transport Layer"), Explicit Notification From Target Via Asynchronous Message PDU
Recovery Of A Session
High
Failure Of All Connections Coupled With Inability To Recover One Or More Connections

Error Recovery Hierarchy
RFC 3720 defines three error recovery levels that map to the four error recovery classes. The three recovery levels are referred to as the Error Recovery Hierarchy. During the Login Phase, the recovery level is negotiated via the ErrorRecoveryLevel key. Each recovery level is a superset of the capabilities of the lower level. Thus, support for a higher level indicates a more sophisticated iSCSI implementation. Table 8-9 summarizes the mapping of levels to classes.
Table 8-9. iSCSI Error Recovery Hierarchy
ErrorRecoveryLevel
Implementation Complexity
Error Recovery Classes
0
Low
Recovery Of A Session
1
Medium
Recovery Within A Command, Recovery Within A Connection
2
High
Recovery Of A Connection

At first glance, the mapping of levels to classes may seem counter-intuitive. The mapping is easier to understand after examining the implementation complexity of each recovery class. The goal of iSCSI recovery is to avoid affecting the SAL. However, an iSCSI implementation may choose not to recover from errors. In this case, recovery is left to the SCSI application client. Such is the case with ErrorRecoveryLevel 0, which simply terminates the failed session and creates a new session. The SCSI application client is responsible for reissuing all affected tasks. Therefore, ErrorRecoveryLevel 0 is the simplest to implement. Recovery within a command and recovery within a connection both require iSCSI to retransmit one or more PDUs . Therefore, ErrorRecoveryLevel 1 is more complex to implement. Recovery of a connection requires iSCSI to maintain state for one or more tasks so that task reassignment may occur. Recovery of a connection also requires iSCSI to retransmit one or more PDUs on the new connection. Therefore, ErrorRecoveryLevel 2 is the most complex to implement. Only ErrorRecoveryLevel 0 must be supported. Support for ErrorRecoveryLevel 1 and higher is encouraged but not required.
PDU Boundary Detection
To determine the total length of a PDU without relying solely on the iSCSI BHS, RFC 3720 permits the use of message synchronization schemes . Even though RFC 3720 encourages the use of such schemes, no such scheme is mandated. That said, a practical requirement for such schemes arises from the simultaneous implementation of header digests and ErrorRecoveryLevel 1 or higher. As a reference for implementers, RFC 3720 provides the details of a scheme called fixed interval markers (FIM). The FIM scheme works by inserting an 8-byte marker into the TCP stream at fixed intervals . Both the initiator and target may insert the markers. Each marker contains two copies of a 4-byte pointer that indicates the starting byte number of the next iSCSI PDU. Support for the FIM scheme is negotiated during the Login Phase.
PDU Retransmission
iSCSI guarantees in-order data delivery to the SAL. When PDUs arrive out of order due to retransmission, the iSCSI protocol does not reorder PDUs per se. Upon receipt of all TCP packets composing an iSCSI PDU, iSCSI places the ULP data in an application buffer. The position of the data within the application buffer is determined by the Buffer Offset field in the BHS of the Data-In/Data-Out PDU. When an iSCSI digest error, or a dropped or delayed TCP packet causes a processing delay for a given iSCSI PDU, the Buffer Offset field in the BHS of other iSCSI data PDUs that are received error-free enables continued processing without delay regardless of PDU transmission order. Thus, iSCSI PDUs do not need to be reordered before processing. Of course, the use of a message synchronization scheme is required under certain circumstances for PDU processing to continue in the presence of one or more dropped or delayed PDUs. Otherwise, the BHS of subsequent PDUs cannot be read. Assuming this requirement is met, PDUs can be processed in any order.
Retransmission occurs as the result of a digest error, protocol error, or timeout. Despite differences in detection techniques, PDU retransmission is handled in a similar manner for data digest errors, protocol errors and timeouts. However, header digest errors require special handling. When a header digest error occurs, and the connection does not support a PDU boundary detection scheme, the connection must be terminated . If the session supports ErrorRecoveryLevel 2, the connection is recovered, tasks are reassigned, and PDU retransmission occurs on the new connection . If the session does not support ErrorRecoveryLevel 2, the connection is not recovered. In this case, the SCSI application client must re-issue the terminated tasks on another connection within the same session. If no other connections exist with the same session, the session is terminated, and the SCSI application client must re-issue the terminated tasks in a new session. When a header digest error occurs, and the connection supports a PDU boundary detection scheme, the PDU is discarded . If the session supports ErrorRecoveryLevel 1 or higher, retransmission of the dropped PDU is handled as described in the following paragraphs. Note that detection of a dropped PDU because of header digest error requires successful receipt of a subsequent PDU associated with the same task . If the session supports only ErrorRecoveryLevel 0, the session is terminated, and the SCSI application client must re-issue the terminated tasks in a new session. The remainder of this section focuses primarily on PDU retransmission in the presence of data digest errors.
Targets explicitly notify initiators when a PDU is dropped because of data digest failure. The Reject PDU facilitates such notification. Receipt of a Reject PDU for a SCSI Command PDU containing immediate data triggers retransmission if ErrorRecoveryLevel is 1 or higher. When an initiator retransmits a SCSI Command PDU, certain fields (such as the ITT, CmdSN, and operational attributes) in the BHS must be identical to the original PDU's BHS. This is known as a retry. A retry must be sent on the same connection as the original PDU unless the connection is no longer active. Receipt of a Reject PDU for a SCSI Command PDU that does not contain immediate data usually indicates a non-digest error that prevents retrying the command. Receipt of a Reject PDU for a Data-Out PDU does not trigger retransmission. Initiators retransmit Data-Out PDUs only in response to Recovery R2T PDUs. Thus, targets are responsible for requesting retransmission of missing Data-Out PDUs if ErrorRecoveryLevel is 1 or higher. Efficient recovery of dropped data during write operations is accomplished via the Buffer Offset and Desired Data Transfer Length fields in the Recovery R2T PDU. In the absence of a Recovery R2T PDU (in other words, when no Data-Out PDUs are dropped), all Data-Out PDUs are implicitly acknowledged by a SCSI status of GOOD in the SCSI Response PDU. When a connection fails and tasks are reassigned, the initiator retransmits a SCSI Command PDU or Data-Out PDUs as appropriate for each task in response to Recovery R2T PDUs sent by the target on the new connection. When a session fails, an iSCSI initiator does not retransmit any PDUs. At any point in time, an initiator may send a No Operation Out (NOP-Out) PDU to probe the sequence numbers of a target and to convey the initiator's sequence numbers to the same target. Initiators also use the NOP-Out PDU to respond to No Operation IN ( NOP-In) PDUs received from a target. A NOP-Out PDU may also be used for diagnostic purposes or to adjust timeout values. A NOP-Out PDU does not directly trigger retransmission.
Initiators do not explicitly notify targets when a Data-In PDU or SCSI Response PDU is dropped due to data digest failure. Because R2T PDUs do not contain data, detection of a missing R2T PDU via an out-of-order R2TSN means a header digest error occurred on the original R2T PDU. When a Data-In PDU or SCSI Response PDU containing data is dropped, the initiator requests retransmission via a Data/R2T SNACK Request PDU if ErrorRecoveryLevel is 1 or higher. Efficient recovery of dropped data during read operations is accomplished via the BegRun and RunLength fields in the SNACK Request PDU. The target infers that all Data-In PDUs associated with a given command were received based on the ExpStatSN field in the BHS of a subsequent SCSI Command PDU or Data-Out PDU. Until such acknowledgment is inferred, the target must be able to retransmit all data associated with a command.
This requirement can consume a lot of the target's resources during long read operations. To free resources during long read operations, targets may periodically request explicit acknowledgment of Data-In PDU receipt via a DataACK SNACK Request PDU. When a connection fails and tasks are reassigned, the target retransmits Data-In PDUs or a SCSI Response PDU as appropriate for each task. Initiators are not required to explicitly request retransmission following connection recovery. All unacknowledged Data-In PDUs and SCSI Response PDUs must be automatically retransmitted after connection recovery. The target uses the ExpDataSN of the most recent DataACK SNACK Request PDU to determine which Data-IN PDUs must be retransmitted for each task. Optionally, the target may use the ExpDataSN field in the TMF Request PDU received from the initiator after task reassignment to determine which Data-In PDUs must be retransmitted for each task. If the target cannot reliably maintain state for a reassigned task, all Data-In PDUs associated with that task must be retransmitted. If the SCSI Response PDU for a given task was transmitted before task reassignment, the PDU must be retransmitted after task reassignment. Otherwise, the SCSI Response PDU is transmitted at the conclusion of the command or task as usual. When a session fails, an iSCSI target does not retransmit any PDUs. At any point in time, a target may send a NOP-In PDU to probe the sequence numbers of an initiator and to convey the target's sequence numbers to the same initiator. Targets also use the NOP-In PDU to respond to NOP-Out PDUs received from an initiator. A NOP-In PDU may also be used for diagnostic purposes or to adjust timeout values. A NOP-In PDU does not directly trigger retransmission.
iSCSI In-Order Command Delivery
According to the SAM, status received for a command finalizes the command under all circumstances. So, initiators requiring in-order delivery of commands can simply restrict the number of outstanding commands to one and wait for status for each outstanding command before issuing the next command. Alternately, the SCSI Transport Protocol can guarantee in-order command delivery. This enables the initiator to maintain multiple simultaneous outstanding commands.
iSCSI guarantees in-order delivery of non-immediate commands to the SAL within a target. Each non-immediate command is assigned a unique CmdSN. The CmdSN counter must be incremented sequentially for each new non-immediate command without skipping numbers. In a single-connection session, the in-order guarantee is inherent due to the properties of TCP. In a multi-connection session, commands are issued sequentially across all connections. In this scenario, TCP cannot guarantee in-order delivery of non-immediate commands because TCP operates independently on each connection. Additionally, the configuration of a routed IP network can result in one connection using a "shorter" route to the destination node than other connections . Thus, iSCSI must augment TCP by ensuring that non- immediate commands are processed in order (according to the CmdSN) across multiple connections . So, RFC 3720 requires each target to process non-immediate commands in the same order as transmitted by the initiator . Note that a CmdSN is assigned to each TMF Request PDU. The rules of in-order delivery also apply to non-immediate TMF requests.
Immediate commands are handled differently than non-immediate commands. An immediate command is not assigned a unique CmdSN and is not subject to in-order delivery guarantees. The initiator increments its CmdSN counter after transmitting a new non-immediate command. Thus, the value of the initiator's CmdSN counter (the current CmdSN) represents the CmdSN of the next non-immediate command to be issued. The current CmdSN is also assigned to each immediate command issued, but the CmdSN counter is not incremented following issuance of immediate commands. Moreover, the target may deliver immediate commands to the SAL immediately upon receipt regardless of the CmdSN in the BHS. The next non-immediate command is assigned the same CmdSN. For that PDU, the CmdSN in the BHS is used by the target to enforce in-order delivery. Thus, immediate commands are not acknowledged via the ExpCmdSN field in the BHS. Immediate TMF requests are processed like non-immediate TMF requests. Therefore, marking a TMF request for immediate delivery does not expedite processing.
Note
The order of command delivery does not necessarily translate to the order of command execution. The order of command execution can be changed via TMF request as specified in the SCSI standards.

iSCSI Connection and Session Recovery
When ErrorRecoveryLevel equals 2, iSCSI supports stateful recovery at the connection level . Targets may choose whether to maintain state during connection recovery. When state is maintained, active commands are reassigned, and data transfer resumes on the new connection from the point at which data receipt is acknowledged. When state is not maintained, active commands are reassigned, and all associated data must be transferred on the new connection. Connections may be reinstated or recovered. Reinstatement means that the same CID is reused. Recovery means that a new CID is assigned. RFC 3720 does not clearly define these terms, but the definitions provided herein appear to be accurate. When MaxConnections equals one, and ErrorRecoveryLevel equals two, the session must temporarily override the MaxConnections parameter during connection recovery. Two connections must be simultaneously supported during recovery. Additionally, the failed connection must be cleaned up before recovery to avoid receipt of stale PDUs following recovery.
In a multi-connection session, each command is allegiant to a single connection. In other words, all PDUs associated with a given command must traverse a single connection. This is known as connection allegiance. Connection allegiance is command-oriented, not task-oriented . This can be confusing because connection recovery involves task reassignment. When a connection fails, an active command is identified by its ITT (not CmdSN) for reassignment purposes. This is because SCSI defines management functions at the task level, not at the command level. However, multiple commands can be issued with a single ITT (linked commands), and each linked command can be issued on a different connection within a single session. No more than one linked command can be outstanding at any point in time for a given task, so the ITT uniquely identifies each linked command during connection recovery. The PDUs associated with each linked command must traverse the connection on which the command was issued. This means a task may be spread across multiple connections over time, but each command is allegiant to a single connection at any point in time.
When a session is recovered, iSCSI establishes a new session on behalf of the SCSI application client . iSCSI also terminates all active tasks within the target and generates a SCSI response for the SCSI application client. iSCSI does not maintain any state for outstanding tasks. All tasks must be reissued by the SCSI application client.
The preceding discussion of iSCSI delivery mechanisms is simplified for the sake of clarity. For more information about iSCSI delivery mechanisms, readers are encouraged to consult IETF RFCs 3720 and 3783.