With the rise of multi-core, multi-threaded data processing systems, a key performance consideration is the coordination of the processing performed by multiple concurrent threads. In conventional systems, coordination of processing between threads is commonly achieved through the use of mutex (mutually exclusive) locks in memory. Typically, a mutex lock is associated with a portion of a shared data set, and prior to entering a code section that modifies that portion of the shared data set, a thread must acquire the associated mutex lock. After the code section is complete, the thread then releases the mutex lock, making it available for acquisition by the other concurrent threads.
Mutex locks thus synchronize access to the shared data set among multiple concurrent threads, but do so with significant overhead as each of the concurrent threads consumes cycles acquiring and releasing the mutex locks corresponding to various portions of the shared data set. It should also be noted that the use of mutex locks also leads to thread stalls, since a thread requiring a mutex lock held by another concurrent thread must suspend processing until the mutex lock is released by the other thread.
Another commonly employed alternative or additional technique for coordinating the processing of multiple threads is the execution of atomic memory access instruction pairs, such as load-and-reserve and store-conditional. To support atomic memory access, each processing element is equipped with one reservation register per-thread for holding the memory address of a reserved memory location. Upon execution of a load-and-reserve instruction in a thread, the load target address is recorded in the reservation register of that thread and is said to be "reserved" by the thread. While an address is reserved, storage modifying operations by other threads are monitored to detect any storage-modifying access to the reserved address, which would cancel the reservation. If no intervening storage-modifying operation is detected following establishment of the reservation and prior to execution by the reserving thread of a store-conditional instruction targeting the reserved address, the store-conditional instruction succeeds in updating the reserved address. If, however, a storage-modifying operation targeting the reserved address is detected between establishment of the reservation and execution of the store-conditional instruction, the reservation is canceled, and the store-conditional instruction fails. In this case, the reserving thread is typically programmed to loop back and again obtain a reservation for the same address.
As with lock-based thread synchronization, thread synchronization through the use of atomic memory access instruction pairs is also subject to high overhead costs. In particular, a separate reservation must be acquired (possibly many times) for each memory address to which access is synchronized, leading to substantial code overhead in cases in which access to numerous memory address is to be synchronized.
Because of the overhead concerns associated with conventional mutex locks and atomic memory access instruction pairs, recently developed multi-threaded data processing systems employ transactional processing in which a thread executes critical code sections atomically, only committing data results of a critical code section to the architected state if no other thread performs a conflicting data access during execution of the critical code section. Thus, in cases in which a conflicting data access is detected during transactional processing of a critical code section, the data processing system must discard any speculative data results obtained during execution of the critical code section and revert to the architected state of the thread.
In one embodiment, a data processing system that efficiently handles transactional processing includes a processor having an instruction sequencing unit, execution unit, and multi-level register file including a first level register file having a lower access latency and a second level register file having a higher access latency. Responsive to the processor processing a second instruction in a transactional code section to obtain as an execution result a second register value of the logical register, the mapper moves a first register value of the logical register to the second level register file, places the second register value in the first level register file, marks the second register value as speculative, and replaces a first mapping for the logical register with a second mapping. Responsive to unsuccessful termination of the transactional code section, the mapper designates the second register value in the first level register file as invalid so that the first register value in the second level register file becomes the working value.
With reference now to FIG. 1, there is illustrated a high level block diagram of an exemplary data processing system 100 according to one embodiment. Data processing system 100 includes one or more (and in some embodiments, a multiplicity of) processor complexes 102 coupled for communication by a interconnect fabric 101, which may include, for example, one or more buses and/or switches.
Processor complexes 102, which may be implemented, for example, as chip multiprocessors (CMPs) or multi-chip module (MCMs), each include at least one processor core 104 for processing data under the direction of program instructions. Each processor core 104 is capable of simultaneously executing multiple independent hardware threads of execution.
Each processor core 104 is supported by a cache hierarchy including one or more upper level caches 106 and a lowest level cache 108. As will be appreciated by those skilled in the art, the cache hierarchy provides processor cores 104 with low latency access to instructions and data retrieved from system memory. While it is typical for at least the highest level cache (i.e., that with the lowest access latency) to be on-chip with the associated core 104, the lower levels of cache memory (including lowest level cache 108) may be implemented as on-chip or off-chip, in-line or lookaside cache, which may be fully inclusive, partially inclusive, or non-inclusive of the contents the upper levels of cache. As indicated, the lowest-level cache 108 can be (but is not required to be) shared by multiple processor cores 104, and further can optionally be configured as a victim cache.
Processor complex 102 additionally includes one or more memory controllers (MCs) 110 each controlling read and write access to system (or main) memory, which is the lowest level of storage addressable by the real address space of processor complex(es) 102. In an exemplary embodiment, each memory controller 110 is coupled by a memory bus 112 to at least one respective memory channel 120, each of which includes one or more ranks of system memory.
As will be appreciated, data processing system 100 may further include additional unillustrated components, such network interface cards, I/O adapters, non-volatile data storage, additional bus bridges, etc., as is known in the art.
Referring now to FIG. 2, there is depicted a more detailed view of a processor core of data processing system 100 of FIG. 1. Each processor core 104 includes an instruction sequencing unit (ISU) 200 for fetching and ordering instructions for execution in multiple concurrent hardware threads, as well as one or more execution units 202 for executing the instructions in the multiple concurrent hardware threads. In at least some embodiments, execution units 202 may execute at least some instructions within a given hardware thread out-of-order and may further employ known techniques of speculation and branch prediction to enhance cycles-per-instruction (CPI) performance.
Coupled to execution units 202 is one multi-level register file (RF) 204 per concurrent hardware thread supported by processor core 104. Each multi-level register file 204, which includes at least a first level register file 206 and a second level register file 208, contains a plurality of physical registers providing the lowest latency storage for source and destination operands of the instructions executed by execution units 202. First level register file 206 preferably contains fewer physical registers than second level register file 208 and/or is physically closer to execution units 202, and accordingly has a lower access latency than second level register file 208. The contents of first level register file 206 include both architected register values, which represent the current non-speculative state of a thread, as well as non-architected register values, which represent working values not yet committed to the architected state of the thread.
In at least one embodiment, first level register file 206 contains fewer physical registers than would be necessary to provide storage for the operands of all instructions concurrently undergoing execution by processor core 104. Accordingly, processor core 104 includes a mapper 210 including logic and data structures to map logical registers referenced by instructions executed by execution units 202 to particular physical registers in first level register file 206 and second level register file 808. Specifically, the depicted embodiment of mapper 210 includes an architected mapper cache (AMC) 212 for each thread that tracks the assignment of physical registers in first level register file 206 to logical registers referenced by instructions. Mapper 210 also includes a free list 214 for each thread that records the pool of physical registers in first level register file 206 that are free (currently unassigned to a logical register) and therefore available for assignment to a logical destination register referenced by a subsequent instruction in the thread. Finally, mapper 210 includes a register pointer table 216 for each thread that tracks the locations of architected register values evicted from first level register file 206 and placed into second level register file 208.
In contrast to first level register file 206, the physical registers in second level register file 208 are preferably unmapped, meaning that each physical register in second level register file 208 corresponds to a particular logical register in the register space of the instruction set architecture of processor core 104. As discussed further below, it is preferable if each logical register has two corresponding physical registers in second level register file 208, one to hold a speculative logical register value and another to hold a non-speculative or working logical register value. Thus, at any given time, second level register file 208 may contain architected, working, and speculative register values.
With reference now to FIG. 3, there is illustrated a more detailed view of an exemplary embodiment of the architected mapper cache (AMC) 212 of FIG. 2. In the exemplary embodiment, AMC 212 includes a plurality of entries 310 for buffering temporary mappings between logical registers referenced by instructions and physical registers within first level register file 206. Each such entry 310 includes a valid field 300 (e.g., a single bit) indicating whether that entry 310 is valid or invalid, a register mapping field 302 indicating a logical-to-physical register mapping, and a transactional field 304 indicating (e.g., by a single bit) whether the register value is a speculative value within a transactional code section or a non-speculative value.
In one exemplary embodiment, AMC 22 is organized as a N-way set-associative cache, with each of M sets contains multiple entries 212. In this embodiment, each of the M sets in AMC 22 corresponds (e.g., via index bits of a logical register number) to greater than N logical registers.
FIG. 4 is a high level logical flowchart of an exemplary process by which a mapper 210 allocates a register from a multi-level register file to an instruction in one embodiment. As a logical flowchart, FIG. 4 presents steps in a logical rather than strictly chronological sequence, meaning that in at least some embodiments certain of the illustrated steps can be performed in a different order than illustrated and/or in parallel.
Referencing FIG. 2 again, mapper 210 is coupled to receive each instruction issued to execution units 202. Thus, the exemplary process depicted in FIG. 4 begins at block 400 in response to mapper 210 receiving an instruction dispatched from instruction sequencing unit 200. Following block 400, the process proceeds to block 401, which illustrates mapper 210 determining whether or not the instruction demarks the end of a transactional code section of one of the concurrent hardware threads of processor core 104. As described further below with reference to FIGS. 7 and 8, the beginning and end of a transaction code section of a thread can be demarked, for example, by special purpose instruction opcodes, or alternatively, by fields of instructions that additionally perform arithmetic, logical and/or program control functions. In response to a determination at block 401 that the received instruction demarks the end of a transactional code section, the process depicted in FIG. 4 proceeds through page connector 403 to FIG. 6, which is described below.
Returning to block 401, if mapper 210 determines that the received instruction does not demark the end of a transactional code section of a thread, mapper 201 further determines at block 402 whether or not the instruction falls within a transactional code section of the thread. For example, mapper 210 can determine that processor core 104 is executing a transactional code section of a thread, for example, if mapper 210 has detected an instruction demarking the beginning of a transactional code section without detecting a corresponding instruction demarking the end of the transactional code section.
In response to mapper 210 determining at block 402 that the instruction does not belong to a transactional code section, the process proceeds to block 420, which is described below. If, however, mapper 210 determining at block 402 that the instruction falls within a transactional code section of a thread, mapper 210 further determines at block 404 whether or not the logical destination or target register referenced by the instruction collides with (i.e., matches) a logical destination register referenced by a previous instruction of the thread preceding the transactional code section. If not, the process proceeds to block 420, which is described below. If, however, mapper 210 determines that the logical destination register referenced by the instruction collides with a logical destination register referenced by a previous instruction of the thread preceding the transactional code section, the process proceeds to block 410.
Block 410 depicts mapper 210 establishing a checkpoint for the register value of the logical destination register referenced by the instruction in the transactional code section by transferring the register value from first level register file 206 to second level register file 208. In this manner, the non-speculative value of the logical destination register established prior to the transactional code section will be preserved in case the transactional code section is aborted. Mapper 210 additionally updates register pointer table 216 to identify the location of the non-speculative value of the logical register within second level register file 208. Thereafter, at block 414, mapper 210 adds the physical register in first level register file 206 to which the logical destination register was formerly mapped to free list 214, thus returning that physical register to the pool of physical registers available for mapping to a logical destination register. From block 414, the process proceeds to block 420.
At block 420, mapper 210 determines whether or not an eviction from AMC 212 is required in order to map the logical destination register referenced by the current instruction to a physical register in first level register file 206. To make this determination, mapper 210 can, for example, determine if there are any free (i.e., invalid) entries in AMC 212 in the set to which the logical destination register of the current instruction maps. If mapper 210determines at block 420 that a free entry is available and therefore no eviction is required, the process proceeds from block 420 to block 424. Otherwise, the process passes from block 420 to block 422, which depicts mapper 210 evicting the value of a logical register from a physical register in first level register file 206 and installing the value of the logical register in the particular physical register in second level register file 208 corresponding to that logical register. The setting of the transactional field 304 associated with the evicted logical register value is similarly transferred from first level register file 206 and installed in second level register file 208. To track the register value in second level register file 208, mapper 210 then updates the relevant pointer in register pointer table 216 to point to the logical register value in second level register file 208.
Next, at blocks 424 and 430, mapper 210 allocates a physical register in first level register file 206 from free list 214 to hold the value of the logical destination register of the current instruction and removes the allocated physical register from free list 214. In addition, at block 432, mapper 210records the mapping between the logical destination register and the physical register in an entry of AMC 212. In conjunction with recording the mapping, mapper 210 sets valid field 300 to indicate that the entry 212 is valid, and as indicated at blocks 440 and 442, also sets the transactional field304 to indicate whether the instruction producing the register value is within a transactional code section (and therefore speculative) or not. Following block 440 or block 442, the process depicted in FIG. 4 ends at block 444.
As noted above, FIG. 4 presents steps in the destination register allocation process in a logical rather than strictly chronological manner. For example, in one preferred embodiment, mapper 210 performs the register allocation and update to free list 214 shown at blocks 424 and 430 at instruction dispatch time and subsequently performs the recording of the mapping in AMC 212 depicted at block 432 at instruction completion time. However, in an alternative embodiment, all of steps 424, 430 and 432 can all be performed at instruction completion time.
With reference now to FIG. 5, there is illustrated a high level logical flowchart of an exemplary method of translating a reference by an instruction to a logical source register from which an instruction operand is to be obtained in accordance with one embodiment. The exemplary process begins at block500 in response to receipt by mapper 210 of an instruction referencing at least logical source register and thereafter proceeds to block 502. Block 502depicts mapper 210 performing a lookup in AMC 212 to determine if the referenced logical source register is currently mapped to a physical register in first level register file 206.
Mapper 210 then determines at block 504 whether or not the lookup in AMC 212 resulted in a hit. If so, mapper 210 accesses the physical register in first level register file 206 indicated by the register mapping field 302 of the matching entry 310 to obtain the register value of the referenced logical source register (block 506). If, on the other hand, the lookup in AMC 212 results in a miss, mapper 210 accesses the physical register of the second level register file 208 assigned to the referenced logical source register to obtain the register value (block 510). As indicated above, second level register file 208 preferably contains multiple physical registers for each logical register. If second level register file 208 contains multiple register values for a given logical register, as will be the case if affirmative determinations are made at both of blocks 404 and 420 of FIG. 4, the logical register value to be accessed at block 510 is identified by the pointer in register pointer table 216 corresponding to the referenced logical register. Following either block 506 or block 510, the process passes to block 512, which illustrates mapper 210 providing the register value of the logical source register to execution units 202. Thereafter, the process ends at block 514.
With a physical register allocated to the logical destination register of the instruction in accordance with FIG. 4 and the value(s) of the logical source register(s), if any, accessed in accordance with FIG. 5, the instruction can be executed by execution units 202. Following execution of the instruction, mapper 210 places execution results, if any, returned by execution units 202 in the physical register previously mapped to the logical destination register of the instruction.
Referring now to FIG. 6, there is depicted a high level logical flowchart of an exemplary process by which a mapper 210 handles the termination of transactional execution of a code section of a thread in one embodiment. The process shown in FIG. 6 begins at block 600, for example, in response to receipt by mapper 210 of an instruction demarking the termination of a transactional code section, as described above at block 401 of FIG. 4.
In response to detection of an instruction demarking the end of a transactional code section of a thread, the process proceeds to block 604, which depicts mapper 210 determining whether or not the transactional code section ended successfully ("passed"). A successful end of the transactional code section can be indicated, for example, by the opcode of the instruction demarking the end of the transactional code section or by a separate success or failure indication, which may be received from execution units 202 or ISU 200 and/or may reside in a condition register of multi-level register file 204.
If mapper 210 determines at block 604 that the transactional code section concluded successfully, mapper 210 resets any transactional fields 304 that are set in AMC 212 or second level register file 208 to indicate that the associated logical register values are no longer speculative (block 610). Thus, mapper 210 commits the heretofore speculative execution results of the transactional code section of the thread to the architected state of the thread. Thereafter, the process shown in FIG. 6 ends at block 624.
Referring now to block 620, if the transactional code section did not end successfully, mapper 210 discards the execution results of the transactional code section held in first level register file 206. In particular, as indicated at block 620, mapper 210 resets the valid field 300 of (i.e., invalidates) each entry 310 in AMC 212 in which transactional field 304 is set and returns the physical registers of first level register file 206 mapped by such entries to free list 214. In addition, mapper 210 discards any execution results of the transactional code section held in second level register file 208. For example, as indicated at block 622, mapper 210 invalidates in second level register file 208 any logical register values marked as speculative and updates each associated pointer in register pointer table 216 with a null pointer. Thereafter, the process depicted in FIG. 6 ends at block 624.
With reference now to FIG. 7, there is illustrated a portion of an exemplary first thread 700 including a transactional code section 702. Prior to transactional code section 702 in the program order of first thread 700 are one or more non-transactional instructions, such as add instruction 704, which references logical destination register r5. Within transactional code section 702, the beginning of which is demarked by initiating instruction 706(i.e., tm_start), are one or more instructions, such as add instruction 708, which conditionally update the architected state of the thread if the transactional code section completes successfully. As indicated, add instruction 708 also targets logical destination register r5. In exemplary first thread700, transactional code section 702 can terminate in either of two ways, namely, successfully through the execution of ending instruction 712 (e.g., tm_end) or unsuccessfully through the execution of abort instruction 710 (e.g., tm_abort). As indicated previously, if transactional code section 702terminates through the execution of ending instruction 712, the execution results of transactional code section 702 are committed to the architected state of thread 700, and if transactional code section 702 terminates through the execution of abort instruction 710, the execution results of transactional code section 702 are not committed to architected state of thread 700 and are instead discarded.
Referring now to FIG. 8, there is depicted the contents of AMC 212, multi-level register file 204 and register pointer table 216 in the case in which processing of transactional code section 702 of FIG. 7 fails to complete successfully. In particular, in response to mapper 210 detecting a collision between the logical destination registers of add instructions 704 and 708 at block 404 of FIG. 4, mapper 210 establishes a checkpoint as described at block 410 of FIG. 4 by transferring the register value of logical destination register r5 produced by add instruction 704 (i.e., r5_val) from first level register file 206 to second level register file 208. Mapper 210 then allocates a free physical register in first level register file 206 to logical destination register r5, records the mapping between the allocated physical register and logical destination register r5 in register mapping field 302 of an entry 310of AMC 212, and sets valid field 300 (e.g., to 1) as described at blocks 424 and 432 of FIG. 4. In addition, to indicate that the register value of logical destination register r5 of add instruction 708 (i.e., r5′_val) is speculative, mapper 210 sets transaction field 304 of the entry 310 in AMC 212. Thereafter, in response to receipt of abort instruction 710, mapper 210 resets valid field 300 of the entry 310 in AMC 212, as shown in FIG. 8 and described at block 620 of FIG. 6, thus discarding the speculative register value of logical destination register r5. Subsequent references to logical register r5 will be satisfied from second level register file 208, as described above with reference to block 510 of FIG. 5.
With reference now to FIG. 9, there is illustrates the contents of AMC 212, multi-level register file 204 and register pointer table 216 in the contrary case in which processing of transactional code section 702 of thread 700 of FIG. 7 completes successfully. In this case, in response to mapper 210receiving ending instruction 712 rather than abort instruction 710, mapper 210 resets transaction field 304 to indicate that the register value of the logical destination register is no longer speculative, as described above with reference to block 610 of FIG. 6. In accordance with the process of FIG. 5, the new register value r5′_val will subsequently be accessed in lieu of the stale register value r5_val based upon lookup hits for logical register r5 in AMC 212.
Referring now to FIG. 10, there is depicted an exemplary second thread 1000 including a transactional code section 1002. As indicated by like reference numerals, second thread 1000 is similar to first thread 700 and includes previously described instructions 704, 706, 708, 710 and 712. However, transactional code section 1002 of second thread 1000 additional includes add instruction 1004, which references logical destination register r21. In this example, it is assumed that logical register r21 maps to the same set in AMC 212 as logical register r5.
As indicated in FIG. 11, in response to mapper 210 detecting a collision between the logical destination registers of add instructions 704 and 708 at shown block 404 of FIG. 4, mapper 210 establishes a checkpoint as described at block 410 of FIG. 4 by transferring the register value of logical destination register r5 produced by add instruction 704 (i.e., r5_val) from first level register file 206 to second level register file 208. Mapper 210 then allocates a free physical register in first level register file 206 to logical destination register r5, records the mapping between the allocated physical register and logical destination register r5 in register mapping field 302 of an entry 310 of AMC 212, and sets valid field 300 (e.g., to 1) as described at blocks 424 and 432 of FIG. 4. In addition, to indicate that the register value of logical destination register r5 of add instruction 708 (i.e., r5′_val) is speculative, mapper 210 sets transaction field 304 of the entry 310 in AMC 212.
In response to receipt of add instruction 1004, mapper 210 determines at block 420 of FIG. 4 that an entry 310 must be evicted from AMC 212 to accommodate the mapping for logical register r21. Accordingly, as shown in FIG. 11 and described with reference to block 422 of FIG. 4, mapper 210evicts logical register value r5′_val from first level register file 206 and installs it in second level register file 208 in association with a set transaction field to identify logical register value r5′_val as speculative. As further shown in FIG. 11 and as also depicted in block 422 of FIG. 4, mapper 210updates pointer register table 216 to point to logical register value r5′_val.
Referring now to FIG. 12, in the case that transactional code section 1002 terminates unsuccessfully (e.g., via abort instruction 710), mapper 210resets valid field 300 of the entry 310 allocated to logical register r21 in AMC 212, as shown in FIG. 12 and described at block 620 of FIG. 6, thus discarding the speculative register value of logical destination register r21. Further, as shown in FIG. 12 and block 622 of FIG. 6, mapper 210 also updates register pointer table 216 to point to register value r5_val rather than r5′_val, thus discarding the speculative execution results of add instruction 708. Subsequent references to logical register r5 will thus be satisfied by accessing register value r5_val from second level register file 208, as described above with reference to block 510 of FIG. 5.
With reference now to FIG. 13, in the case that transactional code section 1002 of thread 1000 terminates successfully (e.g., via end instruction 712), mapper 210 resets transactional field 304 of the entry 310 allocated to logical register r21 in AMC 212, as illustrated in FIG. 13 and at block 610 of FIG. 6. Further, as shown in FIG. 13 and at block 610 of FIG. 6, mapper 210 similarly updates the transactional indication associated with logical register value r5′_val in second level register file 208. Thereafter, a reference to logical register r21 can be satisfied from first level register file 206, and a reference to logical register r5 will be satisfied with register value r5′_val from second level register file 208.
As has been described, an exemplary embodiment of a data processing system suitable for executing multiple concurrent hardware threads utilizing transactional code sections to synchronize memory access includes a processor having a multi-level register file. The multi-level register file includes a first level register file having a lower access latency and a second level register file having a higher access latency, where each of the first and second level register files includes a plurality of physical registers. As the processor executes a thread, a first register value of a logical register obtained as an execution result of a first instruction in the thread is placed in a first physical register of the first level register file. A first mapping between the logical register and the first physical register is also recorded in a mapping data structure, and the first register value is marked as non-speculative.
In response to processing a second instruction in a transactional code section of the thread to obtain as an execution result a second register value of the logical register, the first register value is moved to the second level register file, and the second register value is placed in a second physical register of the first level register file. Accordingly, the first mapping in the mapping data structure is replaced with a second mapping between the logical register and the second physical register. In addition, because the second instruction is in the transactional code section, the second register value is initially marked as speculative. In response to unsuccessful termination of the transactional code section, the second register value in the first level register file is designated as invalid, meaning the first register value in the second level register file becomes a working value of the logical register. On the other hand, in response to successful termination of the transactional code section, the second register value in the first level register file is designated as non-speculative.
It should thus be noted that upon entry of a transactional code section, there is no need to move data contained in the first level register file en masse into the second level register file. Instead, checkpoints in the second level register file are preferably established for only logical destination registers on a register-by-register basis as logical register collisions spanning the boundary of a transactional code section are detected. Further, upon unsuccessful termination of a transactional code section, there is no necessity to move data from the second level register file back into the first level register file. Instead, the invalidation of the mapping of the speculative data in the first level register file leaves the checkpointed data held in the second level register file as the working data.