CSI 3530 – Databases II Spring 2007Term Project — Symmetric Hash Join11 ObjectiveYou are to implement a new symmetric hash join query operator to replace the current hash join implementation.Adding the new operator will require modifications to both the optimizer component and theexecutor component of PostgreSQL .2 Hash JoinIn this section we introduce the regular hash join operator which is currently implemented in PostgreSQL8.1.4. Hash join can be used when the query includes join predicates of the form T1.attr1 = T2.attr2, whereT1 and T2 are two relations to be joined and attr1 and attr2 are join attributes with the same data type.One of the relations is designated as the inner relation while the other is designated as the outer relation.We will assume that T1 is the inner relation and T2 is the outer relation.The hash join has two consecutive phases: the building phase and the probing phase.Building Phase: In this phase, the inner relation (T1) is scanned. The hash value of each tuple is computedby applying a hash function f to the join attribute (attr1). The tuple is then inserted into a hashtable. No outputs are produced during this phase.The probing phase: When the inner relation has been completely processed and the hash table has beenconstructed, the outer relation T2 is scanned and its tuples are matched to those from the inner relation.This matching is done on two steps. First, the outer tuple’s join attribute, attr2, is hashed using thesame hash function f. The resulting hash value is used to probe the inner hash table. Next, if theappropriate bucket of the inner hash table is not empty, the attr1 value of each inner tuple in thebucket is compared to attr2 of the outer tuple. If the actual values match, then the correspondingtuples are returned as one tuple in the join result. This comparison of the actual attribute values of thetwo tuples is necessary because of the possibility of hash collisions (multiple attribute values havingthe same hash value).One property of this algorithm is that no join outputs are produced until all of the tuples of the innerrelation are completely processed and inserted into the hash table. This represents a bottleneck in the queryexecution pipeline. Symmetric hash join, which you will implement in this assignment, avoids this problem.One of the drawbacks of the hashing algorithm that was just described is that there must be sufficientmemory available to hold the hash table for the inner relation. This can be a problem for large relations.For this reason, PostgreSQL actually implements a more complex algorithm, known as Hybrid Hash Join,which avoids this problem. The hybrid hash join algorithm divides the tuples of the two relations intomultiple partitions, or batches. For example, each partition can be associated with a specific range of hashvalues. Only one partition of the inner relation’s tuples is kept in main memory at any time. The remainingbatches reside on disk. When an outer tuple is found to belong in a partition other than the current one(i.e. the memory-resident partition), it is written to a temporary file and its processing is postponed untilthe corresponding partition from the inner relation is retrieved. Partitions are processed in sequence untilall partitions have been processed.3 Symmetric Hash JoinHere we describe the symmetric hash join [1] that you are required to implement. The symmetric hash joinoperator maintains two hash tables, one for each relation. Each hash table uses a different hash function. Itsupports the traditional demand-pull pipeline interface. The symmetric hash join works as follows:1Do not redistribute this project which is partof an archive that Ihab Ilyas gave to me for my personal use in my course.1CSI 3530 – Databases II Spring 2007• Read a tuple from the inner relation and insert it into the inner relation’s hash table, using the innerrelation’s hash function. Then, use the new tuple to probe the outer relation’s hash table for matches.To probe, use the outer relation’s hash function.• When probing with the inner tuple finds no more matches, read a tuple from the outer relation. Insertit into the outer relation’s hash table using the outer relation’s hash function. Then, use the outertuple to probe the inner relation’s hash table for matches, using the inner table’s hash function.These two steps are repeated until there are no more tuples to be read from either of the two inputrelations. That is, the algorithm alternates between getting an inner tuple and getting an outer tuple untilone of the two input relations is exhausted, at which point the algorithm reads (and hashes and probes with)the remaining tuples from the other relation.Note that the symmetric hash join operator must be implemented to adhere to the demand-pull pipelineoperator interface. That is, on the first pull call, the algorithm should run until it can produce one joinoutput tuple. On the next pull call, the algorithm should pick up where it left off, and should run until itcan produce the second join output tuple, and so on. This incremental behaviour requires that some statebe saved between successive calls to the operator. Essentially, this state records where the algorithm left offafter the previous demand-pull call so that it can pick up from there on the next call. Clearly, this stateshould include an indication of whether the operator is currently working on an inner or an outer tuple,the current tuple itself, the current “probe position” within the inner or outer hash table, and so on. Thesymmetric hash join operator records this state into a structure called a state node.As an example of join execution, consider a join with join predicate T1.attr1 = T2.attr2. The join operatorwill incrementally load a hash table H1 for T1 by hashing attr1 using hash function f1, and another hashtable H2 for T2 by hashing attr2 using hash function f2. The symmetric hash join operator starts by gettinga tuple from T1, hashing its attr1 field using f1, and inserting it into H1. Then, it probes H2 by applying f2to attr1 of the current T1 tuple, returning any matched tuple pairs that it finds. Next, it gets a tuple fromT2, hashes it by applying f2 to attr2, and inserts it into H2. Then, it probes H1 by applying f1 to attr2 ofthe current T2 tuple, and returns any matches. This continues until all tuples from T1 and T2 have beenconsumed.4 PostgreSQL Implementation of Hash JoinIn this section, we present an introduction to two components of PostgreSQL that you will need to modifyin this assignment, namely the optimizer and the executor. Then, we describe the hash join algorithm thatis already implemented in PostgreSQL .4.1 OptimizerThe optimizer uses the output of the query parser to generate an execution plan for the executor. Duringthe optimization process, PostgreSQL builds Path trees representing the different ways of executing a query.It selects the cheapest Path that generates the desired outputs and converts it into a Plan to pass to theexecutor. A Path (or Plan) is represented as a set of nodes, arranged in a tree structure with a top-levelnode, and various sub-nodes as children. There is a one-to-one correspondence between the nodes in thePath and Plan trees. Path nodes omit information that is not needed during planning, while Plan nodesdiscard planning information that is not needed by executor.The optimizer builds a RelOptInfo structure for each base relation used in the query. RelOptInfo recordsinformation necessary for planning, such as the estimated number of tuples and their order. Base relations(baserel) are either primitive tables, or subqueries that are planned via a separate recursive invocation ofthe planner. A RelOptInfo is also built for each join relation (joinrel) that is considered during planning.A joinrel is simply a combination of baserel’s. There is only one join RelOptInfo for any given set ofbaserels. For example, the join {A ./ B ./ C} is represented by the same RelOptInfo whether it is builtby first joining A and B and then adding C, or by first joining B and C and then adding A. These differentmeans of building the joinrel are represented as different Paths. For each RelOptInfo we build a list ofPaths that represent plausible ways to implement the scan or join of that relation. Once we have considered2CSI 3530 – Databases II Spring 2007all of the plausible Paths for a relation, we select the cheapest one according to the planner’s cost estimates.The final plan is derived from the cheapest Path for the RelOptInfo that includes all the base relations ofthe query. A Path for a join relation is a tree structure, with the top Path node representing the join method.It has left and right subpaths that represent the scan or join methods used for the two input relations.The join tree is constructed using a dynamic programming algorithm. In the first pass we considerways to create joinrels representing exactly two relations. The second pass considers ways to makejoinrels that represent exactly three relations. The next pass considers joins of four relations, and soon. The last pCSI 3530代做、代写Databases、SQL程序设计ass considers how to make the final join relation that includes all of the relations involved inthe query. For more details about the construction of query Path and optimizer data structures, refer tosrc/backend/optimizer/README.4.2 ExecutorThe executor processes a tree of Plan nodes. The plan tree is essentially a demand-pull pipeline of tupleprocessing operations. Each node, when called, will produce the next tuple in its output sequence, or NULLif no more tuples are available. If the node is not a primitive relation-scanning node, it will have child node(s)that it calls recursively to obtain input tuples. The plan tree delivered by the planner contains a tree ofPlan nodes (struct types derived from struct Plan). Each Plan node may have expression trees associatedwith it to represent, e.g., qualification conditions.During executor startup, PostgreSQL builds a parallel tree of identical structure containing executorstate nodes. Every plan and expression node type has a corresponding executor state node type. Eachnode in the state tree has a pointer to its corresponding node in the plan tree, in addition to executor statedata that is needed to implement that node type. This arrangement allows the plan tree to be completelyread-only as far as the executor is concerned; all data that is modified during execution is in the statetree. Read-only plan trees simplify plan caching and plan reuse. Altogether there are four classes of nodesused in these trees: Plan nodes, their corresponding PlanState nodes, Expr nodes, and their correspondingExprState nodes.There are two types of Plan node execution: single tuple retrieval and multi-tuple retrieval. Theseare implemented using the functions ExecInitNode and MultiExecProcNode, respectively. In single tupleretrieval, ExecInitNode is invoked each time a new tuple is needed. In multi-tuple retrieval, the functionMultiExecProcNode is invoked only once to obtain all of the tuples, which are returned in a form of a hashtable or a bitmap. For more details about executor structures, refer to src/backend/executor/README.4.3 PostgreSQL Hash Join OperatorIn PostgreSQL , hash join is implemented in the file nodeHashjoin.c and creation of a hash table is implementedin the file nodeHash.c. A hash join node in the query plan has two subplans that represents theouter and the inner relations to be joined. The inner subplan must be of type HashNode.As was described in Section 2, PostgreSQL implements hybrid hash join so that it can deal with largerelations. Recall that hybrid hash join processes tuples in batches based on their hash values. To make theimplementation of the symmetric hash join algorithm simpler, you may ignore this additional complexity.Specifically, your symmetric hash join implementation may assume that both hash tables will fit memorywithout resorting to paritioning into multiple batches. However, you will need to find out how to disable theuse of multiple batches in the current implementation of hashing.4.4 Relevant FilesHere we present a list of files that are relevant to the assignment.• src/backend/executor/– nodeHashJoin.c: This file implements the actual processing of the hash join operator.– nodeHash.c: This file is responsible for creating and maintaining a hash table.• src/backend/optimizer/plan/3CSI 3530 – Databases II Spring 2007– createplan.c: This file contains the code that creates a hash join node in the query plan.• src/include/nodes/– execnodes.h: This file contains the structure HashJoinState that maintains the state of thehash join during execution.4.5 Main FunctionsThe implementation of the hash join operator consists of a number of functions as follows:• ExecHashJoin: This is the main function that is called each time a new tuple is required by the hashjoin node. Note that the first time this function is called, it has to create the hash table of the innernode. The join algorithm goes as described in Section 2.• ExecInitHashJoin: This function is responsible for initializing the state of the join node as well asinvoking initialization procedures for inner and outer nodes.• ExecHashJoinOuterGetTuple: When a new outer tuple is required, this function is invoked to getthe next tuple from the outer node. Note that after all tuples are retrieved from the outer relation, itis still necessary to retrieve the tuples that were previously saved to temporary files as soon as theircorrect batch of the inner hash table becomes available in the memory.• ExecHashJoinSaveTuple and ExecHashJoinGetSavedTuple: These support saving and retrievingthe outer tuples to/from temporary files. Saving a tuple of the outer relation is necessary when itsbatch number is not the same as the current memory resident batch of the hash table. Retrieving allsaved tuples is done after consuming all tuples of the outer relation.• ExecHashJoinNewBatch This function retrieves the next batch of the inner hash table.To implement hash node, a number of functions are provided in nodeHash.c. The following functionsare relevant to the assignment.• MultiExecHash: This function retrieves all tuples from the subnode and insert them into the hashtable. It returns the created hash table to the caller.• ExecHash: Pipelined execution of hash is not implemented by default in PostgreSQL . This functionwill simply return an error message indicating that this execution mode is not supported for hashnodes.• ExecHashGetBucketAndBatch: This function retrieves the bucket number and the batch number fora specific hash key.5 Problem StatementIn this assignment, you are to implement the symmetric hash join to replace the traditional hash join. Youshould alter the necessary files/functions to achieve all requirements. The assignment can be divided intosubtasks as follows:• Change the optimizer so that both the inner and outer relations are first processed by hash nodes beforebeing processed by the hash join operator. In the current implementation, only the inner relation ishashed, as described in Section 2. The hashing nodes is necessary to incrementally retrieve a tuplefrom each relation, insert it into the corresponding hash table and return it to the hash join node. Thiseffect can be implemented by modifying the function that creates the hash join node in the plannercomponent, which can found in the file createplan.c. (10%)• Modify the hashing algorithm to support pipelined execution instead of the multi-tuple execution modethat is currently implemented. This means that you will need to implement the ExecHash function,which currently returns an error message to indicate that this execution mode is not supported. (20%)4CSI 3530 – Databases II Spring 2007• For this assignment you can disable the use of multiple batches, i.e. you may assume that the wholehash table consists one batch that is resident in memory for the duration of the join processing. Modifythe file nodeHash.c to achieve this. Note that related code in nodeHashjoin.c that handles savingand retrieving tuples to temporary batch files can now be discarded. (0%)• Modify the structure of HashJoinState to support the symmetric hash join algorithm by alteringthe file execnodes.h. You need to replicate the existing structures to support bi-directional probing.(10%)• Replace the hash join algorithm with the symmetric hash join. This algorithm should be implementedin nodeHashjoin.c. After completing the join operation, your operator should print the number ofresulting tuples that were found by probing inner hash table and outer hash tables, respectively. (60%)• Disable other join operators (i.e. merge join and nested loop join) to force the query optimizer to use thesymmetric hash join operator. This can be done by modifying the configuration file postgresql.confwhich can be found in the database directory. (0%)You can test that the symmetric hash join is working properly by comparing the tuples resulting from anarbitrary join query before and after the modification. You can use the provided test case from the courseweb site to test your implementation.6 DeliverablesThe following files should be submitted : nodeHashjoin.c, nodeHash.c, execnodes.h and createplan.c.Although you do not have to change other files, you may need to explore additional files to understand howcertain procedures and structures are implemented.You can submit these files by sending them to the TA by email in a ZIPed file. To use the zip command,make sure that all the required files are in the current directory.Make sure to clearly emphasize parts of the files that you have modified by inserting comments beforeand after changes. Also, include any necessary comments to describe how modifications are done or how youimplemented new features. All comments should be preceded by ’CSI3530:’.References[1] Annita N. Wilschut and Peter M. G. Apers, Dataflow Query Execution in a Parallel Main-Memory Environment.In Proceedings of the First International Conference on Parallel and Distributed InformationSystems (PDIS 1991), Miami Beach, Florida, December 4-6, 1991, pp. 68–77.5转自:http://www.6daixie.com/contents/15/4484.html