Final Assignment (Assignment 4)Objective:We consider a special case of matrix multiplication:C := C + A*Bwhere A, B, and C are n x n matrices. This can be performed using 2n3 floating point operations (n3 adds,n3 multiplies), as in the following pseudocode: for i = 1 to n for j = 1 to n for k = 1 to n C(i,j) = C(i,j) + A(i,k) * B(k,j)In this assignment, we will design and implement a C program that performs computations on largematrices. The size of a matrix is large enough so that the execution of program causes paging. Read p.225 to p. 228, and p. 413 to p. 417 of Patterson & Hennessy textbook.Purpose:Different choices of copying the matrix may have different impacts on the program runtime. You arerequired to notice such impacts and eventually propose a design that efficiently leverages themechanisms described below to achieve the best performance.Instructions: You will be paired up in teams of four. Sign up for your team at the blackboard Wiki by April 12o If you don’t find a team, you will be randomly assigned by the instructor After teams decided, you need to sign up for presentation via Wiki of the blackboardo There are 4 groups of max. 6 teamso Each team will sign up for 10 min. time sloto Each team must arrive before their group begins and stay until the last team of thegroup finishes Each team must submit the followings by 12:00 pm April 30o Presentation slide to the blackboard assignment linko Report (Write-up) to the turnitin link The names of the people in your team The optimization used or attempted The reason for any odd behavior (if any) in performance You need to describe on what mechanisms are used in your implementation andhow much performance improvement is achieved. Please write it down in termsof runtime. How the performance changed when running your optimized code on adifferent machine (You can run the same program on each member’s machine –different hardware spec)o Source code to the blackboard assignment linkRequirements:1. Complete the given C program and use multiple optimization mechanisms to improve theexecution runtime.2. You are required to use all the mechanisms discussed in the class and recitation3. You are only allowed to use standard libraries and intrinsic library to implement your program4. You are not allowed to create new threads in your implementation.5. When using gcc to compile your code, you are allowed to use optimization level3(-O3).Mechanisms:1. Caching: You are required to try different cache block size in your code and use the block sizethat gives you minimum runtime when integrated with other techniques (SIMD and superscalarmechanism)2. SIMD: You are required to make use of Single Instruction Multiple Data (SIMD). It meansperforming a single operation on multiple data points simultaneously.3. Superscalar mechanisms: A superscalar processor executes more than one instruction during aclock cycle by simultaneously dispatching multiple instructions to different components on theprocessor.Experiments: Per attempted optimization, you can have different size of matrix (for example, n = 128, 256,512, 1024, 2048), different block sizes (m = 2^x), different number of unrolled instructions, etc. You need to measure the runtime of each experiment.You need to verify the correctness of the attempted optimization.Reference: Matrix Multiplication Intel Intrinsic Guide GCC documentation Streaming SIMD ExtensionsGrading Rubrics:1. If your code does not run, you will not receive any credits2. You are required to combine at least two technique of optimization, otherwise will not receiveany credits3. Combining two methods with proper experiments can enable you to have at most 50%4. Combining all 3 techniques can earn you additional (bonus) 20%5. 50% of the points is reserved for quality of your presentation and report (25% each); i.e. how youtried fine-tuning with different parameters and combined several different ways to reach yourresult. #include #include #include #include #include #include #include #include #define ALIGN __attribute__ ((aligned (32)))#define SIZE 1024double ALIGN a[SIZE * SIZE];double ALIGN b[SIZE * SIZE];double ALIGN c[SIZE * SIZE];double ALIGN c1[SIZE * SIZE];// native matrix multiplicationvoid dgemm(int n){int i,j,k;for(i=0; i{for(j=0; j{double cij = 0;for(k=0; kcij = cij + a[i*n+k] * b[k*n+j];c1[i*n+j] = cij;}}}/* Implement this function with multiple optimization techniques. */void optimized_dgemm(int n){ // call any of optimization attempt}void main(int argc, char** argv){int i, j;time_t t;struct timeval start, end;double elapsed_time;Sample Code for DGEMMint check_correctness = 0;int correct = 1;if(argc > 1){if(strcmp(argv[1], corr) == 0){check_correctness = 1;}}/* Initialize random number generator */srand((unsigned) time(&t));/* Populate the arrays with random values */for(i=0; i{for(j=0; j{a[i* SIZE +j] = (double)rand() / (RAND_MAX + 1.0);b[i* SIZE +j] = (double)rand() / (RAND_MAX + 1.0);c[i* SIZE +j] = 0.0;c1[i* SIZE +j] = 0.0;}}gettimeofday(&start, NULL);/* Call you optimized function optimized_dgemm */optimized_dgemm(SIZE);gettimeofday(&end, NULL);/* For TA use only */if(check_correctness){dgemm(SIZE);for(i=0; (i{for(j=0; (j{if(fabs(c[i* SIZE +j]-c1[i* SIZE +j]) >= 0.0000001){printf(%f != %f\n, c[i* SIZE +j], c1[i* SIZE +j]);correct = 0;}}}if(correct)printf(Result is correct!\n);elseprintf(Result is incorrect!\n);}elapsed_time = (end.tv_sec - start.tv_sec) * 1000.0;elapsed_time += (end.tv_usec - start.tv_usec) / 1000.0;printf(dgemm finished in %f milliseconds.\n, elapsed_time);}void dgemm_unrolling(int n){int i,j,k;for(i=0; i{for(j=0; j{double cij = 0;for(k=0; k{double s1 = a[i*n+k] * b[k*n+j];double s2 = a[i*n+(k+1)] * b[(k+1)*n+j];double s3 = a[i*n+(k+2)] * b[(k+2)*n+j];double s4 = a[i*n+(k+3)] * b[(k+3)*n+j];cij += s1 + s2 + s3 + s4;}c[i*n+j] = cij;}}}#define BLOCK_SIZE 4void do_block(int n, int si, int sj, int sk, double *a, double *b, double *c){Loop Unrolling Mechanism ExampleCache Blocking Mechanism Example int i, j, k; for (i=si; i for (j=sj; j double cij = c[i*n+j]; for (k=sk; k cij += a[i*n+k] * b[k*n+j]; c[i*n+j] = cij; }}void dgemm_blocking(int n){ int i, j, k; for(i=0; i for(j=0; j c[i*n+j] = 0; for(k=0; k do_block(n, i, j, k, a, b, c); }}void dgemm_intrin(int n){int i,j,k;for(i=0; i{for(j=0; j{__m256d c4 = _mm256_load_pd(&c[i * n+j]);for(k=0; k{__m256d a4 = _mm256_broadcast_sd(&a[i*n+k]);__m256d b4 = _mm256_load_pd(&b[k*n+j]);c4 = _mm256_add_pd(c4, _mm256_mul_pd(a4, b4));}_mm256_store_pd(&c[i*n+j], c4);}}}SIMD Mechanism ExampleAfter implementing the 3 things that we were given I tried many other things.Things that sped up the run time but weren’t mentioned because miniscule boosts: Ordering of variables of sequential operations Manual calculations of repeatedly used arithmetic to relieve the CPU of arithmetic Loop Tiling + Loop Jamming to reduce branch checks and thus less stalls Using load intrinsic directly within function since it has no latency or throughput Deleting declarations of objects that we no longer needed Perform all load first before executing operations (less hazards, less stalls)Things that I tried but didn’t decrease run time so I tossed it: Using block sizes greater than xx Unrolling more than optimal amount to see results Using SSSE3 (_mm_stream_pd) function to initialize array c with 0’s Using memcmp or memset from SSE4.1 and SSE4.2 to zero out array Manual calculation of some repetitive arithmetic actually slowed it down Using shift logical left by 10 instead of multiplying by 1024 slows it down Removing the pointer parameters (unstable and spikes but best results 104ms) Initializing the global array c = { 0 } so it gets zeroed from the beginningThings that I want to try but can’t because they’re not released yet: AVX512 intrinsic library since it has functions that can perform twice as fast as it is nowBlocking Loop Unrolling +BlockingIntrisic +Blocking +Unrolling3 + SLL multiplyy 3 + FMAinstrinsic + sequentiallocality + temporal &spatial locality + unrolldo_block + loopinterchange &optimizationAverage Runtime for 1024 x 1024 (ms)Average Runtime for 1024 x 1024 (ms)Example Report from the past semester转自:http://www.7daixie.com/2019043020176071.html