<< Chapter < Page Chapter >> Page >

This can be solved using a technique called red-black , where we alternate between two arrays. [link] shows how the red-black version of the computation operates. This kills two birds with one stone! Now the mathematics is precisely correct, and there is no recurrence. Sounds like a real win-win situation.

Computing the new value for a cell

This figure is a flow chart showing boxes starting with 100 and followed by a string of boxes labeled 50, 25, 12.5, then six boxes labeled 0. The boxes are also numbered from 1 to 10. Below the fourth, fifth, and sixth boxes is the label, average.

Using two arrays to eliminate a dependency

This figure is a flow chart showing boxes starting with 100 and followed by a string of boxes labeled  0. The boxes are also numbered from old 1 to old 10. Below the fourth, fifth, and sixth boxes is the label, average. This average is encompassed by a large arrow that points down to another flow chart showing boxes starting with 100 and 50 and followed by a string of boxes labeled  0. The boxes are also numbered from fixed 1 to fixed 10.

The only downside to this approach is that it takes twice the memory storage and twice the memory bandwidth. There is another red-black approach that computes first the even elements and then the odd elements of the rod in two passes. This approach has no data dependencies within each pass. The ROD array never has all the values from the same time step. Either the odd or even values are one time step ahead of the other. It ends up with a stride of two and doubles the bandwidth but does not double the memory storage required to solve the problem. The modified code is as follows:


PROGRAM HEATRED PARAMETER(MAXTIME=200)INTEGER TICKS,I,MAXTIME REAL*4 RED(10),BLACK(10)RED(1) = 100.0BLACK(1) = 100.0 DO I=2,9RED(I) = 0.0 ENDDORED(10) = 0.0 BLACK(10) = 0.0DO TICKS=1,MAXTIME,2IF ( MOD(TICKS,20) .EQ. 1 ) PRINT 100,TICKS,(RED(I),I=1,10) DO I=2,9BLACK(I) = (RED(I-1) + RED(I+1) ) / 2 ENDDODO I=2,9 RED(I) = (BLACK(I-1) + BLACK(I+1) ) / 2ENDDO ENDDO100 FORMAT(I4,10F7.2) END

The output for the modified program is:


% f77 heatred.f heatred.f:MAIN heatred: % a.out1 100.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 21 100.00 82.38 66.34 50.30 38.18 26.06 18.20 10.35 5.18 0.0041 100.00 87.04 74.52 61.99 50.56 39.13 28.94 18.75 9.38 0.00 61 100.00 88.36 76.84 65.32 54.12 42.91 32.07 21.22 10.61 0.0081 100.00 88.74 77.51 66.28 55.14 44.00 32.97 21.93 10.97 0.00 101 100.00 88.84 77.70 66.55 55.44 44.32 33.23 22.14 11.07 0.00121 100.00 88.88 77.76 66.63 55.52 44.41 33.30 22.20 11.10 0.00 141 100.00 88.89 77.77 66.66 55.55 44.43 33.32 22.22 11.11 0.00161 100.00 88.89 77.78 66.66 55.55 44.44 33.33 22.22 11.11 0.00 181 100.00 88.89 77.78 66.67 55.55 44.44 33.33 22.22 11.11 0.00%

Interestingly, the modified program takes longer to converge than the first version. It converges at Time step 181 rather than 101. If you look at the first version, because of the recurrence, the heat ended up flowing up faster from left to right because the left element of each average was the next-time-step value. It may seem nifty, but it's wrong. There are other algorithmic approaches to solving partial differential equations, such as the "fast multipole method" that accelerates convergence "legally." Don't assume that the brute force approach used here is the only method to solve this particular problem. Programmers should always look for the best available algorithm (parallel or not) before trying to scale up the "wrong" algorithm. For folks other than computer scientists, time to solution is more important than linear speed-up. Generally, in this problem, either approach converges to the same eventual values within the limits of floating-point representation.

This heat flow problem is extremely simple, and in its red-black form, it's inherently very parallel with very simple data interactions. It's a good model for a wide range of problems where we are discretizing two-dimensional or three-dimensional space and performing some simple simulations in that space.

This problem can usually be scaled up by making a finer grid. Often, the benefit of scalable processors is to allow a finer grid rather than a faster time to solution. For example, you might be able to to a worldwide weather simulation using a 200-mile grid in four hours on one processor. Using 100 processors, you may be able to do the simulation using a 20-mile grid in four hours with much more accurate results. Or, using 400 processors, you can do the finer grid simulation in one hour.

Get Jobilize Job Search Mobile App in your pocket Now!

Get it on Google Play Download on the App Store Now




Source:  OpenStax, High performance computing. OpenStax CNX. Aug 25, 2010 Download for free at http://cnx.org/content/col11136/1.5
Google Play and the Google Play logo are trademarks of Google Inc.

Notification Switch

Would you like to follow the 'High performance computing' conversation and receive update notifications?

Ask