Quantcast
Viewing all articles
Browse latest Browse all 1142

Parallel substring match (Acceler8 2012)

WHC Team: Grigore Lupescu, Mihai Oprea
Prof: Emil Slusanschi

The algorithm was based on the same dynamic programming approach as the Intel algorithm, with a complexity of O(m*n). For the solution we used C++/OpenMP/SSE4.2. As compiler icpc was found to do slightly better in our case (compared to g++).

Low memory usage:
Main idea was to traverse a virtual L matrix diagonally (top, left)->(bottom, right) and not storing intermediate results. By parallelizing at the diagonal level, data dependencies are avoided, thus we could assign a a set of diagonals per thread.

Each thread would work at a time a diagonal and using a variable (we called it “maxS”), would be able to pinpoint start (x1,y1) and stop(x2, y2) and enter them if maxS > threshold.
Memory would thus be reduced to
strLength1 * sizeof(char) + strLength2 * sizeof(char) + diagonalLength * sizeof(int)/(nrThreads*chunkSize) + solutionsFound * 6 * sizeof(int).

Use OpenMP:
Using the fact that diagonals were independent we could split them in chunks according to thread number. With OpenMP we issued the following (found 120 to be a good tradeoff)
#pragma omp parallel for default(none) private(slice) firstprivate(…) shared(vecPartial) schedule(dynamic, 120)


Add SSE:
Next we added SSE4.2 by comparing not 1 group (str1[X] == str1[Y]) but several {4,8,16} depending on minMatchLength. We would have for example :
__m128i R = _mm_cmpestrm(A, 4, B, 4, CMP_EQUAL_EACH);
memcpy(&ret, &R, 1);
if((ret & 15) == 15)
maxS += 4;


We kept the inner loop simple and clean – if all 1s – and adjusted at a later stage. For example if we used SSE with 4 comparisons and we would have “ABCABC”, we would detect “BCAB” and adjust to “ABCABC”. The gain was significant because the number of solutions was almost always low (compared to the number of checks). We got almost linear speedup improvement when using this method – 16x for SSE_16, 8x for SSE_8…


Other optimisations:
- Reading chunks of data, not checking for {a,c,t,g} ~+ 5%
- Filtering (to output correctly) is done in parallel ~+ 5-10%
- Use of STL containers/functions ~+10%
- Use of icpc with –axsse4.2 –O3 –fast ~+5% (compared to g++)
- If (on SSE) ret & 1 == 0, jump more elements ~5-15%


We noticed that:
- Program scales very well if sequence are large enough
- Inverting sequences shows no gain
- Using HT does not improve much ~5% possibly due to SSE
- Compilers g++ and icpc give rather different results in timings +-10% , bigger tests go better with icpc.


Compilation flags:
$(ICC_PATH)icpc -fopenmp -axsse4.2 -O3 -fast main.cpp utils.cpp process.cpp -o run

Results:

on a 40-cores HT machine, using 40 worker threads, running benchmark AE12CB-13105960671904317035:
invalid benchmark - TIMEOUT

on a 40-cores HT machine, using 20 worker threads, running benchmark AE12CB-9510636300373457187:
real:17.32 user:313.36 sys:0.74 CPU:1813.51%

on a 40-cores HT machine, using 40 worker threads, running benchmark AE12CB-9510636300373457187:
real:9.84 user:318.87 sys:0.95 CPU:3250.2%

on a 40-cores HT machine, using 20 worker threads, running benchmark AE12CB-14929312560780125584:
real:9.04 user:165.92 sys:0.68 CPU:1842.92%

on a 40-cores HT machine, using 40 worker threads, running benchmark AE12CB-14929312560780125584:
real:5.22 user:172.38 sys:0.98 CPU:3321.07%

on a 12-cores HT machine, using 6 worker threads, running benchmark AE12CB-14929312560780125584:
real:21.58 user:127.02 sys:0.37 CPU:590.315%

on a 12-cores HT machine, using 12 worker threads, running benchmark AE12CB-14929312560780125584:
real:11.98 user:136.72 sys:0.37 CPU:1144.32%

on a 12-cores HT machine, using 24 worker threads, running benchmark AE12CB-14929312560780125584:
real:9.58 user:212.51 sys:0.56 CPU:2224.11%

on a 12-cores machine, using 6 worker threads, running benchmark AE12CB-10353053912364647132:
real:0.4 user:1.29 sys:0 CPU:322.5%

on a 12-cores machine, using 12 worker threads, running benchmark AE12CB-10353053912364647132:
real:0.2 user:1.31 sys:0 CPU:655%

on a 12-cores machine, using 6 worker threads, running benchmark AE12CB-16325737234926730915:
real:0.2 user:0.25 sys:0 CPU:125%

on a 12-cores machine, using 12 worker threads, running benchmark AE12CB-16325737234926730915:
real:0.2 user:0.25 sys:0 CPU:125%

Viewing all articles
Browse latest Browse all 1142

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>