Parallel substring match (Acceler8 2012)

WHC Team: Grigore Lupescu, Mihai Oprea
Prof: Emil Slusanschi

The algorithm was based on the same dynamic programming approach as the Intel algorithm, with a complexity of O(m*n). For the solution we used C++/OpenMP/SSE4.2. As compiler icpc was found to do slightly better in our case (compared to g++).

Low memory usage:

Main idea was to traverse a virtual L matrix diagonally (top, left)->(bottom, right) and not storing intermediate results. By parallelizing at the diagonal level, data dependencies are avoided, thus we could assign a a set of diagonals per thread.

Each thread would work at a time a diagonal and using a variable (we called it “maxS”), would be able to pinpoint start (x1,y1) and stop(x2, y2) and enter them if maxS > threshold.
Memory would thus be reduced to
strLength1 * sizeof(char) + strLength2 * sizeof(char) + diagonalLength * sizeof(int)/(nrThreads*chunkSize) + solutionsFound * 6 * sizeof(int).

Use OpenMP:
Using the fact that diagonals were independent we could split them in chunks according to thread number. With OpenMP we issued the following (found 120 to be a good tradeoff)
#pragma omp parallel for default(none) private(slice) firstprivate(…) shared(vecPartial) schedule(dynamic, 120)

Add SSE:
Next we added SSE4.2 by comparing not 1 group (str1[X] == str1[Y]) but several {4,8,16} depending on minMatchLength. We would have for example :
__m128i R = _mm_cmpestrm(A, 4, B, 4, CMP_EQUAL_EACH);
memcpy(&ret, &R, 1);
if((ret & 15) == 15)
maxS += 4;

We kept the inner loop simple and clean – if all 1s – and adjusted at a later stage. For example if we used SSE with 4 comparisons and we would have “ABCABC”, we would detect “BCAB” and adjust to “ABCABC”. The gain was significant because the number of solutions was almost always low (compared to the number of checks). We got almost linear speedup improvement when using this method – 16x for SSE_16, 8x for SSE_8…

Other optimisations:
- Reading chunks of data, not checking for {a,c,t,g} ~+ 5%
- Filtering (to output correctly) is done in parallel ~+ 5-10%
- Use of STL containers/functions ~+10%
- Use of icpc with –axsse4.2 –O3 –fast ~+5% (compared to g++)
- If (on SSE) ret & 1 == 0, jump more elements ~5-15%

We noticed that:
- Program scales very well if sequence are large enough
- Inverting sequences shows no gain
- Using HT does not improve much ~5% possibly due to SSE
- Compilers g++ and icpc give rather different results in timings +-10% , bigger tests go better with icpc.

Compilation flags:

$(ICC_PATH)icpc -fopenmp -axsse4.2 -O3 -fast main.cpp utils.cpp process.cpp -o run

Results:

on a 40-cores HT machine, using 40 worker threads, running benchmark AE12CB-13105960671904317035:
invalid benchmark - TIMEOUT

on a 40-cores HT machine, using 20 worker threads, running benchmark AE12CB-9510636300373457187:
real:17.32 user:313.36 sys:0.74 CPU:1813.51%

on a 40-cores HT machine, using 40 worker threads, running benchmark AE12CB-9510636300373457187:
real:9.84 user:318.87 sys:0.95 CPU:3250.2%

on a 40-cores HT machine, using 20 worker threads, running benchmark AE12CB-14929312560780125584:
real:9.04 user:165.92 sys:0.68 CPU:1842.92%

on a 40-cores HT machine, using 40 worker threads, running benchmark AE12CB-14929312560780125584:
real:5.22 user:172.38 sys:0.98 CPU:3321.07%

on a 12-cores HT machine, using 6 worker threads, running benchmark AE12CB-14929312560780125584:
real:21.58 user:127.02 sys:0.37 CPU:590.315%

on a 12-cores HT machine, using 12 worker threads, running benchmark AE12CB-14929312560780125584:
real:11.98 user:136.72 sys:0.37 CPU:1144.32%

on a 12-cores HT machine, using 24 worker threads, running benchmark AE12CB-14929312560780125584:
real:9.58 user:212.51 sys:0.56 CPU:2224.11%

on a 12-cores machine, using 6 worker threads, running benchmark AE12CB-10353053912364647132:
real:0.4 user:1.29 sys:0 CPU:322.5%

on a 12-cores machine, using 12 worker threads, running benchmark AE12CB-10353053912364647132:
real:0.2 user:1.31 sys:0 CPU:655%

on a 12-cores machine, using 6 worker threads, running benchmark AE12CB-16325737234926730915:
real:0.2 user:0.25 sys:0 CPU:125%

on a 12-cores machine, using 12 worker threads, running benchmark AE12CB-16325737234926730915:
real:0.2 user:0.25 sys:0 CPU:125%

Parallel substring match (Acceler8 2012)

Trending Articles

Snes4Sym emulator for nokia s60v3

Black Angus Grilled Artichokes

Rajasthan Board 10th Result 2016 Roll No wise & Name Wise

A/L Technology Stream – Subject combinations, Syllabuses and Teacher guides

Calaveras conflict results in shooting, 4 arrests

Kurabuitaki na Sota Koya

GTA 5 PPSSPP Zip File Download For Android Mediafire 382 MB

Adilabad District Police Office Mobile Numbers List in Telangana State

Download: Rich Bizzy -Panono Ukwenda (Cover)

Re: Error UA_400_EB000U0410

How to repair Samsung LCD TV panel screen - Part-1 of 5

Practice Sheet of Right form of verbs for HSC Students

Moondru Mudichu 20-07-2016 – Polimer tv Serial

School playground abuse and assault convictions against solicitor...

मुख मैथुन से उठाएं सेक्स का भरपूर मज़ा, जानें क्या है इसका सही तरीकामुख मैथुन...

Rajasthan Board 10th Result 2017 RBSE 10th Class Result 2017 Name Wise...

23-11-2015 – Priyamana Thozhi

Best Suvichar in Hindi |बेस्ट सुविचार |शुभ विचार हिंदी में

The Last Ship – 2ª Temporada Dublado e Legendado – MEGA

Top 10 best green tea brands in Nigeria you should try