6 Dec 2007 13:43
question on OpenMP usage in LDL decomposition and time estimation of "fork-join" process
Hello,
I tried to use OpenMP to speed up my LDL factorization algorithm, but I got only +8% in speed (I have Intel Core2Duo processor and I have only about 75% ). I'm solving matrix equation [A]*{x}={f}, where matrix [A] is positive defined, banded and symmetrical (N - matrix dimension, r - band size). My code looks like:
for (int i=0;i<N;i++)
{
.......
#pragma omp parallel for
for (int j=max(0,i-r);j<i;j++)
{ ... }
.....
#pragma omp parallel for
for (int j=i+1;j<min(N,i+1+r);j++)
{ ......
for (int k=max(0,j-r);k<j;k++)
{ ..... }
} // end for j
......
} // end for i
And the main problem here is that I cannot make outer loop parallel, because each (i+1)-th iteration uses results from (i)-th iteration and furthermore r<<N (N can be from 1e6 to 1e10, and N/r can be from 10000 to 100). I suppose that in this case the "fork-join" procedures are the bottleneck, because they are executed too often.
Can you help me with this problem?
And is there some kind of "approach" to such problems?
for (int i=0;i<N;i++)
{
.......
#pragma omp parallel for
for (int j=max(0,i-r);j<i;j++)
{ ... }
.....
#pragma omp parallel for
for (int j=i+1;j<min(N,i+1+r);j++)
{ ......
for (int k=max(0,j-r);k<j;k++)
{ ..... }
} // end for j
......
} // end for i
And the main problem here is that I cannot make outer loop parallel, because each (i+1)-th iteration uses results from (i)-th iteration and furthermore r<<N (N can be from 1e6 to 1e10, and N/r can be from 10000 to 100). I suppose that in this case the "fork-join" procedures are the bottleneck, because they are executed too often.
Can you help me with this problem?
And is there some kind of "approach" to such problems?
Another issue:
Suppose that I have function func1(a,b), 2 processors (like Intel Core2Duo) and code, called about 1e20 times, that looks like:
Code: Select all
{double t1=func(a,b);
double t2=func(b,c);
}
I can make this code parallel like this
Code: Select all
{double t1=0,t2=0;
#pragma omp parallel sections
{
#pragma omp section
t1=func(a,b);
#pragma omp section
t2=func(b,c);
}
}
but I think that if my function is like this
{ return a+b; }
then I will get no speedup at all, because most of the time processor will make "fork-join" operations rather than actual work. Is it possible to estimate time needed for "fork-join" operations for latter comparison?
Is there any criterion saying that a function is "worth" parallelizing?
Thanks in advance
P.S.
unfortunately I cannot use MKL or other similar package for LDL decomposition, because of the matrix size. I'm limited with using ordinary computer (like Intel Core2Duo with about 2Gb of RAM) and the dimensions of [A] N x r = 1e7 x 1e3 gives me 1e10 doubles = 76Gb, which can only be stored on HDD - so I'm forced to use my own algorythms, working with matrix blocks, loaded from HDD into RAM. (I already contacted with Intel and they said that they cannot offer me anything useful in this problem)
--
Best regards,
Vladimir Prokopov
Best regards,
Vladimir Prokopov
_______________________________________________ Omp mailing list Omp@... http://openmp.org/mailman/listinfo/omp
RSS Feed