6 Dec 2007 13:43

question on OpenMP usage in LDL decomposition and time estimation of "fork-join" process

Hello,

I tried to use OpenMP to speed up my LDL factorization algorithm, but I got only +8% in speed (I have Intel Core2Duo processor and I have only about 75% ). I'm solving matrix equation [A]*{x}={f}, where matrix [A] is positive defined, banded and symmetrical (N - matrix dimension, r - band size). My code looks like:

for (int i=0;i<N;i++)
{
.......
#pragma omp parallel for
for (int j=max(0,i-r);j<i;j++)
{ ... }
.....
#pragma omp parallel for
for (int j=i+1;j<min(N,i+1+r);j++)
{ ......
for (int k=max(0,j-r);k<j;k++)
{ ..... }
} // end for j
......
} // end for i

And the main problem here is that I cannot make outer loop parallel, because each (i+1)-th iteration uses results from (i)-th iteration and furthermore r<<N (N can be from 1e6 to 1e10, and N/r can be from 10000 to 100). I suppose that in this case the "fork-join" procedures are the bottleneck, because they are executed too often.
Can you help me with this problem?
And is there some kind of "approach" to such problems?
Another issue:

Suppose that I have function func1(a,b), 2 processors (like Intel Core2Duo) and code, called about 1e20 times, that looks like:
Code: Select all {
double t1=func(a,b);
double t2=func(b,c);
}
I can make this code parallel like this

Code: Select all {
double t1=0,t2=0;
#pragma omp parallel sections
{
#pragma omp section
t1=func(a,b);
#pragma omp section
t2=func(b,c);
}
}

but I think that if my function is like this
Code: Select all double func1(double a, double b)
{ return a+b; }

then I will get no speedup at all, because most of the time processor will make "fork-join" operations rather than actual work. Is it possible to estimate time needed for "fork-join" operations for latter comparison?
Is there any criterion saying that a function is "worth" parallelizing?

Thanks in advance

P.S.
unfortunately I cannot use MKL or other similar package for LDL decomposition, because of the matrix size. I'm limited with using ordinary computer (like Intel Core2Duo with about 2Gb of RAM) and the dimensions of [A] N x r = 1e7 x 1e3 gives me 1e10 doubles = 76Gb, which can only be stored on HDD - so I'm forced to use my own algorythms, working with matrix blocks, loaded from HDD into RAM. (I already contacted with Intel and they said that they cannot offer me anything useful in this problem)
--
Best regards,
Vladimir Prokopov
```_______________________________________________
Omp mailing list
Omp@...
http://openmp.org/mailman/listinfo/omp
```
21 Nov 2007 22:00

nested parallelization with dynamically allocated memory

http://openmp.org/pipermail/omp/2007/001107.html
> > #pragma omp parallel private(row,col,i) > > { > > #pragma omp for schedule(static) > > for (row=0; row<N; row++) > > { > > #pragma omp parallel shared(MA,MB,MC,N) > > { > > #pragma omp for schedule(static) > > for (col=0; col<N; col++) > > for (i=0; i<N; i++) > > MC[row][col] = MA[row][i] * MB[i][col]; > > } > > } > > } > you were absolutely right. > Variable "row" had to be declared as "shared". > Now it seems (on the first glance) to work right. Is this really correct? The innermost region will already have "row" shared by default. But it will also have "i" shared by default, which would seem to be a problem. E.g., if I compile with the Sun Studio C compiler -- e.g., http://developers.sun.com/sunstudio/downloads/index.jsp -- using "cc -xopenmp -xvpara" (first switch to turn on OpenMP compilation and second one to show verbose parallelization warnings), I get a message that indicates "a.c", line 47: Warning: inappropriate scoping variable 'i' may be scoped inappropriately as 'shared' . read at line 51 and write at line 51 may cause data race (My line numbers may not be the same as yours since I replaced the "..." in the original code to make it compilable.) Anyhow, the warning points to the innermost "parallel" directive. It notes that 'i' is (implicitly) scoped as shared, which causes problems on the innermost loop. We can see that 'i' should be private.
```_______________________________________________
Omp mailing list
Omp@...
http://openmp.org/mailman/listinfo/omp
```
19 Nov 2007 14:34

Re: Antwort: nested parallelization with dynamically allocated memory

```Dear Uwe,

you were absolutely right.
Variable "row" had to be declared as "shared".
Now it seems (on the first glance) to work right.
This is the nested parallel section as I use it now:

...
#pragma omp parallel private(row, col, i, N)
{
..#pragma omp for schedule(dynamic)
..for (row=0; row<N; row++)
..{
....#pragma omp parallel shared(N, MA, MB, MC, row)
....{
.......#pragma omp for schedule(dynamic)
.......for (col=0; col<N; col++)
.........for (i=0; i<N; i++)
...........MC[row][col] = MA[row][i] * MB[i][col];
....}
..}
}
...

Thank you very much for your fast response, it really helped me
out of a big dilemma =')

Best regards from Stuttgart,

Dan

> -----Ursprüngliche Nachricht-----
> Von: <U.Ruetjes@...>
> Gesendet: 19.11.07 12:20:27
> An: Popovic Dan <danpopovic@...>
> Betreff: Antwort: [Omp] nested parallelization with dynamically allocated memory

>
> Hello all,
>
> I am not quite sure if i'm right on this, I am also new to OMP. But I
> think that the nested parallel region does not know the variable 'row'. In
> my opinion it should be SHARED or at least in a COPYIN clause.
>
> Best regards,
> Uwe.
>
> --
> Dipl.-Ing. Uwe Ruetjes
> Werkzeugmaschinenlabor (WZL)
> RWTH-Aachen
> Steinbachstrasse 19
> D-52056 Aachen
>
> Tel.: +49-(0)241-80-26294
> Fax: +49-(0)241-80-22293
>
>
>
> Popovic Dan <danpopovic@...>
> Gesendet von: omp-bounces@...
> 19.11.2007 11:12
>
> An
> omp@...
> Kopie
>
> Thema
> [Omp] nested parallelization with dynamically allocated memory
>
>
>
>
>
>
> Dear all,
>
> I am relatively new to openMP and have a problem with nested
> parallelization of a matrix-matrix-multiplication
> when the storage of matrices (type double **) were allocated dynamically.
>
> The nested parallelization is done with an outer loop (one thread per row
> of the resulting matrix)
> and the inner one is a thread for the scalar products which have to be
> calculated in order to get
> one entry in the resulting matrix. (See program snippet below).
>
> Unfortunately, I get segmentation fault errors in the inner nested
> parallel section.
> The serial version of that program works well, if I do not allocate
> storage dynamically
> but statically (something like matrix A[100][100]), the parallelization
> seems to work
> sufficiently.
>
> Where is my mistake, what did I understand wrong? Thanks in advance for
> your help.
> Best regards, Dan
>
> /* matrix-matrix-mult. with omp */
> long i, row, col, N; /* matrix indices, N: problem size */
>
> int main(int argc, char* argv[])
> {
>   /* matrices */
>   double **MA, **MB, **MC;
>   scanf("%d",&N);
>
>   /* allocate storage for dbl. pointers for the rows */
>   MA = (double **) malloc(N*sizeof(double *));
>   MB = (double **) malloc(N*sizeof(double *));
>   MC = (double **) malloc(N*sizeof(double *));
>
>   /* storage for colums of then rows of MA, MB, MC */
>   for (row=0; row<N; row++)
>   {
>     MA[row] = (double *)malloc(N*sizeof(double));
>     MB[row] = (double *)malloc(N*sizeof(double));
>     MC[row] = (double *)malloc(N*sizeof(double));
>   }
>
>   /* initialization of A and B and C */
>   ...
>
>   /* introduce nested parallelization */
>   omp_set_nested(1);
>
>   #pragma omp parallel private(row,col,i)
>   {
>     #pragma omp for schedule(static)
>     for (row=0; row<N; row++)
>     {
>       #pragma omp parallel shared(MA,MB,MC,N)
>       {
>         #pragma omp for schedule(static)
>         for (col=0; col<N; col++)
>           for (i=0; i<N; i++)
>             MC[row][col] = MA[row][i] * MB[i][col];
>       }
>     }
>   }
>
>   /* free memory in reverse order, columns first (!) */
>   ...
>   return (0);
> }
>
> --
> Dan Popovic
> Klausenpfad 22; 69121 Heidelberg
> danpopovic@...; Tel. 06221/7282102 oder 01743036428
> http://www.rzuser.uni-heidelberg.de/~dpopovi2/index.html
> ______________________________________________________________________________
> Jetzt neu! Im riesigen WEB.DE Club SmartDrive Dateien freigeben und mit
> Freunden teilen!
> http://www.freemail.web.de/club/smartdrive_ttc.htm/?mc=021134
>
> _______________________________________________
> Omp mailing list
> Omp@...
> http://openmp.org/mailman/listinfo/omp
>
>
>

--
Dan Popovic
Klausenpfad 22; 69121 Heidelberg
danpopovic@...; Tel. 06221/7282102 oder 01743036428
http://www.rzuser.uni-heidelberg.de/~dpopovi2/index.html
_______________________________________________________________________
Jetzt neu! Schützen Sie Ihren PC mit McAfee und WEB.DE. 3 Monate
kostenlos testen. http://www.pc-sicherheit.web.de/startseite/?mc=022220
```
19 Nov 2007 11:12

nested parallelization with dynamically allocated memory

```Dear all,

I am relatively new to openMP and have a problem with nested parallelization of a matrix-matrix-multiplication
when the storage of matrices (type double **) were allocated dynamically.

The nested parallelization is done with an outer loop (one thread per row of the resulting matrix)
and the inner one is a thread for the scalar products which have to be calculated in order to get
one entry in the resulting matrix. (See program snippet below).

Unfortunately, I get segmentation fault errors in the inner nested parallel section.
The serial version of that program works well, if I do not allocate storage dynamically
but statically (something like matrix A[100][100]), the parallelization seems to work
sufficiently.

Where is my mistake, what did I understand wrong? Thanks in advance for your help.
Best regards, Dan

/* matrix-matrix-mult. with omp */
long i, row, col, N; /* matrix indices, N: problem size */

int main(int argc, char* argv[])
{
/* matrices */
double **MA, **MB, **MC;
scanf("%d",&N);

/* allocate storage for dbl. pointers for the rows */
MA = (double **) malloc(N*sizeof(double *));
MB = (double **) malloc(N*sizeof(double *));
MC = (double **) malloc(N*sizeof(double *));

/* storage for colums of then rows of MA, MB, MC */
for (row=0; row<N; row++)
{
MA[row] = (double *)malloc(N*sizeof(double));
MB[row] = (double *)malloc(N*sizeof(double));
MC[row] = (double *)malloc(N*sizeof(double));
}

/* initialization of A and B and C */
...

/* introduce nested parallelization */
omp_set_nested(1);

#pragma omp parallel private(row,col,i)
{
#pragma omp for schedule(static)
for (row=0; row<N; row++)
{
#pragma omp parallel shared(MA,MB,MC,N)
{
#pragma omp for schedule(static)
for (col=0; col<N; col++)
for (i=0; i<N; i++)
MC[row][col] = MA[row][i] * MB[i][col];
}
}
}

/* free memory in reverse order, columns first (!) */
...
return (0);
}

--

--
Dan Popovic
Klausenpfad 22; 69121 Heidelberg
danpopovic@...; Tel. 06221/7282102 oder 01743036428
http://www.rzuser.uni-heidelberg.de/~dpopovi2/index.html
______________________________________________________________________________
Jetzt neu! Im riesigen WEB.DE Club SmartDrive Dateien freigeben und mit
Freunden teilen! http://www.freemail.web.de/club/smartdrive_ttc.htm/?mc=021134
```
16 Nov 2007 13:17

writing to a threadprivate global variable

Hi

I have a question for the list members-apologies in advance if this is something trivial I could find in the documentation:

I have some fortran code which calls several subroutines which modify the same variable. These subroutine calls are within a do loop (that I want to parallelise) . According to my understanding of OpenMP, if I make this variable global and threadprivate, then each thread will have a copy of the variable and can write to that copy without causing any errors.

Is this understanding correct? If not, what is the rule of thumb for dealing with these situations?

Thanks in advance
Andrew Skeen
```_______________________________________________
Omp mailing list
Omp@...
http://openmp.org/mailman/listinfo/omp
```
17 Nov 2007 10:39

allocatable memory and array in subroutine are not private - segfault issue

```I am having a problem with prallelizing some code. It seems to be according to
spec but still causes errors.

I have a function called from within a parallel do loop which among other
things has several small arrays and some allocated and deallocated memory. The
spec seems to say that all that memory is private but it seems that different
instances of that function (different threads) stomp each others memory.

I haven't managed to make a minimal example since this depends on timing issues
apparantly but here are the relevant parts of the code (after cutting out what
I believe to be unrelated). This is run under linux and shows a problem with
both the intel compiler and gfortran and is run under intel core 2 duo in 64
bit mode.

There is one omp do loop directive near the start, the problematic function is EigenVecs.
The two points of problem when using more then one thread
1. the call to dsaupd (arpack eigenvalue calculation) tell me that the input values are
wrong (don't know which one because the error comes from farther). Not on the first
run of the loop, but a bit later
2. I get a segmentation fault with the two deallocate directives one line up from the end.
Removing them solves the segfault so there is some double deallocation going on

seems to me that the arrays/allocatable objects are forced to shared for some reason

the code:
-----------

integer function PeronaMalikIteration(in, out, pprms, eprms)

real*8, allocatable :: v(:, :), d(:)

!\$omp parallel shared(in, out, eprms, pprms) private(ret, col, row, v, d)

allocate(v(eprms%ldv, eprms%ncv), d(eprms%neigs))

!\$omp do schedule(dynamic, 1)

do col = 1,n
do row = 1,n
ret = EigenVecs(in, v, d, pprms, eprms)
end do
end do
!\$omp end do nowait

deallocate(v, d)

!\$omp end parallel

end function PeronaMalikIteration

! ------------------------------------------------------
! calculate eigenvectors
! ------------------------------------------------------
integer function EigenVecs(img, v, d, pprms, eprms)

type(PeronaParams), intent(IN) :: pprms
type(EigParams), intent(IN) :: eprms

real*8, intent(IN) :: img(eprms%windowSz, eprms%windowSz), &
v(eprms%ldv, eprms%ncv), d(eprms%neigs)

! Algorithm parameters
! Relative Tolerance, set to default - machine precision
real*8 :: tol = 0.0d0
real*8 :: sigma = 0.0d0

! Algorithm variables
integer :: ido, lworkl, info
integer, dimension(11) :: iparam, ipntr
real*8, allocatable, dimension(:) :: resid, workd, workl
!DEC\$ ATTRIBUTES ALIGN: 16 :: resid, workd, workl
logical :: rvec = .true.
logical, allocatable, dimension(:) :: select
character(len=1) :: B = 'I'
character(len=2) :: which = 'LA'

! Set parameter values
! Default return, no error
EigenVecs = 0
! Initial parameters to the dsaupd routine
ido = 0
info = 0
! (/ ISHIFT, -, MXITER, NB, NCONV, -, MODE, NP, NUMOP, NUMOPB, NUMREO /)
iparam = (/ 1 , 0, 1000*eprms%neigs, 1 , 0 , 0, 1 , 0 , 0 , 0 , 0 /)
! Workspace
lworkl = eprms%ncv*(eprms%ncv + 8)
allocate(resid(eprms%n), workd(3*eprms%n), workl(lworkl))

do
call dsaupd(ido, B, eprms%n, which, eprms%neigs, tol, resid, &
eprms%ncv, v, eprms%ldv, iparam, ipntr, workd, workl, lworkl, info)

select case (ido)
case (-1,1)
! multiply y <- OP*x
! where x = workd(ipntr(1))
! y = workd(ipntr(2))
call PeronaMalikResponse(workd(ipntr(1)), workd(ipntr(2)), img, &
eprms, pprms)
case default
exit
end select
end do

! Error !!!
if (info /= 0) then
EigenVecs = info
goto 99999
end if

allocate(select(eprms%ncv))

call dseupd(rvec, 'A', select, d, v, eprms%ldv, sigma, B , eprms%n, &
which, eprms%neigs, tol, resid, eprms%ncv, v, eprms%ldv, iparam, &
ipntr, workd, workl, lworkl, info)

deallocate(select)
99999 deallocate(resid, workd, workl)
end function EigenVecs
```
12 Nov 2007 21:18

Fortran sequential loop iteration vars

```Hi!

Is

SUBROUTINE foo(a, b, n)
DOUBLE PRECISION a, b
INTEGER*8 i1, i2, i3, n
DIMENSION a(n,n,n), b(n,n,n)
!\$OMP PARALLEL
!\$OMP+DEFAULT(SHARED)
!\$OMP+PRIVATE(I3)
!\$OMP DO
!\$OMP+LASTPRIVATE(I1,I2)
DO i3 = 2, n-1, 1
DO i2 = 2, n-1, 1
DO i1 = 2, n-1, 1
a(i1, i2, i3) = b(i1, i2, i3);
600    CONTINUE
ENDDO
ENDDO
ENDDO
!\$OMP END DO NOWAIT
!\$OMP END PARALLEL
RETURN
END

valid?  My reading of the standard is it is not, because both I1
and I2 are sequential loop iterator vars in a parallel construct
and as such should be predetermined private rather than implicitly
determined shared (OpenMP 2.5, 2.8.1.1).  It is not present
in any of the clauses on the parallel construct which could possibly
override it.  2.8.3.5 about the lastprivate clause in the first restriction
says that the vars can't be private in the parallel region.
Several other compilers accept this code though.

In OpenMP 3.0 draft the wording is even clear, because it talks there
about the loop iterators being predetermined private in a task region,
and !\$omp do doesn't create a new task region.

Or am I wrong with this?

Thanks.

Jakub
```

Gmane