May, John | 22 May 2013 19:28
Favicon

condor_store_cred not updating/adding password on Windows

I have a relatively small Condor pool with about 10 nodes that I use for various processing jobs at my job. 
I've had a problem for a while now, starting with Condor v7.4.4 and continuing to Condor v7.8.8, where I
cannot get "condor_store_cred" to update or add a credential on Windows.  It doesn't matter if it is a new
install or just a user trying to change their password, sometimes the credential will get updated and
sometimes it won't.  If I run the command, I get a reply that the "Operation Succeeded", but if I try to submit
a job, it will say no credential is stored.  If I then go back and check with "condor_store_cred query", it
will say there is no credential stored.  I repeat the process over and over and it will always say the
"operation has succeeded", but I can never submit anyt
 hing because there is "no credential stored".  I've tried Windows XP, Windows XP x64 and Windows 7 with the
same results.  I've also tried numerous workstations with the same result.  I jus!
 t seems to be hit or miss if it will work or not.  I would normally just keep trying until it worked, but some
workstations just refuse to work now.  I have no idea what's going on.  Nothing in the logs would indicate a
problem either.

I can post more info if needed.

John M.
Systems Administrator

Michael Di Domenico | 22 May 2013 16:07
Picon

job exit hook timout

i'm running a job exit hook to trap exit codes from my jobs.  the
behavior i see is that the exit hook is run and completes up to a
certain point.

This point seems to be fairly random.  My script does a bunch of
things, but where it tends to get stuck is that i'm using
condor_config_val to change some of the slot level classads on a
machine and then reconfig the startd

however, as i iterate over the slots i might get 1, 1-2, 1-5, etc and
it fails to run the reconfig most of the time

my question is, is there a watchdog mechanism that kills the script it
if runs too long?  the script can run for several seconds as it
updates the classads, but it seems it always just stops as if it's
being killed
Rabia Bashir | 20 May 2013 17:47
Picon

Job scheduling in a Pool


Hi,
I am using condor 7.8.7, and it is working properly.
I have made a condor pool that contains two machines. One is a server, and the other one is a client.
My jobs on both machines are executing separately but it is not working on the other node of a condor pool.
I guess there is some scheduling problems that I'm not getting....

What should I do now?

Thankx!!!
<div><div dir="ltr">
<br><div><div dir="ltr">Hi, <br>I am using condor 7.8.7, and it is working properly.<br>I have made a condor pool that contains two machines. One is a server, and the other one is a client.<br>My jobs on both machines are executing separately but it is not working on the other node of a condor pool.<br>I guess there is some scheduling problems that I'm not getting....<br><br>What should I do now?<br><br>Thankx!!!<br>
</div></div>
</div></div>
Antonis Sergis | 20 May 2013 13:41
Picon
Favicon

checkpointing the vanilla universe under windows

Dear all,
 
I am coming back to the hot “checkpointing the vanilla universe issue” under windows. I have a fortran 90 code which can run for a while. For longer runs, condor’s performance drops significantly as jobs get interrupted by users and with the lack of a native checkpointing function and the inability to use the “standard" universe the code has to restart from the beginning on a different machine. As a result seldom any jobs manage to finish off. I changed my source code to accommodate a check pointing feature. The code reads a “flag” file (which is also one of the initial input files) and creates a checkpoint file with all the required data to be able to resume a job from where it was left off. The flag file initially contains a “0”. As soon as a given elapsed time passes (1hr and then every one hour from there onwards) the first checkpoint takes place. The flag file is supposed to be updated with a value of “1” and a “history” file is created saving the required checkpoint data. The idea is that when the code gets evicted, it will read the input file as “1” and then use the “history” file to read the last checkpoint data and resume from where it left off. This doesn’t seem to be working. I am quite confused if the flag file gets updated and re-read upon re-starting the job. I am also not sure if condor will be able to read the “history” file which was created as an output file and is not in the initial input files list.
 
Any ideas?
 
This is the current submit file I am using to accommodate the checkpoint function:
 
************************
************************
Requirements = (Memory >=900) && (Arch=="X86_64") && (OpSys=="WINDOWS")
transfer_input_files = mds.exe, input, flag
Universe = vanilla
Getenv = False
output = Test_cores.out
error = Test_cores.err
log = Test_cores.log
should_transfer_files = ALWAYS
when_to_transfer_output = ON_EXIT_OR_EVICT
periodic_release = TRUE
Queue 250
************************
************************
 
Regards
Antonis
<div>
<div dir="ltr">
<div>
<div>Dear all,</div>
<div>&nbsp;</div>
<div>I am coming back to the hot &ldquo;checkpointing the vanilla universe issue&rdquo; 
under windows. I have a fortran 90 code which can run for a while. For longer 
runs, condor&rsquo;s performance drops significantly as jobs get interrupted by users 
and with the lack of a native checkpointing function and the inability to use 
the &ldquo;standard" universe the code has to restart from the beginning on a 
different machine. As a result seldom any jobs manage to finish off. I changed 
my source code to accommodate a check pointing feature. The code reads a &ldquo;flag&rdquo; 
file (which is also one of the initial input files) and creates a checkpoint 
file with all the required data to be able to resume a job from where it was 
left off. The flag file initially contains a &ldquo;0&rdquo;. As soon as a given elapsed 
time passes (1hr and then every one hour from there onwards) the first 
checkpoint takes place. The flag file is supposed to be updated with a value of 
&ldquo;1&rdquo; and a &ldquo;history&rdquo; file is created saving the required checkpoint data. The 
idea is that when the code gets evicted, it will read the input file as &ldquo;1&rdquo; and 
then use the &ldquo;history&rdquo; file to read the last checkpoint data and resume from 
where it left off. This doesn&rsquo;t seem to be working. I am quite confused if the 
flag file gets updated and re-read upon re-starting the job. I am also not sure 
if condor will be able to read the &ldquo;history&rdquo; file which was created as an output 
file and is not in the initial input files list. </div>
<div>&nbsp;</div>
<div>Any ideas?</div>
<div>&nbsp;</div>
<div>This is the current submit file I am using to accommodate the checkpoint 
function:</div>
<div>&nbsp;</div>
<div>************************</div>
<div>************************</div>
<div>Requirements = (Memory &gt;=900) &amp;&amp; (Arch=="X86_64") &amp;&amp; 
(OpSys=="WINDOWS")</div>
<div>Executable = <a href="file://%5C%5Chtcondor%5Chtcondorjobs%5C%5C****%5CT2%5Cmds.exe">\\htcondor\htcondorjobs\\****\T2\mds.exe</a>
</div>
<div>initialdir = <a href="file://%5C%5Chtcondor%5Chtcondorjobs%5C%5C****%5CT2">\\htcondor\htcondorjobs\\****\T2</a>
</div>
<div>transfer_input_files = mds.exe, input, flag</div>
<div>Universe = vanilla</div>
<div>Getenv = False</div>
<div>output = Test_cores.out</div>
<div>error = Test_cores.err</div>
<div>log = Test_cores.log</div>
<div>should_transfer_files = ALWAYS</div>
<div>when_to_transfer_output = ON_EXIT_OR_EVICT</div>
<div>periodic_release = TRUE</div>
<div>Queue 250</div>
<div>************************</div>
<div>************************</div>
<div>&nbsp;</div>
<div>Regards</div>
<div>Antonis</div>
</div>
</div>
</div>
Usman Khan | 19 May 2013 19:37

Job Scheduling

Hello all!!
I'ad made Htcondor pool that consist of only two machines one is master 
and other is normal node.
it is working fine only problem that I'm facing is that job is not 
scheduling in pool. Both nodes are only executing there jobs and they 
are not moving to other node.
I'm not getting where this problem arises and how to solve this.

What should I do now?

Greetings.....
Leo Singer | 19 May 2013 19:21
Picon
Favicon

Mailing list digest mode doesn't work?

Hi,

I am trying to set up my htcondor-users mailing list subscription to receive daily digests. I went to the
'membership configuration' page and selected the 'On' button for the 'Set Digest Mode' preference.
However, I continue to receive individual e-mails. I just visited the configuration page again and found
the 'Off' button active. I have tried this several times now.

Leo Singer
Graduate Student  <at>  LIGO-Caltech
Mostafa.B | 19 May 2013 12:04
Picon

Cluster Multithreading

Hi,

I am interested in running a task in my PC and let it use the threads available in the cluster, I mean the CPU cores available in the cluster would appear as cores available in my PC from the the point of view of the task that I am going to run in my PC!
(I don't know whether this is called cluster Multithreading or not!)
is such thing possible with Condor?
if yes:
1. is it also available in Windows or this is also one of those amazing capabilities that is only possible with Linux.
2.How this is accomplished?

any other suggestions?

Regards

-Mosy
<div><div dir="ltr">Hi,<div><br></div>
<div>I am interested in running a task in my PC and let it use the threads available in the cluster, I mean the CPU cores available in the cluster would appear as cores available in my PC from the the point of view of the task that I am going to run in my PC!</div>
<div>(I don't know whether this is called cluster Multithreading or not!)</div>
<div>is such thing possible with Condor?</div>
<div>if yes:</div>
<div>1. is it also available in Windows or this is also one of those amazing capabilities that is only possible with Linux.</div>
<div>2.How this is accomplished?</div>
<div><br></div>
<div>any other suggestions?</div>
<div><br></div>
<div>Regards</div>
<div><br></div>
<div>-Mosy</div>
</div></div>
Dan Shea | 17 May 2013 20:21
Picon
Favicon

Jobs do not execute, they sit idle in the queue indefinitely

Hi,

I'm attempting to configure a test condor cluster.  I have 10 machines
all running Centos 6.4
They are not configured with DNS records, they all have /etc/hosts files
that contain the relevant ip addresses for each node in the cluster.

I've configured the stable repo and used that to install the condor
software.
I then modified the /etc/condor/condor_config so that the subnet these
machines reside on was enabled for write access.

A quick test showed everything was working and jobs would execute as
expected.
However, this was with the following condor_config.local entry on each
of the 10 nodes

DAEMON_LIST = COLLECTOR, MASTER, NEGOTIATOR, SCHEDD, STARTD

I am now attempting to configured one node as a gatekeeper
DAEMON_LIST = COLLECTOR, MASTER, NEGOTIATOR, SCHEDD

And the other 9 nodes as execution only nodes
DAEMON_LIST = MASTER, STARTD

After restarting services I now no longer see jobs executing. They sit
idle in the queue indefinitely.

[root <at> node00 condor]# condor_q

-- Submitter: node00 : <10.11.114.220:44213> : node00
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE
CMD              
   2.0   mfs             5/17 13:41   0+00:00:00 I  0   0.0  myprog
Example.2.0

1 jobs; 0 completed, 0 removed, 1 idle, 0 running, 0 held, 0 suspended

condor_q -analyze is not much help

-- Submitter: node00 : <10.11.114.220:44213> : node00
---
002.000:  Request has not yet been considered by the matchmaker.

I did notice the following warning in the SchedLog

SchedLog:05/17/13 13:41:21 (pid:9037) WARNING: forward resolution of
localhost.localdomain doesn't match 10.11.114.220!

I also found this entry which makes no sense to me since schedd is not
setup to run on node00 in the local config.

SchedLog:05/17/13 13:56:21 (pid:9037) Can't find address for startd node00

The test job itself is from the tutorial here:
http://research.cs.wisc.edu/htcondor/tutorials/scotland-admin-tutorial-2003-10-23/scotland-admin-tutorial-2003-10-23.DEMO.html

Any assistance pointing me in the right direction is greatly appreciated.

Regards,
Dan Shea

--

-- 
Dan Shea - daniel_shea2@...
Senior Systems Administrator, West Quad Computing Group
Harvard Medical School
"Charlie was a chemist, But Charlie is no more. For what he thought was H2O, Was H2SO4."

Zhe Zhang | 17 May 2013 17:23
Picon
Gravatar

Dynamically determine the amount of extensible machine resource

Hi,

I know that HTCondor support extensible machine resource. Links at http://spinningmatt.wordpress.com/2012/11/19/extensible-machine-resources/ explains this.

For example, if you want to advertise GPU as a resource, you can add the following in your config files:

MACHINE_RESOURCE_NAMES = GPU
MACHINE_RESOURCE_GPU = 2

SLOT_TYPE_1 = cpus=100%,auto
SLOT_TYPE_1_PARTITIONABLE = TRUE
NUM_SLOTS_TYPE_1 = 1

Now I have some resource, e.g. the bandwidth, which I also want to advertise as a machine resource in this way. However, instead of hard coding each machine with a specific value for bandwidth, I want it to be generalized as for different machine, I could dynamically determine the bandwidth resource (I have a piece of code does that) and put it in the config file. Is there a way I can do that?

Thanks.

Zhe

--
Zhe Zhang
Department of Computer Science and Engineering
University of Nebraska-Lincoln
Lincoln, NE, 68588
<div><div dir="ltr">Hi,<div><br></div>
<div>I know that HTCondor support extensible machine resource. Links at&nbsp;<a href="http://spinningmatt.wordpress.com/2012/11/19/extensible-machine-resources/">http://spinningmatt.wordpress.com/2012/11/19/extensible-machine-resources/</a> explains this.</div>

<div><br></div>
<div>For example, if you want to advertise GPU as a resource, you can add the following in your config files:</div>
<div><br></div>
<div>MACHINE_RESOURCE_NAMES = GPU</div>
<div>MACHINE_RESOURCE_GPU = 2</div>

<div><br></div>
<div>SLOT_TYPE_1 = cpus=100%,auto</div>
<div>SLOT_TYPE_1_PARTITIONABLE = TRUE</div>
<div>NUM_SLOTS_TYPE_1 = 1</div>
<div><br></div>
<div>Now I have some resource, e.g. the bandwidth, which I also want to advertise as a machine resource in this way. However, instead of hard coding each machine with a specific value for bandwidth, I want it to be generalized as for different machine, I could dynamically determine the bandwidth resource (I have a piece of code does that) and put it in the config file. Is there a way I can do that?</div>

<div><br></div>
<div>Thanks.</div>
<div><br></div>
<div>Zhe</div>
<div>
<div><br></div>-- <br>Zhe Zhang<br>Department of Computer Science and Engineering<br>University of Nebraska-Lincoln<br>Lincoln, NE, 68588<br>
</div>
</div></div>
Israel Casas Lopez | 17 May 2013 14:21
Picon

Stork server do not start with condor

Hi Condor Team,

 

I've been using DAGman and now I installed stork.

The problem is that stork server simply does not start.

During the installation everything was ok.

 

But, when I want to start the server by typing stork_server, it does not give any error, but stork server never starts.

I already check it with ps -ef | grep stork

 

 

Any reason for that?

 

I already tried selecting the port, host, but nothing

 

I have a pool of condor machines installed on Ubuntu. I used the apt-get install condor to make the installation on all machines.

 

Any other configuration Im missing?

 

Thank you in advance.

 

Israel

<div>
<div>
<p>Hi Condor Team,</p>
<p>&nbsp;</p>
<p>I've been using DAGman and now I installed stork.</p>
<p>The problem is that stork server simply does not start. </p>
<p>During the installation everything was ok.</p>
<p>&nbsp;</p>
<p>But, when I&nbsp;want to start the server by&nbsp;typing stork_server, it does not give any error, but stork server never starts.</p>
<p>I already check it with ps -ef | grep stork</p>
<p>&nbsp;</p>
<p>&nbsp;</p>
<p>Any reason for that?</p>
<p>&nbsp;</p>
<p>I already tried selecting the port, host, but nothing</p>
<p>&nbsp;</p>
<p>I have a pool of condor machines installed on Ubuntu. I used the apt-get install condor to make the installation on all machines.</p>
<p>&nbsp;</p>
<p>Any other configuration Im missing?</p>
<p>&nbsp;</p>
<p>Thank you in advance.</p>
<p>&nbsp;</p>
<p>Israel</p>
</div>
</div>
Leo Singer | 17 May 2013 02:45
Picon
Favicon

building HTCondor 7.9.6 on Mac OS fails, _res_9_init undefined

Hi,

I am having some difficulty building HTCondor 7.9.6 on Mac OS. In fact, I am attempting to update the
MacPorts package. The build fails here:

Undefined symbols for architecture x86_64:
  "_res_9_init", referenced from:
      DaemonCore::refreshDNS()       in libcondor_utils_7_9_6.a(daemon_core.cpp.o)
ld: symbol(s) not found for architecture x86_64
collect2: ld returned 1 exit status
make[2]: *** [src/condor_shadow.V6.1/condor_shadow] Error 1

I'm no expert on CMake internals, but the file src/condor_shadow.V6.1/CMakeFiles/link.txt, which
seems to contain the linker command, does not contain a '-lresolv' flag. It seems like it should, because
this is where res_9_init is defined:

$ nm /usr/lib/libresolv.dylib | grep res_9_init
000000000000dc4c T _res_9_init

Is this a plausible diagnosis? If so, is there a workaround?

Thanks,
Leo Singer
Graduate Student  <at>  LIGO-Caltech

Gmane