Monday, December 14, 2009

Advanced debugging of submitting jobs to Swarm-Grid

This instruction is for the advanced users of Swarm-Grid to track their submitted jobs.
(1) Checking the remote cluster
[Step 1] login to a machine which has Globus installed.
[Step 2] get your proxy certificate
myproxy-logon -s myproxy.teragrid.org -l your_user_name
[Step 3] submit a request to remote cluster
globusrun -o -r grid-co.ncsa.teragrid.org/jobmanager '&(executable=/bin/ls)'

* If this step works fine, your remote cluster and the jobmanager on the cluster works fine.
* If your myproxy-logon cannot be located, check your globus user environment is set up correctly.
source $GLOBUS_LOCATION/etc/globus-user-env.sh

(2) Checking the mySQL table directly
[Step 1] login to the mySQL with username and password for the swarm.
mysql -u jobsub -p
[Step 2] select the database
use jobsubmission;
[Step 3] check the debugging log for your job
select * from SubmitRecord where TicketID="213655869" AND InternalID=200;
* As soon as the job reaches the stage, swarm records the time to the field. Therefore if the value of the field is NULL, the job has not been reached to that stage.

(3) Globus Temporary Files and Locations
your_home
--.globus
----g1
------h6.bigred.teragrid.iu.edu
-------- directories with the actual globus job numbers
-----------remote_io_url
-----------scheduler_loadleveler_job_script:
-----------stdout
-----------x509_up
-- Globus*** : temporary directory: keeps input,output etc.. files. This is the directory your script will process as current directory.

* Temporary directory is mostly deleted after the job is processed. This is for the bigred. Other teragrid machines have similar structures for the temporary files and unstaged files.

Tuesday, May 12, 2009

Adding EC2 computing nodes to your Hadoop Cluster

If you want to add EC2 nodes as your slave nodes of existing Hadoop cluster:

Step1] Create a EC2 instance which has hadoop installed.
ec2-describe-images -x all | grep hadoop

Step2] Make sure that the version of hadoop MUST be identical on all the machines in your cluster.

Step3] Generate public key on your master machine:

ssh-keygen -t rsa

Then your publickey is stored in,

.ssh/id_rsa.pub

Step4] Copy your public key (id_rsa.pub) to .ssh/authorized_keys on your EC2 instance. Now you can ssh to your instance without your keypair.

Step5] Add your new slave node to your hadoop_loc/conf/slaves
Mine looks like:

localhost
ec2-111-222-333-444.compute-1.amazonaws.com

Step6] Synchronize or copy your configufation files to the slave nodes:
-hadoop-site.xml
-slaves
-master

Step7] format your namenode (if you need)

Step8] start your cluster!!

Wednesday, March 18, 2009

Additional Teragrid resources for Bioinformatics Software(CAP3)

Last week, I found three additional teragrid sites working with Swarm. Please note that I have experinece some problem with ANL while I was stage out files before. And Marlon informed me that Pople doesn't support community certificate.

NCSA-Abe
(1) grid_resource: gt2 grid-abe.ncsa.teragrid.org:2119/jobmanager-pbs
(2) globusrsl: (jobtype=single)(count=1)
(3) Test cap3: success

Pople
(1) grid_resource: gt2 gram.pople.psc.teragrid.org:2119/jobmanager-pbs
(2) globusrsl: (jobtype=single)(count=1)
(3) Test cap3: success


ANL
(1) grid_resource: gt2 tg-grid.uc.teragrid.org:2119/jobmanager-pbs
(2) globusrsl: (jobtype=single)(count=1)
(3) Test cap3: success

Friday, March 6, 2009

Being a nice client of Inca service: monitoring Grid Resource

If you consider a monitoring mechanism for the TeraGrid resources, the Inca project from SDSC provides very useful information. (http://inca.sdsc.edu/) With access to Inca service, you can easily utilize monitoring information with light-weight software component. Thank you for hosting the service and very helpful supports from the Inca team! It was very pleasant experience.

If you are using the TeraGrid resources, you might be interested in the Inca's real time monitoring testbed (http://www.teragridforum.org/mediawiki/index.php?title=Inca_Real-Time_Monitoring_Testbed). Inca server runs the testing described in above document. Those information is available through RESTful URLs for the clients. The information is encoded either XML or HTML.

However, to use access the result, you have to create a query. There are pre-created query available for the particular projects including the TeraGrid Resource Monitoring page in the TeraGrid user portal. This query includes result of monitoring status of remote login to the TeraGrid cluster, pre-ws gatekeeper, and ws gatekeeper server. If you want to see the most recent result of the testing encoded in XML:
http://inca.teragrid.org/inca/XML/kit-status-v1/portal
Or, as a HTML page,
http://inca.teragrid.org/inca/HTML/kit-status-v1/portal

For more detail information,
http://inca.teragrid.org/inca/XML/kit-status-v1/portal/[clusterID]/prews-gram-batch
For example,
http://inca.teragrid.org/inca/XML/kit-status-v1/portal/sdsc-ia64/prews-gram-batch
You will see the more detail information about the SDSC IA64 clusters such as testing intervals.

To create a query for your own purpose, you have to contact the administrator of the Inca service. Inca provides nice user interface for the administrator to create new query. After you create new query, you can access your result through:
http://inca.teragrid.org/inca/[XML|HTML]/kit-status-v1//

This output can be easily integrated to your software though light-weight way such as HTTP client. After downloding the information, we used conventional Java DOM implementation. With customized query, the response size seems to be reasonable to run with DOM implementation.

Please note that if you access Inca service hosted by SDSC, the testing interval is not changable.


Here is my test code:

import java.net.*;
import java.io.*;
import javax.xml.parsers.*;
import org.w3c.dom.*;

public class Test{
public static void main(String[] args){
try {
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
DocumentBuilder db = dbf.newDocumentBuilder();
Document dom = db.parse("http://inca.teragrid.org/inca/XML/kit-status-v1/portal_summary");
Element docEle = dom.getDocumentElement();
NodeList nl = docEle.getElementsByTagName("row");

System.out.println("There are "+ nl.getLength()+ " elements.");
if(nl != null && nl.getLength() > 0) {
for(int i = 0 ; i < nl.getLength();i++) {

//get the employee element
Element el = (Element)nl.item(i);
NodeList sub1 = el.getElementsByTagName("reportSummary");
NodeList sub2 = ((Element)sub1.item(0)).getElementsByTagName("hostname");
String hostname = ((Element)sub2.item(0)).getFirstChild().getNodeValue();
System.out.println("HostName "+i+":"+hostname);
}
}
} catch (Exception e) {
e.printStackTrace();
}
}
}

Friday, February 20, 2009

Store and Register your EC2 machine image

To retrieve your machine image later, your machine image should be stored in S3. To do that, there are multiple steps required.

Step 1) Copy your X.509 certificate and private key to your instance.
scp -i id_rsa-gsg-keypair pk-HKZYKTAIG2ECMXYIBH3HXV4ZBZQ55CLO.pem cert-HKZYKTAIG2ECMXYIBH3HXV4ZBZQ55CLO.pem root@domU-12-34-31-00-00-05.compute-1.amazonaws.com:/mnt

This is done on your local desktop. Than, your security files will be stored in /mnt directory of your instance.

Step 2) Now you have to bundle your instance with the commmand:
ec2-bundle-vol -d /mnt -k /mnt/pk-HKZYKTAIG2ECMXYIBH3HXV4ZBZQ55CLO.pem -c /mnt/cert-HKZYKTAIG2ECMXYIBH3HXV4ZBZQ55CLO.pem -u 495219933132 -r i386 -p sampleimage

This takes some time. (several minutes) If it was successful, you can see the bundled files under your /mnt directory. They are bunch of files with names of smapleimage*.

Step 3) Upload bundled files to S3 from your instance
ec2-upload-bundle -b -m /mnt/sampleimage.manifest.xml -a -s

I had some problem with this process. Although I have S3 access, this command complained that I don't have S3 access. Interestingly, I created a bucket in S3 manually, and it worked.
I use the firefox extension that recommanded in a EC2 forum.
http://www.s3fox.net/
It is quite useful. If I want to sync part of my disk, it would be a great tool.

Step 4) Then, logout your instance, and on your desktop,
ec2-register /sampleimage.manifest.xml

Now you can see your machine image from,
ec2-describe-images -o self

Step 5) That's it! whenever you want to reload your machine image,
ec2-run-instances ami-5bae4b32

for the longer version of tutorial,
http://docs.amazonwebservices.com/AWSEC2/latest/GettingStartedGuide/

Tuesday, February 17, 2009

Getting started with EC2

To get start with EC2, you have to create an account in EC2 site.
After you successfully create your EC2 account from the amazon site, there are several steps to actually use EC2 instances.

Step1) Create X.509 certificate
Login to you Amazon account and AWS access identifiers page. In the X.509 certificate section, click Create New. Also Download your Certificate and store ot into your desktop safely.
mkdir .ec2
cd .ec2
mv ~/Desktop/*.pem .

Step2) Download EC2 command line tools. http://developer.amazonwebservices.com/connect/entry.jspa?externalID=351&categoryID=88
And unzip files under the ec2 directory.
mv ~/Desktop/ec2-api-tools.zip .
unzip ec2-api-tools.zip

Step3) Modify your shell script file. In my .bashrc file, I added,
export EC2_HOME=~/.ec2
export PATH=$PATH:$EC2_HOME/bin
export EC2_PRIVATE_KEY=pk-YOURKEYNAME.pem
export EC2_CERT=cert-YOURKEYNAME.pem

Step 4) Generate key pair to ssh to your instance
cd .ec2
ec2-add-keypair pstam-keypair > id_rsa-pstam-keypair

Step 5) Now select your new instance
ec2-describe-images -o amazon

Step 6) Create an instance
ec2-run-instances ami-6138dd08 -k pstam-keypair

Step 7) Check the description of the instance
Loading an instance takes some time. If you see the valid URL for the instance in the description of the instance, it's ready to access.
To check the description,
ec2-describe-instances

Step 8) Open the ports for ssh and http connection
You have to open some of the port you will allow to outside of the instances.
ec2-authorize default -p 22
ec2-authorize default -p 80

After this you can ssh to your instance.

Step 9) SSH to your instance
ssh -i id_rsa-pstam-keypair root@ec2-XXX-XXX-XXX-XXX.z-2.compute-1.amazonaws.com

Step 10) Getting a static IP address
For your application, sometimes you need a static IP address for your instance.
First you have to assign a static IP and tie the address to your instance.
ec2-allocate-address
ec2-associate-address -i i-yourinstance XXX.XXX.XXX.XXX

Step 11) Now you can SSH to your instance like to other remote machine.
ssh root@XXX.XXX.XXX.XXX

Step 12) Terminate Instance
ec2-terminate-instances i-yourinstance

Wednesday, February 11, 2009

Recent Change in TG

cobalt requires (queue=) in rsl string. It was optional previously. Now if you don't put either
standard or industrial, globus will through exception,
Globus error 37

Also condor will through exception,
Exec format error Code 6 Subcode 8

Submitting job to Ranger

Gatekeeper : gatekeeper.ranger.tacc.teragrid.org:2119/jobmanager-sge
Batch queue system: SGE
Check the queue status:showq -u


(Test 0) try locally installed executables with datafiles stored in scratch directory
./cap3 /scratch/00891/tg459282/2mil/cluster1807.fsa -p 95 -o 49 -t 100

(Test 1) try simple job submission through globus toolkit.
globusrun -o -r gatekeeper.ranger.tacc.teragrid.org:2119/jobmanager-sge '&(executable=/bin/ls)'

It works with swarm!

Monday, February 9, 2009

Submit and Manage Vanilla Condor Job to Swarm

To submit Vanilla condor job(s) to Swarm, first you have to install swarm and configure the service to process vanilla condor jobs. the source code is available from
svn co https://ogce.svn.sourceforge.net/svnroot/ogce/ogce-services-incubator

For the general information to install swarm, please refer,
http://decemberstorm.blogspot.com/2008/11/swarm-sever-installation-guide.html

Now, configure your swarm for the Vanilla condor pool.
Step 1. modify swarm/TGResource.properties, to process only vanilla condor pool. This is my properties:
matchmaking=FirstAvailableResource
taskQueueMaxSize = 100
taskQueueScanningInterval = 3000
condorCluster=true
teragridHPC=false
eucalyptus=false
condorCluster_numberOfToken=10
condorRefreshInterval = 2000

Step 2. start swarm server.
Upload swarm/build/jobsub.aar to your axis2 installation.

Step 3. Client example of vanilla condor job is stored at,
swarm/src/core/org/ogce/jobsub/clients/SubmitVanillaJob.java

There are three example methods:
submitJobWithStandardOutput(): job with standard output wihout input files
submitJobWithOutputTransfer(): job with standard output and output files
submitJobWithInputOutputTransfer(): job with standard output, output files, and local input files

Step 4. Compile example file
[swarm]ant clean
[swarm]ant

Step 5. Run example file
[swarm]./run.sh submit_vanilla_job

Step 6. Check the status
[swarm]cd ClientKit
[ClientKit]./swarm status http://serverIP:8080/axis2/services/Swarm your_ticket_ID

Step 7. Get the location of output file
[ClientKit]./swarm outputURL http://serverIP:8080/axis2/services/Swarm your_ticket_ID
This will return the URL of the output file.