High Performance Computing: Download and prepare data in a batch mode

Over the time, I need to manipulate a lot of data on a Linux cluster. Some of these manipulations actually read/write data, whereas some are essentially file system operations, such as downloading the files.
Here I present a list of similar operations suitable for HPC using pbs job approach whenever possible.
I do not attempt to include all possible methods but only the ones that I find useful and easy to prepare in seconds.
The most efficient way to download MODIS alike data using HPC.
wget -r --no-parent -R "index.html*" --retr-symlinks -A "*.nc" ftp-url
wget -r --no-parent -R "index.html*" -A "MOD17A2.A2000*.hdf" -A "MOD17A2.A2000*.xml" http-url
wget -r --no-parent -R "index.html*" -A "MOD17*.hdf" -A "MOD17*.xml" http-url
You can basically setup filter for file type, year and granule id.
A live example:
#PBS -l nodes=1:ppn=1                     
#PBS -l naccesspolicy=singleuser       
#PBS -l walltime=40:00:00                   
#PBS -M your email address
#PBS -m ae             
#PBS -N download                         
#PBS -q standby                           
wget -r --no-parent  -R "index.html*"   --retr-symlinks  -A "*.tar" ftp://somwhere

Compress and extract 

#use this script to extract tar files under the sub directory     
for dir in `find -mindepth 1 -maxdepth 1 -type d`
    cd $dir
    echo $dir
    tar xf *.tar ./
    cd ..
# Pass the name of the file to unpack on the command line $1
for file in *.gz
    gunzip -d "$file"


grep -rnw '/path/to/somewhere/' -e "pattern"
find . -maxdepth 1 -name "*string*" -print

make &> results.txt


find . -name '*.cpp' | xargs wc -l


qsub -I -lnodes=1:ppn=20 -lwalltime=04:00:00 -q boss  -X

Simply organize these above bash script and replace with commands, most file system related tasks can be resolved. I will add more related scripts later.


