Using the Linux Command Line to Find and Copy A Large Number of Files from a Large Archive, Preserving Metadata

One of my recent chal­lenges is to go through an archive on a NAS and find all of the .xlsx files, then copy them; pre­serv­ing as much of the file meta­data (date cre­at­ed, fold­er tree, etc) as pos­si­ble, to a spec­i­fied fold­er.  After this copy, they will be gone through with anoth­er script, to rename the files, using the meta­data, where they will then be processed by an appli­ca­tion, which uti­lizes the name of the file in its process.

The part I want to share here, is find­ing the files and copy­ing them to a fold­er, with meta­data pre­served.  This is where the pow­er of the find util­i­ty comes in handy.

Since this is a huge archive, I want to first pro­duce a list of the files, that way I will be able to break this up into two steps. This will pro­duce a list and write it into a text file.  I am first going to run a find com­mand on the vol­ume I have mount­ed called data in my Vol­umes fold­er.

find /Volumes/data/archive/2012 -name '*.xlsx' > ~/archive/2012_files.txt

Now that the list is saved into a text file, I want to copy the files in the list, pre­serv­ing the file meta­data and path infor­ma­tion, to my archive fold­er.  The cpio util­i­ty accepts the paths of the files to copy from std­in, then copies them to my archive fold­er.

cat ~/archive/2012_files.txt | cpio -pvdm ~/archive

Explicitly Setting log4j Configuration File Location

I ran into an issue recent­ly, where an exist­ing log4j.xml con­fig­u­ra­tion file was built into a jar file I was ref­er­enc­ing and I was unable to get Java to rec­og­nize anoth­er file that I want­ed it to use instead.  For­tu­nate­ly, the solu­tion to this prob­lem is fair­ly straight­for­ward and sim­ple.

I was run­ning a stand­alone appli­ca­tion in lin­ux, via a bash shell script; but this tech­nique can be used in oth­er ways too.  You sim­ply add a para­me­ter to the JVM call like the exam­ple below.

So the syn­tax is basi­cal­ly:

java -Dlog4j.configuration="file:<full path to file>" -cp <classpath settings> <package name where my main function is located>

Lets say I have a file named log4j.xml in /opt/tools/myapp/ which I want to use when my appli­ca­tion runs, instead of any exist­ing log4j.xml files.  This can be done by pass­ing a JVM flag –Dlog4j.configuration to Java.

Here is an exam­ple:

java -Dlog4j.configuration="file:/opt/tools/myapp/log4j.xml" -cp $CLASSPATH  my.standalone.mainClass;

With that change, as long as your log4j file is set up prop­er­ly, your prob­lems should be behind you.

Fixing Performance Problems on Your JBoss Web Apps By Diagnosing Blocked Thread Issues

I was once per­plexed by a bizarre per­for­mance issue, I encoun­tered at seem­ing­ly ran­dom inter­vals, in an appli­ca­tion I help to main­tain. The appli­ca­tion kept freez­ing up, with­out any log mes­sages to use for diag­no­sis. This was very frus­trat­ing, because it meant the appli­ca­tion server typ­i­cal­ly had to be restart­ed man­u­al­ly to restore ser­vice.

After a bit of research, I learned of thread block­ing, as a poten­tial per­for­mance issue. Being as I was fair­ly cer­tain that the data­base was func­tion­ing with­in accept­able para­me­ters and the server had ample CPU and mem­o­ry to han­dle the load. I sought to deter­mine if thread block­ing was an issue.

I start­ed by sim­ply run­ning a twid­dle com­mand to dump the threads, when­ev­er this per­for­mance prob­lem was report­ed. This showed that the BLOCKED threads were indeed the cause. Con­tin­ue read­ing “Fix­ing Per­for­mance Prob­lems on Your JBoss Web Apps By Diag­nos­ing Blocked Thread Issues”

Tar/GZip Files in One Operation, Unattached to the Terminal Session

When you’re try­ing to move a large block of files, its often use­ful to do so in one com­mand and to be able to close your ter­mi­nal win­dow (or allow it to time out). If you run a com­mand under nor­mal cir­cum­stances, los­ing the con­nec­tion can cause your com­mand to ter­mi­nate pre­ma­ture­ly, this is where nohup (No HangUP — a util­i­ty which allows a process to con­tin­ue even after a con­nec­tion is lost) comes in.

Let’s say we have a large direc­to­ry to back­up, which we want to first tar, then gzip; keep­ing the com­mand non-dependent on the con­ti­nu­ity of the ter­mi­nal ses­sion. Con­tin­ue read­ing “Tar/GZip Files in One Oper­a­tion, Unat­tached to the Ter­mi­nal Ses­sion”

Quick and Easy Regular Expression Command/Script to Run on Files in the Bash Shell

I often find it nec­es­sary to run reg­u­lar expres­sions on, not just one file; but instead a range of files. There are per­haps dozens of ways this can be done, with vary­ing lev­els of under­stand­ing nec­es­sary to do this. 

The sim­plest way I have encoun­tered uti­lizes the fol­low­ing syn­tax:

perl -pi -e "s/<find string>/<replace with string>/g" <files to replace in>

Here is an exam­ple where I replace the IP address in a range of report tem­plates with a dif­fer­ent IP address:

perl -pi -e "s/mysql:\/\/\/\/" $reportTemplateLocation/*.rpt*

Basi­cal­ly, I am look­ing for a line which con­tains mysql://, which I want to replace with mysql://

Here is an exam­ple of a bash script I call, which I wrap around that com­mand, to accom­plish that same task with more ele­gance:

# @(#)$Id$
# Point the report templates to a different database IP address.
    echo "$arg0: $*" 1>&2
    exit 1
        echo "Usage $0 -o <old-ip-address> -n <new-ip-address>";
while getopts hvVo:n: flag
    case "$flag" in
    (h) help; exit 0;;
    (V) echo "$arg0: version 0.1 8/28/2010"; exit 0;;
    (v) vflag=1;;
    (o) oldip="$OPTARG";;
    (n) newip="$OPTARG";;
    (*) usage;;
shift $(expr $OPTIND - 1)
if [ "$oldip" = "" ]; then
        exit 1;
if [ "$newip" = "" ]; then
        exit 1;
echo "$0: Changing report templates to use the database at $newip from $oldip";
perl -pi -e "s/mysql:\/\/$oldip/mysql:\/\/$newip/g" $reportTemplateLocation/*.rpt*

Usage of the script is as sim­ple as the com­mand below. It will change every data­base ref­er­ence on report tem­plates in the direc­to­ry ref­er­enced by the vari­able report­Tem­plate­Lo­ca­tion to the new val­ue.

./  -o -n

A fur­ther improve­ment, which may be use­ful to some, would be to make the direc­to­ry a flag which can be edit­ed at the com­mand line.

Monitoring Process Counts and Alerting Via Email

Below is a sim­ple script called monitor_jboss, which checks to see if jboss is run­ning and whether or not too many instances are cur­rent­ly run­ning. I found a need to write this script because we have some cron scripts which auto­mat­i­cal­ly restart JBoss each day and the JBoss shut­down script itself some­times fails to prop­er­ly shut down, caus­ing some quirky behav­ior.

If it deter­mi­nes that one of the fol­low­ing con­di­tions are true, it sends a short email to the address spec­i­fied in the vari­able email describ­ing the prob­lem.

  • JBoss is not run­ning at all
  • Jboss has more than max instances run­ning

This script is then placed in /etc/cron.d/cron.hourly/ where it will check the sys­tem once an hour and send an email as appro­pri­ate.


# email addresses to send the message to

# maximum number of concurrently running instances allowed

# determine the number of running instances
count_running_jbosses=$(ps aux | grep jboss | grep -v grep | grep -v monitor_jboss  | wc -l)

if [ $count_running_jbosses -eq "0" ]            # jboss isn't running
        message="JBoss Is Currently Not Running"
if [ $count_running_jbosses >  $max ]           # too many jboss instances running
        message="JBoss Is Currently Running $count_running_jbosses instances; the maximum is $max"

subject="JBOSS MONITORING ALERT FOR: $(hostname)"

echo "$message" | /bin/mail -s "$subject" "$email"