Bash script, execution time problem

ridonkulous · 08-05-2009 2:15pm #1

Hi Guys,

I have a script which I will place at the end of the thread that does a very simple calculation. The only problem is the length of time for the script to finish execution is just too long. Now the script deals with files that are quite large (~100,000 lines of data). What the script does to make the data analysis is very simple scripting but as I say its just too slow.

#!/bin/bash

#Time script was executed
echo SCRIPT STARTED $(date) >> script_times

#Remove rows selected line from Mobility.csv
sed -i '/rows selected/ d' *_03.csv

#Remove blank lines from SQL query output file
sed '/^$/d' Mobility_01_03.csv > temp
mv temp Mobility_temp.csv

#Places all the MSISDNs into one file
cat Mobility_temp.csv | cut -d"," -f2 > msisdn_ORIG;

#Removes all duplicate MSISDNs
sort -u msisdn_ORIG > msisdn_NO_DUPLICATE;

################################################################
#### Place all the MSISDNs into an array with no duplicates ####
################################################################

declare -a MSISDN_ARRAY
declare -i i

#open file for reading to array
exec 10<msisdn_NO_DUPLICATE

while read line <&10; do

MSISDN_ARRAY[$i]=$line
((i++))
done

echo the number of elements is: ${#MSISDN_ARRAY[@]}
#echo ${MSISDN_ARRAY[@]}

#close file
exec 10>&-

#####################################################################
############## Place all MSISDNs into seperate files ################
#####################################################################

for ((i=0;i<${#MSISDN_ARRAY[@]};i++)); do

grep -F "${MSISDN_ARRAY[$i]}" Mobility_temp.csv | cut -d"," -f3 > ${MSISDN_ARRAY[$i]}_Municipos.csv
grep -F "${MSISDN_ARRAY[$i]}" Mobility_temp.csv | cut -d"," -f4 > ${MSISDN_ARRAY[$i]}_Celdas.csv

sort -u ${MSISDN_ARRAY[$i]}_Municipos.csv > temp_municipos;
mv temp_municipos ${MSISDN_ARRAY[$i]}_Municipos.csv;

sort -u ${MSISDN_ARRAY[$i]}_Celdas.csv > temp_celdas;
mv temp_celdas ${MSISDN_ARRAY[$i]}_Celdas.csv;

echo ${MSISDN_ARRAY[$i]},`cat ${MSISDN_ARRAY[$i]}_Municipos.csv | wc -l`,`cat ${MSISDN_ARRAY[$i]}_Celdas.csv | wc -l` >> List.csv

rm ${MSISDN_ARRAY[$i]}_Municipos.csv;
rm ${MSISDN_ARRAY[$i]}_Celdas.csv;

done

echo SCRIPT Finished $(date) >> script_times

exit

The input file that is used in the script has the following layout;

date,8 digit number, name_of_location, name_of_location

Any help would be greatly appreciated. Thanks in advance.

sh_o · 08-05-2009 2:55pm

Without stating the obvious, I would echo out a lot more of the times into the script_times file (or another file for debug reasons) so you can see what exactly is taking so long and then try to optimise that particular part.

ridonkulous · 08-05-2009 3:04pm

Its the amount of file IO that is causing the length of the execution time. I have looked into associative arrays and hash tables to solve this problem but I have never done anything like those in bash before and would appreciate if anybody had any good links to some examples of the above.

ronivek · 08-05-2009 6:53pm

I don't have a whole lot of experience developing bash scripts; however I have had the misfortune of working with a number of scripts which were resource hogs due to liberal usage of grep and cut.

Unfortunately I can't help with your bash script directly; but my solution was to write a high performance Python script to replace the block of bash script which cut and grepped like there was no tomorrow. I'm sure there's any number of languages you could use in a similar manner and still accomplish your task via a bash script of sorts.

Not what you're looking for I know but if it's an option for you it might be easier than attempting to optimise a script in a language you're not too familiar with.

TeaServer · 08-05-2009 9:10pm

ridonkulous wrote: »

Its the amount of file IO that is causing the length of the execution time.

You have pretty much answered your own question here. The script is doing loads of unnecessary file I/O. You should pipe the results of commands directly into sort -u. This will all happen in RAM and will speed things up massively. The simple rule for speed is to do as much as you can in RAM.

1. Put the 2 sed operations into one sed script use -e for each 'script'
# sed -e 'scrpt1' -e 'script2' > output
2. No need for the MSISDN_ORIG file just pipe directly into sort -u to create the no_duplicate file (this will be done in ram instead of the I/O write/read)
# cat Mobility_temp.csv | cut -d"," -f2 | sort -u > msisdn_NO_DUPLICATE
3. Similiar thing here - you don't care what is in the files just how many
COUNT1="grep -F "${MSISDN_ARRAY[$i]}" Mobility_temp.csv | cut -d"," -f3 | sort -u | wc -l";
COUNT2="grep -F "${MSISDN_ARRAY[$i]}" Mobility_temp.csv | cut -d"," -f4 | sort -u | wc -l";
then use the count in the output, you could further improve this by pulling out the MSISDN and then checking for each field in the smaller file instead of re-reading the big file again....

daymobrew · 10-05-2009 1:13pm

ronivek wrote: »

but my solution was to write a high performance Python script to replace the block of bash script which cut and grepped like there was no tomorrow.

+1
You could probably replace the entire bash script with a perl or Python script that can handle a lot of the work in RAM.

Bash script, execution time problem

Comments