Advertisement
If you have a new account but are having problems posting or verifying your account, please email us on hello@boards.ie for help. Thanks :)
Hello all! Please ensure that you are posting a new thread or question in the appropriate forum. The Feedback forum is overwhelmed with questions that are having to be moved elsewhere. If you need help to verify your account contact hello@boards.ie

Bash script, execution time problem

Options
  • 08-05-2009 2:15pm
    #1
    Registered Users Posts: 528 ✭✭✭


    Hi Guys,

    I have a script which I will place at the end of the thread that does a very simple calculation. The only problem is the length of time for the script to finish execution is just too long. Now the script deals with files that are quite large (~100,000 lines of data). What the script does to make the data analysis is very simple scripting but as I say its just too slow.


    #!/bin/bash

    #Time script was executed
    echo SCRIPT STARTED $(date) >> script_times

    #Remove rows selected line from Mobility.csv
    sed -i '/rows selected/ d' *_03.csv

    #Remove blank lines from SQL query output file
    sed '/^$/d' Mobility_01_03.csv > temp
    mv temp Mobility_temp.csv

    #Places all the MSISDNs into one file
    cat Mobility_temp.csv | cut -d"," -f2 > msisdn_ORIG;

    #Removes all duplicate MSISDNs
    sort -u msisdn_ORIG > msisdn_NO_DUPLICATE;



    ################################################################
    #### Place all the MSISDNs into an array with no duplicates ####
    ################################################################

    declare -a MSISDN_ARRAY
    declare -i i

    #open file for reading to array
    exec 10<msisdn_NO_DUPLICATE

    while read line <&10; do

    MSISDN_ARRAY[$i]=$line
    ((i++))
    done

    echo the number of elements is: ${#MSISDN_ARRAY[@]}
    #echo ${MSISDN_ARRAY[@]}

    #close file
    exec 10>&-

    #####################################################################
    ############## Place all MSISDNs into seperate files ################
    #####################################################################


    for ((i=0;i<${#MSISDN_ARRAY[@]};i++)); do

    grep -F "${MSISDN_ARRAY[$i]}" Mobility_temp.csv | cut -d"," -f3 > ${MSISDN_ARRAY[$i]}_Municipos.csv
    grep -F "${MSISDN_ARRAY[$i]}" Mobility_temp.csv | cut -d"," -f4 > ${MSISDN_ARRAY[$i]}_Celdas.csv

    sort -u ${MSISDN_ARRAY[$i]}_Municipos.csv > temp_municipos;
    mv temp_municipos ${MSISDN_ARRAY[$i]}_Municipos.csv;

    sort -u ${MSISDN_ARRAY[$i]}_Celdas.csv > temp_celdas;
    mv temp_celdas ${MSISDN_ARRAY[$i]}_Celdas.csv;

    echo ${MSISDN_ARRAY[$i]},`cat ${MSISDN_ARRAY[$i]}_Municipos.csv | wc -l`,`cat ${MSISDN_ARRAY[$i]}_Celdas.csv | wc -l` >> List.csv

    rm ${MSISDN_ARRAY[$i]}_Municipos.csv;
    rm ${MSISDN_ARRAY[$i]}_Celdas.csv;

    done

    echo SCRIPT Finished $(date) >> script_times

    exit


    The input file that is used in the script has the following layout;

    date,8 digit number, name_of_location, name_of_location


    Any help would be greatly appreciated. Thanks in advance.


Comments

  • Closed Accounts Posts: 198 ✭✭sh_o


    Without stating the obvious, I would echo out a lot more of the times into the script_times file (or another file for debug reasons) so you can see what exactly is taking so long and then try to optimise that particular part.


  • Registered Users Posts: 528 ✭✭✭ridonkulous


    Its the amount of file IO that is causing the length of the execution time. I have looked into associative arrays and hash tables to solve this problem but I have never done anything like those in bash before and would appreciate if anybody had any good links to some examples of the above.


  • Registered Users Posts: 1,916 ✭✭✭ronivek


    I don't have a whole lot of experience developing bash scripts; however I have had the misfortune of working with a number of scripts which were resource hogs due to liberal usage of grep and cut.

    Unfortunately I can't help with your bash script directly; but my solution was to write a high performance Python script to replace the block of bash script which cut and grepped like there was no tomorrow. I'm sure there's any number of languages you could use in a similar manner and still accomplish your task via a bash script of sorts.

    Not what you're looking for I know but if it's an option for you it might be easier than attempting to optimise a script in a language you're not too familiar with.


  • Registered Users Posts: 157 ✭✭TeaServer


    Its the amount of file IO that is causing the length of the execution time.

    You have pretty much answered your own question here. The script is doing loads of unnecessary file I/O. You should pipe the results of commands directly into sort -u. This will all happen in RAM and will speed things up massively. The simple rule for speed is to do as much as you can in RAM.

    1. Put the 2 sed operations into one sed script use -e for each 'script'
    # sed -e 'scrpt1' -e 'script2' > output
    2. No need for the MSISDN_ORIG file just pipe directly into sort -u to create the no_duplicate file (this will be done in ram instead of the I/O write/read)
    # cat Mobility_temp.csv | cut -d"," -f2 | sort -u > msisdn_NO_DUPLICATE
    3. Similiar thing here - you don't care what is in the files just how many
    COUNT1="grep -F "${MSISDN_ARRAY[$i]}" Mobility_temp.csv | cut -d"," -f3 | sort -u | wc -l";
    COUNT2="grep -F "${MSISDN_ARRAY[$i]}" Mobility_temp.csv | cut -d"," -f4 | sort -u | wc -l";
    then use the count in the output, you could further improve this by pulling out the MSISDN and then checking for each field in the smaller file instead of re-reading the big file again....


  • Registered Users Posts: 6,509 ✭✭✭daymobrew


    ronivek wrote: »
    but my solution was to write a high performance Python script to replace the block of bash script which cut and grepped like there was no tomorrow.
    +1
    You could probably replace the entire bash script with a perl or Python script that can handle a lot of the work in RAM.


  • Advertisement
Advertisement