Advertisement
If you have a new account but are having problems posting or verifying your account, please email us on hello@boards.ie for help. Thanks :)
Hello all! Please ensure that you are posting a new thread or question in the appropriate forum. The Feedback forum is overwhelmed with questions that are having to be moved elsewhere. If you need help to verify your account contact hello@boards.ie

Need code to delete rows in huge .CSV

Options
  • 17-01-2013 8:59pm
    #1
    Banned (with Prison Access) Posts: 32,865 ✭✭✭✭


    Hi All,

    Basically, I'm trying to create an Access Database, but my source file is so large that things are a bit sluggish. I have a .CSV file that has over 2 million rows, but the majority of them I don't need. So I need a macro that will say something like "if cell in column I does not = a or b or c or d, or is blank, delete row".

    Naturally, the file is too large to open in excel and run a macro, is there a way to do this in a closed workbook?

    If anyone can offer assistance I'll would be very thankful. :)


Comments

  • Closed Accounts Posts: 2,117 ✭✭✭Defiler Of The Coffin


    Have you any experience in scripting? Have you ever used a language like Python or Perl?


  • Banned (with Prison Access) Posts: 32,865 ✭✭✭✭MagicMarker


    Nope, no experience at all. But I'm starting to learn!


  • Closed Accounts Posts: 2,117 ✭✭✭Defiler Of The Coffin


    I am not an expert on spreadsheets, so aside from suggesting that you open the workbook on a machine with a huge amount of RAM the only solution I have for you involves scripting. If you are badly stuck then PM me and I can help you out.


  • Registered Users Posts: 20,299 ✭✭✭✭MadsL


    Open in a text editor that does good regular expression macros.

    http://en.wikipedia.org/wiki/Comparison_of_text_editors

    Reimport to Excel.


  • Registered Users Posts: 727 ✭✭✭prettygurrly


    R will do that for you. It's a bitch to learn but once you do it just does everything.

    If you're not getting anywhere with previous posters offers get back to me. Can easily subset your sample leaving out rows containing the lines you don't want and it will only take a while to do...depends on the speed of your machine.


  • Advertisement
  • Registered Users Posts: 727 ✭✭✭prettygurrly


    in fact, this is very easy..I have code already written that will do it. Is your file sensitive that you can't send it on?


  • Registered Users Posts: 419 ✭✭Mort5000


    Split the file into a few that are small enough for Excel to handle. Filter. Delete. Rejoin. Done!


  • Closed Accounts Posts: 18,966 ✭✭✭✭syklops


    how big is the file? E.g. less than ~10 MB? If so, and the content is not sensitive, send it to me and I can write a perl or python script which will loop through and reduce it for you.

    I am sure others on the board would offer same.


  • Banned (with Prison Access) Posts: 32,865 ✭✭✭✭MagicMarker


    Thanks for the offers guy, the file is sensitive though so I wouldn't be able to send it on to anyone.
    Mort5000 wrote: »
    Split the file into a few that are small enough for Excel to handle. Filter. Delete. Rejoin. Done!

    2,377,406 rows manually broken down into Excel sheets limited to 65k rows. Shouldn't take long at all! :p


  • Registered Users Posts: 419 ✭✭Mort5000


    Take long?

    Yeah ok, do it manually, that'll be much quicker.


  • Advertisement
  • Banned (with Prison Access) Posts: 32,865 ✭✭✭✭MagicMarker


    Mort5000 wrote: »
    Take long?

    Yeah ok, do it manually, that'll be much quicker.
    Am I missing something? :confused:

    If there's a way to do it automatically then I'm all ears...


  • Closed Accounts Posts: 7,145 ✭✭✭DonkeyStyle \o/


    You could try OpenOffice. I'm still using an old-ish version here (3.3), I can import a 300k line CSV, I don't know how high it goes since I don't have anything bigger handy.


  • Registered Users Posts: 72 ✭✭kevincaomhin


    Am I missing something? :confused:

    Why not import into MS Access directly and create a new table with whatever filter you need?


  • Closed Accounts Posts: 19,777 ✭✭✭✭The Corinthian


    Am I missing something? :confused:

    Why not import into MS Access directly and create a new table with whatever filter you need?
    TBH, I was thinking the same thing. It sounds like the easiest solution is to import the whole lot into your access database and then delete those rows that match your criteria using SQL. Something like:

    DELETE FROM mytable WHERE (column1 NOT LIKE 'a' AND column1 NOT LIKE 'b' AND NOT LIKE 'c' AND NOT LIKE 'd') OR LEN(column1) = 0


  • Banned (with Prison Access) Posts: 32,865 ✭✭✭✭MagicMarker


    Solutions are only easy when you know how.

    I tried deleting 50k blank rows manually in Access and it rendered my PC useless for a lengthy period until I had to force close.
    DELETE FROM mytable WHERE (column1 NOT LIKE 'a' AND column1 NOT LIKE 'b' AND NOT LIKE 'c' AND NOT LIKE 'd') OR LEN(column1) = 0

    Where would I put this?


  • Closed Accounts Posts: 19,777 ✭✭✭✭The Corinthian


    Solutions are only easy when you know how.
    Often we learn how by finding those solutions.
    Where would I put this?
    You probably wouldn't put 'that' anywhere, as it's based upon the criteria in your first post, and presumes the name of the table your data will be assigned to. It's roughly what you would use though.

    Create it as a query, then edit it using View -> SQL View.


  • Banned (with Prison Access) Posts: 32,865 ✭✭✭✭MagicMarker


    Cool, I'll give that a shot, thanks.


  • Banned (with Prison Access) Posts: 32,865 ✭✭✭✭MagicMarker


    Managed to get it sorted.

    In the end I just ran a query for everything I needed, exported that to a txt file and re imported the txt file as a separate database.

    So I've gone from a DB just under 1GB to one just under 100Mb, happy days!

    Thanks all.


  • Registered Users Posts: 7,157 ✭✭✭srsly78


    I am not an expert on spreadsheets, so aside from suggesting that you open the workbook on a machine with a huge amount of RAM the only solution I have for you involves scripting. If you are badly stuck then PM me and I can help you out.

    Heheh have you ever tried 64bit office? It won't even co-exist on machine with 32bit office. So having loads of ram is not really a solution, first you gotta setup a brand new windows box just for 64bit office, then none of your macros/code from 32bit will work with it.


  • Registered Users Posts: 727 ✭✭✭prettygurrly


    well done you!


  • Advertisement
  • Registered Users Posts: 669 ✭✭✭Patrickof


    srsly78 wrote: »
    Heheh have you ever tried 64bit office? It won't even co-exist on machine with 32bit office. So having loads of ram is not really a solution, first you gotta setup a brand new windows box just for 64bit office, then none of your macros/code from 32bit will work with it.

    I have 64 bit office, and still wouldn't fix the OP problem.

    I have a 1.8 million line CSV file that stops in Excel 64 bit at 1.1 million lines.

    Access and a query is the best way.


  • Registered Users Posts: 72 ✭✭kevincaomhin


    Solutions are only easy when you know how.

    I tried deleting 50k blank rows manually in Access and it rendered my PC useless for a lengthy period until I had to force close.



    Where would I put this?


    Good to hear you got sorted.

    Do not use access any more, but my understanding was that if you manually delete rows in access, the "engine" treats as multiple sql queries whereas if you run one query as Corithian suggested it is a lot more efficient in terms of systems resources.


  • Closed Accounts Posts: 2,117 ✭✭✭Defiler Of The Coffin


    Now would be a good time to learn some scripting OP

    Just in case :cool:


  • Registered Users Posts: 27,161 ✭✭✭✭GreeBo


    play around with awk/sed/grep etc etc
    make your life easier next time


Advertisement