Advertisement
If you have a new account but are having problems posting or verifying your account, please email us on hello@boards.ie for help. Thanks :)
Hello all! Please ensure that you are posting a new thread or question in the appropriate forum. The Feedback forum is overwhelmed with questions that are having to be moved elsewhere. If you need help to verify your account contact hello@boards.ie

SQL BCP Importing large volume of data from CVS file

Options
  • 15-10-2004 6:14pm
    #1
    Registered Users Posts: 4,222 ✭✭✭


    I have a CSV text file that i need to open using an ADODB connection and import all the contents into a MSSQL db table.
    The problem is that i have to be able to handle a cvs file with up to 500,000 rows on a busy system.
    I've been tinkering with the SQL Bulk Copy Program but cant get it to work, it keeps throwing an error:
    Starting copy...
    SQLState = S1001, NativeError = 0
    Error = [Microsoft][ODBC SQL Server Driver]Memory allocation failure
    i'm using a format file as the source csv file does not contain all the colums in the destination table.
    Any ideas whats wrong or anyone know any other methods for coping large volumes of data from a csv file to a SQL db table??


Comments

  • Registered Users Posts: 15,443 ✭✭✭✭bonkey


    MSSQL (from version 7 onwards) comes with a tool called "Data Transformation Services" which is far more ideally suited to what you want to do.

    In the SQL Enterprise Manager, you can even kick off an "Import Wizard" which will help you define everything. After that, you can tool around with options such as dropping/rebuilding indexes, sending mail notifications on success/failure, scheduling, and the rest of it.

    And DTS jobs can be run from the command line (as you'll see if you schedule one), which means that even if this needs to be run from inside a script or whatnot, its no big issue.

    jc


  • Registered Users Posts: 4,222 ✭✭✭Scruff


    You might be right. However The DTS package will be kicked off from a sql job and there are 2 things that i need to be able to do that i haven't been able to find out how to do or if its possible at all. Before the DTS package is kicked off by the job
    1) I need to be able to dynamical set the name and location of the source CSV file from where the data is being imported.
    2) There is an integer column in the destination table that does not exist in the source table. This integer column has a a foreign key relationship with an primary key coulmn in another table so a also need to be able to dynamically set the value that is to be inserted into this column in the destination table.

    If these cant be done then i'm at a loss really. Are there any programming methods i could use instead to be able to say use a standard SELECT query from the source CSV table in batches? The problem with this kind of approach is maintaining the position of the next record in CSV file i want to extract the next batch from and how to access it at that position without having to query the entire table all over again.


  • Registered Users Posts: 15,443 ✭✭✭✭bonkey


    Hmmm....

    I would solve the first half of your problem differently....I'd rename the file to a fixed name each time, rather than trying to get the DTS to dynamically handle any file. But maybe thats just me.

    As for the second half....I'm not sure quite what you want to do here. Saying that you want to set the value *dynamically* to me implies on a per-record basis. To do this, you must have rules (e.g. if fieldA = X and fieldB = Y then use value1, and so on) or something. All of this can be coded if you use an actual transformation to map one table to another, where you can use VBScript which runs on a "per-row" basis

    Alternately, you can import the table with this field set as NULL, and then run an UPDATE across all fields with a NULL value once the import has finished.

    As for programming methods....I don't see any difficulty if you go that route:
    - Read (next) row from CSV,
    - parse,
    - build INSERT statement,
    - run INSERT statement,
    - repeat to end-of-file.

    I don't see why you'd be querying the entire table at all, ever, unless you're trying to do something other than what you've described. If you want to do it in batches, then do so, but all that would do is :

    - Read row from CSV,
    - parse,
    - build INSERT statement,
    - run INSERT statement inside transaction
    - incrememnt transaction-size-counter variable
    - if transaction-size == desired batch size, commit writes.
    - repeat to end-of-file.

    jc


  • Registered Users Posts: 437 ✭✭Spunj


    I used to do a lot of import work using bcp / bulk insert. With huge amounts of data, DTS is not really feasable as it can take a huge amount of time and is always logged (you can find the log files expanding to fill all your disk space before you know it).

    Usually, the fastest way is to drop the indexes before you start, do the import and then recreate the indexes afterwards (this can be 100x faster than logged mode). Bcp can't work on 'fast' mode if there are indexes on the tables if I remember correctly. You can control the batch size yourself which can help if you are getting errors. Are you using delimited or fixed width definitions in your format files (I assume delimited as you said CSV)? I have had trouble with delimiters appearing in my input data so you may want to double check this.

    Note - this is not always the case. You can try it both ways and monitor which runs better. Creating a lot of indexes can also chew up disk space so it depends on key type/complexity etc.

    Loading a table with BCP does not apply rules, triggers, FK constraints etc., so you might have to load the data to an intermediate table than run your validation, then run inserts. This can sometimes be the best way, as bcp can have problems with defaults; it can do weird things with null values for instance.

    Im not sure exactly what the integer column problem is (need more info) but the approach involving an intermediate table may be something to investigate.


  • Registered Users Posts: 15,443 ✭✭✭✭bonkey


    Spunj wrote:
    DTS is not really feasable as it can take a huge amount of time and is always logged (you can find the log files expanding to fill all your disk space before you know it).

    Sorry, but DTS is not always logged. Just like BCP, you need to set a couple of pre-requisites and then it happily runs in non-logged mode.

    The options are :
    1) Make sure the DB has its "Bulk Copy / Insert Into" option turned on (as you need for non-logged BCP copies)
    2) Make sure the import in DTS has the following two options set : "Use Fast Load", and "Table Lock".

    jc


  • Advertisement
  • Registered Users Posts: 4,222 ✭✭✭Scruff


    Thanks lads, havent had a chance to give yer suggestions a go
    bonkey wrote:
    As for programming methods....I don't see any difficulty if you go that route:
    - Read (next) row from CSV,
    - parse,
    - build INSERT statement,
    - run INSERT statement,
    - repeat to end-of-file.
    jc
    This is the method i'm currently using but was only ever processing files with a max of ~64K entries. Took an awful lot of time for 500K. Just thought one of the methods for coping large voumes of data into the DB would be quicker.


Advertisement