more efficient way to update my access database?

DeDoc · 31-12-2013 02:16PM #1

Hi

I have a large database, with several million records over ~ 20 tables - many of which reside outside my control (i.e. I can't change their structure or update them or whatever - just read the contents).
I'm currently involved in a project where we're looking to do some detailed statistical analysis, a pre-cursor for which involves lots of calculation on the database. What I've been doing/trying to do is querying the database to get the raw data, doing all the processing and then writing the calculated data to a new table. I can can then read directly from this table into R (where I'm doing the statistical analysis). I'm happy enough that this is a sensible way to go (even though I could theoretically read directly into R, I think this way is preferable).
My issue is over the first part - and whether there is a more efficient way of creating/updating this table of parameters from my calculation stage.

At present, to create a batch of parameters I have a query which I open through VBA, reading all the 'raw' data into the recordset. For each record in this recordset, I then run a further query, the results of which are used in the calculations etc. I now write these results, using an update query, into my parameters table and go on to the next record in the recordset. The process is pretty slow - in no small part down to the huge number of records to be processed. What I'm wondering is whether it would be more efficient to somehow store all the results of these calculations in memory and then batch update (with an update query)? By observation, it appears that the final update query seems to take most of the time in my code.

A simplified example, which might help understanding.
Lets say I have database of customer purchases in a shop. What I'd like to do is see if there is anything in the customers purchase history that can be used to predict what he'll buy when/if he comes into the shop next. That predictive bit is in the realm of the statistics (which will happen in R, i.e. outside Access), but I need to generate the raw data first.
So, for example, I might want to know for every purchase (i.e. each record in the purchases table), how many prior purchases that customer has, what his average spend was, what percentage of the time he bought products in each of my product categories. That data will be calculated and stored in a 'parameters' table, which I'll use later in R.
So, say, I decide to generate the parameters for November. I first run a query to get all the information from the purchases table and other related tables - lets say there are 1000 purchases. Now for each of those thousand purchases I go back and calculate all the things mentioned in the above paragraph. I have a set of numbers (say 12) for that particular purchase. The nub of my question is whether it is more efficient to
a) run an update query to write those 12 to the parameters table (i.e. updating one record in that table) before moving on to the next record in the purchases recordset (1000 records) etc., or....
b) store the 12 values in memory, and cycle through the purchases recordset (1000), before using one single update query to write all 12K (12 parameters x 1000 purchases) values to the parameters table

srsly78 · 31-12-2013 05:19PM

Use a proper database, that would give a big boost. Access is simply not designed for heavy duty use.

You can still use your hello kitty access frontend tho, just link the tables to a real database.

Using sql server + stored procedures would be a LOT faster than what you are doing currently for example.

beauf · 02-01-2014 01:31AM

Its vastly more efficient/faster though SQL. As srsly78 says do all the heavy lifting lifting there. You'll break your heart trying to it in Access.

Unless of course SQL just isn't an option for some reason?

http://www.expresstechnology.com/support/knowledge-base/46-sql-database-server/103-what-is-the-advantage-of-microsoft-sql-server-over-access

I was stuck with a vast Access SQL/VBA/Worksheet (at the file size limit) and it took days if not weeks to run the jobs in it. The same jobs in SQL would have taken minutes. The only reason I didn't rewrite it was there was 10yrs of VBA code in there which I didn't have time (or knowledge of the business processes) to re-write.

TheBoffin · 02-01-2014 01:39AM

SQL is the way to go with this one as you are putting huge work on yourself trying to achieve the same in access and it would be less work to convert it all into SQL.

The best way to set out your tables is to link by ID so on the product purchases for example you can run (SELECT * FROM SALE_ITEMS WHERE CUSTOMERID = x)

Assuming that query has a sale for an External DVD Writer which has a category [DVD01] then an additional query will display other products associated with the category for example blank DVD's.

The trick is to try keep the DB as lean as possible and put all the donkey work code in the program or website itself and not on the DB end. The DB should just capture information once and not duplicate existing data.

srsly78 · 02-01-2014 01:42AM

Oh yeah, set the correct primary keys etc on the tables. Also add other indices as needed, may depend on access version. Without the right keys and indexes any queries will be brutally slow. This applies to all databases, not just access.

This can turn an "over the weekend" query into a 5 minute one.

beauf · 02-01-2014 01:51AM

TheBoffin wrote: »

... put all the donkey work code in the program or website itself and not on the DB end. ...

That would depend on what kind of donkey work it is.

TheBoffin · 02-01-2014 01:55AM

That would depend on what kind of donkey work it is.

You would be amazed at what I have seen in DB's before and the god awful structure..... However, yes i agree, use common sense on the donkeywork.

ChRoMe · 02-01-2014 10:26AM

TheBoffin wrote: »

You would be amazed at what I have seen in DB's before and the god awful structure..... However, yes i agree, use common sense on the donkeywork.

That post is impressive in that it contains absolutely no value to anyone.

beauf · 02-01-2014 11:03AM

IMO, if you are doing "lots of calculation on the database" for the purpose of "statistical analysis" I would make a copy of the data on another database, preferably on another machine. So that running and developing the statistics doesn't effect the performance of the live database. You can create a job to synch/replicate the live data to the report server every 24hrs or so. I would guess it doesn't need to be more current than that. Probably a week old would be enough, so do it at the weekend.

If you move that to a machine with lots of cores and sql you can take full advantage of multi threading. I would guess that the bottleneck will not be ram or the disk access but the cpu so thats a reason to go with SQL. But you'll have to optimise your queries and move away from record by record processing.

The main issue won't be running the reports when they are live. The issue is the time it takes you do tweak your queries when developing them. Then the slow queries really bites.

If you said you've only got access to work with then I'd go a different way.

croo · 02-01-2014 11:31AM

I was recently doing some reports whose source data was, primarily but not solely, an access database. I used pentaho's open source ETL tool, Spoon, to quickly pull all the data needed from access into a postgresql database where I could easily (again using Spoon) reformat/restructure to a form that made the required querying and reporting straighforward. I had used Spoon before with postgresql but even so extracting the loading was so simple. In my case it was pretty uch a once off but there Spoon tool can be configured to run as autoated jobs.

Now there are probably MS tools that will do the same but these open source tools are simple to use and cost nothing. I normally work in linux myself but in this case I installed postgres & spoon on a windows box and it was straightforward. If you do have access to, and experience of, MS SQLServer then I'm sure it must have its own ETL tools that you might use - though Spoon can just as easily work with MSSQL as Postgresql.

Actually the first project I undertook with Pentaho's ETL tools was something similar to your retail problem.

One final point since you seem to have the bulk of the work done but this final step is the problem...
Are you using (database) transactions in your VB code?
Back in my early days I worked on mainframes - the big updates (the creating off a production plan for a number of factories) that ran at the weekend would take too long with the database (not ms access

) transactions enabled. We knew nothing else would be accessing the database while this job was running so we disabled the transaction logging, backed-up the database and ran the planning process - if it failed we restored the backup and investigated otherwise we just re-enabled transaction logging. The process took a fraction of the time when run thus. Now I should point out that in that case we had already exhausted other tuning measures (ensuring correct indices existed, strategic placement of the various database files on the physical disks - all that kind of stuff).
Anyway the point is, database transaction logging can add a big overhead and you mentioned the writeback is when you are hitting problems so perhaps investigate this... are transactions enabled, is it one big transaction or perhaps it's the other extreme and every write is a transaction? Maybe you can find a happy medium? If it's used at all.

jmcc · 02-01-2014 02:03PM

DeDoc wrote: »

The nub of my question is whether it is more efficient to
a) run an update query to write those 12 to the parameters table (i.e. updating one record in that table) before moving on to the next record in the purchases recordset (1000 records) etc., or....
b) store the 12 values in memory, and cycle through the purchases recordset (1000), before using one single update query to write all 12K (12 parameters x 1000 purchases) values to the parameters table

It may depend on which fields are used for the index or indexes. In real terms, a few million rows isn't that big and as long as the normalisation is done well, it shouldn't be too problematic (unknowns such as table complexity and indices excluded). SQL Server seems like a far better option and Access should be easily able to interface with it. I'm not sure how Access's memory handling works but most real db software allows the configuration for memory handing (RAM and temporary swaps) to be adjusted. With large tables that cannot be reindexed easily in RAM, the HD will be used as swap space. How much memory has the machine you are using to crunch the data?

The Jet Engine tuning tips might be useful (you may already know this stuff though):
http://msdn.microsoft.com/en-us/library/dd942824%28v=office.12%29.aspx

Regards...jmcc

jmcc · 02-01-2014 02:07PM

ChRoMe wrote: »

That post is impressive in that it contains absolutely no value to anyone.

Well it made me laugh so it has some value.

Regards...jmcc

jmcc · 02-01-2014 02:15PM

TheBoffin wrote: »

SQL is the way to go with this one as you are putting huge work on yourself trying to achieve the same in access and it would be less work to convert it all into SQL.

Being pedantic here, but doesn't Access use SQL?

The trick is to try keep the DB as lean as possible and put all the donkey work code in the program or website itself and not on the DB end.

Unfortunately, the DB might be better at some kinds of donkey work. Producing the raw statistical data from a set of queries might be one of these situations.

Regards...jmcc

DeDoc · 04-01-2014 08:08PM

Many thanks all

To answer some questions:

I have been making a local copy to work on. I'm in a data mining phase right now, so having the most current data isn't really important. When I find out the relationships and develop the model etc, then I'll be querying a fairly small set from the live database, and I don't anticipate speed being a problem
There are major legacy issues here. The raw data is in some old version of FoxPro which I'm reading through ODBC into Access.
The machine is a quad core 4GB machine with the data and operating system (Windows 7 64 bit) on an SSD disk. I forget the clock speed - think it is about 3.5 GHz.

At this stage I have most of the raw processing done, so there isn't much incentive to move to a different database platform - I guess that decision should possibly have been made with more thought when I first started tinkering with this project some 2-3 years ago. I went with Access because I'd used it before, because I had a copy and because it interfaced ok with FoxPro. Since then I've done a little bit of work in MySQL, so I could I suppose if I needed to swap to that in the future.

I've one more major task/headache (which might cause me to revisit the above!), and that is I need to search one particular field (a text field) for the presence of any (and which one is important) occurence of a set of key phrases (some 200 or so). Imagine it as being like searching a set of newspaper articles for mention of anyone in a list of people - needing to know whether any particular person is mentioned in any particular article. I'm wondering about the most efficient way to do this (bearing in mind the overhead of transferring the data into a more suitable database environment etc)
1) Am I better stepping through each name and using an update query on all records, e.g. UPDATE X SET Flag_Pauline=TRUE WHERE Content_text LIKE "*Pauline*", or
2) Select all (relevant) records into a recordset and then step through each in turning, with a single (large) update query string created programmatically by searching for each name in the list

Thanks in advance

neil_hosey · 04-01-2014 08:31PM

im not exactly sure if this is what you want and this is only a suggested alternative

.. but you could possibly create a seperate sql server express DB and read the raw data from the access DB into the new sql server db using the sp_linkedserver command and DTS or something..

Then your R app can use the purely MS sql DB for performing analysis. Any alterations or calculations to the data can be handled by a sql server job and stored procedues maybe..

The only issue is duplication of data between access and MS SQL DBs, but if its speed of development that is required and zero modification of the current access app, this might be the way to go..

croo · 04-01-2014 08:54PM

Some reading...
http://www.programcreek.com/2011/08/lucene-vs-database-index/
http://lucene.apache.org/
You seem to work within the MS tech stack, not sure what the equivalent of lucene is there but there is sure to be one.

beauf · 04-01-2014 11:04PM

You say speed isn't an issue, but wasn't that originally your query? Efficiency and speed?

DeDoc wrote: »

....The process is pretty slow - in no small part down to the huge number of records to be processed. ....it appears that the final update query seems to take most of the time in my code....

Bear in mind your main bottleneck is probably not ram or disk I/O but cpu processing,

DeDoc wrote: »

...
The machine is a quad core 4GB machine with the data and operating system (Windows 7 64 bit) on an SSD disk. I forget the clock speed - think it is about 3.5 GHz....

... then Access uses this like single core machine.

But tbh, (and I'm not making a criticism, just an observation) I think you are fixated on doing his through Access and VBA in which case you should stick to the way you are doing it. But its probably the slowest least efficient way of doing this.

But that's only an issue if this something that become time critical in the future. If perhaps they want to increase the frequency of the reports, or you find an error that requires you to rerun it more often. Then it will bite you.

You say there are major legacy issues here. But there aren't really. Data is data and modern databases can pull in legacy data with ease. Especially with something like SSIS in MS SQL.

So if you are staying with MS Access I think you'd be better asking on some of the Access VBA specific forum for optimum method of searching through text. In general a SQl query is faster than a record set, but the only way to know for sure is to do it both ways and time it. If you don't know how to do it both ways, then do it the way you know. Unless its going to be run regularly , then time it.

http://www.utteraccess.com/wiki/index.php/Recordsets_for_Beginners
http://bytes.com/topic/access/answers/197773-speed-sql-vs-recordset
http://bytes.com/topic/access/answers/193944-efficiency-sql-update-vs-recordset-edit-update

I'll answer a different question for you though. Its better for you to learn the MS SQL (or similar) way of doing this. Even though pressure is for you to do it the way you know, and initially take the least time.

jimmycrackcorm · 05-01-2014 01:36AM

DeDoc wrote: »

Many thanks all

To answer some questions:

I have been making a local copy to work on. I'm in a data mining phase right now, so having the most current data isn't really important. When I find out the relationships and develop the model etc, then I'll be querying a fairly small set from the live database, and I don't anticipate speed being a problem
There are major legacy issues here. The raw data is in some old version of FoxPro which I'm reading through ODBC into Access.
The machine is a quad core 4GB machine with the data and operating system (Windows 7 64 bit) on an SSD disk. I forget the clock speed - think it is about 3.5 GHz.

At this stage I have most of the raw processing done, so there isn't much incentive to move to a different database platform - I guess that decision should possibly have been made with more thought when I first started tinkering with this project some 2-3 years ago. I went with Access because I'd used it before, because I had a copy and because it interfaced ok with FoxPro. Since then I've done a little bit of work in MySQL, so I could I suppose if I needed to swap to that in the future.

I've one more major task/headache (which might cause me to revisit the above!), and that is I need to search one particular field (a text field) for the presence of any (and which one is important) occurence of a set of key phrases (some 200 or so). Imagine it as being like searching a set of newspaper articles for mention of anyone in a list of people - needing to know whether any particular person is mentioned in any particular article. I'm wondering about the most efficient way to do this (bearing in mind the overhead of transferring the data into a more suitable database environment etc)
1) Am I better stepping through each name and using an update query on all records, e.g. UPDATE X SET Flag_Pauline=TRUE WHERE Content_text LIKE "*Pauline*", or
2) Select all (relevant) records into a recordset and then step through each in turning, with a single (large) update query string created programmatically by searching for each name in the list

Thanks in advance

For the last bit using SQL Server's full text indexing would make life easier. Also, depending On the volume of data and transactions I wonder if using mapreduce be most efficient. Anyway doing the operations outside access is best.

Talisman · 05-01-2014 10:51PM

DeDoc wrote: »

I've one more major task/headache (which might cause me to revisit the above!), and that is I need to search one particular field (a text field) for the presence of any (and which one is important) occurence of a set of key phrases (some 200 or so). Imagine it as being like searching a set of newspaper articles for mention of anyone in a list of people - needing to know whether any particular person is mentioned in any particular article. I'm wondering about the most efficient way to do this (bearing in mind the overhead of transferring the data into a more suitable database environment etc)
1) Am I better stepping through each name and using an update query on all records, e.g. UPDATE X SET Flag_Pauline=TRUE WHERE Content_text LIKE "*Pauline*", or
2) Select all (relevant) records into a recordset and then step through each in turning, with a single (large) update query string created programmatically by searching for each name in the list

In this situation your best options are to implement a full text search index in your database (Not possible in Access), or create an external index using a search server such as Apache Solr or Sphinx and use that for running your search query. Have a look through the following Slideshare presentation for a numbers comparison:

Full Text Search Throwdown

jmcc · 06-01-2014 02:32AM

DeDoc wrote: »

2) Select all (relevant) records into a recordset and then step through each in turning, with a single (large) update query string created programmatically by searching for each name in the list

It is generally a good thing to reduce the amount of data to be processed.

I don't know how efficient Access is with regexps and searching text fields. With Linux/Unix many of the languages have their own regexps that can be used to parse text data and this can sometimes be quicker than using the database approach. I think that VisualBasic has its own regexp.

Basically the process would be to have the text field as a named file, check the file for the keyword and if found, write a simple SQL statement to update the main table to an SQL file. After the files are processed, the SQL file can be executed as a batch file against the table to be updated. It can have scalability issues and it needs to be well planned. It really depends on the size and complexity of the text fields. It is not something that might be used for this kind of application.

The fulltext search available in SQL Server, MySql, Postgres etc might be a far simpler approach and may not involve an external program and coding.

Apart from reducing the set of data to be processed, it might be a good thing to look at simplifying queries. Use simple queries rather than complicated queries where possible.

Regards...jmcc

croo · 06-01-2014 10:36AM

I did post something yesterday that seemed to get lost in the maintenance downtime.
The short of it was ....
http://www.slideshare.net/billkarwin/full-text-search-in-postgresql
http://www.postgresql.org/docs/8.3/static/textsearch.html
http://linuxgazette.net/164/sephton.html

more efficient way to update my access database?

Comments