Database table Maintenance - using XML as a means to verify Data Entry

krinpit · 14-03-2006 02:04AM #1

I have received a user request, to allow 2 people to enter the same data into a database table, and to compare/verify the 2 sets of data before commiting it to the database.

At the moment, I'm thinking along the lines of saving all entered data by each user into an XML file, and comparing the two XML files at the end.

The solution would allow a user to make many inserts and updates, which would all be recorded in that "session's" XML file.

This seems very simple and straightforward at first glance, but then consider that the user may wish to amend a mistake or just to go back and double-check that the information he has insert/updated.

So, before I begin, I'm wondering if there are any standard approaches to this type of solution? If not, do you have any advice on how I should proceed.

(I hope I have explained my problem clearly, but please let me know if I haven't)

Thanks

amen · 14-03-2006 09:03AM

before implenting anything I would have to ask you why would want

to allow 2 people to enter the same data into a database table, and to compare/verify the 2 sets of data before commiting it to the database

Is there a business reason for this that could be solve anothere way (outside of programming). What type of data is involved.

I think doing this sort of thing leads to data pollution and is going to give someone a lot of grief at some point in the future

krinpit · 14-03-2006 09:43AM

Yes, there is a strong business case for the requirment, as the data is largely numberic in nature. Users will enter the same set of data, at different times - one session each.

As only one set of reconsiled data will actually be commited to the database, I don't see any possibility of data polution. In fact, quite the contrary

liamo · 14-03-2006 10:15AM

Hi Krinpit,

You could store it in XML or in another table in the database if that's easier. Any temporary storage area will do.

It seems pretty straightforward, assuming that the records that are being input can be identified such that a user can view/modify his own data, and also so that records in each set can be matched up after inputting. You'll also need to think about how to resolve a situation where two matching records have different data.

Other than that, I don't see any gotchas about it. I haven't seen this method of data validation for quite some time but it's certainly an effective way to improve the accuracy of data input, particularly when coupled with some programmatic validation at the point of input.

Regards,

Liam

Inspector Gadget · 14-03-2006 12:06PM

These sort of things are fine (in theory) right up to the point that you have to decide which data you're going to accept and which you're going to discard (assuming both people enter data into the same fields). Unless there's some hard-and-fast rule for this, or you reject both (though this is rarely approved as an idea from what I've seen), this will be a headwrecker...

Gadget

liamo · 14-03-2006 12:25PM

To my mind, the only way that it can be resolved is for a supervisor to make a decision about which is correct by manually checking the data that has been input against the original records, or for the operators to re-input that record again (or edit the original record) until the two records match.

I would imagine that the data doesn't have to be entered in parallel, and that the batch is not committed until all records entered by operator A agree with the corresponding records entered by operator B.

If this is the case, you could have the situation where a single operator enters the batch twice which would have the same effect as two operators entering the batch once each.

Regards,

Liam

bonkey · 14-03-2006 12:32PM

krinpit wrote:

Yes, there is a strong business case for the requirment, as the data is largely numberic in nature. Users will enter the same set of data, at different times - one session each.

Ah yes, the good ol' double-entry system. An employer I used to work for used it on application-form entry.

I'm not convinced as to its efficacy, but that's neither here nor there...

At the moment, I'm thinking along the lines of saving all entered data by each user into an XML file, and comparing the two XML files at the end.

I can't see the reason behind this. You're effectively suggesting that you use an XML file as a second database. Why not use the database you already have, complete with all the transactionality etc.

As only one set of reconsiled data will actually be commited to the database,

I would have said that there should be three sets of data committed to the database to implement this system properly - what the first user entered, what the second user entered, and what the "final" values are. AS I pointed out, while you suggest saving teh first two sets of this in XML files, you're really saying "I'll save the first two sets in two XML-File-based-Databases, and the third set in a relational database". Does having seperate databases really gain you anything?

So I'll continue on the assumption that it'll all go into the main db.

Whether you store those "vertically" in a single table (1-3 rows per entry, with distinguishing flags), horizontally in a single table (1 row with 3 sets of fields...ugly, but sometimes used), or have multiple tables with the same schema...thats up to you.

You can get away with storing less data, but I would say that you want all three for the following reasons:

1) because the data is entered at two different times, the "first" copy needs to be saved in order to be compared against the second. So, first-entry is comitted to teh db

2) After the second person enters their data, you *could* compare it to the first (saved) set dynamically, resolve all issues, and overwrite, but I would be inclined to say you should store the "second-entry" values *and* the "ultimate" values at this point. This will be covered in point 3.

3) Periodically (daily?) you should generate a report of all information where first- and second-entries did not match, and have this data either manually or statistically analysed. For manual analysis, someone should take the data (or some random subset) which had differing entries and verify that the correct "ultimate" value was chosen (thus giving double-corrected-double-entry). Statistically, one should check that its not the case that the second-entered value is vastly more likely to be chosen as the ultimate value.

Why? Because as operators get used to the system, what tends to happen is that the person entering the data the second time round gets notified of a discrepancy, and tends to just say "Oh, I know I entered that right" and selects their value to be the correct one without verifying that it is....thus negating the entire point of the double-entry.

If you think about it, a diliigent data-enterer will verify each field that they enter themselves. WHy don't they? Because its faster not to, and they "know" they haven't made mistakes. Why will they be any different when told that their entry clashes with someone elses? "Obviously" the someone else made the mistake, because they still "know" that they are correct. This is why I'm dubious about the beenfits of implementing the system (and this was the problem experienced where I previously worked), but as long as you're aware of the issue...

So yeah...for post-entry analysis etc. it is desireable to have both sets of data as originally entered, plus the ultimate value. Lets not forget that User1 could enter my name as 'donley', User2 as 'donkey', and the ultimate value is actually stored as 'bonkey' after User2 noticed that there was something up. Over time, storing all of these values allows us to build up useful statistical information on how well teh system is working....and lets face it, if you're gonna implement a system to reduce errors, you have to want metrics on error-rates.

I don't see any possibility of data polution. In fact, quite the contrary

I don't see an issue with data-pollution either....as long as the physical process is well-defined and understood at development time.

My bigger concern is that this will add a not-insignificant amount of time to the development cycle, and will also reduce overall data-input speed without offering any long-term benefit in accuracy.

jc

democrates · 16-03-2006 12:52AM

What Bonkey said.
A third person could validate, using a page that presents non-matching records and prompts the validator to accept either value or enter an alternative. If there is a pool of people doing the job they should be rotated around the two roles on a rota.

The metrics are key, if 80% of errors can be demostrated as coming from one person it may be possible to sort it. I once maintained such a system written in mantis on a mainframe with an IMS db. Bonuses were linked inversely to error rates.

krinpit · 16-03-2006 01:57AM

Wow thanks guys, I really appreciate your advice (bonkey, that must have taken you ages

)

I agree that the double-entry method may not be the most efficient use of either my time, or the end-user's time, but it has been requested, so I do have to evaluate whether it will be worth implementing. If not, we can always shift it to a later version of the software. In the meantime, I still personally find it quite an interesting concept, so let's continue....

I suppose it does appear that I'm over pushing the XML thing alright. To put it in context, I'm using .NET, which provides a method of storing Datasets in a standard XML format, thus making the XML approach initially appealing. It means that I can temporarily write two separate XML files to a shared location and re import them to a .NET dataset with minimal coding. I could then leverage the power of the programming language to perform logical comparisons between the two datasets. However, I will certainly consider the pure relational database approach, seeing as SQL has so much to offer in terms of data comparison. As I'm writing this out, I'm thinking of various combinations of SELECT and UNION to provide the matching data and NOT IN to provide the exceptions.

As for physically resolving the discrepancies, I intend to provide some kind of exceptions report, which would be presented to what you would refer to as "a supervisor", while all valid entries are accepted immediately.

Let me mull this one over. I'll get back with any further thoughts, and I'll let you know about the final implementation plan (if it goes ahead of course

).

Please do post back with any additional comments

bonkey · 16-03-2006 11:37AM

krinpit wrote:

bonkey, that must have taken you ages

Not so long...

To put it in context, I'm using .NET, which provides a method of storing Datasets in a standard XML format, thus making the XML approach initially appealing. It means that I can temporarily write two separate XML files to a shared location and re import them to a .NET dataset with minimal coding.

You can indeed.

But as I said...there are (good) reasons why you want to keep first-entry & second-entry data as well as the ultimate values. I don't think you're disagreeing with this point, so teh question then becomes - should we use a different type of persistence for first- and second-entry data to the persistence used for ultimate values and everything else.

So far, I can't see much (if any) benefit in storing the data in two seperate ways. You have to code the DB store/retrieve stuff anyway for the "ultimate" values, and the change from this to storing "first/second/ultimate" should be tiny - a table-name, or a flag-value is pretty-much all that will be different.

Database drivers take care of so much for you. They resolve multi-user access/locking issues. They can be transacted effortlessly. They allow for highly-granular security. All of this can be achieved with files, but generally you have to write it yourself.

I could then leverage the power of the programming language to perform logical comparisons between the two datasets.

True, but if you think about it from a slightly different perspective, this is true regardless of where the datasets have been loaded from, so its not an argument in favour of XML-persistence over DB-persistence. Its an argument for using Datasets over hand-rolled object models or other in-memory storage mechanisms.

I'd agree with the use of Datasets...I'm just not convinced that persisting to/from XML brings anything to the table.

However, I will certainly consider the pure relational database approach, seeing as SQL has so much to offer in terms of data comparison. As I'm writing this out, I'm thinking of various combinations of SELECT and UNION to provide the matching data and NOT IN to provide the exceptions.

Indeedy. Or you can stick with the Dataset-functionality if you're more comfortable with that / its more suitable to what you're doing.

jc

Database table Maintenance - using XML as a means to verify Data Entry

Comments