Import/Export from application using SQL Scripts

funkey_monkey · 08-10-2016 8:37pm #1

Hi,

I've recently started a new job and have been asked to look into how the application could use postgres to export the data in the app (held in tables) out to the file system.

I've never written anything related to databases before so I'm a bit stumped as to what I need to consider or think about.

Initially at this stage I just want to know what type of things I need to consider.

The brief I've been given is very sketchy at the minute as it is still at concept stage.

What do I need to be looking into and is there anything different to be done for large items (BLOBs) or the smaller tables?

Thanks.

daymobrew · 09-10-2016 10:07am

There are bound to be a number of examples of "export to SQL" for postgres.
It if was in MySQL I'd be looking at the phpMyAdmin source code.

For postgres there seems to be an equivalent: phpPgAdmin.
There are probably more.

jmcc · 09-10-2016 2:02pm

Three simple questions:
What do you have to export?
In what format is the exported data required?
What is your skillset (Bash, scripting languages, Python etc)?

If it is a simple text export, then it should be quite easy. If it requires in-database formatting of the data first, it could be somewhat more complicated.

There are all sorts of ways that data can be exported from databases. Perhaps if you Googled it, it could help.
https://www.postgresql.org/docs/9.0/static/sql-copy.html
http://www.postgresqltutorial.com/export-postgresql-table-to-csv-file/
https://wiki.postgresql.org/wiki/COPY

Also what exactly do you mean by "how the application could use postgres to export the data in the app (held in tables) out to the file system"? Are you talking about a separate program to Postgres or a Postgres database? Have you any experience with database work that could be useful?

"The brief I've been given is very sketchy" This is the most worrying thing.

Regards...jmcc

funkey_monkey · 09-10-2016 2:24pm

Hi,

At the minute I'm not doing any coding aspect to this, just putting together some info for discussion - thankfully.

My understanding (as a new starter) is that most of the data will be exported as text, but there will be larger data sets which will require to be exported as BLOBs.

I've been told that they want the data to be extracted as sql scripts - what I assume will be as the first reply here: Stackoverflow.

Also what exactly do you mean by "how the application could use postgres to export the data in the app (held in tables) out to the file system"? Are you talking about a separate program to Postgres or a Postgres database? Have you any experience with database work that could be useful?

That is a direct quote from the boss. The data is currently held in a postgres database and they want a discussion on how best to extract the data from this database in the file system.

jmcc · 09-10-2016 2:56pm

funkey_monkey wrote: »

Hi,

At the minute I'm not doing any coding aspect to this, just putting together some info for discussion - thankfully.

This is where most of the problems are created by accident. You really need a good specification of what is to be extracted, the format in which it is to be extracted (csv, tsv, binary file etc) and what the data is to be used for eventually. You've got to get this part right from the start or it will be creating a lot of problems down the road.

My understanding (as a new starter) is that most of the data will be exported as text, but there will be larger data sets which will require to be exported as BLOBs.

Ask as many questions as you can think of in order to narrow the process down.

I've been told that they want the data to be extracted as sql scripts - what I assume will be as the first reply here: Stackoverflow

If it is as SQL scripts, then it is somewhat different to data extraction. What you might be able to do is to create temporary tables and dump these instead. A table dump is, at least with MySQL, a table creation statement followed by a set of INSERT statements. I use MySQL on a daily basis with large datasets (the internet domain name system, domain statistics and IP addresses) and sometimes it is necessary to build SQL queries from within a database using the CONCAT statement and then dump these to an external SQL text file which can be run later or on another database.

That is a direct quote from the boss. The data is currently held in a postgres database and they want a discussion on how best to extract the data from this database in the file system.

Ask him/her what they intend doing with the data. Extracting data to the file system is rarely an end in itself as the format has to be compatible with future use. The trick is to make the process as simple as possible and as robust as possible.

Regards...jmcc

funkey_monkey · 09-10-2016 5:56pm

Hi thanks for the reply. Yes it definitely is SQL Scripts. I've rechecked my notes and its definitely that method they want.

The data is to be extracted in order to share it with other instances of the application.

croo · 09-10-2016 6:28pm

Lots of good advice already so I'll just add something nobody has mentioned yet.

If your SQL skills are not great you might consider using an ETL [Extract Tranform Load] tool, which is a common tool to use when creating interfaces between two db based systems. Two popular ETL applications that I've used with Postgres without issues are Pentaho's Spoon and Talend (which is an eclipse based application). Both are open source tools (like Postgres), though if you want to spend money then both come as commercial versions too!

The Pentaho solution is actually a suit of tools, Spoon is what they call the GUI job/ask designer/builder application, but Kettle is what you would deploy it too. It's sounding a little complicated but actually it's very straight forward!

As others mentioned, the details of what your goal is seem a little vague yet but if, when you get to the details, it really is a straight forward dump then these might be unnecessary, though they would still work.

For Postgres admin I've used pgAdmin for the last 15 years with no problems.

funkey_monkey · 09-10-2016 7:11pm

Yes, it has been informative.

I'm only at the stage of thinking about things. I just am trying to note down a few things for everyone to consider at this stage of the design.

The bit that I'm struggling with is deciphering what to think about here:

how the application could use postgres to export the data in the app (held in tables) out to the file system.

To repeat from above, the data is to be exported from the app using postgres. The data is to exported using SQL Scripts. Most data will be text while a minority will be BLOBs. From this, I assume it will be in binary format. The reason for exporting is so that it can be shared with other instances of the application.

However, what I initially had drafted was methods for storing the data in postgres - not how the app might use postgres to export the table.

What am I missing here? I'm trying to avoid continuous questioning of the manager, but I'm struggling to decipher on my own.

Talisman · 09-10-2016 8:26pm

pg_dump can be used to dump a database into a script file without taking the database offline.

If you want to extract elements of the database for import into other applications, pgclimb can export data to a number of formats including JSON and XML.

Tom Dunne · 10-10-2016 10:17am

funkey_monkey wrote: »

To repeat from above, the data is to be exported from the app using postgres. The data is to exported using SQL Scripts. Most data will be text while a minority will be BLOBs. From this, I assume it will be in binary format. The reason for exporting is so that it can be shared with other instances of the application.

What am I missing here? I'm trying to avoid continuous questioning of the manager, but I'm struggling to decipher on my own.

I think what you need to be clear on is whether getting the data out of the tables is an actual piece of functionality within that application, or, are you required to access the database, independent of the application and get the data out for another instance of the application?

In other words, is there a button within a form/screen on the application that says "Export data" or "Export to CSV"? Or are you firing up a command line/GUI and getting the data out to get it into another instance of the application?

It sounds like you are doing the latter.

10-10-2016 10:39am

funkey_monkey wrote: »

I've been told that they want the data to be extracted as sql scripts

funkey_monkey wrote: »

The data is to exported using SQL Scripts..

There's a huge difference between "as SQL scripts" and "using SQL scripts":

For the former, the final result will be a script that can be run on another SQL database to recreate the data
The latter could mean one of two things:
- Write a SQL script that returns the set of data to the application. The application will then export it to CSV and allow the user to download the file
- Write a SQL script that will directly export the data to a file on the server's file-system. You can do that using PostgreSQL's inbuilt copy command: http://stackoverflow.com/questions/1517635/save-pl-pgsql-output-from-postgresql-to-a-csv-file

You really need to figure out what they want, not how to do it. That comes later.

funkey_monkey · 10-10-2016 10:13pm

Sorry - that was a slip. Yes - as SQL scripts.

It will be activated via a button on the GUI:

getting the data out of the tables is an actual piece of functionality within that application

^This is what I'm looking into.

10-10-2016 10:52pm

So:

the user clicks the button
a file dialog opens up, they select a location and filename, and clicks save
a SQL script is saved to the user's computer

What's the next step? You say it's to be used with another instance of the application. Does that mean each instance has its own local database? Will the structure of each database be exactly the same? Will the user be importing this file?

If so, using SQL scripts as a data store is an incredibly bad idea. What if a malicious user creates a sql script with `DROP DATABASE criticalInfo`? Or what if the database structure changes? All previous exports become useless straightaway. Unless there's an extremely good reason that you haven't covered yet, the way to go is to export to a standard file format (XML/JSON/CSV etc).

funkey_monkey · 11-10-2016 8:27am

Next step (eventually) will be

the user clicks a button
a file dialog opens up, they select the location and filename of the SQL Scripts, and clicks upload
the data is injected into the app on the user's computer

I'm not sure why he wants this format. I've heard them previously talking about XML etc. But he definitely said scripts this time. Not sure why. Back in work today, so will nail this down.

funkey_monkey · 11-10-2016 5:15pm

Update. The data to be exported resides on the postgres database as postgres tables and and/or on the file system.
I need to consider how best to export the mixed bag of data.

Some items are all in either one or the other and some are held across both.

I need to think about what is the best way to undertake this export and have it in a state for import,when required.

jmcc · 11-10-2016 5:20pm

funkey_monkey wrote: »

[*] the data is injected into the app on the user's computer
[/LIST]

Make sure that the app writer is in on this otherwise you will get the bullet when the import process goes wrong. You really need to CYA on this and get the specification signed off first.

Regards...jmcc

funkey_monkey · 11-10-2016 10:34pm

Thanks. Any tips on what I should be considering here? Lots of folk of on holiday at the minute.
The design will be a joint effort, i'm just putting together some points for discussion and for managers.

To clarify, we have a 'repository' which stores data for a number of subsystems. This data is stored within a combination of databases and file systems. We want to export this data out into a file system as SQL scripts.

What is being asked at this stage is:

an outline of a few possible methods to discuss
risks associated with exporting from two sources (db + file system => file system)
Alternatives to SQL Scripts (i.e. XML, JSON)
Risks associated with SQL scripts

funkey_monkey · 13-10-2016 5:21pm

Okay things have moved on. I've now been asked to look at the possibility of storing into a zip file. Specifically to see if it is worthwhile. Is there any reliable source of data on compression times versus file size, or similar.

Also wondering about whether user checksums to validate the extracted data. I've found a website that stays 2% performance bit can be expected. But I really don't want to quote a random person on net?

I also need to estimate the export time for a dataset. How is this best done?
We will be using pg_dump. Can the tool estimate this?

I need some error handling to check for deletion or modification of a dataset whilst exporting. Is assume I want to lock it whilst expiring. Can this be done, is what are the alternatives?

Talisman · 13-10-2016 7:14pm

funkey_monkey wrote: »

Okay things have moved on. I've now been asked to look at the possibility of storing into a zip file. Specifically to see if it is worthwhile. Is there any reliable source of data on compression times versus file size, or similar.

Also wondering about whether user checksums to validate the extracted data. I've found a website that stays 2% performance bit can be expected. But I really don't want to quote a random person on net?

I also need to estimate the export time for a dataset. How is this best done?
We will be using pg_dump. Can the tool estimate this?

I need some error handling to check for deletion or modification of a dataset whilst exporting. Is assume I want to lock it whilst expiring. Can this be done, is what are the alternatives?

You can compress the output of the pg_dump command by adding the -Z parameter. The minimum level of compression will compress the output significantly while adding about 30 seconds to the export time. The -Fc parameter will allow the data to be restored in parallel jobs.

pg_dump -Z1 -Fc

Increasing the compression level can significantly increase the export time.

There's no way to estimate the export time because it depends on a number of environment factors.

pg_dump runs in a serializable transaction so it makes a consistent snapshot of the database.

The best advice I can give you is to perform a number of backups to give yourself some real numbers.

funkey_monkey · 13-10-2016 7:27pm

Can I get pgadmin to give me a time for exporting?
I can't see how. I'm running v1.20.0 and postgresql V 9.4.

Is there any way to lock a dataset so that other users cannot modify/delete it when it is being exported?

Is the 30 fixed period of time to add compression or is that figure based on a certain size export? Do you have a source for these figures?

Thanks for the replies. Very much appreciated. I'm learning a lot from this.

Thanks.

Talisman · 13-10-2016 9:50pm

No tool can estimate the time it will take to perform the task. It is dependent on factors such as the number of processors, the processor speed, the disk I/O speeds, available RAM, available disk space, the system load and other parameters such as the number of tables, indexes and key constraints. There is no magic formula.

The 'about 30 seconds' is based upon my own experience of backing up Postgres databases. The largest database I have had to dump was just over 20GB on the disk. It took just under 7 minutes to perform the task. With the minimal compression setting the task completed in around 7 minutes and 30 seconds. Using the maximum compression setting the task completed in 30~35 minutes and the resulting export was slightly over 1GB smaller than that created with the minimal compression setting.

Have a look at the figures provided in this blog post: Using compression with PostgreSQL’s pg_dump

command           dump time      file size
pg_dump            4m22.654s       13G
pg_dump -Fc       11m12.443s       4.6G
pg_dump -Z0 -Fc    4m18.742s       14G
pg_dump -Z1 -Fc    5m0.498s        5.4G
pg_dump -Z2 -Fc    5m24.376s       5.2G
pg_dump -Z3 -Fc    6m20.513s       5.1G
pg_dump -Z4 -Fc    6m46.074s       4.9G
pg_dump -Z5 -Fc    8m22.397s       4.7G
pg_dump -Z6 -Fc   11m18.154s       4.6G
pg_dump -Z7 -Fc   14m21.447s       4.5G
pg_dump -Z8 -Fc   25m15.000s       4.5G (45MB smaller)
pg_dump -Z9 -Fc   26m40.550s       4.5G (1.8MB smaller)

In this case, the minimal compression setting added ~40 seconds to the task.

The pg_dump task is a transaction so the data export will only reflect the changes of transactions that have completed before the task was executed.

funkey_monkey · 14-10-2016 12:10pm

Thanks for that table. adding the -Z1 option seems to be a worthwhile trade off. Most of our data ranges from a few Kb to a couple of Gb.

Does the shared lock provide protection against someone trying to delete a dataset during an export? From reading the notes it would seem so. Is this correct?

funkey_monkey · 14-10-2016 5:29pm

thanks for the feedback. It is very useful stuff for me.

funkey_monkey · 14-10-2016 5:32pm

I've manually timed some backups from pgadmin to get a rough idea of if the times involved.

I've got a website which states checksums give a 2% performance hit. Would you guys concur with that? I can't find any other evidence to confirm or contradict it.
The website was: http://blog.endpoint.com/2015/12/postgres-checksum-performance-impact.html

Can any one back this up or counter it?

Talisman · 15-10-2016 6:22am

funkey_monkey wrote: »

Does the shared lock provide protection against someone trying to delete a dataset during an export? From reading the notes it would seem so. Is this correct?

Provided you're not doing anything crazy outside of the backup job, such as deleting/recreating indexes then the backup will be a clean snapshot of the database.

Talisman · 15-10-2016 7:15am

funkey_monkey wrote: »

I've manually timed some backups from pgadmin to get a rough idea of if the times involved.

I've got a website which states checksums give a 2% performance hit. Would you guys concur with that? I can't find any other evidence to confirm or contradict it.
The website was: http://blog.endpoint.com/2015/12/postgres-checksum-performance-impact.html

Can any one back this up or counter it?

Interesting article - Once again this is something that is dependant on the environment.

The effect on inserts should be minimal - the data has to be written to disk anyway. It's a relatively slow operation so streaming the data through a checksum calculator will have little/no effect on a system with an old school spinning disk.

Databases are designed to read (SELECT) data quickly, adding a checksum calculator to the task is going to slow down the process.

If the database needs to do lots SELECTs of data that isn't cached in buffers then you would expect to see a performance hit. Checksum calculations are a mathematical sum of the bytes being summed, in this case the data being read from the disk. Creating a checksum for 1MB of data should always take the same amount of time, as there are 8 bits in every byte regardless of their content.

If the data is cached then a checksum calculation is not required before returning the SELECT results. It stands to reason that if there is a large cache, then fewer checksum calculations will be required and the performance impact is reduced.

Import/Export from application using SQL Scripts

Comments