Large Table/Single Query vs Many Tables/Queries

jmcc · 17-06-2013 09:08PM #1

I've been working on a new country level web directory that has to be scalable to millions of websites. The theory says that minimising the number of tables and the numbers of queries to produce a page is the best strategy. The ideal theoretical solution is a single table with one query for the data needed to build the web page. However building that large single table means integrating the data from about 21 other tables. Most of the data is simple but the data from three tables requires complex and expensive queries to integrate with the single table. (Effectively a single index.) Would a less than optimal solution of keeping the data which requires the complex queries to integrate in separate tables (Multiple indexes.) and integrate the lot at the presentation level in PHP (where the queries on the complex data tables might be simpler and less expensive) or would it be best to integrate it as a single table and then use this for the main website? Has anyone run across a similar situation before?

Regards...jmcc

colm_c · 17-06-2013 09:42PM

Theory vs real-world can vary when you get specific and scale.

I would run some simple benchmarks, and that will easily verify which approach will work best. Be sure to do this with enough sample data to make the benchmarks as real as possible.

Also depending on how fresh you need to keep things the caching the query results will have big benefits too. Not to mention application and web server caching.

jmcc · 17-06-2013 09:54PM

colm_c wrote: »

Theory vs real-world can vary when you get specific and scale.

I know.

There are some points beyond which things get a bit counter-intuitive.

I would run some simple benchmarks, and that will easily verify which approach will work best. Be sure to do this with enough sample data to make the benchmarks as real as possible.

Well there's more than enough data to run the benchmarks. It will run on MySQL which means that the main issue is the indexes as the tables are just data on the disk. The multi-table approach means more indexes and more RAM.

Also depending on how fresh you need to keep things the caching the query results will have big benefits too. Not to mention application and web server caching.

That's a critical issue. The rebuild of the large table takes some time even on a 300K website test dataset. However using the multi-table model with a limited large table, the build is a lot faster. The frontend is easier because it it is simple to scale horizontally with MySQL/PHP.

Regards...jmcc

Giblet · 17-06-2013 11:04PM

This type of stuff you really need a denormalised view, whether that's a cached view, database view or document representing the aggregation doesn't matter. Scalability is really about not hitting the database often, if at all.

jmcc · 17-06-2013 11:48PM

Giblet wrote: »

This type of stuff you really need a denormalised view, whether that's a cached view, database view or documentinng represent the aggregation doesn't matter.

More like a demoralised view at the moment. The generation time on the single table is not easily scaled and it could be a show stopper for this approach.

Scalability is really about not hitting the database often, if at all.

It is actually more often about hitting the database with fast, inexpensive queries and caching the unchanging/slowly changing results from frequent queries.

Regards...jmcc

Colonel Panic · 18-06-2013 12:03AM

Why would there by multiple queries? Are joins really slow in MySQL?

I would tend to agree with Giblet about trying to not hit the database. I've had a lot of success with memcached for the sort of thing you describe.

jmcc · 18-06-2013 02:14AM

Colonel Panic wrote: »

Why would there by multiple queries? Are joins really slow in MySQL?

On a single table, it would be one query. By using one less large table with the complex data in other tables, it might be possible to keep almost the same speed by using PHP to do some of the processing.

I would tend to agree with Giblet about trying to not hit the database. I've had a lot of success with memcached for the sort of thing you describe.

The other advantage of a single table is that each page would only require one query and the index could be compressed.

Regards...jmcc

croo · 18-06-2013 07:54AM

How often does the underlying data change?

if it were infrequent and the updates do not need to appear immediately, they you could generate the "query" table on a periodic basis... a kind of simplified datamart.

Querying a million records -even from multiple tables- shouldn't be that much of a burden and there are probably a lot of DB tuning activities you could undertake before going down the route of pre-processing the data purely to improve performance.

Colonel Panic · 18-06-2013 09:54AM

jmcc wrote: »

On a single table, it would be one query. By using one less large table with the complex data in other tables, it might be possible to keep almost the same speed by using PHP to do some of the processing.

The other advantage of a single table is that each page would only require one query and the index could be compressed.

Regards...jmcc

But... you can send a single query to MySQL that would let you join multiple tables together. There are arguments for denormalised tables. There are very few arguments for querying multiple tables from PHP and churning through the results to join everything up in code when that's the whole point of SQL.

I guess I'm just thinking, this would be so easy to profile if you just tested with denormalised and normalised data.

What about the memcached suggestion?

stevenmu · 18-06-2013 10:15AM

If you can denormalise your data is there a good argument here for a NoSQL solution?

I don't buy into the idea that NoSQL can completely replace relational databases, but this seems like a case where it may work well.

jmcc · 18-06-2013 03:33PM

croo wrote: »

How often does the underlying data change?

Theoretically, monthly. However given the way that websites are hacked (approx 780 in the Irish webscape at the moment), it must be possible to remove sites quickly. Also thousands of websites will drop in any given month and thousands more will appear as new websites. On the UK set of sites, at least 50K sites drop and another 60K or so appear.

Querying a million records -even from multiple tables- shouldn't be that much of a burden and there are probably a lot of DB tuning activities you could undertake before going down the route of pre-processing the data purely to improve performance.

Millions. And these queries could be occurring on anything from a single query per second up to hundreds a second. It depends on whether users and search engine spiders collide. That's what this schema has to be capable of handling.

Regards...jmcc

Sparks · 18-06-2013 03:47PM

Just as a tangential thought, you could always throw money at the problem (ie. buy more servers (whether physical or virtual), set up a MySQL cluster and shard the denormalised monster table over them)...

jmcc · 18-06-2013 03:50PM

Colonel Panic wrote: »

But... you can send a single query to MySQL that would let you join multiple tables together. There are arguments for denormalised tables. There are very few arguments for querying multiple tables from PHP and churning through the results to join everything up in code when that's the whole point of SQL.

Using highly denormalised tables is a good approach for very large websites. As I said, some things at very large scale levels are counter-intuitive. Updates, loading data and rebuilding indexes all come into play at the very large site level so what might seem like a good solution on a table with a few hundred or even a million rows, becomes almost unusable at the very large site level. (Some of the newer tables on the main site have over 50 million rows and it uses a highly denormalised schema. It also serves approximately 300K pages a day. I tested it and it held up well. It is not deployed yet. ) The main data is the history of domain names and it involves about 365 million current and deleted domains. So an individual domain might only get a few queries per year.

I guess I'm just thinking, this would be so easy to profile if you just tested with denormalised and normalised data.

I guess it is back to A/B testing for the moment. I'm still inclined to go with a denormalised approach.

What about the memcached suggestion?

Definitely worth considering but it would be a bit premature as I haven't finalised the schema.

Regards...jmcc

jmcc · 18-06-2013 03:52PM

stevenmu wrote: »

If you can denormalise your data is there a good argument here for a NoSQL solution?

I don't buy into the idea that NoSQL can completely replace relational databases, but this seems like a case where it may work well.

Perhaps. But it is one problem at a time and getting up to speed with NoSQL would complicate matters. If MySQL doesn't work, then it is an option.

Regards...jmcc

jmcc · 18-06-2013 04:08PM

Sparks wrote: »

Just as a tangential thought, you could always throw money at the problem (ie. buy more servers (whether physical or virtual), set up a MySQL cluster and shard the denormalised monster table over them)...

That's the expensive way and it is not necessary yet.

This solution has to be horizontally scalable in a cheap and cheerful way. (Round robin DNS with multiple servers not necessarily using a MySQL cluster approach.)

Regards...jmcc

Sparks · 18-06-2013 04:33PM

Well, it is always expensive to throw money at a problem

The balancing act is whether you're throwing more money at it through buying resources like more servers; or whether you're throwing more money at it through increased development time and the losses incurred by increased time-to-market. (but you know this

)

Colonel Panic · 18-06-2013 04:36PM

I know where you're coming from re: heavily denormalised tables. I've done it in the past and just thought "this feels like overkill".

I've stopped worrying about it now and model the data instead and make sure I've got that right before I finalise anything DB related. Then I profile and tune the database as required.

Somewhere in between, there is a balance. A series of trade offs.

Ha, this is a pretty handwavey post, but it's just my own experience with databases and potentially premature optimisation.

jmcc · 18-06-2013 04:40PM

Sparks wrote: »

Well, it is always expensive to throw money at a problem
The balancing act is whether you're throwing more money at it through buying resources like more servers; or whether you're throwing more money at it through increased development time and the losses incurred by increased time-to-market...

The balancing act has to do with walking the bottom line of my bank balance.

A good initial design means that it can, largely, be scaled well horizontally but without the heavily interconnected database backend. The same schema should work well for the Irish webscape (approx 350K sites) and the UK (possibly 8 million or more websites). The US might take a bit more work.

Regards...jmcc

jgh_ · 18-06-2013 07:39PM

Another thought might be that you want to actually have two database servers, one for write and one for read, and have them swap back and forth.

Additionally you probably want to think about how the data will be accessed. Caching is probably the first thing you want to look at, and then probably partition the data depending on how it will be accessed. Maybe keep the most commonly accessed data in a memory cache, then move on to SSD storage, and finally onto regular ol' magnetic storage for infrequently accessed stuff.

I've also heard that breaking up large queries into small ones can allow the database engine to make better use of concurrency, but I can't verify that.

disclaimer: I'm not a big data guy, I'm just throwing out some suggestions that may or may not work for you. Ultimately you have to profile to find what works for you.

stevenmu · 18-06-2013 07:58PM

jmcc wrote: »

Perhaps. But it is one problem at a time and getting up to speed with NoSQL would complicate matters. If MySQL doesn't work, then it is an option.

Regards...jmcc

Yeah valid point. Although ...

I was just thinking, if you qualified for Microsoft's Bizspark program, you get a certain amount of hosting and storage, SQL Azure etc for free, and can add to it at reasonable rates. It is a bunch of different technologies if you are expecting to use PHP/MySQL, but it could be very cost effective way to scale.

Giblet · 19-06-2013 08:54AM

jgh_ wrote: »

Another thought might be that you want to actually have two database servers, one for write and one for read, and have them swap back and forth.

Additionally you probably want to think about how the data will be accessed. Caching is probably the first thing you want to look at, and then probably partition the data depending on how it will be accessed. Maybe keep the most commonly accessed data in a memory cache, then move on to SSD storage, and finally onto regular ol' magnetic storage for infrequently accessed stuff.

I've also heard that breaking up large queries into small ones can allow the database engine to make better use of concurrency, but I can't verify that.

disclaimer: I'm not a big data guy, I'm just throwing out some suggestions that may or may not work for you. Ultimately you have to profile to find what works for you.

You're getting into CQRS there, which is very scalable, but might not be what the OP needs.

jmcc · 19-06-2013 02:40PM

jgh_ wrote: »

Another thought might be that you want to actually have two database servers, one for write and one for read, and have them swap back and forth.

Most of the activity will be read-only. That's the good thing about this type of website. The backend and data generation side of things can be done over a few computers but the webserver is generally a very simple LAMP style server. When it is a read only table, updating is often a lot simpler - just load the new table with a different name and then drop the existing table and rename the new one.

Additionally you probably want to think about how the data will be accessed. Caching is probably the first thing you want to look at, and then probably partition the data depending on how it will be accessed. Maybe keep the most commonly accessed data in a memory cache, then move on to SSD storage, and finally onto regular ol' magnetic storage for infrequently accessed stuff.

Definitely things to consider once the schema is nearly finalised. But at this stage, it is all about getting the best schema that scales well and easily.

I've also heard that breaking up large queries into small ones can allow the database engine to make better use of concurrency, but I can't verify that.

It can vary. On a large scale site, the simpler the query, the faster it can (typically) execute. The other aspect is that you have to look at freezing data which is not being updated. You will end up with two sets of data, a frozen one and a current one. It is, potentially, two queries but the benefit is on update speeds because any updates are only applied to a much smaller current table.

Regards...jmcc

jmcc · 19-06-2013 03:02PM

stevenmu wrote: »

Yeah valid point. Although ...

I was just thinking, if you qualified for Microsoft's Bizspark program, you get a certain amount of hosting and storage, SQL Azure etc for free, and can add to it at reasonable rates. It is a bunch of different technologies if you are expecting to use PHP/MySQL, but it could be very cost effective way to scale.

From MySQL to SQL Server and from PHP to ASP isn't that much of a jump. SQL Server has a steep learning curve though. Who knows - it could be the replacement for Bing?

Regards...jmcc

Large Table/Single Query vs Many Tables/Queries

Comments