Advertisement
If you have a new account but are having problems posting or verifying your account, please email us on hello@boards.ie for help. Thanks :)
Hello all! Please ensure that you are posting a new thread or question in the appropriate forum. The Feedback forum is overwhelmed with questions that are having to be moved elsewhere. If you need help to verify your account contact hello@boards.ie

Large Table/Single Query vs Many Tables/Queries

Options
  • 17-06-2013 9:08pm
    #1
    Registered Users Posts: 7,410 ✭✭✭


    I've been working on a new country level web directory that has to be scalable to millions of websites. The theory says that minimising the number of tables and the numbers of queries to produce a page is the best strategy. The ideal theoretical solution is a single table with one query for the data needed to build the web page. However building that large single table means integrating the data from about 21 other tables. Most of the data is simple but the data from three tables requires complex and expensive queries to integrate with the single table. (Effectively a single index.) Would a less than optimal solution of keeping the data which requires the complex queries to integrate in separate tables (Multiple indexes.) and integrate the lot at the presentation level in PHP (where the queries on the complex data tables might be simpler and less expensive) or would it be best to integrate it as a single table and then use this for the main website? Has anyone run across a similar situation before?

    Regards...jmcc


Comments

  • Registered Users Posts: 2,031 ✭✭✭colm_c


    Theory vs real-world can vary when you get specific and scale.

    I would run some simple benchmarks, and that will easily verify which approach will work best. Be sure to do this with enough sample data to make the benchmarks as real as possible.

    Also depending on how fresh you need to keep things the caching the query results will have big benefits too. Not to mention application and web server caching.


  • Registered Users Posts: 7,410 ✭✭✭jmcc


    colm_c wrote: »
    Theory vs real-world can vary when you get specific and scale.
    I know. :) There are some points beyond which things get a bit counter-intuitive.
    I would run some simple benchmarks, and that will easily verify which approach will work best. Be sure to do this with enough sample data to make the benchmarks as real as possible.
    Well there's more than enough data to run the benchmarks. It will run on MySQL which means that the main issue is the indexes as the tables are just data on the disk. The multi-table approach means more indexes and more RAM.
    Also depending on how fresh you need to keep things the caching the query results will have big benefits too. Not to mention application and web server caching.
    That's a critical issue. The rebuild of the large table takes some time even on a 300K website test dataset. However using the multi-table model with a limited large table, the build is a lot faster. The frontend is easier because it it is simple to scale horizontally with MySQL/PHP.

    Regards...jmcc


  • Registered Users Posts: 11,979 ✭✭✭✭Giblet


    This type of stuff you really need a denormalised view, whether that's a cached view, database view or document representing the aggregation doesn't matter. Scalability is really about not hitting the database often, if at all.


  • Registered Users Posts: 7,410 ✭✭✭jmcc


    Giblet wrote: »
    This type of stuff you really need a denormalised view, whether that's a cached view, database view or documentinng represent the aggregation doesn't matter.
    More like a demoralised view at the moment. The generation time on the single table is not easily scaled and it could be a show stopper for this approach.
    Scalability is really about not hitting the database often, if at all.
    It is actually more often about hitting the database with fast, inexpensive queries and caching the unchanging/slowly changing results from frequent queries.

    Regards...jmcc


  • Registered Users Posts: 2,022 ✭✭✭Colonel Panic


    Why would there by multiple queries? Are joins really slow in MySQL?

    I would tend to agree with Giblet about trying to not hit the database. I've had a lot of success with memcached for the sort of thing you describe.


  • Advertisement
  • Registered Users Posts: 7,410 ✭✭✭jmcc


    Why would there by multiple queries? Are joins really slow in MySQL?
    On a single table, it would be one query. By using one less large table with the complex data in other tables, it might be possible to keep almost the same speed by using PHP to do some of the processing.
    I would tend to agree with Giblet about trying to not hit the database. I've had a lot of success with memcached for the sort of thing you describe.
    The other advantage of a single table is that each page would only require one query and the index could be compressed.

    Regards...jmcc


  • Moderators, Technology & Internet Moderators Posts: 1,334 Mod ✭✭✭✭croo


    How often does the underlying data change?

    if it were infrequent and the updates do not need to appear immediately, they you could generate the "query" table on a periodic basis... a kind of simplified datamart.

    Querying a million records -even from multiple tables- shouldn't be that much of a burden and there are probably a lot of DB tuning activities you could undertake before going down the route of pre-processing the data purely to improve performance.


  • Registered Users Posts: 2,022 ✭✭✭Colonel Panic


    jmcc wrote: »
    On a single table, it would be one query. By using one less large table with the complex data in other tables, it might be possible to keep almost the same speed by using PHP to do some of the processing.

    The other advantage of a single table is that each page would only require one query and the index could be compressed.

    Regards...jmcc

    But... you can send a single query to MySQL that would let you join multiple tables together. There are arguments for denormalised tables. There are very few arguments for querying multiple tables from PHP and churning through the results to join everything up in code when that's the whole point of SQL.

    I guess I'm just thinking, this would be so easy to profile if you just tested with denormalised and normalised data.

    What about the memcached suggestion?


  • Moderators, Society & Culture Moderators Posts: 9,689 Mod ✭✭✭✭stevenmu


    If you can denormalise your data is there a good argument here for a NoSQL solution?

    I don't buy into the idea that NoSQL can completely replace relational databases, but this seems like a case where it may work well.


  • Registered Users Posts: 7,410 ✭✭✭jmcc


    croo wrote: »
    How often does the underlying data change?
    Theoretically, monthly. However given the way that websites are hacked (approx 780 in the Irish webscape at the moment), it must be possible to remove sites quickly. Also thousands of websites will drop in any given month and thousands more will appear as new websites. On the UK set of sites, at least 50K sites drop and another 60K or so appear.
    Querying a million records -even from multiple tables- shouldn't be that much of a burden and there are probably a lot of DB tuning activities you could undertake before going down the route of pre-processing the data purely to improve performance.
    Millions. And these queries could be occurring on anything from a single query per second up to hundreds a second. It depends on whether users and search engine spiders collide. That's what this schema has to be capable of handling.

    Regards...jmcc


  • Advertisement
  • Registered Users Posts: 40,038 ✭✭✭✭Sparks


    Just as a tangential thought, you could always throw money at the problem (ie. buy more servers (whether physical or virtual), set up a MySQL cluster and shard the denormalised monster table over them)...


  • Registered Users Posts: 7,410 ✭✭✭jmcc


    But... you can send a single query to MySQL that would let you join multiple tables together. There are arguments for denormalised tables. There are very few arguments for querying multiple tables from PHP and churning through the results to join everything up in code when that's the whole point of SQL.
    Using highly denormalised tables is a good approach for very large websites. As I said, some things at very large scale levels are counter-intuitive. Updates, loading data and rebuilding indexes all come into play at the very large site level so what might seem like a good solution on a table with a few hundred or even a million rows, becomes almost unusable at the very large site level. (Some of the newer tables on the main site have over 50 million rows and it uses a highly denormalised schema. It also serves approximately 300K pages a day. I tested it and it held up well. It is not deployed yet. ) The main data is the history of domain names and it involves about 365 million current and deleted domains. So an individual domain might only get a few queries per year.
    I guess I'm just thinking, this would be so easy to profile if you just tested with denormalised and normalised data.
    I guess it is back to A/B testing for the moment. I'm still inclined to go with a denormalised approach.
    What about the memcached suggestion?
    Definitely worth considering but it would be a bit premature as I haven't finalised the schema.

    Regards...jmcc


  • Registered Users Posts: 7,410 ✭✭✭jmcc


    stevenmu wrote: »
    If you can denormalise your data is there a good argument here for a NoSQL solution?

    I don't buy into the idea that NoSQL can completely replace relational databases, but this seems like a case where it may work well.
    Perhaps. But it is one problem at a time and getting up to speed with NoSQL would complicate matters. If MySQL doesn't work, then it is an option.

    Regards...jmcc


  • Registered Users Posts: 7,410 ✭✭✭jmcc


    Sparks wrote: »
    Just as a tangential thought, you could always throw money at the problem (ie. buy more servers (whether physical or virtual), set up a MySQL cluster and shard the denormalised monster table over them)...
    That's the expensive way and it is not necessary yet. :) This solution has to be horizontally scalable in a cheap and cheerful way. (Round robin DNS with multiple servers not necessarily using a MySQL cluster approach.)

    Regards...jmcc


  • Registered Users Posts: 40,038 ✭✭✭✭Sparks


    Well, it is always expensive to throw money at a problem :D
    The balancing act is whether you're throwing more money at it through buying resources like more servers; or whether you're throwing more money at it through increased development time and the losses incurred by increased time-to-market. (but you know this :D )


  • Registered Users Posts: 2,022 ✭✭✭Colonel Panic


    I know where you're coming from re: heavily denormalised tables. I've done it in the past and just thought "this feels like overkill".

    I've stopped worrying about it now and model the data instead and make sure I've got that right before I finalise anything DB related. Then I profile and tune the database as required.

    Somewhere in between, there is a balance. A series of trade offs.

    Ha, this is a pretty handwavey post, but it's just my own experience with databases and potentially premature optimisation.


  • Registered Users Posts: 7,410 ✭✭✭jmcc


    Sparks wrote: »
    Well, it is always expensive to throw money at a problem :D
    The balancing act is whether you're throwing more money at it through buying resources like more servers; or whether you're throwing more money at it through increased development time and the losses incurred by increased time-to-market...
    The balancing act has to do with walking the bottom line of my bank balance. :) A good initial design means that it can, largely, be scaled well horizontally but without the heavily interconnected database backend. The same schema should work well for the Irish webscape (approx 350K sites) and the UK (possibly 8 million or more websites). The US might take a bit more work.

    Regards...jmcc


  • Registered Users Posts: 92 ✭✭jgh_


    Another thought might be that you want to actually have two database servers, one for write and one for read, and have them swap back and forth.

    Additionally you probably want to think about how the data will be accessed. Caching is probably the first thing you want to look at, and then probably partition the data depending on how it will be accessed. Maybe keep the most commonly accessed data in a memory cache, then move on to SSD storage, and finally onto regular ol' magnetic storage for infrequently accessed stuff.

    I've also heard that breaking up large queries into small ones can allow the database engine to make better use of concurrency, but I can't verify that.

    disclaimer: I'm not a big data guy, I'm just throwing out some suggestions that may or may not work for you. Ultimately you have to profile to find what works for you.


  • Moderators, Society & Culture Moderators Posts: 9,689 Mod ✭✭✭✭stevenmu


    jmcc wrote: »
    Perhaps. But it is one problem at a time and getting up to speed with NoSQL would complicate matters. If MySQL doesn't work, then it is an option.

    Regards...jmcc

    Yeah valid point. Although ... :)

    I was just thinking, if you qualified for Microsoft's Bizspark program, you get a certain amount of hosting and storage, SQL Azure etc for free, and can add to it at reasonable rates. It is a bunch of different technologies if you are expecting to use PHP/MySQL, but it could be very cost effective way to scale.


  • Registered Users Posts: 11,979 ✭✭✭✭Giblet


    jgh_ wrote: »
    Another thought might be that you want to actually have two database servers, one for write and one for read, and have them swap back and forth.

    Additionally you probably want to think about how the data will be accessed. Caching is probably the first thing you want to look at, and then probably partition the data depending on how it will be accessed. Maybe keep the most commonly accessed data in a memory cache, then move on to SSD storage, and finally onto regular ol' magnetic storage for infrequently accessed stuff.

    I've also heard that breaking up large queries into small ones can allow the database engine to make better use of concurrency, but I can't verify that.

    disclaimer: I'm not a big data guy, I'm just throwing out some suggestions that may or may not work for you. Ultimately you have to profile to find what works for you.

    You're getting into CQRS there, which is very scalable, but might not be what the OP needs.


  • Advertisement
  • Registered Users Posts: 7,410 ✭✭✭jmcc


    jgh_ wrote: »
    Another thought might be that you want to actually have two database servers, one for write and one for read, and have them swap back and forth.
    Most of the activity will be read-only. That's the good thing about this type of website. The backend and data generation side of things can be done over a few computers but the webserver is generally a very simple LAMP style server. When it is a read only table, updating is often a lot simpler - just load the new table with a different name and then drop the existing table and rename the new one.
    Additionally you probably want to think about how the data will be accessed. Caching is probably the first thing you want to look at, and then probably partition the data depending on how it will be accessed. Maybe keep the most commonly accessed data in a memory cache, then move on to SSD storage, and finally onto regular ol' magnetic storage for infrequently accessed stuff.
    Definitely things to consider once the schema is nearly finalised. But at this stage, it is all about getting the best schema that scales well and easily.
    I've also heard that breaking up large queries into small ones can allow the database engine to make better use of concurrency, but I can't verify that.
    It can vary. On a large scale site, the simpler the query, the faster it can (typically) execute. The other aspect is that you have to look at freezing data which is not being updated. You will end up with two sets of data, a frozen one and a current one. It is, potentially, two queries but the benefit is on update speeds because any updates are only applied to a much smaller current table.

    Regards...jmcc


  • Registered Users Posts: 7,410 ✭✭✭jmcc


    stevenmu wrote: »
    Yeah valid point. Although ... :)

    I was just thinking, if you qualified for Microsoft's Bizspark program, you get a certain amount of hosting and storage, SQL Azure etc for free, and can add to it at reasonable rates. It is a bunch of different technologies if you are expecting to use PHP/MySQL, but it could be very cost effective way to scale.
    From MySQL to SQL Server and from PHP to ASP isn't that much of a jump. SQL Server has a steep learning curve though. Who knows - it could be the replacement for Bing? :)

    Regards...jmcc


Advertisement