Advertisement
If you have a new account but are having problems posting or verifying your account, please email us on hello@boards.ie for help. Thanks :)
Hello all! Please ensure that you are posting a new thread or question in the appropriate forum. The Feedback forum is overwhelmed with questions that are having to be moved elsewhere. If you need help to verify your account contact hello@boards.ie

So what on earth did I do?

Options
123457

Comments

  • Closed Accounts Posts: 6,925 ✭✭✭RainyDay


    Dav wrote: »
    A test environment might have helped, or it might not. I do dozens of changes to the site every week, many of which have the facility to kill the site and all on the live environment (but I'll run it through a test machine internally if it's a procedure I've never done before so that I will know how it works and the tech team will have an indicator as to what the impact of the task is on the servers etc). Normally I'm careful about how I do it and tech team will be keeping an eye so they can monitor the servers for any unwanted fruitiness.
    This sounds scary. Obviously, I know nothing about your business, and less about vBulletin. But the idea that a moment's carelessness by any employee can bring an online business down for a day is scary to me. If I were an investor, or an advertiser, I think I'd be expecting better than this. I'm not an investor or an advertiser, so feel free to ignore me.

    Perhaps in your requirements spec for the replacement to vBulletin, you might consider some environmental management functions, to ensure that a dev/test/live model or similar can be utilised for changes like this.
    Dav wrote: »
    The simple fact of the matter is I wasn't careful enough on this occasion. The process has quite obviously been reviewed internally (I didn't explicitly state that because I thought that the fact that we'd carry out a post mortem of the entire incident would have been self evident). Rainy Day, you have indeed made plenty of solid and helpful suggestions, but with respect, I would say that you do us a dis-service to suggest that there was no intention to take steps to prevent such things again just because we didn't explicitly post about them here on this thread. But there's no harm done in your suggestions and certainly no offence taken on my part.
    Just to be clear, I didn't say that there was no intention to take steps to prevent re-occurrence. I said that there was no sign of any intention to take steps to prevent re-occurrence. I fully accept that it was your intention to carry out a post-mortem.

    Good to hear that no offence was taken, because obviously, I wasn't trying to cause offence - it was a genuine constructive suggestion.


  • Registered Users Posts: 35,524 ✭✭✭✭Gordon


    You should read up on vbulletin and it's ingrained inability to handle large forums well. It's been an ongoing struggle for years by various members of the tech employees to move away from vbulletin, it's _that_ ingrained in everything. Just a small change has massive ramifications.


  • Business & Finance Moderators, Entertainment Moderators Posts: 32,387 Mod ✭✭✭✭DeVore


    Also, consider that we run a, what, 17 server setup for production? I dunno.. I honestly have lost track of the number of servers at this point.
    To replicate that in a test environment would be ideal but also, impractical. We are not a mega corporation with multi-millions, we really run a financial tight ship and you probably have waaaay to inflated an idea of what HQ looks like simply because the guys do such an awesome job of keeping the ship afloat.

    Not only that but most of our problems arise not so much from the size of the database as the number of concurrent users. Restarting Boards isnt a question of turning the machine off and on again :)
    Imagine if you turn on the fire-hose of users and start bringing up webservers. We'd literally DDOS ourselves. So we use load-balancers but they bring their own complexities... then there are things like caches which are critical to Boards speed but which take time to build up due to their nature.

    Now a test environment has none of that. So its not really a test environment then is it? Cos you arent replicating the production environment even remotely.


    The plain fact is that Boards has always been done in front of a live audience... and we trust to the quality of our people to get things right. Its kinda like space flight, when it goes wrong it reminds you of just how amazing it is that it generally goes right :)



    Still.. you know... all I'm saying is... like... it wasnt ME.... *cough*.


  • Closed Accounts Posts: 10,076 ✭✭✭✭Czarcasm


    RainyDay wrote: »
    This sounds scary. Obviously, I know nothing about your business, and less about vBulletin. But the idea that a moment's carelessness by any employee can bring an online business down for a day is scary to me. If I were an investor, or an advertiser, I think I'd be expecting better than this. I'm not an investor or an advertiser, so feel free to ignore me.

    Perhaps in your requirements spec for the replacement to vBulletin, you might consider some environmental management functions, to ensure that a dev/test/live model or similar can be utilised for changes like this.


    Even with the dev/test/live model in place, things can still go wrong in between testing and when a system goes live. I'm sure you've heard of Microsoft's "Patch Tuesday", not to mention numerous examples of unintended consequences in software development, most recently my own experience with AIB -

    Czarcasm wrote: »
    Would it be too much for AIB if they're charging me €28 in fees per quarter, to send out a bloody text message to all customers to let them know that there is an issue with their systems at the moment and they're working on fixing it as soon as possible! They have most customers mobile numbers on file, and not all of us are on Shìtter (Twitter).

    I didn't know what was going on when my wife called me to say that the ATM wouldn't give out money. I shouldn't have had to call AIB 24hr to find out what the story was, they should be making contact directly with their customers as soon as possible rather than have their customers find out when they go to withdraw funds that they can't.


    Boards.ie provides a free service to me and were able to communicate effectively throughout the outage, and courteously send a personalised PM after the outage to explain the situation, and Dav has explained what happened.

    I can't get that kind of professional courtesy from a service I'm paying for, so when I get it from a service I'm not being asked to pay for, I think the actions of all those involved in managing the site and keeping it running as smoothly as it does are a testament to the dedication of their employees and an example that other companies and service providers would do well to follow.


  • Registered Users Posts: 30,123 ✭✭✭✭Star Lord


    DeVore wrote: »
    The plain fact is that Boards has always been done in front of a live audience... and we trust to the quality of our people to get things right. Its kinda like space flight, when it goes wrong it reminds you of just how amazing it is that it generally goes right :)

    This is one of the key elements to my mind. You can have all the dev/test/live systems you want, with all kinds of virtual safety nets and so on, but ultimately there has to come a point where the human entering the data has to be trusted to do so correctly.

    Human error happens, there is absolutely NO way around that, no matter how many simulated attempts to replicate or test any given scenario. Plus these things can tend to induce errors from my experience, as people get overly confident in what they are doing, it's all too easy to fill in a form, click check boxes and drop down boxes, sure in your own mind that you got everything as it should be, only to find that you missed a check box by a pixel or two, or clicked the value below the one you should have selected, or entered a typo into a text field.

    You simply get the best people that you can for the job, and you trust them.

    On this one occasion out of hundreds (thousands?) something went amiss. Dav and the tech guys (there's a band name there!) handled the situation fantastically well in my opinion.


  • Advertisement
  • Registered Users Posts: 4,767 ✭✭✭cython


    RainyDay wrote: »
    This sounds scary. Obviously, I know nothing about your business, and less about vBulletin. But the idea that a moment's carelessness by any employee can bring an online business down for a day is scary to me. If I were an investor, or an advertiser, I think I'd be expecting better than this. I'm not an investor or an advertiser, so feel free to ignore me.

    Perhaps in your requirements spec for the replacement to vBulletin, you might consider some environmental management functions, to ensure that a dev/test/live model or similar can be utilised for changes like this.

    I can't help but think that you are continuing to miss the nature of the change here, and the fact that it was done via a UI, so this was not an environmental change, at least not in the conventional sense (I would class an environmental change as being something more like upgrading MySQL/Apache/PHP, or changing some piece of the tech stack, etc.) but rather it was an (unfortunately) incorrectly applied data update, entered through a stupidly designed screen.

    In terms of the section in bold, I have worked with something of a legacy version of an enterprise software system whereby a user attempting to search with too broad of criteria could actually bring down the Java application server running the application, due to the search being badly written, and an overly large dataset being returned! Ultimately this only required a forced restart to "resolve" but it still cropped up periodically in this legacy version due to users forgetting, new users being unaware, or users simply making a mistake in what they entered. While it was already fixed in later versions, the fix was quite involved, so the alternative fix that the client eventually got until they could upgrade was a Javascript popup warning/confirmation if their criteria weren't sufficiently specific.

    So while it might be scary to you that human error can have this level of impact/effect, it's more widespread than you might think, and as in the above example it would have been both unreasonable and not foolproof in the above example for users to test all their searches before running them, so too would there be limitations to testing all admin actions on a test system. Especially if the test system has to either have a DB the size of boards.ie live, without which the issue might not even have manifested on a test run.


  • Registered Users Posts: 29,509 ✭✭✭✭randylonghorn


    Morag wrote: »
    regisaysoops.jpg
    The only amazing thing is that it took 92 posts for "Regi says ooops!!" to pop up. :(


  • Moderators, Sports Moderators, Regional Abroad Moderators Posts: 2,646 Mod ✭✭✭✭TrueDub


    RainyDay wrote: »
    But the idea that a moment's carelessness by any employee can bring an online business down for a day is scary to me

    Ulster Bank/RBS
    Microsoft.
    Sony.

    Huge companies. One moment's carelessness, major outages.

    It's scary to everyone, but it does happen.


  • Registered Users Posts: 35,524 ✭✭✭✭Gordon


    TrueDub wrote: »
    Ulster Bank/RBS
    Microsoft.
    Sony.
    The great fire of London.


  • Closed Accounts Posts: 12,898 ✭✭✭✭Ken.


    TrueDub wrote: »
    Ulster Bank/RBS
    Microsoft.
    Sony.
    Gordon wrote: »
    The great fire of London.
    Don't worry dear I'll pull out.


  • Advertisement
  • Closed Accounts Posts: 6,925 ✭✭✭RainyDay


    TrueDub wrote: »
    Ulster Bank/RBS
    Microsoft.
    Sony.

    Huge companies. One moment's carelessness, major outages.

    It's scary to everyone, but it does happen.
    What particular major outages of Microsoft or Sony were caused by a moment's carelessness? And if they were, you can be damn sure that fixes were put in place to prevent similar problems in the future.

    I've worked in web services in large corporate similar to these, and I've never come across a situation where a single individual would regularly be carrying out manual changes in a live environment that are capable of bringing down that live environment.
    cython wrote: »
    I can't help but think that you are continuing to miss the nature of the change here, and the fact that it was done via a UI, so this was not an environmental change, at least not in the conventional sense (I would class an environmental change as being something more like upgrading MySQL/Apache/PHP, or changing some piece of the tech stack, etc.) but rather it was an (unfortunately) incorrectly applied data update, entered through a stupidly designed screen.
    Nope, I’m not missing anything. I understand how the change was applied. At a human level, I understand how human error occurred, and I’d be guilty of similar errors myself from time to time.
    But from a corporate point of view, any outage is unacceptable. But from a customer point of view, how the change is applied is an irrelevant technical detail. The customer (say a Boards advertiser) just wants the service that they’ve paid for to be available. They rely on the techies to make sure this happens.
    cython wrote: »
    So while it might be scary to you that human error can have this level of impact/effect, it's more widespread than you might think, and as in the above example it would have been both unreasonable and not foolproof in the above example for users to test all their searches before running them, so too would there be limitations to testing all admin actions on a test system. Especially if the test system has to either have a DB the size of boards.ie live, without which the issue might not even have manifested on a test run.
    No indeed, asking users to test all their searches would not be a good solution. There are other good solutions, like the one that you eventually found.
    Just to be clear, I don’t know enough about boards.ie or vBulletin to say that a dev/test/live structure is the right answer. I’m not saying it is the only answer. The more important general principle here is that regardless of how good your people are, you really don’t want to have techies working in a live environment in a way that risks the stability of that environment.

    DeVore wrote: »
    Also, consider that we run a, what, 17 server setup for production? I dunno.. I honestly have lost track of the number of servers at this point.
    To replicate that in a test environment would be ideal but also, impractical. We are not a mega corporation with multi-millions, we really run a financial tight ship and you probably have waaaay to inflated an idea of what HQ looks like simply because the guys do such an awesome job of keeping the ship afloat.

    Not only that but most of our problems arise not so much from the size of the database as the number of concurrent users. Restarting Boards isnt a question of turning the machine off and on again :)
    Imagine if you turn on the fire-hose of users and start bringing up webservers. We'd literally DDOS ourselves. So we use load-balancers but they bring their own complexities... then there are things like caches which are critical to Boards speed but which take time to build up due to their nature.

    Now a test environment has none of that. So its not really a test environment then is it? Cos you arent replicating the production environment even remotely.
    I’m a few years out of date on best practices in this area, but when I was involved, test environments didn’t try to replicate the scale of live environments. It was more about testing code and code fixes, and config fixes. This doesn’t catch every possible problem situation, but from the sound of Dav’s description of what happened, it would have caught this problem. However, the absence of any structured way to move fixes from dev to test and test to live would militate against the usefulness of a dev/test/live model.
    DeVore wrote: »
    The plain fact is that Boards has always been done in front of a live audience... and we trust to the quality of our people to get things right. Its kinda like space flight, when it goes wrong it reminds you of just how amazing it is that it generally goes right :)
    When you’re looking to improve processes and quality, the answer of ‘We’ve always done it this way’ is never a good answer.
    This is one of the key elements to my mind. You can have all the dev/test/live systems you want, with all kinds of virtual safety nets and so on, but ultimately there has to come a point where the human entering the data has to be trusted to do so correctly.
    Human error happens, there is absolutely NO way around that, no matter how many simulated attempts to replicate or test any given scenario. Plus these things can tend to induce errors from my experience, as people get overly confident in what they are doing, it's all too easy to fill in a form, click check boxes and drop down boxes, sure in your own mind that you got everything as it should be, only to find that you missed a check box by a pixel or two, or clicked the value below the one you should have selected, or entered a typo into a text field.
    Not true. There are many ways of using systems or people to MAKE SURE that data is entered correctly. A simple option is to get two people to check the data before a change is made. If that doesn’t work, get three people to check it. If that doesn’t work, build a tool that gets two people to enter the data and get the system to check that it matches. If you have safety critical systems like airport luggage scanners, you impose an image of a gun or a knife from time to time, and make sure the operator recognises it.
    Depending on the importance of the system, there are a whole pile of options to use processes or technology to avoid human error.
    You simply get the best people that you can for the job, and you trust them.
    Not good enough. You get the best people that you can for the job, and put the best processes in place to ensure quality.
    On this one occasion out of hundreds (thousands?) something went amiss. Dav and the tech guys (there's a band name there!) handled the situation fantastically well in my opinion.
    Fully agree – it was a minor inconvenience, and the solution was clearly communicated


  • Closed Accounts Posts: 5,797 ✭✭✭KyussBishop


    The only fix to this problem, is to just (while waiting to move away from VBulletin) fix the obviously broken UI - perhaps to add a VBulletin plugin to (at a more advanced level) try to detect admin/mod actions which will trigger an excessive amount of queries, and to provide a warning prompt when that is found - and in general, just to add input validity checks and warning prompts all over the place.

    Moving away from VBulletin to custom software sounds like a good (but resource-heavy) long term goal, seeing as VBulletin seems to have a lot of problems - though (like any software) will have its own drawbacks (big changes to different forum layouts/frameworks that affect users, can negatively affect communities - new software tends to have a fair amount of security vulnerabilities, which I know well as it's part of my work).


  • Registered Users Posts: 26,578 ✭✭✭✭Creamy Goodness


    boards vbulletin install has long passed the ability to install plugins, Conor who worked on the tech-team mentioned this several times 3-4 years ago. They were removing vbulletin then and they're still removing it. It's ****ing huge and there's only so many developers.

    Boards only real solution to removing vbulletin is to slowly smoother it with a nice fluffy pillow and watch for the twitches to stop (but it's a long a laborious process).


  • Registered Users Posts: 1,771 ✭✭✭Dude111


    Dav wrote:
    I was not as careful with the vBulletin tools that manage this as I should have been and I accidentally copied every post from the site into one of the Airsoft forums instead of just the posts from one of the forums that was getting merged.
    Im sorry you had issues......

    Im glad the site is ok!!


  • Registered Users Posts: 30,123 ✭✭✭✭Star Lord


    RainyDay wrote: »
    Not true. There are many ways of using systems or people to MAKE SURE that data is entered correctly. A simple option is to get two people to check the data before a change is made. If that doesn’t work, get three people to check it. If that doesn’t work, build a tool that gets two people to enter the data and get the system to check that it matches. If you have safety critical systems like airport luggage scanners, you impose an image of a gun or a knife from time to time, and make sure the operator recognises it.
    Depending on the importance of the system, there are a whole pile of options to use processes or technology to avoid human error.

    Getting multiple people to validate entry is a great idea, and I'm sure it works, in a large corporate environment. Outside of a large corporate environment, that kind of thing just does not work, as you do not have the time, the money nor the manpower to do such things. What you are suggesting would effectively shackle the mods, admins and devs so that every change made would have to be double or triple checked. And that's just never going to work.

    Having some script that checks the data entered is all well and good too, but they can't check input against intent, so will always remain somewhat open to human error.


  • Closed Accounts Posts: 6,925 ✭✭✭RainyDay


    What you are suggesting would effectively shackle the mods, admins and devs so that every change made would have to be double or triple checked. And that's just never going to work.
    Ok, let's take your word that this won't work. Is Boards.ie going to say to advertisers "We're liable to lose the site at any time for 12-24 hours due to human error. We work really hard to avoid this, and we've great people, but sure you know yourself."

    because that's the alternative option.

    What changes are so frequent, urgent and important that they can't be double-checked by a second pair of eyes?


  • Moderators, Category Moderators, Home & Garden Moderators, Recreation & Hobbies Moderators Posts: 22,380 CMod ✭✭✭✭Pawwed Rig


    RainyDay wrote: »
    Ok, let's take your word that this won't work. Is Boards.ie going to say to advertisers "We're liable to lose the site at any time for 12-24 hours due to human error. We work really hard to avoid this, and we've great people, but sure you know yourself."

    I would agree with you if boards was going down regularly for prolonged periods because of issues similar to DavGate. However this is the first time in a long time where I could not access the site for a few hours so I don't think a mountain should be made out of a molehill here. I am sure that they can tell advertisers that the site is available 99.99% of the time.


  • Registered Users Posts: 33,712 ✭✭✭✭Penn


    RainyDay wrote: »
    Ok, let's take your word that this won't work. Is Boards.ie going to say to advertisers "We're liable to lose the site at any time for 12-24 hours due to human error. We work really hard to avoid this, and we've great people, but sure you know yourself."

    because that's the alternative option.

    What changes are so frequent, urgent and important that they can't be double-checked by a second pair of eyes?

    Any site is liable to go down at any time for a few hours due to human error. The advertisers own sites are liable to go down due to human error. That's the problem, it's an error. It's not supposed to happen, but it does.

    Each site has to have in place the safeguards they feel are necessary to make sure it doesn't happen, but also have to make sure they don't go overboard as it could cost more money in the long run.

    Mistakes happen. In Boards' case, they happen very rarely, and any and all effort is always made to resolve them quickly. But let's not go over the top. It was a Sunday afternoon (Note: NONE of the Talk To forum companies are scheduled to be online on Sundays anyway), a mistake was made, the problem was soon corrected.


  • Registered Users Posts: 10,981 ✭✭✭✭dulpit


    This thread is beginning to remind me of work. Some error happened as a one-off, and now people are suggesting whole-sale changes to the way everything is done, even though the past number of years has shown this is unlikely to occur again.


  • Business & Finance Moderators, Entertainment Moderators Posts: 32,387 Mod ✭✭✭✭DeVore


    In that we can do better and iron out these situations where human error can creep in, we will. Thats already happening in this case as Dav said. Its an iterative process. Its not a case of "fix everything or leave it all as is". Thats a false dichotomy.


    Life on the internet means you can get hacked, ddosed, DNS-jacked or simply take your site offline with a misclick. Given enough time that goes from "can" to "will". GMail has outages, remember :)


    We have a record of 99.9995% uptime in recent months. Yep. 99.9995%
    I'll wager we're the best in the business in Ireland at it.

    There are *banks* that wish they had that uptime record. Lots of them in Ireland :)


  • Advertisement
  • Registered Users Posts: 11,647 ✭✭✭✭El Weirdo


    Pawwed Rig wrote: »
    ... DavGate ...
    5vQ7PV.jpg


  • Closed Accounts Posts: 31,967 ✭✭✭✭Sarky


    I like the "live audience" analogy. Mostly because I can liken admins to various comedian actors.


  • Registered Users Posts: 8,427 ✭✭✭Morag


    Sarky wrote: »
    I like the "live audience" analogy. Mostly because I can liken admins to various comedian actors.

    Does that make the mods "the bit with the dog" ?


  • Closed Accounts Posts: 12,807 ✭✭✭✭Orion


    qvc8g.png
    :eek: That's the default? This was only a matter of time so.

    We need a new Gathering card for this one :D


  • Moderators, Category Moderators, Entertainment Moderators, Sports Moderators Posts: 22,584 CMod ✭✭✭✭Steve


    Airsoft, 40 mil posts in one day.... up yores AH. :cool:


  • Subscribers Posts: 4,076 ✭✭✭IRLConor


    RainyDay wrote: »
    I don’t know enough about boards.ie or vBulletin to say that a dev/test/live structure is the right answer.

    I do (or at least did). When I initially joined Boards.ie there was no development environment since Ross had been swamped for months and hadn't the capacity to do anything more than firefighting. For the first few months the procedure for us to make a change was to SSH into a machine, fire up vim, edit the code, squint at it to make sure it was right, <esc>:w and then alt-tab to our browser to make sure the site was still up. As you might imagine this was not only scary and error prone but also made development very slow.

    Eventually between the two of us we were able to get our heads above water just long enough to make some development environments for ourselves. This only segregated the application under a development vhost, the rest (load balancer, database, monitoring, etc) was all production stuff. This was enough to insulate the users from the vast majority of our screwups and got us a little bit further down the road.
    RainyDay wrote: »
    I’m a few years out of date on best practices in this area, but when I was involved, test environments didn’t try to replicate the scale of live environments.

    When I was working on boards.ie we couldn't work with a full sized copy of the database since we didn't have any machines that were both spare and beefy enough to deal with the boards.ie database. I tried a few times to work with a cut-down copy of a backup but a) it had very different performance characteristics meaning it was useless for proper testing and b) some parts of the site (the front page in particular IIRC) behaved quite strangely when it didn't have new data.

    At the time, our development environments used the live DB because it was the only way of actually getting anything done. I don't know what they do now, hopefully they've found ways of dealing with it. It's a massive PITA since a lot (>75%) of the challenges with Boards are scale issues which you have to test with the full size database. Just copying and restoring a backup to another machine in the same rack used to take the guts of an hour.
    boards vbulletin install has long passed the ability to install plugins, Conor who worked on the tech-team mentioned this several times 3-4 years ago.

    Killing the plugin system got rid of a huge pile of security holes and maintenance nightmares. It truly was a joy to kill it off.
    It's ****ing huge

    Just so people get a better idea of this, when I started deleting parts of it there was roughly 250k lines of code and the guts of 100k lines of templates (stored in the database! where else should it go! :mad:). Hopefully there's much less there now.

    Oh yeah, and vBulletin doesn't come with any tests so if you want testing you gotta write the tests yourself.


  • Registered Users Posts: 40 HipsterHunter


    True....wannabe soldiers with their little plastic pea shooters:rolleyes:,running thru the woods shouting"gotcha"
    has to be the gayest game ever .....;)


    love to see you play ..id shoot the crap outta ya ! show ya what a pea shooter can do :eek:

    dont knock it till you try it .


  • Registered Users Posts: 6,744 ✭✭✭raze_them_all_


    love to see you play ..id shoot the crap outta ya ! show ya what a pea shooter can do :eek:

    dont knock it till you try it .

    You don't get sarcasm do you


  • Registered Users Posts: 15,944 ✭✭✭✭Villain


    DeVore wrote: »
    In that we can do better and iron out these situations where human error can creep in, we will. Thats already happening in this case as Dav said. Its an iterative process. Its not a case of "fix everything or leave it all as is". Thats a false dichotomy.


    Life on the internet means you can get hacked, ddosed, DNS-jacked or simply take your site offline with a misclick. Given enough time that goes from "can" to "will". GMail has outages, remember :)


    We have a record of 99.9995% uptime in recent months. Yep. 99.9995%
    I'll wager we're the best in the business in Ireland at it.

    There are *banks* that wish they had that uptime record. Lots of them in Ireland :)
    99.9995 is 13 seconds a month, you might need to take a few nine's off that for this month :D


  • Advertisement
  • Closed Accounts Posts: 10,076 ✭✭✭✭Czarcasm


    You don't get sarcasm do you


    I'm well used to that by now... :rolleyes:

    :D


Advertisement