Quote by Gufuncouple
This problem is entirely due to slow IP communication between back-end computers, possibly including the Web server. Only the people who set up the network at the site which hosts the Web server can fix this problem.
You are clearly an expert in such matters, I have very little knowledge about any of what you have said, but I think I get the jist of it.
Does this mean that if Swinging Heaven are not the same people as the people who set up the network.... that we will have to wait for the people who set up the network to address the problem.
If so, it could potentially be a little while - unless Swinging Heaven have arrange some form of out of hours support to address such issues.
Depending on the type of Datacentre that the servers are hosted from. The SH support team have to contact the DC Operations staff to identify where the problem is and if a fix can be applied to restore service. Most DC's are 'secure' environments and require a PTW (Permit to Work) for external engineers to visit and repair.
8 times out of 10 a 504 error is caused by a faulty patch cable between the server and the network hub. The other 2 times are usually the NIC (Network Interface Card) developing a fault.
When we architect server farms we would normally specify servers that contain twin NIC's and that they are patched to separate hubs to avoid this type of problem occurring and thus provide some degree of fail-safe.
DRAC, ILO allow for remote access when a server is down, IP-KVM if the servers dont support that is another option, ATS, RemotePDUs, so for the most part they have the ability to control things remotley, the kit exists.
They are using Varnish/Nginx servers which support load-balancing but will kick out faulty servers from the pool if one stops responding so even with single NICs this should be under control. There may be LVM sitting infront of all the servers too which will also kick out non responding servers.
Dual NICs in a Active/Pasive setup will not always fall over correctly in certian situations which can be quite annoying, round-robin is annoying as it can lead to an intermittant packet loss if a switch/cable has a problem.
They should also be running some form on nagios or service monitoring system to make sure that all is running correctly and if not it will start txting the admins to sort it out or take certian actions to kick the servers back into life again.
