High availability
Updating the version of a web application without interruption for users is a big challenge. Having a high availability means increasing costs and work.
Think, for example, of the following scenarios:
- What if there are incompatible database changes? Two versions cannot co-exist.
- What if the data structures or objects stored in the user session are incompatible between versions?
- And if the new version presents a worse problem and you need to make a rollback?
- And if there is a critical process running in the background in the old version?
- And if the app crashes or an OOME occurs (
OutOfMemoryError
) even when there is no update?
All this and more should be taken into account if you want to have high availability. So first you need to see how much availability you really crave.
Simple alternative
A simple alternative could be a policy determining a night update window (which can be automated) or something like.
I say this because often the cost of maintaining continuous high availability simply isn’t worth it.
People think it’s something simple, but it’s not if you consider the multiple facets. Each application functionality needs to be designed to allow uninterrupted operation.
Another alternative policy is that only very urgent corrections can be updated during the day, without affecting the database. Normal system updates would enter the maintenance window.
Remember that even giants like Facebook, Google, and Amazon have maintenance windows, and it’s not uncommon for you to experience some failure due to migration. So more than server the clients uninterruptedly, you can aim simply by having a short maintenance window at times that do not harm users (thus requiring statistics of system usage).
How to solve problems
Updating of the database
As the other answer already mentioned, you can use a database versioning and migration tool, such as Liquibase or Flyway. These tools allow you to gradually specify each change in the database via code, SQL or XML (each tool has different forms). So the system is able to update itself automatically.
However, you need to design the queries and functionalities in a way that doesn’t have two consecutive versions that are incompatible. Doing this is not simple and requires you to think case by case by adding a great overhead in development.
One way to mitigate this is to make each new version work with the previous version of the database and only apply the changes after the previous version has been completely disabled.
Session (Session)
Objects placed in the session can present problems because the class of those objects can change from one version to another or even values that the new version expects to find in a given attribute can be something different than the previous version put.
Another problem with session is that it hinders much that you can put the system in cluster or even if each cluster node is tenantless, that is, be independent of the customer.
Finally, in all cases I know of systems that need to scale and have high availability, the session is abolished whenever possible.
Alternatives for storing user data are distributed caches such as memcached and ehcache. However, it is preferable for the system to be stateless as much as possible, which means that it *should not store user data in memory.
Version transaction and lock protection
The only reliable way for you to transition between two versions and continue to serve users without interruption in case of a crash is to have more than one server per client.
To achieve this, the best way is to make the system multi-tenant, that is, that each instance of the system is able to meet requests from any client.
That way you should ride a cluster with at least two servers serving their clients.
At the time of upgrade, usually when there is little access, you should move all customers to one of the nodes, isolating the other. Then you update the isolated node and apply sanity tests, for example, with Web Driver. After the new version is correct, then you move all the clients to the new node and perform the update on the other node after some time.
Remembering that if there are incompatible database changes, the new system version should be able to work with the old database until both nodes are updated.
In the end, you have both knots updated. If one of them, for example starts using too much memory because of some application defect (memory leak, load too much data from the database instead of paging, etc.) then you may have a monitoring tool that automatically restarts the node, while customers are served by the other node that did not present the memory problem at that time.
Considerations
Everything I’ve written is still superficial considering all the challenges of upgrading a system with high availability.
Other aspects would be:
- Assets versioning: updated scripts, images and styles need to be invalidated for each new version.
- Cache: if it is necessary to use caches, as they should be correctly used in a cluster?
- If the system fails, a proxy may present a user-friendly page instead of a "500 error". In addition, tools need to notify those responsible automatically.
- If the user is finishing a text or complex action on a large screen, how can you guarantee that he will not lose everything he did if any error occurs just at the time he submits the data?
And so on and so on...
Enkins wouldn’t solve you ?
– Otto