Think of a CDN as another type of cache that aims to decrease the time the user expects for some resource, usually images, styles, fonts and scripts, but not limited to these.
In general, resources that stay in a CDN (or any other cache) must be static, that is, they do not change, at least for a period of time.
A website or web system may have several cache levels. Let’s see some:
Serving static content
Suppose you have a Wordpress blog or some system that generates pages dynamically according to the URL and parameters. Suppose because the content is great, your blog is taking 10 seconds to load a page.
The first thing you can do is prevent it from being necessary to generate the page again with each request. The effort to do this depends on the type of content.
If parts of the site, such as advertisements, vary with each request, you can isolate this by carrying this content asynchronously or by placing it in a iframe
. The rest of the page can be stored in memory or on disk files.
A simple strategy to do this is to use a proxy such as Varnish. Basically it is between the user and the server. In the first request to a resource, it asks the server to generate that page, as if it were a normal access. It then saves the received content from the server and returns it to the client. In other requests to the same resource, it simply returns what was saved before.
This mechanism allows to increase in several orders of magnitude the load that a server supports. It primarily attacks server processing time, which in this case would include running hundreds of PHP scripts and dozens of database accesses.
Avoiding frequent access to the database
Continuing in the example of the blog, the problem is that in the first access to a page, the user still takes 10 seconds to see the result.
One of the common causes for the delay of generating a page on a web system is that it always needs to load all the information from the database again.
The first recommendation to resolve this is to avoid reading the same information more than once from the bank. A side effect of modularization of systems is that often different classes or scripts request the same information several times and this goes to the database.
The way to solve this is to create a request cache type, where already read information is reused until the end of the request.
The second thing you can do is extend the request cache to be reused between requests in the case of information that changes little. For example, on a system that handles image galleries, categories tend not to change as often, so you can cache the category list and invalidate the cache only when the user or administrator makes changes.
Database item caching is usually done in memory and, depending on the distribution scale, you can use a distributed cache.
This approach decreases the server processing time to generate an output.
Avoiding unnecessary requests
In the two previous approaches, we tried to decrease the server response time to the maximum, in the first request and also in the subsequent requests.
However, latency to the server can be a big problem if there are still many resources loaded on a page.
Packaging resources efficiently
If you have many scripts and styles, consider minifying them and grouping them into larger files. Small images can also be grouped into a larger image in sprites displayed with specific CSS styles.
Most browsers limit competing downloads to no more than half a dozen, so if you have for example Suns 15 scripts, 10 styles and 30 more images, all this will be lined up and downloaded in cascade.
However, if you join each resource type into larger files, it will decrease the total download time.
Note that with HTTP/2 this will not be so true, because it will support multiplexed transport of multiple resources on a single connection, and compress headers, decreasing overhead. So make sure your server has HTTP/2 support before creating batches of resources.
Making good use of browser cache
Another problem that is usually ignored is when no good use is made of the browser cache. Pages and features that do not change should be configured to be cached for a long time.
Imagine that we use version 1.7.1. of jQuery on the blog. We can configure its cache to last "forever", that is, never expire. If we want to change the jQuery version, just serve the library with another URL on the page, for example:
/js/vendor/jquery-1.7.2.js
Therefore, if the resources have the version included in the URL and the cache is configured correctly, the user will only need to load the files again when they effectively change.
The same can be done with images if you set the URL of the images with some pattern based, perhaps, on the date of upload or change of the same.
Anyway, it is important to think about each resource that your page downloads in the browser and set a cache policy. The more you can make static and cache "infinite", the better.
Using Third Party Cache
If your page uses common libraries like jQuery, you might consider serving them using a free external CDN.
Google, for example, hosts several common libraries and when using them you take advantage of the already cached files if the user accessed some site that uses the same libraries and the same versions.
Also, the browser checks that these files are in another domain and can increase the number of simultaneous connections for download.
Break, you save bandwidth and processing on your server.
I know you mentioned that you didn’t want to use free Cdns, but in this case it’s not the same as using any CDN. You will hardly get a higher uptime than a CDN like this from Google, for example.
Getting closer to the user
Assuming we apply all the above steps, the blog should be as fast as a bullet. (I know, because I did all this in my).
However, users who make the first request on the other side of the world still have some difficulty loading the heavier pages or, in this case, it may be that some large images take time to be downloaded.
Also, sometimes the user presses F5 and the browser decides to make an HTTP request HEAD
to check if the features have changed and the page loading is slow because even this takes.
Well, this is where you need to start thinking about geolocation, that is, putting a server closer to your client.
Distributing your system
An approach to decreasing latency in general, not just static features, is simply to put a copy of the system close to the client.
The problem with this approach is that it is not always easy to split the database, especially if the content is shared among all users, as in a social network.
In the case of a blog, you could simply create nodes in different locations and sync the content on the nodes slave based on a master node, for example.
However, this approach usually requires changes in the overall system architecture.
Using Cdns
Finally we come to the point.
If you conclude that much of the page load time is in the download of static resources (or that they could be static, or that they can be versioned by varying the URL), then a good strategy would be to distribute these resources to near the user, using servers at different locations, while the main system could still be centralized.
Let’s imagine that the blog has more accesses in Brazil and Japan, being the system hosted in Brazil.
If we think of a CDN as a simple content distribution network, we can then design a cache in Japan to store the images so that users with eyes pulled experience the same download speed as the Mamluks here.
Let’s split the structure of our CDN into network infrastructure and a simple HTTP server.
A network infrastructure could use a methodology implementation Anycast. Basically, this technology allows you to route a request to the node closest to the network. Thus, we could serve only one URL for each image and the server that would receive the request depends on the client’s location.
On the simple HTTP server, we only need an Apache and a copy of the images that will be served. No mystery.
The problem is how to put the images on the two HTTP servers in Brazil and Japan.
The shape Lazy to do this is to put a script on the Apache server that checks whether the image exists locally on that node. If it exists, just serve the image. If it does not exist, make a request to the system and save locally.
The problem with this approach is that we partially regress performance:
the first requisition will take a long time. The advantage is that the image a user accesses is cached in that region for all other users to access. This can be advantageous in the case of a blog, but if each user tends to access different images, the gain will be minimal.
The alternative is to be aggressive, and each time a user uploads an image, you distribute the content on your content distribution network. The result is that we have the best experience for users in each region, when they try to view the images.
On the other hand, you create a difficulty for those who send the images. Now the request needs to wait for synchronization between servers and content creators is who will suffer.
You might think about doing this asynchronously, but it also means that someone accessing a page at the time of an upload might see a broken link because the image hasn’t yet landed on the nearest CDN.
And the problem would only increase by scaling your content network to more locations.
A third alternative would be to implement the first two, i.e., to cache aggressively, but to synchronize the image individually in case it does not find the node that received the request.
CDN homemade?
For all of this, I really want to discourage you from creating your own CDN solution.
If you have little traffic, CDN is probably not worth it. If you have a lot of traffic, paying a good CDN will be one of the smallest concerns.
Think of the time wasted on setting up this entire infrastructure and the time spent on handling sync issues while uploading.
Considerations
I wrote all this because most of the time you can solve your performance problem by achieving acceptable performance by simply doing a tuning on your server and/or system.
On the other hand, we often see in the CDN a kind of "savior of the fatherland", a magical, quick and easy way to gain performance with little effort and this is not true in most cases.
Of course, the decision to create or hire a CDN, or even decide not to use any, is up to each type of system and the use that is made of it. I hope this information will help readers make better decisions for every situation.
I should be able to signal this question as -> Question this very good and deserves a reward. +1.
– gato
I agree with @Denercarvalho. There was no answer and I already learned what a CDN is just from the text of the question. I hope you get answers as good as the question, I’ll be waiting too.
– user28595
I’m pretty sure the answer to the first one is done on the "nearest" server and not in PHP, but it’s something I really don’t understand :/ - Big question +1
– Guilherme Nascimento
@Renan, I did an update on the question demonstrating the use of PHP with GEOIP for localization, so I put the language tag. But I think the answers are less biased to language.
– Papa Charlie
The "hole is lower". Check BGP4 (Border Gateway Protocol). A "real" CDN works at the routing protocol level, even before it reaches the server. According to where the connection comes from, it is set who will meet the request.
– Bacco
DNS plays a crucial role in the implementation of a CDN (https://www.nczonline.net/blog/2011/11/29/how-content-delivery-networks-cdns-work/). It sounds interesting: http://www.fromdev.com/2011/06/create-cdn-content-delivery-network.html
– cantoni
@Cantoni remembering that the route can be solved at IP level as well.
– Bacco
@Papacharlie I do not understand how it works, but the comment of Bacco seems to be the answer of the first, I just do not know how it works, but I will follow, until favorite the question :D
– Guilherme Nascimento
You have an interesting toy here: https://stat.ripe.net/widget/bgplay#w. Resource=104.24.99.83 - I got an IP from Cloudflare, so you can see how many different paths there are. Compare to a simple hosting. Below is the timeline, you can track all the route changes at certain times. That there, in a simplified way, is a graph of how systems know the paths between themselves for data exchanges.
– Bacco
Important concepts to start with: https://en.wikipedia.org/wiki/Autonomous_system_(Internet) https://en.wikipedia.org/wiki/Border_Gateway_Protocol
– Bacco
I hope with the Bounty in the post someone remember to talk about Anycast, which although more complicated to implement, works independent of DNS, as large networks like Cloudflare do. For those who don’t know what it is, it’s same IP everywhere, and in edge routing is chosen the "nearest" network. It works even if, for example, someone from Brazil or Portugal is using Google DNS or Opendns.
– Bacco
@PapaCharlie https://anuragbhatia.com/networking/different-cdn-technologies-dns-vs-anycast-routing/
– Bacco
@Bacco Thanks for one more reference, I will garimpas all links - There is stuff for God, will give a cool whitening.
– Papa Charlie