Creating a CDN - content delivery network

Asked

Viewed 4,172 times

30

Wikipedia

Networked computer system via the Internet, which cooperate transparently to provide content.

CDN are usually deployed in multiple locations. Benefits include reducing bandwidth costs, improving page load times, or increasing overall content availability.


Baboo

A CDN is a data and information distribution network that allows you to quickly provide content to multiple users, regardless of where they are. What the system does is distribute the same content on several servers, to direct this data according to the user’s server proximity.

The service works as a cache of the website’s non-dynamic files in different datacenters of locations scattered around the world.


(Stackoverflow)

CDN is simply a network of servers that replicate their binary files so that they are served from geographically close locations. CDN is simply - a bunch of "disks" that are spread all over the world in different datacenters. You send your file to one of these disks - and then you tell where your customers are coming from. It will then copy the same file to other disks that are closer to your clients.


inserir a descrição da imagem aqui Single server distribution vs CDN (wikipedia)


That said, let’s get to the questions.

It’s been a long time that I’ve been researching about Cdns for a project with great demand for static content. I didn’t want to resort to any CDN Free, so little hire a Akamai. So I thought I’d dig deeper into the subject to weigh the cost benefit of building something of my own.

  1. Assuming I choose a DC in Brazil and one in the US for a CDN in the Americas, the identification of the request to fetch the nearest DC is done on the server, or in PHP?
  2. Uploaded to an image, it is instantly replicated to DC or only when local demand?
  3. Apache and Linux are enough or there is need to install other own tools for a CDN?

Having my main domain with links to www.my-cdn.com, in this I can have PHP working with GEOIP to know the location of the request, but how to demand the correct DC? It seems to me that this way the process is more expensive. There are tools of their own?


I do not expect a tutorial, just a light on some points. I’ve really looked a lot, but everything is very superficial, I haven’t found anything that deepens in the creation of a CDN, most explain how it works, or explain the differences between CDN and a Cloud.

  • 5

    I should be able to signal this question as -> Question this very good and deserves a reward. +1.

  • 4

    I agree with @Denercarvalho. There was no answer and I already learned what a CDN is just from the text of the question. I hope you get answers as good as the question, I’ll be waiting too.

  • I’m pretty sure the answer to the first one is done on the "nearest" server and not in PHP, but it’s something I really don’t understand :/ - Big question +1

  • 1

    @Renan, I did an update on the question demonstrating the use of PHP with GEOIP for localization, so I put the language tag. But I think the answers are less biased to language.

  • 3

    The "hole is lower". Check BGP4 (Border Gateway Protocol). A "real" CDN works at the routing protocol level, even before it reaches the server. According to where the connection comes from, it is set who will meet the request.

  • 1

    DNS plays a crucial role in the implementation of a CDN (https://www.nczonline.net/blog/2011/11/29/how-content-delivery-networks-cdns-work/). It sounds interesting: http://www.fromdev.com/2011/06/create-cdn-content-delivery-network.html

  • @Cantoni remembering that the route can be solved at IP level as well.

  • 1

    @Papacharlie I do not understand how it works, but the comment of Bacco seems to be the answer of the first, I just do not know how it works, but I will follow, until favorite the question :D

  • 2

    You have an interesting toy here: https://stat.ripe.net/widget/bgplay#w. Resource=104.24.99.83 - I got an IP from Cloudflare, so you can see how many different paths there are. Compare to a simple hosting. Below is the timeline, you can track all the route changes at certain times. That there, in a simplified way, is a graph of how systems know the paths between themselves for data exchanges.

  • Important concepts to start with: https://en.wikipedia.org/wiki/Autonomous_system_(Internet) https://en.wikipedia.org/wiki/Border_Gateway_Protocol

  • 1

    I hope with the Bounty in the post someone remember to talk about Anycast, which although more complicated to implement, works independent of DNS, as large networks like Cloudflare do. For those who don’t know what it is, it’s same IP everywhere, and in edge routing is chosen the "nearest" network. It works even if, for example, someone from Brazil or Portugal is using Google DNS or Opendns.

  • 1

    @PapaCharlie https://anuragbhatia.com/networking/different-cdn-technologies-dns-vs-anycast-routing/

  • @Bacco Thanks for one more reference, I will garimpas all links - There is stuff for God, will give a cool whitening.

Show 8 more comments

4 answers

18


Think of a CDN as another type of cache that aims to decrease the time the user expects for some resource, usually images, styles, fonts and scripts, but not limited to these.

In general, resources that stay in a CDN (or any other cache) must be static, that is, they do not change, at least for a period of time.

A website or web system may have several cache levels. Let’s see some:

Serving static content

Suppose you have a Wordpress blog or some system that generates pages dynamically according to the URL and parameters. Suppose because the content is great, your blog is taking 10 seconds to load a page.

The first thing you can do is prevent it from being necessary to generate the page again with each request. The effort to do this depends on the type of content.

If parts of the site, such as advertisements, vary with each request, you can isolate this by carrying this content asynchronously or by placing it in a iframe. The rest of the page can be stored in memory or on disk files.

A simple strategy to do this is to use a proxy such as Varnish. Basically it is between the user and the server. In the first request to a resource, it asks the server to generate that page, as if it were a normal access. It then saves the received content from the server and returns it to the client. In other requests to the same resource, it simply returns what was saved before.

This mechanism allows to increase in several orders of magnitude the load that a server supports. It primarily attacks server processing time, which in this case would include running hundreds of PHP scripts and dozens of database accesses.

Avoiding frequent access to the database

Continuing in the example of the blog, the problem is that in the first access to a page, the user still takes 10 seconds to see the result.

One of the common causes for the delay of generating a page on a web system is that it always needs to load all the information from the database again.

The first recommendation to resolve this is to avoid reading the same information more than once from the bank. A side effect of modularization of systems is that often different classes or scripts request the same information several times and this goes to the database.

The way to solve this is to create a request cache type, where already read information is reused until the end of the request.

The second thing you can do is extend the request cache to be reused between requests in the case of information that changes little. For example, on a system that handles image galleries, categories tend not to change as often, so you can cache the category list and invalidate the cache only when the user or administrator makes changes.

Database item caching is usually done in memory and, depending on the distribution scale, you can use a distributed cache.

This approach decreases the server processing time to generate an output.

Avoiding unnecessary requests

In the two previous approaches, we tried to decrease the server response time to the maximum, in the first request and also in the subsequent requests.

However, latency to the server can be a big problem if there are still many resources loaded on a page.

Packaging resources efficiently

If you have many scripts and styles, consider minifying them and grouping them into larger files. Small images can also be grouped into a larger image in sprites displayed with specific CSS styles.

Most browsers limit competing downloads to no more than half a dozen, so if you have for example Suns 15 scripts, 10 styles and 30 more images, all this will be lined up and downloaded in cascade.

However, if you join each resource type into larger files, it will decrease the total download time.

Note that with HTTP/2 this will not be so true, because it will support multiplexed transport of multiple resources on a single connection, and compress headers, decreasing overhead. So make sure your server has HTTP/2 support before creating batches of resources.

Making good use of browser cache

Another problem that is usually ignored is when no good use is made of the browser cache. Pages and features that do not change should be configured to be cached for a long time.

Imagine that we use version 1.7.1. of jQuery on the blog. We can configure its cache to last "forever", that is, never expire. If we want to change the jQuery version, just serve the library with another URL on the page, for example:

/js/vendor/jquery-1.7.2.js

Therefore, if the resources have the version included in the URL and the cache is configured correctly, the user will only need to load the files again when they effectively change.

The same can be done with images if you set the URL of the images with some pattern based, perhaps, on the date of upload or change of the same.

Anyway, it is important to think about each resource that your page downloads in the browser and set a cache policy. The more you can make static and cache "infinite", the better.

Using Third Party Cache

If your page uses common libraries like jQuery, you might consider serving them using a free external CDN.

Google, for example, hosts several common libraries and when using them you take advantage of the already cached files if the user accessed some site that uses the same libraries and the same versions.

Also, the browser checks that these files are in another domain and can increase the number of simultaneous connections for download.

Break, you save bandwidth and processing on your server.

I know you mentioned that you didn’t want to use free Cdns, but in this case it’s not the same as using any CDN. You will hardly get a higher uptime than a CDN like this from Google, for example.

Getting closer to the user

Assuming we apply all the above steps, the blog should be as fast as a bullet. (I know, because I did all this in my).

However, users who make the first request on the other side of the world still have some difficulty loading the heavier pages or, in this case, it may be that some large images take time to be downloaded.

Also, sometimes the user presses F5 and the browser decides to make an HTTP request HEAD to check if the features have changed and the page loading is slow because even this takes.

Well, this is where you need to start thinking about geolocation, that is, putting a server closer to your client.

Distributing your system

An approach to decreasing latency in general, not just static features, is simply to put a copy of the system close to the client.

The problem with this approach is that it is not always easy to split the database, especially if the content is shared among all users, as in a social network.

In the case of a blog, you could simply create nodes in different locations and sync the content on the nodes slave based on a master node, for example.

However, this approach usually requires changes in the overall system architecture.

Using Cdns

Finally we come to the point.

If you conclude that much of the page load time is in the download of static resources (or that they could be static, or that they can be versioned by varying the URL), then a good strategy would be to distribute these resources to near the user, using servers at different locations, while the main system could still be centralized.

Let’s imagine that the blog has more accesses in Brazil and Japan, being the system hosted in Brazil.

If we think of a CDN as a simple content distribution network, we can then design a cache in Japan to store the images so that users with eyes pulled experience the same download speed as the Mamluks here.

Let’s split the structure of our CDN into network infrastructure and a simple HTTP server.

A network infrastructure could use a methodology implementation Anycast. Basically, this technology allows you to route a request to the node closest to the network. Thus, we could serve only one URL for each image and the server that would receive the request depends on the client’s location.

On the simple HTTP server, we only need an Apache and a copy of the images that will be served. No mystery.

The problem is how to put the images on the two HTTP servers in Brazil and Japan.

The shape Lazy to do this is to put a script on the Apache server that checks whether the image exists locally on that node. If it exists, just serve the image. If it does not exist, make a request to the system and save locally.

The problem with this approach is that we partially regress performance: the first requisition will take a long time. The advantage is that the image a user accesses is cached in that region for all other users to access. This can be advantageous in the case of a blog, but if each user tends to access different images, the gain will be minimal.

The alternative is to be aggressive, and each time a user uploads an image, you distribute the content on your content distribution network. The result is that we have the best experience for users in each region, when they try to view the images.

On the other hand, you create a difficulty for those who send the images. Now the request needs to wait for synchronization between servers and content creators is who will suffer.

You might think about doing this asynchronously, but it also means that someone accessing a page at the time of an upload might see a broken link because the image hasn’t yet landed on the nearest CDN.

And the problem would only increase by scaling your content network to more locations.

A third alternative would be to implement the first two, i.e., to cache aggressively, but to synchronize the image individually in case it does not find the node that received the request.

CDN homemade?

For all of this, I really want to discourage you from creating your own CDN solution.

If you have little traffic, CDN is probably not worth it. If you have a lot of traffic, paying a good CDN will be one of the smallest concerns.

Think of the time wasted on setting up this entire infrastructure and the time spent on handling sync issues while uploading.

Considerations

I wrote all this because most of the time you can solve your performance problem by achieving acceptable performance by simply doing a tuning on your server and/or system.

On the other hand, we often see in the CDN a kind of "savior of the fatherland", a magical, quick and easy way to gain performance with little effort and this is not true in most cases.

Of course, the decision to create or hire a CDN, or even decide not to use any, is up to each type of system and the use that is made of it. I hope this information will help readers make better decisions for every situation.

  • I already make use of sprites and cache for static files, including some BD items as you mentioned. I haven’t finished reading everything, I will read and reread it calmly... That answer was a lesson!

5

Cdns are more complex than a mere redirect based on a PHP server. Even because, if take into account the latency that the server will require to make the decision, will already be actually hindering the goal.

Real Cdns, like a Cloudfire or Akamai, have a much more robust structure, already at the level of datacenter and routing, IE, it is big people fight.

I think the intention of not relying on third parties to serve content is commendable, after all this is how new ideas arise. The idea of routing via PHP can be an interesting alternative to redundancy or load balancing. We would also need to take into account more concrete real data about what your demand is going to be, so we can do the math and see if really what your purpose will lack and what cost it will generate.

Without wanting to be discouraged, consider that depending on the size of the demand, a Cloudfire may be more affordable with resources ready. If you have more data, we will be happy to help you.

5

1 - The choice of server will be made even before your application is triggered. The person responsible is a network element (e.g., reverse proxy) or even DNS (answer 3). You can understand the CDN network as a layer prior to the bottom of your application.

2 - It will seem an empty answer, but "depends". There are passive and active implementations. The active ones (instant replication) tend to present better results.

3 - You will need to define the router of the requests. The one that will know which server is closest to the applicant. Usually this is done with a DNS server that uses an IP geolocation table.

  • +1, only one caveat: depending on DNS is not effective. The geolocation of the DNS has no direct relationship with the geolocation of the final customer, and there are cases where the resolution is done in cascades. For example, if someone is going to analyze my requests, they will see the Opendns Ips in most cases, which are unrelated to the geographic position where I am requesting the data. The DNS serves only as an auxiliary layer. In addition, it has providers that pass the request to third party DNS, further complicating the geographical determination. The same ideal would be to route the address on the edges between the Ass.

3

Basic characteristics and requirements of a CDN

A CDN may not only keep a copy of static content, but may be used to distribute accessible page caching that can be stored in the main memory to improve the distribution.

For a CDN to perform well it is necessary to take care to make the file delivery header as small as possible, avoiding for example the use of cookies in the domain of CDN, one of the advantages is the increase in SEO ratings.

It is very important to observe server latency for a particular location. For example: The response time of a server to the client located in Brazil is 200ms and another is 20ms, so the best server will be the second. To evaluate the latency of the servers in different locations there is a bunch of websites that allows to run these tests http://traceroute.org/

The structure using DNS with Geoip

     [Servidor GeoDNS]
            | |
            | | --> [Servidor CDN 1] --|
            | |                        |
Cliente --> | | --> [Servidor CDN 2] --| --> [Servidor WEB]
            | |                        |
            | | --> [Servidor CDN 3] --|
            | |

The main content node will run a mix of Apache + PHP + Mysql for example, the tip nodes (delivery servers) may use the Nginx and the Varnish for caching.

For the DNS server you can use the BIND9 along with a geoip database. Thus the DNS server will do the distribution by Geolocation, but a request is redirected several times until it reaches the server and comes back, which can cause the client’s IP not to be obtained correctly. However there are ways around this problem, but often this does not cause major inconvenience, unless you want a city-level CDN server.

The structure using Anycast DNS

The Anycast-based structure is a little more complex because it is on a lower network layer. Anycast assigns an IP prefix to multiple servers, while common DNS attempts to solve a more suitable IP for response, this results in a lower name resolution latency.

Another advantage of Anycast is that it also serves as load balancing, where if a nearest server is out of the air, it automatically forwards to another.

         [Roteador]
Cliente --> | | --> [Servidor CDN 1] --|
                                       |
Cliente --> | | --> [Servidor CDN 2] --| --> [Servidor WEB]
                                       |
Cliente --> | | --> [Servidor CDN 3] --|

That is, The client will always receive the same IP and when the network receives the request, will define the best route to meet. It is not about connecting directly to a server but to a network.

References

  • Great explanation with the graphics and great references, I will read each one calmly.

Browser other questions tagged

You are not signed in. Login or sign up in order to post.