Cities and states in XML or Mysql

Asked

Viewed 485 times

3

I am doing the optimization of my database and application in PHP. I would like your opinion regarding the data concerning cities and states. Currently I use Mysql to display cities according to the state selected by the user.

I then thought about the possibility of not using the database for this case, but only an XML file with all this data, in order to save my server’s resources. What do you think? Searching cities in an XML file is lighter than searching in Mysql?

  • 3

    Much more complex. XML is one of the worst things they invented, because it’s bad for everything. The format only came because the market is regulated by theorists with good positions in companies, and academics, who do not deal with problems in practice. It even has its niche of use, but almost always those who use only use it by ignorance.

  • 2

    Other examples of things often used by "fashion": Base64 (of every 10 uses I see, 9 are absurd), Regex, OO in language you don’t need, REST, SOAP, MVC, Patterns design in general.

  • If you want to simplify without installing Mysql, better use an Sqlite, which takes about 300k or even include with an array. If Mysql is already installed on the machine, it has no advantage not to use.

  • Obgd for the clarification.

  • I just wanted to advance the subject, and take the opportunity to counter the fashion :) But let’s see if someone put a "official" answer with legal and interesting guidelines. As the format of the site allows multiple responses, it may be that different people present different alternatives.

  • @Bacco Glad I’m not the only one not to like XML. In the years between 2005 and 2010 mainly, it was a lot of boredom of these people who think(va) that XML was the silver bullet and the magic solution to all problems, where everything in the world had to be in XML and that not using XML was considered a crime. When I criticized this, it was called stupid and/ or crazy.

Show 1 more comment

2 answers

5


Three different approaches to locating data

The idea of the file until it is good because the registration of cities changes very slowly, then the updates would occur in a very sporadic and casual way.

The reason is that with the database, the server has to:

  1. Receive the request.
  2. Connect to the database.
  3. Make an SQL query.
  4. Obtain the results of the consultation.
  5. Assemble an answer in HTML, XML, JSON format or the format you prefer and that will have an X size.
  6. Send this X-size response to the browser.

Without the database, it gets faster because the server will:

  1. Receive the request.
  2. Find the file already ready.
  3. Open the file and read it in full.
  4. Upload the file contents (X-size) to the browser.

It gets faster even if you keep it in memory like this:

  1. Receive the request.
  2. Send the contents of a global variable, string type constant, or fixed text from memory to browser.

One way to do this is with a <?php include, since PHP will do keep in memory the content of all accessed PHP files (it will not keep the output they generate, but the PHP source code). By putting in this PHP code a fixed content, the source code will be the output itself, which will induce PHP to at the end of the accounts simply copy the contents of that memory directly to the request response without any or almost no further processing.

To avoid having to manually edit the XML (or the format it is) containing the data of the cities, you can write some procedure that reads the necessary information from the database and at the end generates the necessary content and saves everything in a PHP file already in the desired format. That way, if/when your database changes, you can just run the procedure again. You will not need to perform this procedure at any time other than when something changes or in the database, or in the data format to be served, or in the algorithm that generates it.

Caching

In addition to keeping everything in memory, you can take advantage of the browser’s skills in caching content, considering that it will change very rarely. To do this, you post the content somewhere with a URL called cidades-2016-08-14.php or similar thing. There, you put the following headers in the request response:

Cache-control: max-age=31536000
ETag: "Cidades20160814"

And then, on the server, before sending the content in memory to the client, you check for a request header called If-None-Match and if its value is "Cidades20160814" (with quotes). If it is, you return the status 304 Not Modified and the response of the request blank or with a very short content. If it is not, you return the contents of your data and add these two headers to the request response. Note that the contents of ETag and of If-None-Match should always come in quotes.

The field Cache-control specifies the expiration date. Here I put in 31536000 seconds, which gives 365 days. The field ETag serves to make conditional Gets. This value you pass to the browser when it first fetches the resource and the browser stores it. After the expiry of the max-age, the browser sends the value of ETag back in the header If-None-Match to know if the content has changed or not, and when responding 304, you signal that it has not changed. If it had changed, you would return the new string with the ETag different.

The ETag should change when the content of your resource (in case the city list) changes. Let’s assume that on 01/10/2017 some municipality will be created somewhere. At this point, you could update the content of your resource in PHP and switch to ETag for "Cidades20170110".

This scheme of ETag still has a problem: when the database changes, the browser will still have the old database cached for a long time until the client clears the browser cache, or the max-age expire, or something similar happens, which can be problematic. A possible solution would be to reduce the value of max-age, but this is not necessary. In this case, you just change the URL to cidades-2017-01-10.php, putting the new database there. Therefore, the day the database changes, as the URL will also change, automatically the cache of the previous content will no longer be used, as long as all references in the other PHP files and javascript point to the new URL, and that is exactly whyif surely a very high value is placed for the max-age in that case.

Searches in the database

To do research in cities or something like that, the best thing would be to give the whole content to the browser and implement all the research through javascript. This way, the browser will be able to do it alone without having to ask the server for anything.

If a search on the server is necessary, to get the best performance, you search only in the memory of the server itself, and generates for each request a value of ETag which can be easily determined based on the data provided in the survey (e.g., city name) and also based on the date the data was last updated. The exact details of the algorithm that generates the ETag and the exact content of ETag do not matter much, as long as these two important rules are respected:

  • The algorithm has to be deterministic. That is, for the same input, same version of your dataset and same query URL, it should generate the same output.

  • The algorithm has to be collision-free. That is, in the same query URL, two different entries or two identical entries in different versions of your dataset should generate ETags different.

It is easier to manage these two properties (mainly the second) when each version of your dataset has a different URL, because in this case the complexity is reduced only to generate the ETag based on input only. An easy and simple way to generate the ETag in this case, if the input is guaranteed to be a short string, is to make the contents of the ETag be the actual content of the input (maybe with some Escaping/encoding only).

Additional reading

Worth a reading in the text I used as source, here: https://developers.google.com/web/fundamentals/performance/optimizing-content-efficiency/http-caching

1

Regarding performance, any database query consumes more resources than simply reading a file.

If you want a simple and straightforward answer, yes, within the context of what you described in the question, reading the data from an xml file is best option.

However, I suggest you simplify even more with the json format as it is even better than xml by consuming fewer resources.

The xml standard has an "advantage" of being able to validate with schemas. With json it is also possible with gambiarras to simulate xml resources, but the need for this depends on each case. In the case you described, I see no need to use xml.

From now on the subject is dispersed to other topics, so I stop here.

To return to the subject, a "small" disadvantage in not using in the database is when you need mirroring. It is more "easy" to do a re-plication and not worry about data from files like txt, json, xml, among others. In this case you have to be careful to maintain the integrity of the mirrors and keep everything synchronized.

If you think about optimizing even more, you can opt for a simple text file where each city would be separated by semicolon, comma, tab, line break, anyway.

Alternatively, you can already leave it in the format of the language you are using. Example, in php

$arr = array(
    'cidade 1',
    'cidade 2',
    'cidade 3',
    'cidade 4'
);

With this, just make a include in the php file.

If you were to use json, it would look like this

$arr = json_decode(file_get_contents('cidades.json'), true);

A rookie braggart would say that the second option is better because it apparently has less code, but the truth is that the cost of memory and processing is higher because besides the cost of reading the ḥá file also the cost of converting json to array in php. If the final destination is to transform everything into an array in php, then nothing is more obvious than leaving it in the array format in php.

The above example is in case you don’t need to share the data with third party systems.

What I described above are the most obvious points. Being aware of the subject, choose what is appropriate for your case as, in the end, it is what matters.

But, ooops! Wait. If you need a city search system? Search cities that contain "ab", for example. And in case you need to list all the birthday cities in May?

Then the conversation already changes because database would be a better option for relational data and search, but still does not rule out using data caching techniques in text files (xml, json, etc).

Which database to use depends on the project. Sometimes you can see more viable to use databases like Sqlite but also not be the best option if you need to link data from cities to other tables in an SGQB like Mysql.

However, nothing prevents you from optimizing the way you store data or cache recurring search results. You can have a table with cities in Mysql and at the same time a copy of that table in Sqlite or in text files.

In a summary, it is not very advantageous not to use a database when you need relational data or to search. Unless you want to develop "from scratch" a new search method, new algorithms or "reinvent the wheel" without effective gain.

  • 3

    I don’t think your answer is the right one and the best one (if it was I wouldn’t post one either). But I also don’t think she deserves the -1, because after all you present useful content with interesting ideas. So keep my +1.

Browser other questions tagged

You are not signed in. Login or sign up in order to post.