What is the best way to represent an Address?

Asked

Viewed 13,910 times

37

When modeling addresses in an application, naively followed the standard form when representing in the BD (i.e. separate tables for country, state, city, neighborhood etc. - joins everywhere), and in the user interface (html) I put a separate field for each part of that address (text box with autocomplete or combo box). At first I thought it was good that way, and I didn’t think much about it.

However, after doing a question in the UX.SE (relative to our standard of using "X Street" instead of "X Street") the answers led me to question whether it was worthwhile in practice to require such a level of detail in the representation. In addition to the additional complexity when searching/updating (as exemplified in that "pyramid"), do not know how the performance of the system will be when it contains a high number of addresses.

I’d like to know who you already have experience dealing with a large number of addresses, what practices would be recommended: leave everything normalized, use an open text field, condense some tables into one (eg.: cidade_estado_pais) and separate others, etc. taking into account that:

  • Few users will enter with many addresses, of other people (if it were every user entering with their own address once and that’s it, it wouldn’t justify investing in usability).
  • If a part of the address already exists in the database (e.g., a previously registered street) the autocomplete can be used to speed up data entry; this would be more difficult if the address were an open field.
  • Some data are easier to find and pre-popular (e.g., complete list of Brazilian cities), others are more difficult or more expensive - may be better than the user entering them on demand (but still allowing autocomplete).
  • If a field is opened, it is more subject to duplication (e.g., "Av Foo", "Av. Foo", "Avenida Foo", "A. Foo", "Foo"); but duplication is not necessarily a problem...
  • It is more difficult to aggregate in a denormalized field (e.g., if I want statistics by state, but Agrupei cidade_estado_pais in one field, I’ll have trouble).
  • Note: I know that an "optimal" solution varies from case to case (depending on the application), but I am not simply looking for opinions, in which every answer is valid - I would like to hear from those who have experience dealing with systems that contain a large number of addresses, for example to avoid problems that cannot be anticipated when their base is small, only appear after it has already grown in scale.

  • I don’t understand what you’re asking. It’s about data modeling (database) or usability (user interface)??

  • @The question is about modeling, although the different representation decisions (normalize, not normalize) end up having a great impact on usability. In other words, I found it important to quote usability considerations here to help guide an appropriate response.

  • 1

    https://www.mjt.me.uk/posts/falsehoods-programmers-believe-about-addresses/

4 answers

23


In my opinion the best address database in Brazil is the e-DNE that has more than 900 thousand records and, in my opinion, works very fast. Follow the diagram of their seat. Diagrama e-DNE correios

When I needed to use this bank I found it very simple easy and fast, I believe it will serve for your purposes and of course this mail bank is very complex because it covers many things but you can reduce it for your purposes.

Here you can download a zip with the bank template and a doc file that explains very well how it works. So I guess you don’t have to reinvent the wheel, let’s mirror the best...

Now in the matter of duplicate address and duplicate data in general, the best way to solve this is a very smart algorithm that will compare the data provided and cross-check with information to try to unify the information.

I recommend reading this Link, and good luck after all is not a simple task, but achieving the expected results is very rewarding.

  • Thanks for the suggestion, I will read the documentation indicated. In fact, if there is a standardized and widely used model there are many advantages in using it, for example in interoperability with other systems that also use it.

  • Boy, I just got a quick read and I’ve already found a serious mistake in mine modeling: was representing "neighborhood" and "patio" in a relationship 1:N, but a street can pass through several neighborhoods! A couple (logradouro, número) Yes belongs to a single neighborhood, but the street belongs to the city and not to the neighborhood. I have seen that this pattern will be very useful and instructive to "lay people" like me (who even if I have an "intuitive" notion of addresses, I can still make such mistakes when it comes to "putting it on paper").

  • good that helped, I realized that the big move of this post office bench is the way it stores the zip code, separating the zip code you can consult very fast exactly where you want.

14

In our case, the way the data are organized is directly linked to the use of them to facilitate their completion, collection and analysis, i.e.:

  1. If we are going to perform surveys or compile statistics on certain information, this information must be separate:

    Statistics on the location of the person, obliges the field localidade be in a column just for him.

    Example: Coimbra

  2. If the field is not relevant to surveys or statistics, the field may be together with other information:

    The case of the streets where the person resides, which may have short or long names but are not subject to analysis, may be in a single column:

    Example: Av. da Liberdade or Avenida da Liberdade is equal. It boils down to text that separated into two columns, one for the street type and the other for the street name will only generate confusion when filling and/or collecting this information.


Having everything separated if there is no benefit is unnecessary. Having everything together and then needing to analyze the data separately also makes no sense.

What we try to do is find out which data is most common and easy to manage separately. Then we cross-check that with the data that we already know will be subject to analysis, taking into account those that may in the future also be subject to analysis.

With the result, we have a good idea of how we should store all the information.

On the other hand, we also have to consider the use of secondary tables, as is the case of countries. It makes no sense for a user to have to write the name of their country if there are X countries worldwide and that number is rare to change, it makes sense to have a secondary table with countries and the user just has to select his or her.

  • I think you’re right, but I’m gonna wait a little longer. I would like to hear from those who have already been through this situation but made a mistake in the trial (i.e. he thought he would not need the analysis in the future, needed it, and had to make adjustments). If it was quiet to migrate the data, if problems arose...

9

Working on a gigantic project, I realized that it is not possible to escape much this way of treating this paradigm, leaving each table separated by street, neighborhood, city...etc. Something important is to always maintain indexes in all tables and optimize their queries, today we also make a filter, so that the user is more specific in the search if it brings many results, my development team is remodeling the system from scratch starting with the BD, a priori this is the experience I had, in case we get to this part of the bank modeling, and something is modified I will be happy to come to this question and demonstrate to have something more concrete for you, I hope that even though little has contributed...

  • Thank you for the answer! In your case, you separate the patio from its type (Street, Avenue, etc) or it is all one field?

  • For now we use everything together only: Street X, Avenue Y...

  • For now it only stays together the type of street and the name of it, the rest everything has its own table.

  • Good to know, because in fact this is the part that brings more problem (in my initial implementation I did it separately too), I’m even inclined to join. I believe that the cases of duplication (Street X, R. X) are quiet to solve, especially when the addresses are informed by trained employees and not by the general public, right? (or you didn’t experience this problem - for example, by using lists of pre-populated streets?)

  • 1

    Dude, if you’re not going to have this problem of anyone entering, I think we highly recommend keeping it together even, because in my case it’s the other way around, I work with an e-commerce platform, that is, users are signing up, it becomes a mess, all duplicated, in your case I think it valid to keep together...

8

I would like to know, from those who already have experience dealing with a large number of addresses, what practices would be recommended: leave everything normalized, use an open text field, condense some tables into one (e.g., city_state) and leave others separate, etc.

In the company where I work I am responsible for a service that consolidates addresses of several other systems. Since you mentioned the question of experience much of what I will talk about here is my experience/opinion.

» Normalize or not?

Normalize. Unless what you’re doing is extremely simple, I recommend normalizing the data. The consistency of the data will not only free you from future problems (it is, for example, more difficult to insert an invalid address) but will also avoid an plastering of these data (a possible report or integration with another system that has such data standardised).

» Aggregating tables?

I see no need. The only possibility to aggregate tables I see is in the example you cited (city-state-country). If correctly modelled an eventual Join or research will not generate a overhead significant.

» Type of patio?

There are systems that dedicate their own column to the type of the address patio (avenue, street, mall, etc). I don’t think it’s important to do it unless it’s really necessary.

Think, for example, how to extract this user-provided information. Or you make a select (drop down list) and forces the user to select a type of correct street (which is not very functional/pleasant) or tries to extract this information from the street field (which can be complicated to do properly or leave the flow stuck).

» How to model after all?

The following is a practical example of the service I mentioned. The real base has hundreds of thousands of addresses.

Table Endereco:

  • CEP: PK table. Field with exactly 8 characters. Always without mask.
  • Logradouro: Including the type of street and the address number.
  • Bairro
  • Localidade: City. Had an index.
  • UF: Field with exactly 2 characters. It is a FK for the table of federative units.
  • Data: Date of registration/address update.

Table Unidade Federativa

  • Sigla: PK table. Field with exactly 2 characters. State acronym.
  • Nome: Full name of the state.

» Initial charges

Recommend only to Federativa drive and city. Unit it’s rare for there to be changes. City it is already more common to have changes but nothing that is not possible to maintain. Any more localized field such as Neighborhood, it’s hard enough to keep a charge.

Obs.: I refer to data that is not entered by users. This data static maintained by the system itself (often manually by the developers themselves). Hence the recommendation to do only with city and Federative unit.

» Search the column? Index!

Remember to create index in the columns you search for them in (e.g., neighborhood autocomplete).

» Each case is a case

The recommendations I leave are for relational databases and addresses in Brazil. Remember to adapt and put your own knowledge and criticism when modeling your system. After all, each case is a case.

  • 1

    "as well as avoid a plastering of this data" there is something that had not thought, thank you for calling attention to this possibility. I also see that you agree with the others that normalization is desirable (and does not cause performance problems if done correctly), except for the patio and type - which can be kept in one field only. I am already strongly inclined in that direction.

  • At second reading, I had a question: is there a reason for not do initial load of cities, neighborhoods, streets, etc? Okay, that changes, but if this data came from the user, wouldn’t they be subject to the same problem? In my case, I have access to a rather outdated base (2005+/-) of streets, but even if a percentage of them is incorrect I believe that correcting them on demand is less costly than imposing on customers (and it is more than one, with independent bases) start everything from scratch - which also opens up the possibility of mistakes. Agree, or something is escaping me?

  • 1

    @mgibsonbr I see no problem. The problem is when you try on your own to keep this data up to date. I’ll edit my answer to clarify what I meant.

Browser other questions tagged

You are not signed in. Login or sign up in order to post.