What is the purpose of using UUID (version 4) as "Primary key"?

Asked

Viewed 611 times

7

I have noticed that some places have switched the ID (generated by autoincroment, depending on SGDB) by UUID, usually use version 4 of UUID which is based on a pseudo-random system, the question is not about collision problems, which in UUID v4 has minimal chance, but rather the issue of the motivation of the use behind it, follows some examples of uses:

Entity framework

[DatabaseGenerated(DatabaseGeneratedOption.Identity)]
[Key]
public Guid Id { get; set; }

...

In Entity Core I believe you don’t have it yet, or maybe you won’t have it, I can’t confirm, I haven’t used it yet, so you have to solve it in the application with protected override void OnModelCreating(ModelBuilder modelBuilder) (or other means) and apply NEWSEQUENTIALID() (or Guid.NewGuid())

Hibernate

In Hibernate I believe it is something with generators something like this:

@Id @GeneratedValue(generator="system-uuid")
@GenericGenerator(name="system-uuid", strategy = "uuid")

...

Laravel Eloquent

In the Eloquent framework used within the Laravel and Lumen Framework (or even stand-alone) we can configure (and is usually done this way) in the Models, with or directly in the Model of interest, overwriting the method and properties $keyType and $incrementing:

use Illuminate\Support\Str;

...

    protected $keyType = 'string';
    public $incrementing = false;

    protected static function bootHasUuid()
    {
        static::creating(function ($model) {
            if (!$model->getKey()) {
                $model->{$model->getKeyName()} = (string) Str::uuid();
            }
        });
    }

I want to point out that some cases would even set in the default column without needing the application layer, as in Postgres-13 which has native UUID generator instead of postgis-script and maybe in SQL-Server with NEWSEQUENTIALID() (I have tested none in practice).

Although all cases of solving in the application (Asp.net-mvc, JPA, php) seem to work for simple things, but I believe they do not have good efficiency for inserting data many data at the same time (I don’t know how the core of these frameworks works works works, I just assume that it is in the layer of the application that generates UUID if you are wrong you can comment), but the question I have is not about performance, but what is the purpose of using UUID version 4 as ID instead of INT+autoincrement.

Is it supposed to make it difficult to identify data sequentially? For example I have an ID 23, so I can assume another ID of the same context being 24, and so on?

Or it would be difficult to deduce the amount of data?

Or some other reason?

  • The int with auto increment is unique to only one database and only one table, whereas UUID is universally unique, even if you do select uuid(); on two distinct machines with distinct DBMS the values returned by the function will be different. If you adopt UUID as an identifier you would have an ease when migrating the data, because the int is unique only in the context of your table or database so you will have great chances of conflicts when you are importing to another base, while UUID gives you more guarantees on a migration.

  • @cat I agree with that, even Piovezan quoted a link in the comments where the author points out that possibility, but personally the very process of migrating or merging data from different sources to a new or continued structure where a JOIN occurs is not so complicated to solve, if migration will occur where will unify data the application layer itself will have to have new locations, after all it is a new context, then the idea seems good, but if it is to solve a future problem it is better to learn to solve the future problem without needing it...

  • 1

    There was no need for migration ;D

  • 1

    ... I mean, it’s like I create a reasonably smart bot for simple purposes, but I add inside it a bomb because I’m afraid it will dominate humanity and I could destroy it if necessary, well if that’s possible it means that I’ve created a bot that’s too exaggerated for a task that should be average and the problem would be elsewhere and should be solved in another way :P, I know that illustrating the problem so may sound exaggerated, but this is how it seems to me the very idea of using UUID-v4 so far.

  • 1

    Without comparing to the auto increment the motivation is that it does not require a central authority to manage them, making each UUID unique and independent of anything else.

2 answers

4

Thinking of any random identifiers, regardless of the pattern, there are some advantages:

Is a little more secure, as said, a self-improved internal identifier is much simpler to know where other resources of the same nature are, if your user’s data is in /users/123, you can imagine that there is probably a user on /users/122, /users/121, etc. Already a random identifier is not so simple, yet, nothing prevents the attacker to make a script that generates random ids to try to find the resource searched.

Still, that’s just a safety for obscurantism (security through obscurity), is fragile, if there is no means of authentication and authorization, the fault still exists, but it is already an extra difficulty for the attacker. In the ideal world, the systems would have this security, but in reality it does not always have, one can find systems only with the authentication, but no authorization verification, ie, if you are logged in, independent of its user, can do any action, for example, when it is only validated if the token is valid

The random identifiers are infinite, unlike integers, which are limited according to the type used, yet a Bigint has the maximum value of 9,223,372,036,854,776,000 (in Mysql), which is an extremely high value, but in a Big Data scenario of a company grid, long-term thinking, it may be preferable to random identifier

Using this type of identifiers is much simpler when you have a distributed data storage system, for example blockchain and perhaps separate reading and writing database (I don’t know, but it seems to fit). Another case of more common use is when you need to identify only a user, without him registering, a tool like Analytics, if a connection via socket is used, could be done in the traditional way keeping the simplicity, if not, it would be unnecessary to send a request to the server only to get an id and then save it to the client’s machine, out of which, during that time, the client could exit the application and the identifier could not be saved, it being necessary to generate a new one for the same client

In one of the links commented on in the previous responsta, the advantage is mentioned that, as this identifier is usually (not always) created on the application side, it is possible to identify a resource created before it is sent to the database, however, using this identifier before can cause problems, since there is no way to guarantee that the insertion actually worked, even if the application is well tested, the database server may crash, for example, and the complexity and work to create a reactive system to this type of failure probably does not compensate for the little gain with it

As a disadvantage, the larger size occupied by the identifier has been mentioned, which is usually true, but not necessarily, as these identifiers are created from the insertion timestamp, some of them can be transformed at the insertion date (for example, the Mongodb Objectid), which is a commonly saved data anyway, so the identifier can play the role of two fields and the extra occupied memory will be deducted by the least necessary field. However, that too little should not be a problem either

2

I’m gonna make assumptions.

Theoretically, using a random UUID you fail to expose what is the approximate range of valid id’s, from 0 to a maximum N found. No one knows which and how many are these id’s.

The downside is increasing the size of the field for the id and especially the space occupied by the index created for that column when the number of records makes it large. Also the difficulty of working with queries involving Uuids (only on the basis of copy-Paste).

  • I understand, that’s what I said in my own question, but take away from me a doubt, what kind of problem can occur of exposing the intervals and the numerical sequence?

  • I imagine it is a possible security flaw, since these intervals are usually exposed for example in the HTTP Request Urls to REST webservices (example: DELETE api/produto/3125), but I’m speaking without practical experience.

  • I understand, I really can’t see this as a failure or as exposing something, since this type of problem is solved within the application by defining what is permissive to X user, the stackoverflow site itself has sequential Ids, this was never a failure and nor exploitable failure, examples, without knowing the name I can access the profile: https://answall.com/users/1, https://answall.com/users/2, https://answall.com/users/3, https://answall.com/users/10, https://pt.stackoverflowcom/users/11, https://answall.com/users/12 ... https://answall.com/users/20

  • If an attacker with access to a valid token (for example) starts deleting profiles, with UUID it is harder to deduce what to delete. But again I’m making assumptions, I have no practical experience to state.

  • If the Token was exposed is not going to solve this, the failure is already in having a token exposed, because if you have access to it then by a panel itself there may be a system to delete data, or profiles, or anything, without even needing to know the Ids, since managing everything on an Adm level of a system is usually possible, so the failure is still in the application and not in the numeric ID. Agrees?

  • I agree. Maybe this is what you are looking for: https://tomharrisonjr.com/uuid-or-guid-as-primary-keys-be-careful-7b2aa3dcb439

Show 1 more comment

Browser other questions tagged

You are not signed in. Login or sign up in order to post.