Using Group By in SAS Guide

Asked

Viewed 3,334 times

1

I’m having trouble using the command GROUP BY, in the SAS. By definition, SAS only accepts the GROUP BY if used in the SELECT a summary function. So far, ok. Running the Proc below, the result comes out correctly:

proc sql;
    create table tabela_nova as
    select  nr_prpt_vcld,
            (max(ult_acl_atv)) as ult_acl_atv format ddmmyy10.
    from    tabela_base
    group by nr_prpt_vcld
    ;
quit;

nr_prpt_vcld ult_acl_atv
23157 02/01/2014
27175 03/07/2014
29350 17/02/2014

However, adding 2 more fields to the SELECT, (as below) to Proc returns the following result:

proc sql;
    create table tabela_nova as
    select  nr_prpt_vcld,
            nr_cvn_srvc,
            cd_cli_srvc,
            (max(ult_acl_atv)) as ult_acl_atv format ddmmyy10.
    from    tabela_base
    group by nr_prpt_vcld
    ;
quit;

nr_prpt_vcld   nrcvn_srvc   cd_cli_srvc   ult_acl_atv
    23157 140594 504217876 02/01/2014
    23157 147914 510003171 02/01/2014
    23157 153120 511284856 02/01/2014
    27175 140676 203940185 03/07/2014
    27175 141805 931007593 03/07/2014
    27175 150291 510955695 03/07/2014
    29350 137242 508122116 17/02/2014
    29350 145502 30139405 17/02/2014

I tried adding the new fields to GROUP BY, use HAVING, but nothing solved it. I still can’t understand how the GROUP BY in the SAS. Can someone help me?

The expected result is as follows:

nr_prpt_vcld   nr_cvn_srvc       cd_cli_srvc ult_acl_atv 
    23157           153120        511284856  02/01/2014 
    27175           150291        510955695  03/07/2014 
    29350           137242        508122116  17/02/2014 

That is, the values nr_cvn_srvc and cd_cli_srvc that are at the most recent date (ult_acl_atv), grouped by nr_prpt_vcld.

  • It seems that he implicitly added the new fields in the group by, instead of returning an error (which, I believe, would be what a normal BD SQL query would do). By the way, that was the result obtained, right? What would be the expected result?

  • Tip: Edit your question by adding the relevant details, don’t try to post as a comment. If you are having difficulty formatting, select the text and use the "Code Sample" button (which looks like this: { }), and it will be formatted as code (in gray background, with monospaced font).

1 answer

2


TL;DR - (Workaround) use Comments:

select  t1.nr_prpt_vcld                               as nr_prpt_vcld,
        (select t2.nrcvn_srvc
             from tabela_base t2
             where t2.nr_prpt_vcld = t1.nr_prpt_vcld
             order by t2.ult_acl_atv desc
             limit 1
        )                                             as nrcvn_srvc,
        (select t2.cd_cli_srvc
             from tabela_base t2
             where t2.nr_prpt_vcld = t1.nr_prpt_vcld
             order by t2.ult_acl_atv desc
             limit 1
        )                                             as cd_cli_srvc,
        (max(t1.ult_acl_atv))                         as ult_acl_atv
    from    tabela_base t1
    group by t1.nr_prpt_vcld

Example in Sqlfiddle.


I have little familiarity with SAS, but from what I understand, it complies with SQL standards. I will respond in a more didactic way, so that you can identify where you might be missing and what to do to fix:

Overview of SQL queries

Abstractly, one can say that a query performs the following steps, in that order:

  1. Discover the table(s) to be consulted by making the applicable junctions;

    Ex.: from tabela_base

    nr_prpt_vcld   nrcvn_srvc   cd_cli_srvc   ult_acl_atv foo
    23157           140594        504217876  30/12/2013   bar
    23157           147914        510003171  01/01/2014   bar
    23157           153120        511284856  02/01/2014   bar
    27175           140676        203940185  01/07/2014   bar
    27175           141805        931007593  02/07/2014   bar
    27175           150291        510955695  03/07/2014   bar
    29350           137242        508122116  16/02/2014   bar
    29350           145502        30139405   17/02/2014   bar
    99999           999999        999999999  18/02/2014   baz
    
  2. Filters returned lines as per condition where;

    Ex.: where foo = bar

    nr_prpt_vcld   nrcvn_srvc   cd_cli_srvc   ult_acl_atv foo
    23157           140594        504217876  31/12/2013   bar
    23157           147914        510003171  01/01/2014   bar
    23157           153120        511284856  02/01/2014   bar
    27175           140676        203940185  01/07/2014   bar
    27175           141805        931007593  02/07/2014   bar
    27175           150291        510955695  03/07/2014   bar
    29350           137242        508122116  16/02/2014   bar
    29350           145502        30139405   17/02/2014   bar
    -----           ------        --------   ----------   baz (X)
    
  3. Group according to the criterion of the clause group by;

    Ex.: group by no_prpt_vcld

    nr_prpt_vcld   nrcvn_srvc   cd_cli_srvc   ult_acl_atv foo  tupla de agrupamento
    
    23157           140594        504217876  31/12/2013   bar \
    23157           147914        510003171  01/01/2014   bar  (23157)
    23157           153120        511284856  02/01/2014   bar /
    
    27175           140676        203940185  01/07/2014   bar \
    27175           141805        931007593  02/07/2014   bar  (27175)
    27175           150291        510955695  03/07/2014   bar /
    
    29350           137242        508122116  16/02/2014   bar \(29350)
    29350           145502        30139405   17/02/2014   bar /
    
  4. "Flatten" the groups, transforming each one into a single row, adding columns according to the aggregation functions

    Ex.: (max(ult_acl_atv)) as ult_acl_atv

    nr_prpt_vcld   nrcvn_srvc   cd_cli_srvc   ult_acl_atv foo
    23157          [Achatado]   [Achatado]    02/01/2014  [Achatado]
    27175          [Achatado]   [Achatado]    03/07/2014  [Achatado]
    29350          [Achatado]   [Achatado]    17/02/2014  [Achatado]
    
  5. Filter again, now the "flattened" lines (not the original ones)

    Ex.: having ult_acl_atv < '06/06/2014' (imprecise syntax)

    nr_prpt_vcld   nrcvn_srvc   cd_cli_srvc   ult_acl_atv foo
    23157          [Achatado]   [Achatado]    02/01/2014  [Achatado]
    -----          ----------   ----------    03/07/2014  ---------- (X)
    29350          [Achatado]   [Achatado]    17/02/2014  [Achatado]
    
  6. Choose which columns to return; you can only use grouped columns (i.e. that were part of the group by or aggregations of other columns (i.e. that had some value defined after "flattening").

    Ex.: select nr_prpt_vcld, (...) as ult_acl_atv

    nr_prpt_vcld   ult_acl_atv
    23157          02/01/2014
    29350          17/02/2014
    

How this applies to your case

As you can see, you can’t add the columns nrcvn_srvc and cd_cli_srvc because they were "flattened" by taking several lines from the table and turning into one. The most you can do is aggregate them as well, if that’s what you want. Ex.:

select  nr_prpt_vcld,
        (min(nrcvn_srvc)) as nrcvn_srvc,
        (sum(cd_cli_srvc)) as cd_cli_srvc,
        (max(ult_acl_atv)) as ult_acl_atv format ddmmyy10.

    nr_prpt_vcld   nrcvn_srvc   cd_cli_srvc   ult_acl_atv
    23157          140594       1525505903    02/01/2014
    29350          137242       538261521     17/02/2014

Or, add one (or more) of them into the group - but by doing so, the grouping will be different. Ex.:

group by nr_prpt_vcld, nrcvn_srvc

    nr_prpt_vcld   nrcvn_srvc     cd_cli_srvc ult_acl_atv  foo           tupla de agrupamento

    23157           140594        [Achatado]  31/12/2013   [Achatado] => (23157,140594)

    23157           147914        [Achatado]  01/01/2014   [Achatado] => (23157,147914)

    23157           153120        [Achatado]  02/01/2014   [Achatado] => (23157,153120)

    27175           140676        [Achatado]  01/07/2014   [Achatado] => (27175,140676)

    27175           141805        [Achatado]  02/07/2014   [Achatado] => (27175,141805)

    27175           150291        [Achatado]  03/07/2014   [Achatado] => (27175,150291)

    29350           137242        [Achatado]  16/02/2014   [Achatado] => (29350,137242)

    29350           145502        [Achatado]  17/02/2014   [Achatado] => (29350,145502)

etc. If you add other fields in the clause group by, he will group by tuple fields - each different combination of the same will generate a grouping, which will later be flattened and may have its aggregate values.

Normally, a query such as the one you made would generate an error in SQL. But apparently, what SAS did was implicitly add the other fields in the condition of grouping, so that its query was made as if the group by were like this:

group by nr_prpt_vcld, nrcvn_srvc, cd_cli_srvc

So that only lines that had these 3 values identical would be part of the same grouping. If there is no more than one line with these conditions, the result is as if it had not grouped anything, simply returned each line separately...

Workaround

I don’t know if what you want to do is possible within the clause itself group by: obtain the value of a column corresponding to the maximum of another column. A workaround would be to use two Queries, one for each column; in each of them, take the value of the correct column corresponding to the maximum value of the date, for a specific code:

select t2.nrcvn_srvc
    from tabela_base t2
    where t2.nr_prpt_vcld = ### um código fixo ###
    order by t2.ult_acl_atv desc
    limit 1

This is done in the section select - when the grouping has already been done and the line is being set up with the results. Rows are sorted by date (so the first will be the most recent), and the number of results is limited to 1, so that only the largest of them will be returned. You take the field you want and use it as the column value:

select  t1.nr_prpt_vcld       as nr_prpt_vcld,
        (select ...)          as nrcvn_srvc,
        (select ...)          as cd_cli_srvc,
        (max(t1.ult_acl_atv)) as ult_acl_atv

Final result (note that I gave a nickname to the original table):

select  t1.nr_prpt_vcld                               as nr_prpt_vcld,
        (select t2.nrcvn_srvc
             from tabela_base t2
             where t2.nr_prpt_vcld = t1.nr_prpt_vcld
             order by t2.ult_acl_atv desc
             limit 1
        )                                             as nrcvn_srvc,
        (select t2.cd_cli_srvc
             from tabela_base t2
             where t2.nr_prpt_vcld = t1.nr_prpt_vcld
             order by t2.ult_acl_atv desc
             limit 1
        )                                             as cd_cli_srvc,
        (max(t1.ult_acl_atv))                         as ult_acl_atv
    from    tabela_base t1
    group by t1.nr_prpt_vcld

Browser other questions tagged

You are not signed in. Login or sign up in order to post.