Function "Procv" with pyspark

Asked

Viewed 67 times

0

I am beginner and would like to know if there is any code to correlate two spreadsheets by means of index (CAUSE_CODE), like a PROCV in Excel, but in Pyspark.

%pyspark
        machine_A_grouped = machine_A.groupBy("CAUSE_CODE").sum("Time").sort("sum(Time)",ascending = False)
        machine_A_grouped.show()        

(output)

CAUSE_CODE  sum(Time)
  7041        41730
  7031        28076
  7010        11486
    10         3899
   ...         ...

The codes of the causes are described in another table I loaded in a df called machine_cause .

machine_cause.show()

(output)

CAUSE_CODE     Desc
  7031        Cause A
  7041        Cause B
  7010        Cause C
    10        Cause D
  ...          ... 

I’d like a code for:

CAUSE_CODE  sum(Time)     Desc
  7041        41730     Cause B
  7031        28076     Cause A
  7010        11486     Cause C
    10         3899     Cause D
   ...         ...        ...
  • Lucas, good night! Can you make the datasets available? Hug!

1 answer

1

Lucas,

This is called Join and there are some types such as: Left Join, right Join, Inner Join, left Outer Join, right Outer Join, etc...

Each type of Join has a purpose, I suggest a brief reading to deepen better.

In your case it would be something like

join_result = (
machine_A_grouped.join(machine_cause, on = ['CAUSE_CODE'], how = 'left')
)

then you can explore the result using

join_result.limit(10).show()

Browser other questions tagged

You are not signed in. Login or sign up in order to post.