Function "Procv" with pyspark

Question

Function "Procv" with pyspark

Asked 4 years, 8 months ago

Viewed 67 times

0

I am beginner and would like to know if there is any code to correlate two spreadsheets by means of index (CAUSE_CODE), like a PROCV in Excel, but in Pyspark.

%pyspark
        machine_A_grouped = machine_A.groupBy("CAUSE_CODE").sum("Time").sort("sum(Time)",ascending = False)
        machine_A_grouped.show()        

(output)

CAUSE_CODE  sum(Time)
  7041        41730
  7031        28076
  7010        11486
    10         3899
   ...         ...

The codes of the causes are described in another table I loaded in a df called machine_cause .

machine_cause.show()

(output)

CAUSE_CODE     Desc
  7031        Cause A
  7041        Cause B
  7010        Cause C
    10        Cause D
  ...          ...

I’d like a code for:

CAUSE_CODE  sum(Time)     Desc
  7041        41730     Cause B
  7031        28076     Cause A
  7010        11486     Cause C
    10         3899     Cause D
   ...         ...        ...

Lucas, good night! Can you make the datasets available? Hug!

– lmonferrari

2020/12/01 at 23:03

1 answer

Browser other questions tagged python procv spark

You are not signed in. Login or sign up in order to post.

by skulden • 31 points · Answer 1 · 2020-12-21T19:55:11+00:00

Lucas,

This is called Join and there are some types such as: Left Join, right Join, Inner Join, left Outer Join, right Outer Join, etc...

Each type of Join has a purpose, I suggest a brief reading to deepen better.

In your case it would be something like

join_result = (
machine_A_grouped.join(machine_cause, on = ['CAUSE_CODE'], how = 'left')
)

then you can explore the result using

join_result.limit(10).show()