0
I am beginner and would like to know if there is any code to correlate two spreadsheets by means of index (CAUSE_CODE
), like a PROCV
in Excel, but in Pyspark.
%pyspark
machine_A_grouped = machine_A.groupBy("CAUSE_CODE").sum("Time").sort("sum(Time)",ascending = False)
machine_A_grouped.show()
(output)
CAUSE_CODE sum(Time)
7041 41730
7031 28076
7010 11486
10 3899
... ...
The codes of the causes are described in another table I loaded in a df called machine_cause
.
machine_cause.show()
(output)
CAUSE_CODE Desc
7031 Cause A
7041 Cause B
7010 Cause C
10 Cause D
... ...
I’d like a code for:
CAUSE_CODE sum(Time) Desc
7041 41730 Cause B
7031 28076 Cause A
7010 11486 Cause C
10 3899 Cause D
... ... ...
Lucas, good night! Can you make the datasets available? Hug!
– lmonferrari