how to create an udf using two columns and if expression

Question

how to create an udf using two columns and if expression

Asked 5 years, 3 months ago

Viewed 22 times

0

I’m doing a job using Pyspark and SQL

have that function:

selection = F.udf(lambda x: cumulative_sum._1 if 100*random.random()<= x, FloatType())
cumulative_sum.withColumn('Selection2', selection(cumulative_sum.cumSum)

My table calls cumulative_sum and my goal is to check if each column value cumSUm is less than a randomly generated number, if it is I would add the respective value of column _1 in the new column Selection2. However, when I run this code, I have this error:

SyntaxError: invalid syntax

How can I fix this?

I’m sure you can do that without an UDF, that would be more efficient. You can add an example dataset an input and one with the result you want?

– Dee

2021/02/04 at 12:30

2 answers

Browser other questions tagged lambda-expressions spark

You are not signed in. Login or sign up in order to post.

by zeh • **101** points · Answer 1 · 2020-08-31T09:22:57+00:00

You need the Else in the if-else in the same line. It follows as it should be:

selection = F.udf(lambda x: cumulative_sum._1 if 100*random.random()<= x else float(0.0), FloatType())
cumulative_sum.withColumn('Selection2', selection(cumulative_sum.cumSum)

by Dee • **101** points · Answer 2 · 2021-02-04T12:39:06+00:00

I don’t know if I understand the question correctly, I haven’t spoken Portuguese for a long time. That’s what you wanna do?

df.withColumn('selection2', 
               when(F.lag('cumSum') < F.col('cumSum'), F.col('cumSum'))
               .otherwise(0.0) #ou .otherwise(None)
              )