Most voted "spark" questions

10 questions

Sort by count of

5
votes

2
answers

675
views

What is RDD (resilient Distributed dataset)?

I’m studying about Spark in Python and the acronym RDD always appears. However I can’t understand what this nomenclature is really about. So I’d like to know what it is resilient Distributed dataset…

python nomenclature spark
asked 5 years, 5 months ago gato 22,329
2
votes

0
answers

22
views

How to use AWS Glue to obtain 'batch' data in semi-structured BD?

Is it possible to use AWS Glue to retrieve data incrementally in a semi-structured SQL database? The data on-premise that I intend to export to the AWS cloud are in a single table, with primary key,…

python aws scala spark
asked 6 years ago luizleroy 121
1
votes

0
answers

38
views

MVC Routes and Controllers architecture with multiple versions using Spark

I am implementing a Rest API in Java using the framework Spark. I need to split the routes into two versions or more: v1 and v2. I know the Spark provides the path(), I believe it is not difficult.…

java mvc spark
asked 6 years, 10 months ago Ana Caroline Rodrigues 95
1
votes

0
answers

81
views

Convert string to date in Sparkr - Databricks

I’m treating a dataframe in the algorithm prophet in my work, for both in the Rstudio used the following code to convert data from Data who are like string for Date, because the algorithm needs this…

r spark
asked 6 years, 7 months ago Cássio Luis Mascarenhas 11
1
votes

0
answers

116
views

Pytho Kafka, Spark Streaming, Mongodb conection

Hello I am creating a job streaming, but I do not get error message and nor does recording in Mongo. I have tried several types of connection. On the command line, Producer and Consumer can produce…

mongodb python-2.7 streaming spark apache-kafka
asked 5 years, 11 months ago Samuel Pereira 11
0
votes

1
answer

82
views

Cassandra + Spark + R connection

How do I connect Cassandra to Spark? Cassandra > Spark > R I’ve already been able to connect R to Spark, now I need to bring the data that is stored in Cassandra to Spark and finally analyze…

rstudio cassandra spark
asked 6 years, 5 months ago Shirlei Alexandrino 41
0
votes

2
answers

22
views

how to create an udf using two columns and if expression

I’m doing a job using Pyspark and SQL have that function: selection = F.udf(lambda x: cumulative_sum._1 if 100*random.random()<= x, FloatType()) cumulative_sum.withColumn('Selection2',…

lambda-expressions spark
asked 5 years, 3 months ago User0604 1
0
votes

1
answer

67
views

Function "Procv" with pyspark

I am beginner and would like to know if there is any code to correlate two spreadsheets by means of index (CAUSE_CODE), like a PROCV in Excel, but in Pyspark. %pyspark machine_A_grouped =…

python procv spark
asked 4 years, 8 months ago Lucas Henrique 1
0
votes

0
answers

15
views

My variable is not in the group - Error

I am running a query but is giving the error. In Databricks is appearing the following message: "Analysisexception: Expression 'Aux1.nom_canal_escritorio' is neither present in the group by, nor is…

sql spark
asked 4 years ago rafaelrg93 1
0
votes

0
answers

27
views

Yarn AM is ending work by sending a remote job to the cluster

I have searched for an identifier in a problem without too many clues. Any help pointing me in the right direction will be huge. Any help will be of great value. I’m spinning a Hadoop Cluster using…

spark
asked 5 years, 12 months ago lscarpato 11