Most voted "spark" questions
10 questions
Sort by count of
-
5
votes2
answers675
viewsWhat is RDD (resilient Distributed dataset)?
I’m studying about Spark in Python and the acronym RDD always appears. However I can’t understand what this nomenclature is really about. So I’d like to know what it is resilient Distributed dataset…
-
2
votes0
answers22
viewsHow to use AWS Glue to obtain 'batch' data in semi-structured BD?
Is it possible to use AWS Glue to retrieve data incrementally in a semi-structured SQL database? The data on-premise that I intend to export to the AWS cloud are in a single table, with primary key,…
-
1
votes0
answers38
viewsMVC Routes and Controllers architecture with multiple versions using Spark
I am implementing a Rest API in Java using the framework Spark. I need to split the routes into two versions or more: v1 and v2. I know the Spark provides the path(), I believe it is not difficult.…
-
1
votes0
answers81
viewsConvert string to date in Sparkr - Databricks
I’m treating a dataframe in the algorithm prophet in my work, for both in the Rstudio used the following code to convert data from Data who are like string for Date, because the algorithm needs this…
-
1
votes0
answers116
viewsPytho Kafka, Spark Streaming, Mongodb conection
Hello I am creating a job streaming, but I do not get error message and nor does recording in Mongo. I have tried several types of connection. On the command line, Producer and Consumer can produce…
-
0
votes1
answer82
viewsCassandra + Spark + R connection
How do I connect Cassandra to Spark? Cassandra > Spark > R I’ve already been able to connect R to Spark, now I need to bring the data that is stored in Cassandra to Spark and finally analyze…
-
0
votes2
answers22
viewshow to create an udf using two columns and if expression
I’m doing a job using Pyspark and SQL have that function: selection = F.udf(lambda x: cumulative_sum._1 if 100*random.random()<= x, FloatType()) cumulative_sum.withColumn('Selection2',…
-
0
votes1
answer67
viewsFunction "Procv" with pyspark
I am beginner and would like to know if there is any code to correlate two spreadsheets by means of index (CAUSE_CODE), like a PROCV in Excel, but in Pyspark. %pyspark machine_A_grouped =…
-
0
votes0
answers15
viewsMy variable is not in the group - Error
I am running a query but is giving the error. In Databricks is appearing the following message: "Analysisexception: Expression 'Aux1.nom_canal_escritorio' is neither present in the group by, nor is…
-
0
votes0
answers27
viewsYarn AM is ending work by sending a remote job to the cluster
I have searched for an identifier in a problem without too many clues. Any help pointing me in the right direction will be huge. Any help will be of great value. I’m spinning a Hadoop Cluster using…