Discussion

Ian Santillan

Cloud & AI Solution Engineer @ Microsoft

Jul 12, 2021

Spark: Count number of duplicate row

To count the number of duplicate rows in a pyspark DataFrame, you want to groupBy() all the columns and count(), then select the sum of the counts for the rows where the count is greater than 1: import pyspark.sql.functions as funcs df.groupBy(df.col...

ievsantillan.hashnode.dev1 min read

#pyspark #data-analysis #data-engineering

Responses

No responses yet.

Search Hashnode

Spark: Count number of duplicate row

Responses

Recent in Forum