Spark: Count number of duplicate row
To count the number of duplicate rows in a pyspark DataFrame, you want to groupBy() all the columns and count(), then select the sum of the counts for the rows where the count is greater than 1:
import pyspark.sql.functions as funcs
df.groupBy(df.col...
ievsantillan.hashnode.dev1 min read