pyspark.sql.DataFrame.distinct — PySpark 4.1.0 documentation (original) (raw)

DataFrame.distinct()[source]#

Returns a new DataFrame containing the distinct rows in this DataFrame.

New in version 1.3.0.

Changed in version 3.4.0: Supports Spark Connect.

Returns

DataFrame

DataFrame with distinct records.

Examples

Remove duplicate rows from a DataFrame

df = spark.createDataFrame( ... [(14, "Tom"), (23, "Alice"), (23, "Alice")], ["age", "name"]) df.distinct().show() +---+-----+ |age| name| +---+-----+ | 14| Tom| | 23|Alice| +---+-----+

Count the number of distinct rows in a DataFrame

df.distinct().count() 2

Get distinct rows from a DataFrame with multiple columns

df = spark.createDataFrame( ... [(14, "Tom", "M"), (23, "Alice", "F"), (23, "Alice", "F"), (14, "Tom", "M")], ... ["age", "name", "gender"]) df.distinct().show() +---+-----+------+ |age| name|gender| +---+-----+------+ | 14| Tom| M| | 23|Alice| F| +---+-----+------+

Get distinct values from a specific column in a DataFrame

df.select("name").distinct().show() +-----+ | name| +-----+ | Tom| |Alice| +-----+

Count the number of distinct values in a specific column

df.select("name").distinct().count() 2

Get distinct values from multiple columns in DataFrame

df.select("name", "gender").distinct().show() +-----+------+ | name|gender| +-----+------+ | Tom| M| |Alice| F| +-----+------+

Get distinct rows from a DataFrame with null values

df = spark.createDataFrame( ... [(14, "Tom", "M"), (23, "Alice", "F"), (23, "Alice", "F"), (14, "Tom", None)], ... ["age", "name", "gender"]) df.distinct().show() +---+-----+------+ |age| name|gender| +---+-----+------+ | 14| Tom| M| | 23|Alice| F| | 14| Tom| NULL| +---+-----+------+

Get distinct non-null values from a DataFrame

df.distinct().filter(df.gender.isNotNull()).show() +---+-----+------+ |age| name|gender| +---+-----+------+ | 14| Tom| M| | 23|Alice| F| +---+-----+------+