pyspark.sql.DataFrame.freqItems — PySpark 4.1.0 documentation (original) (raw)
DataFrame.freqItems(cols, support=None)[source]#
Finding frequent items for columns, possibly with false positives. Using the frequent element count algorithm described in “https://doi.org/10.1145/762471.762473, proposed by Karp, Schenker, and Papadimitriou”.DataFrame.freqItems() and DataFrameStatFunctions.freqItems() are aliases.
New in version 1.4.0.
Changed in version 3.4.0: Supports Spark Connect.
Parameters
colslist or tuple
Names of the columns to calculate frequent items for as a list or tuple of strings.
supportfloat, optional
The frequency with which to consider an item ‘frequent’. Default is 1%. The support must be greater than 1e-4.
Returns
DataFrame with frequent items.
Notes
This function is meant for exploratory data analysis, as we make no guarantee about the backward compatibility of the schema of the resultingDataFrame.
Examples
from pyspark.sql import functions as sf df = spark.createDataFrame([(1, 11), (1, 11), (3, 10), (4, 8), (4, 8)], ["c1", "c2"]) df = df.freqItems(["c1", "c2"]) df.select([sf.sort_array(c).alias(c) for c in df.columns]).show() +------------+------------+ |c1_freqItems|c2_freqItems| +------------+------------+ | [1, 3, 4]| [8, 10, 11]| +------------+------------+