pipeasy_spark package¶
Submodules¶
pipeasy_spark.convenience module¶
-
pipeasy_spark.convenience.
build_default_pipeline
(dataframe, exclude_columns=())[source]¶ Build simple transformation pipeline (untrained) for the given dataframe.
By defaults numeric columns are processed with StandardScaler and string columns are processed with StringIndexer + OneHotEncoderEstimator
- dataframe: pyspark.sql.Dataframe
- only the schema of the dataframe is used, not actual data.
- exclude_columns: list of str
- name of columns for which we want no transformation to apply.
pipeline: pyspark.ml.Pipeline instance (untrained)
-
pipeasy_spark.convenience.
build_pipeline_by_dtypes
(dataframe, exclude_columns=(), string_transformers=(), numeric_transformers=())[source]¶ Build simple transformation pipeline (untrained) for the given dataframe.
- dataframe: pyspark.sql.Dataframe
- only the schema of the dataframe is used, not actual data.
- exclude_columns: list of str
- name of columns for which we want no transformation to apply.
- string_transformers: list of transformer instances
- The successive transformations that will be applied to string columns Each element is an instance of a pyspark.ml.feature transformer class.
- numeric_transformers: list of transformer instances
- The successive transformations that will be applied to numeric columns Each element is an instance of a pyspark.ml.feature transformer class.
pipeline: pyspark.ml.Pipeline instance (untrained)
pipeasy_spark.core module¶
-
pipeasy_spark.core.
build_pipeline
(column_transformers)[source]¶ Create a dataframe transformation pipeline.
The created pipeline can be used to apply successive transformations on a spark dataframe. The transformations are intended to be applied per column.
>>> df = titanic.select('Survived', 'Sex', 'Age').dropna() >>> df.show(2) +--------+------+----+ |Survived| Sex| Age| +--------+------+----+ | 0| male|22.0| | 1|female|38.0| +--------+------+----+ >>> pipeline = build_pipeline({ # 'Survived' : this variable is not modified, it can also be omitted from the dict 'Survived': [], 'Sex': [StringIndexer(), OneHotEncoderEstimator(dropLast=False)], # 'Age': a VectorAssembler must be applied before the StandardScaler # as the latter only accepts vectors as input. 'Age': [VectorAssembler(), StandardScaler()] }) >>> trained_pipeline = pipeline.fit(df) >>> trained_pipeline.transform(df).show(2) +--------+-------------+--------------------+ |Survived| Sex| Age| +--------+-------------+--------------------+ | 0|(2,[0],[1.0])|[1.5054181442954726]| | 1|(2,[1],[1.0])| [2.600267703783089]| +--------+-------------+--------------------+
- column_transformers: dict(str -> list)
- key (str): column name; value (list): transformer instances (typically instances of pyspark.ml.feature transformers)
pipeline: a pyspark.ml.Pipeline instance
pipeasy_spark.transformers module¶
-
class
pipeasy_spark.transformers.
ColumnDropper
(inputCols=None)[source]¶ Bases:
pyspark.ml.base.Transformer
,pyspark.ml.param.shared.HasInputCols
Transformer to drop several columns from a dataset.
Module contents¶
Top-level package for pipeasy-spark.
The pipeasy-spark package provides a set of convenience functions that make it easier to map each column of a Spark dataframe (or subsets of columns) to user-specified transformations.
-
pipeasy_spark.
build_pipeline
(column_transformers)[source]¶ Create a dataframe transformation pipeline.
The created pipeline can be used to apply successive transformations on a spark dataframe. The transformations are intended to be applied per column.
>>> df = titanic.select('Survived', 'Sex', 'Age').dropna() >>> df.show(2) +--------+------+----+ |Survived| Sex| Age| +--------+------+----+ | 0| male|22.0| | 1|female|38.0| +--------+------+----+ >>> pipeline = build_pipeline({ # 'Survived' : this variable is not modified, it can also be omitted from the dict 'Survived': [], 'Sex': [StringIndexer(), OneHotEncoderEstimator(dropLast=False)], # 'Age': a VectorAssembler must be applied before the StandardScaler # as the latter only accepts vectors as input. 'Age': [VectorAssembler(), StandardScaler()] }) >>> trained_pipeline = pipeline.fit(df) >>> trained_pipeline.transform(df).show(2) +--------+-------------+--------------------+ |Survived| Sex| Age| +--------+-------------+--------------------+ | 0|(2,[0],[1.0])|[1.5054181442954726]| | 1|(2,[1],[1.0])| [2.600267703783089]| +--------+-------------+--------------------+
- column_transformers: dict(str -> list)
- key (str): column name; value (list): transformer instances (typically instances of pyspark.ml.feature transformers)
pipeline: a pyspark.ml.Pipeline instance
-
pipeasy_spark.
build_pipeline_by_dtypes
(dataframe, exclude_columns=(), string_transformers=(), numeric_transformers=())[source]¶ Build simple transformation pipeline (untrained) for the given dataframe.
- dataframe: pyspark.sql.Dataframe
- only the schema of the dataframe is used, not actual data.
- exclude_columns: list of str
- name of columns for which we want no transformation to apply.
- string_transformers: list of transformer instances
- The successive transformations that will be applied to string columns Each element is an instance of a pyspark.ml.feature transformer class.
- numeric_transformers: list of transformer instances
- The successive transformations that will be applied to numeric columns Each element is an instance of a pyspark.ml.feature transformer class.
pipeline: pyspark.ml.Pipeline instance (untrained)
-
pipeasy_spark.
build_default_pipeline
(dataframe, exclude_columns=())[source]¶ Build simple transformation pipeline (untrained) for the given dataframe.
By defaults numeric columns are processed with StandardScaler and string columns are processed with StringIndexer + OneHotEncoderEstimator
- dataframe: pyspark.sql.Dataframe
- only the schema of the dataframe is used, not actual data.
- exclude_columns: list of str
- name of columns for which we want no transformation to apply.
pipeline: pyspark.ml.Pipeline instance (untrained)