fairdo.preprocessing package#
The fairdo.preprocessing package provides methods for pre-processing datasets to achieve fairness.
The base class Preprocessing defines the required methods for a pre-processor,
which are fit and transform. The fit method is used to assign the dataset to an internal variable,
and the transform method is used to apply the pre-processing method to the dataset. The transform method
returns a pre-processed version of the dataset that can be used for training a machine learning model.
The pre-processed dataset is considered fair with respect to the discrimination measure that is given
when initializing the pre-processor. The discrimination measure is a metric from the fairdo.metrics module.
The DefaultPreprocessing object is the default pre-processor that internally uses a genetic algorithm to optimize the fairness of a dataset. It comes with default parameters that are optimal for most use cases. The user can also manually set the population size and number of generations when initializing the DefaultPreprocessing object.
The HeuristicWrapper is a pre-processor that is used with a given combinatorial optimization solver to optimize the fairness of a dataset. The user can use any heuristic solver from the fairdo.optimize module. This requires manually setting the parameters of the heuristic solver and is recommended for advanced users.
The MetricOptimizer is a pre-processor that is used with a given optimization algorithm to optimize the fairness of a dataset. This pre-processor is deprecated. Use DefaultPreprocessing instead.
Other pre-processors are implemented in the base submodule which include a Random pre-processor that randomly removes data points from the dataset, a OriginalData that returns the original dataset, and Unawareness that removes all columns of protected attributes from the dataset. These pre-processors are used for comparison purposes.
Submodules#
fairdo.preprocessing.base module#
- class fairdo.preprocessing.base.OriginalData(**kwargs)[source]#
Bases:
PreprocessingThis class is used to return the original dataset.
- class fairdo.preprocessing.base.Preprocessing(protected_attribute, label)[source]#
Bases:
objectBase class for all preprocessing methods.
- Parameters:
protected_attribute (str) –
label (str) – predicting label
dataset (pandas DataFrame) – original dataset
transformed_data (pandas DataFrame) – dataset after transformation/pre-processing
- fit(dataset)[source]#
Copies the dataset to the class and checks if the dataset is valid, i.e., all columns are numeric.
- Parameters:
dataset (pandas DataFrame) –
- Return type:
self
- class fairdo.preprocessing.base.Random(frac=0.8, protected_attribute=None, label=None, random_state=None)[source]#
Bases:
PreprocessingThis class is used to return a random subset of the dataset. The size of the subset is determined by the frac parameter.
- class fairdo.preprocessing.base.Unawareness(protected_attribute=None, label=None, **kwargs)[source]#
Bases:
PreprocessingFairness Through Unawareness
Removes all columns of protected attributes from the dataset. Is a simple and effective method to ensure fairness in the dataset. But fails to consider the possibility of indirect discrimination, i.e., the protected attribute may be correlated with other features in the dataset.
fairdo.preprocessing.metricoptimizer module#
- class fairdo.preprocessing.metricoptimizer.MetricOptGenerator(protected_attribute, label, frac=1.25, m=5, eps=0, additions=None, fairness_metric=<function statistical_parity_abs_diff>, random_state=None)[source]#
Bases:
PreprocessingDeletes samples which worsen the discrimination in the dataset
- class fairdo.preprocessing.metricoptimizer.MetricOptRemover(protected_attribute, label, frac=0.75, m=5, eps=0, deletions=None, fairness_metric=<function statistical_parity_abs_diff>, random_state=None)[source]#
Bases:
PreprocessingDeletes samples which worsen the discrimination in the dataset
- class fairdo.preprocessing.metricoptimizer.MetricOptimizer(protected_attribute, label, frac=0.75, m=5, eps=0, additions=None, deletions=None, fairness_metric=<function statistical_parity_abs_diff>, random_state=None)[source]#
Bases:
PreprocessingDeletes samples or adds generated samples to decrease the discrimination/bias in the given dataset.
fairdo.preprocessing.wrapper module#
- class fairdo.preprocessing.wrapper.DefaultPreprocessing(protected_attribute, label, disc_measure=<function statistical_parity_abs_diff_max>, pop_size=100, num_generations=500, **kwargs)[source]#
Bases:
SingleWrapperDefaultPreprocessing is a processing method that can be used on-the-go. It uses a Genetic Algorithm to select a subset of the given dataset to optimize for fairness. It also includes a penalty for missing groups in the protected attribute.
- The default parameters are:
pop_size=100, num_generations=500. Selection: Elitist Crossover: Uniform Mutation: Fractional Bit Flip
- func#
The discrimination measure function to be optimized. It is defined within the fit method.
- Type:
callable
- dims#
The number of dimensions or columns in the dataset. It is defined within the fit method.
- Type:
int
- disc_measure#
The discrimination measure to be optimized. It takes the feature matrix (x), labels (y), and protected attributes (z) and returns a numeric value.
- Type:
callable
- dataset#
The dataset to be preprocessed. It is defined within the fit method.
- Type:
pandas DataFrame
- class fairdo.preprocessing.wrapper.MultiWrapper(heuristic, protected_attribute, label, fitness_functions=[<function statistical_parity_abs_diff_max>, <function data_loss>], **kwargs)[source]#
Bases:
PreprocessingA preprocessing wrapper class that applies a given multi-objective optimization method to optimize multiple given objective functions and outputs the Pareto front of the solutions. The solutions are returned as a binary numpy array of shape (n, d) where n is the number of solutions and d is the number of dimensions. The objective functions are defined as a list of functions to be optimized. They evaluate properties of the dataset such as the fairness and data quality/data loss. The pre-processed dataset is a subset of the original dataset, where the columns are selected based on the multi-objective optimization method.
- heuristic#
The method that optimizes multiple fitness functions. It takes multiple fitness functions, the number of dimensions, and some other parameters. It returns solutions in the Pareto front and their corresponding fitness values. All fronts can be returned if requested. The solution has a shape of (n, dims) where n is the number of solutions and dims is the number of dimensions.
- Type:
callable
- funcs#
List of objective function to be minimized. Wrapper for user-given fitness_functions. It is defined within the fit method.
- Type:
callable
- dims#
The number of dimensions or columns in the dataset. It is defined within the fit method.
- Type:
int
- fitness_functions#
The list of objective functions to be minimized. They evaluate properties of the dataset such as the fairness and data quality/data loss.
- Type:
list of callable
- dataset#
The dataset to be preprocessed. It is defined within the fit method.
- Type:
pandas DataFrame
- apply_heuristic()[source]#
Applies the heuristic method to the dataset.
- Returns:
self.transformed_data (pandas DataFrame) – The dataset to be masked based on the heuristic method.
masks (np.array of shape (n, dims)) – The binary masks indicating the selected columns. Represents the n solutions in the Pareto front.
fitness_values (np.array of shape (n, len(fitness_functions))) – The fitness values of the solutions in the Pareto front.
- fit(dataset, synthetic_dataset=None, approach='remove')[source]#
Defines the discrimination measure function and the number of dimensions based on the input dataset.
- Parameters:
dataset (pandas DataFrame) – The dataset to be preprocessed.
synthetic_dataset (pandas DataFrame, optional) – The synthetic dataset to be used for the ‘add’ approach. It is required only if the ‘add’ approach is used.
approach (str) – The approach to be used for the heuristic method. It can be either ‘remove’ or ‘add’.
- Return type:
self
- get_best_fitness(return_baseline)[source]#
Get the best fitness value of the solutions according to the weights set in the transform method.
- Parameters:
return_baseline (bool) – Whether to return the fitness of the original baseline dataset.
- Returns:
The best fitness value.
- Return type:
float
- get_pareto_front(return_baseline=False)[source]#
Get the Pareto front of the solutions. Return the baseline solution if it is requested.
- Parameters:
return_baseline (bool, optional (default=False)) – Whether to return the result of the original baseline dataset.
- Returns:
The Pareto front of the solutions.
- Return type:
np.array of shape (n, len(fitness_functions))
- plot_results(xaxis=0, yaxis=1, xlabel='Fitness 1', ylabel='Fitness 2', title='Multi-Objective Optimization Results', figsize=(7, 7))[source]#
Plot the results of the multi-objective optimization.
- transform(ideal_solution=array([0, 0]), w=0.5)[source]#
Applies the heuristic method to the dataset and returns the best solution in the Pareto front, that is, the solution closest to the ideal solution.
If fitted before, it is possible to return a different dataset by changing the ideal_solution and w parameters.
- Parameters:
ideal_solution (np.array, optional (default=[0, 0])) – The ideal solution to be used for the optimization. Default is [0, 0].
w (float or np.array, optional (default=0.5)) – The weight to be used for the weighted fitness value.
- Returns:
data_best – The dataset closest to the ideal solution.
- Return type:
pandas DataFrame
- class fairdo.preprocessing.wrapper.SingleWrapper(heuristic, protected_attribute, label, disc_measure=<function statistical_parity_abs_diff_max>, fitness_functions=None, **kwargs)[source]#
Bases:
PreprocessingA preprocessing wrapper class that applies a given heuristic method to optimize a given discrimination measure and outputs a pre-processed dataset. The pre-processed dataset is a subset of the original dataset, where the columns are selected based on the heuristic method.
- heuristic#
The method that optimizes the discrimination measure. It takes a function and the number of dimensions, and returns a binary numpy array of shape (d, ) indicating selected columns and the optimized discrimination measure.
- Type:
callable
- func#
The discrimination measure function to be optimized. It is defined within the fit method.
- Type:
callable
- dims#
The number of dimensions or columns in the dataset. It is defined within the fit method.
- Type:
int
- disc_measure#
The discrimination measure to be optimized. It takes the feature matrix (x), labels (y), and protected attributes (z) and returns a numeric value.
- Type:
callable
- dataset#
The dataset to be preprocessed. It is defined within the fit method.
- Type:
pandas DataFrame
- fit(dataset, synthetic_dataset=None, approach='remove')[source]#
Defines the discrimination measure function and the number of dimensions based on the input dataset.
- Parameters:
dataset (pandas DataFrame) – The dataset to be preprocessed.
synthetic_dataset (pandas DataFrame, optional (default=None)) – The synthetic dataset to be used for the ‘add’ approach. It is required only if the ‘add’ approach is used.
approach (str, optional (default='remove')) – The approach to be used for the heuristic method. It can be either ‘remove’ or ‘add’.
- Return type:
self
- fairdo.preprocessing.wrapper.f(binary_vector, dataset, label, protected_attributes, approach='remove', synthetic_dataset=None, fitness_function=<function statistical_parity_abs_diff_max>, penalty=None)[source]#
Two different approaches can be used for the heuristic method: 1. ‘remove’: The data points from the given dataset are removed to promote fairness. 2. ‘add’: Additional samples are added to the original data to promote fairness. The sample data can be synthetic data. Approach addresses this question: Which of the data points from the synthetic_dataframe should be added to the original data to prevent discrimination?
- Parameters:
binary_vector (np.array) – Binary vector indicating which columns to include in the discrimination measure calculation.
dataset (pandas DataFrame) – The data to calculate the discrimination measure on.
label (str) – The column in the dataset to use as the target variable.
protected_attributes (Union[str, List[str]]) – The column or columns in the dataset to consider as protected attributes.
approach (str) – The approach to be used for the heuristic method. It can be either ‘remove’ or ‘add’.
synthetic_dataset (pandas DataFrame, optional) – Extra samples to be added to the original data. Samples can be synthetic data. It is required only if the ‘add’ approach is used.
fitness_function (callable, optional (default=statistical_parity_abs_diff_max)) – A function that takes in x (features), y (labels), and z (protected attributes) and returns a numeric value. Default is statistical_parity_abs_diff_max which is the absolute difference between the maximum and minimum statistical parity values.
penalty (callable, optional (default=None)) – A function that takes a dictionary of keyword arguments and returns a numeric value. This function is used to penalize the discrimination loss. Default is None which means no penalty is applied.
- Returns:
The calculated discrimination measure.
- Return type:
float