fairdo.utils package#
The fairdo.utils package provides a collection of mixed functions that are used across the package.
Submodules#
fairdo.utils.dataset module#
This module contains utility functions to load, preprocess, and synthesize datasets.
- fairdo.utils.dataset.data_generator(data)[source]#
Returns the data generator, from which the user can generate synthetic data
- Parameters:
data (pandas DataFrame) – The real data to be used to generate synthetic data
- Returns:
synthesizer – The data generator object
- Return type:
GaussianCopulaSynthesizer object or None
Examples
>>> import pandas as pd >>> from fairdo.utils.dataset import data_generator >>> data = pd.DataFrame({'age': [39, 50], 'education': ['Bachelors', 'HS-grad'], 'income': ['<=50K', '<=50K']}) >>> synthesizer = data_generator(data) >>> synthetic_data = synthesizer.sample(num_rows=2) >>> print(synthetic_data) age education income 0 39.0 Bachelors <=50K 1 50.0 HS-grad <=50K
- fairdo.utils.dataset.dataset_intersectional_column(data, protected_attributes)[source]#
Combine the protected attributes into a single column named
intersectional_group. This column will be used to identify the intersectional groups.- Parameters:
data (pandas DataFrame) – DataFrame with protected attributes.
protected_attributes (list of str) – List of protected attributes. Each attribute should be a column in the data.
- Returns:
data (pandas DataFrame) – Returns a DataFrame with an extra column of combined protected attributes.
protected_attribute (str) – The name of the column with the combined protected attributes.
Examples
>>> import pandas as pd >>> from fairdo.utils.dataset import dataset_intersectional_column >>> data = pd.DataFrame({'sex': ['male', 'female'], 'race': ['white', 'black']}) >>> pas = ['sex', 'race'] >>> data_new, pa = dataset_intersectional_column(data, pas) >>> print(data_new) sex race pa_merged 0 male white male_white_ 1 female black female_black_ >>> print(pa) intersectional_group
- fairdo.utils.dataset.downcast(data)[source]#
Downcast float and integer columns of the given data to save memory.
- Parameters:
data (pandas DataFrame) – DataFrame to downcast.
- Returns:
data
- Return type:
pandas DataFrame
Examples
>>> import pandas as pd >>> from fairdo.utils.dataset import downcast >>> data = pd.DataFrame({'a': [1, 2], 'b': [1.0, 2.0]}) >>> data = downcast(data) >>> print(data.dtypes) a int8 b float32
- fairdo.utils.dataset.generate_data(data, num_rows=100)[source]#
Generate synthetic data using the sdv library. The method used is Gaussian Copula.
- Parameters:
data (pandas DataFrame) – The real data to be used to generate synthetic data
num_rows (int) – The number of rows to generate
- Returns:
synthetic_data – The synthetic data generated
- Return type:
pandas DataFrame or None
Examples
>>> import pandas as pd >>> from fairdo.utils.dataset import generate_data >>> data = pd.DataFrame({'age': [39, 50], 'education': ['Bachelors', 'HS-grad'], 'income': ['<=50K', '<=50K']}) >>> generate_data(data, num_rows=2) age education income 0 39.0 Bachelors <=50K 1 50.0 HS-grad <=50K
- fairdo.utils.dataset.load_data(dataset_str, multi_protected_attr=False, print_info=True)[source]#
Load the dataset and preprocess it. The preprocessing steps include:
Dropping rows with missing values
Label encode protected attributes and label
One-hot encode all other categorical variables
Downcast float and integer columns to save memory
- Parameters:
dataset_str (str) – Name of the dataset to load and preprocess (e.g., ‘adult’, ‘compas’, ‘bank’, ‘german’).
multi_protected_attr (bool) – Whether to use multiple protected attributes or not.
print_info (bool) – Whether to print information about the dataset or not.
- Returns:
df (pandas DataFrame) – Preprocessed DataFrame.
label (str) – Name of the label column.
protected_attributes (list of str) – List of protected attributes.
Examples
>>> from fairdo.utils.dataset import load_data >>> data, label, protected_attributes = load_data('adult') >>> print(data.head(2)) age education-num race ... relationship_ Wife sex_ Female sex_ Male 0 39 13 4 ... 0 0 1 1 50 13 4 ... 0 0 1 >>> print(label) income >>> print(protected_attributes) ['race']
fairdo.utils.helper module#
Helper functions for the fairdo package.
- fairdo.utils.helper.generate_pairs(lst)[source]#
Generate all possible pairs \((i, j)\) of elements from a list with \(i < j\).
- Parameters:
lst (array_like) – list of elements
- Returns:
list of pairs of elements
- Return type:
list
Examples
>>> from fairdo.utils.helper import generate_pairs >>> generate_pairs([1, 2, 3]) [(1, 2), (1, 3), (2, 3)]
- fairdo.utils.helper.nunique(a, axis=0)[source]#
Count the number of unique elements in an array along a given axis.
- Parameters:
a (np.array) – The array to count the number of unique elements.
axis (int, optional) – The axis along which to count the number of unique elements. Default is 0.
- Returns:
The number of unique elements along the given axis.
- Return type:
np.array
Examples
>>> import numpy as np >>> from fairdo.utils.helper import nunique >>> nunique(np.array([1, 2, 3, 1, 2, 3])) array([3])
>>> nunique(np.array([[1, 2, 3], [1, 2, 3]]), axis=1) array([3, 3])
>>> nunique(np.array([[1, 2, 3], [1, 2, 3]]), axis=0) array([1, 1, 1])
fairdo.utils.math module#
Mixed math functions used throughout the package.
References
- fairdo.utils.math.conditional_entropy_cat(x: array, y: array) float[source]#
Calculate the conditional entropy [1] of a categorical variable
xgiven another categorical variabley, i.e.,\[H(X|Y) = H(X, Y) - H(Y)\]where \(H(X, Y)\) is the joint entropy of the categorical variables
xandyand \(H(Y)\) is the entropy of the variabley.- Parameters:
x (np.array (n_samples,)) – Array of shape (n_samples,) containing the labels.
y (np.array (n_samples,) or (n_samples, n_variables)) – Array containing the labels. Can represent a single or multiple categorical variables.
- Returns:
The conditional entropy of the label distribution.
- Return type:
float
Examples
>>> import numpy as np >>> from fairdo.utils.math import conditional_entropy_cat >>> x = np.array([0, 1, 1, 0, 1, 0, 0, 1]) >>> y = np.array([0, 1, 1, 0, 1, 0, 0, 1]) >>> conditional_entropy_cat(x, y) 0
- fairdo.utils.math.entropy_estimate_cat(x: array, **kwargs) float[source]#
Calculate the entropy [1] of a categorical variable. It is caclulated as:
\[H(X) = - \sum_{i=1}^{n} p(X_i) \log_2 p(X_i)\]where \(p(X_i)\) is the probability of the i-th category. The entropy is a measure of the information/uncertainty of a random variable. Higher values indicate more information/uncertainty.
- Parameters:
x (np.array (n_samples,)) – Array of shape (n_samples,) containing the categorical labels as numerical values.
- Returns:
The entropy of the label distribution.
- Return type:
float
Examples
>>> import numpy as np >>> from fairdo.utils.math import entropy_estimate_cat >>> x = np.array([0, 1, 1, 0, 1, 0, 0, 1]) >>> entropy_estimate_cat(x) 1.0
- fairdo.utils.math.joint_entropy_cat(x: array)[source]#
Calculate the joint entropy [1] of multiple categorical variables. The joint entropy is a measure of the information/surprise/uncertainty of a set of random variables. Let \(X = (X_1, X_2, \ldots, X_m)\) be a set of categorical variables, i.e., multivariate random variable, then the joint entropy is calculated as:
\[H(X) = -\sum_{x_1 \in\mathcal X_1} \ldots \sum_{x_m \in\mathcal X_m} P(x_1, ..., x_m) \log_2[P(x_1, ..., x_m)]\]- Parameters:
x (np.array (n_samples, n_variables)) – Array of shape (n_samples, n_variables) containing the labels as numerical values.
- Returns:
The joint entropy of the categorical variables in the array
x.- Return type:
float
Examples
>>> import numpy as np >>> from fairdo.utils.math import joint_entropy_cat >>> x = np.array([[0, 1, 1, 0, 1, 0, 0, 1], ... [0, 1, 1, 0, 1, 0, 0, 1]]) >>> joint_entropy_cat(x) -0.0