fairdo.utils package#

The fairdo.utils package provides a collection of mixed functions that are used across the package.

Submodules#

fairdo.utils.dataset module#

This module contains utility functions to load, preprocess, and synthesize datasets.

fairdo.utils.dataset.data_generator(data)[source]#

Returns the data generator, from which the user can generate synthetic data

Parameters:: data (pandas DataFrame) – The real data to be used to generate synthetic data
Returns:: synthesizer – The data generator object
Return type:: GaussianCopulaSynthesizer object or None

Examples

>>> import pandas as pd
>>> from fairdo.utils.dataset import data_generator
>>> data = pd.DataFrame({'age': [39, 50], 'education': ['Bachelors', 'HS-grad'], 'income': ['<=50K', '<=50K']})
>>> synthesizer = data_generator(data)
>>> synthetic_data = synthesizer.sample(num_rows=2)
>>> print(synthetic_data)
     age  education  income
0  39.0  Bachelors  <=50K
1  50.0    HS-grad  <=50K

fairdo.utils.dataset.dataset_intersectional_column(data, protected_attributes)[source]#

Combine the protected attributes into a single column named intersectional_group. This column will be used to identify the intersectional groups.

Parameters:

data (pandas DataFrame) – DataFrame with protected attributes.
protected_attributes (list of str) – List of protected attributes. Each attribute should be a column in the data.

Returns:

data (pandas DataFrame) – Returns a DataFrame with an extra column of combined protected attributes.
protected_attribute (str) – The name of the column with the combined protected attributes.

Examples

>>> import pandas as pd
>>> from fairdo.utils.dataset import dataset_intersectional_column
>>> data = pd.DataFrame({'sex': ['male', 'female'], 'race': ['white', 'black']})
>>> pas = ['sex', 'race']
>>> data_new, pa = dataset_intersectional_column(data, pas)
>>> print(data_new)
      sex   race      pa_merged
0    male  white    male_white_
1  female  black  female_black_
>>> print(pa)
intersectional_group

fairdo.utils.dataset.downcast(data)[source]#

Downcast float and integer columns of the given data to save memory.

Parameters:: data (pandas DataFrame) – DataFrame to downcast.
Returns:: data
Return type:: pandas DataFrame

Examples

>>> import pandas as pd
>>> from fairdo.utils.dataset import downcast
>>> data = pd.DataFrame({'a': [1, 2], 'b': [1.0, 2.0]})
>>> data = downcast(data)
>>> print(data.dtypes)
a      int8
b    float32

fairdo.utils.dataset.generate_data(data, num_rows=100)[source]#

Generate synthetic data using the sdv library. The method used is Gaussian Copula.

Parameters:

data (pandas DataFrame) – The real data to be used to generate synthetic data
num_rows (int) – The number of rows to generate

Returns:

synthetic_data – The synthetic data generated

Return type:

pandas DataFrame or None

Examples

>>> import pandas as pd
>>> from fairdo.utils.dataset import generate_data
>>> data = pd.DataFrame({'age': [39, 50], 'education': ['Bachelors', 'HS-grad'], 'income': ['<=50K', '<=50K']})
>>> generate_data(data, num_rows=2)
     age  education  income
0  39.0  Bachelors  <=50K
1  50.0    HS-grad  <=50K

fairdo.utils.dataset.load_data(dataset_str, multi_protected_attr=False, print_info=True)[source]#

Load the dataset and preprocess it. The preprocessing steps include:

Dropping rows with missing values
Label encode protected attributes and label
One-hot encode all other categorical variables
Downcast float and integer columns to save memory

Parameters:

dataset_str (str) – Name of the dataset to load and preprocess (e.g., ‘adult’, ‘compas’, ‘bank’, ‘german’).
multi_protected_attr (bool) – Whether to use multiple protected attributes or not.
print_info (bool) – Whether to print information about the dataset or not.

Returns:

df (pandas DataFrame) – Preprocessed DataFrame.
label (str) – Name of the label column.
protected_attributes (list of str) – List of protected attributes.

Examples

>>> from fairdo.utils.dataset import load_data
>>> data, label, protected_attributes = load_data('adult')
>>> print(data.head(2))
   age  education-num  race  ...  relationship_ Wife  sex_ Female  sex_ Male
0   39             13     4  ...                   0            0          1
1   50             13     4  ...                   0            0          1
>>> print(label)
income
>>> print(protected_attributes)
['race']

fairdo.utils.helper module#

Helper functions for the fairdo package.

fairdo.utils.helper.generate_pairs(lst)[source]#

Generate all possible pairs \((i, j)\) of elements from a list with \(i < j\).

Parameters:: lst (array_like) – list of elements
Returns:: list of pairs of elements
Return type:: list

Examples

>>> from fairdo.utils.helper import generate_pairs
>>> generate_pairs([1, 2, 3])
[(1, 2), (1, 3), (2, 3)]

fairdo.utils.helper.nunique(a, axis=0)[source]#

Count the number of unique elements in an array along a given axis.

Parameters:

a (np.array) – The array to count the number of unique elements.
axis (int, optional) – The axis along which to count the number of unique elements. Default is 0.

Returns:

The number of unique elements along the given axis.

Return type:

np.array

Examples

>>> import numpy as np
>>> from fairdo.utils.helper import nunique
>>> nunique(np.array([1, 2, 3, 1, 2, 3]))
array([3])

>>> nunique(np.array([[1, 2, 3], [1, 2, 3]]), axis=1)
array([3, 3])

>>> nunique(np.array([[1, 2, 3], [1, 2, 3]]), axis=0)
array([1, 1, 1])

fairdo.utils.math module#

Mixed math functions used throughout the package.

References

fairdo.utils.math.conditional_entropy_cat(x: array, y: array) → float[source]#

Calculate the conditional entropy [1] of a categorical variable x given another categorical variable y, i.e.,

\[H(X|Y) = H(X, Y) - H(Y)\]

where \(H(X, Y)\) is the joint entropy of the categorical variables x and y and \(H(Y)\) is the entropy of the variable y.

Parameters:

x (np.array (n_samples,)) – Array of shape (n_samples,) containing the labels.
y (np.array (n_samples,) or (n_samples, n_variables)) – Array containing the labels. Can represent a single or multiple categorical variables.

Returns:

The conditional entropy of the label distribution.

Return type:

float

Examples

>>> import numpy as np
>>> from fairdo.utils.math import conditional_entropy_cat
>>> x = np.array([0, 1, 1, 0, 1, 0, 0, 1])
>>> y = np.array([0, 1, 1, 0, 1, 0, 0, 1])
>>> conditional_entropy_cat(x, y)
0

fairdo.utils.math.entropy_estimate_cat(x: array, **kwargs) → float[source]#

Calculate the entropy [1] of a categorical variable. It is caclulated as:

\[H(X) = - \sum_{i=1}^{n} p(X_i) \log_2 p(X_i)\]

where \(p(X_i)\) is the probability of the i-th category. The entropy is a measure of the information/uncertainty of a random variable. Higher values indicate more information/uncertainty.

Parameters:: x (np.array (n_samples,)) – Array of shape (n_samples,) containing the categorical labels as numerical values.
Returns:: The entropy of the label distribution.
Return type:: float

Examples

>>> import numpy as np
>>> from fairdo.utils.math import entropy_estimate_cat
>>> x = np.array([0, 1, 1, 0, 1, 0, 0, 1])
>>> entropy_estimate_cat(x)
1.0

fairdo.utils.math.joint_entropy_cat(x: array)[source]#

Calculate the joint entropy [1] of multiple categorical variables. The joint entropy is a measure of the information/surprise/uncertainty of a set of random variables. Let \(X = (X_1, X_2, \ldots, X_m)\) be a set of categorical variables, i.e., multivariate random variable, then the joint entropy is calculated as:

\[H(X) = -\sum_{x_1 \in\mathcal X_1} \ldots \sum_{x_m \in\mathcal X_m} P(x_1, ..., x_m) \log_2[P(x_1, ..., x_m)]\]

Parameters:: x (np.array (n_samples, n_variables)) – Array of shape (n_samples, n_variables) containing the labels as numerical values.
Returns:: The joint entropy of the categorical variables in the array x.
Return type:: float

Examples

>>> import numpy as np
>>> from fairdo.utils.math import joint_entropy_cat
>>> x = np.array([[0, 1, 1, 0, 1, 0, 0, 1],
...               [0, 1, 1, 0, 1, 0, 0, 1]])
>>> joint_entropy_cat(x)
-0.0