Sort columns based on missing values in descending order and drop columns without any missing values, by default False spine_color : str, optional Use to control the figure size, by default (20, 20) sort : bool, optional More information can be found in the matplotlib documentation, by default “PuBuGn” figsize : Tuple, optional If a Pandas DataFrame is provided, the index/column information is used to label the plots cmap : str, optionalĪny valid colormap can be used. Parameters:ĢD dataset that can be coerced into Pandas DataFrame. Two-dimensional visualization of the missing values in a dataset. missingval_plot ( data:, cmap: str = 'PuBuGn', figsize: Tuple = (20, 20), sort: bool = False, spine_color: str = '#EEEEEE' ) ¶ Returns the Axes object with the plot for further tweaking. Type of split to be performed, by default None If a Pandas DataFrame is provided, the index/column information is used to label the plots split : Optional, optional
Returns a color-encoded correlation matrix. corr_mat ( data:, split: Optional = None, threshold: float = 0, target: Union = None, method: str = 'pearson', colored: bool = True ) → Union ¶ Use to control the color of the bars indicating the least common values, by default “#d8b365” Use to control the color of the bars indicating the most common values, by default “#5ab4ac” bar_color_bottom : str, optional Show the “bottom” most frequent values in a column, by default 3 bar_color_top : str, optional Show the “top” most frequent values in a column, by default 3 bottom : int, optional
Use to control the figure size, by default (18, 18) top : int, optional If a Pandas DataFrame is provided, the index/column information is used to label the plots figsize : Tuple, optional Two-dimensional visualization of the number and frequency of categorical features. empty or not) checked for correlations among each other and with other features and in a second step for correlations with the label before a decision on ommitting them is made.Functions for descriptive analytics. Instead of simply dropping these columns, they are converted into binary features (i.e. Many parameters are available allowing a more restrictive data cleaning where needed.įurthermore, the function klib.mv_col_handling() provides a sophisticated selection mechanism for columns with relatively many missing values. Using this procedure, 56006 duplicate rows are identified in the subset, i.e., 56006 rows in 10 columns are encoded into a single column of dtype integer, greatly reducing the memory footprint and number of columns which should speed up model training.Īll of these functions were run with their relatively “soft” default settings.
This allows us to pool and encode “carrier” and similar columns, while “tailnum” remains in the dataset. While this is unlikely, it is advised to specifically exclude features that provide sufficient informational content by themselves as well as the target column by using the “exclude” setting.Īs can be seen in *cat_plot()* the “carrier” column is made up of a few very frequent values - the top 4 values account for roughly 75% - while in “tailnum” the top 4 values barely make up 2%. While the encoding itself does not lead to a loss in information, some details might get lost in the aggregation step. These are then added to the original data what allows dropping the previously identified and now encoded columns. Specifically, the pooling is achieved by finding duplicates in subsets of the data and encoding the largest possible subset with sufficient duplicates with integers. This function “pools” columns together based on several settings. Further, klib.pool_duplicate_subsets() can be applied, what ultimately reduces the dataset to only 3.8 MB (from 51 MB originally).