API Reference¶
This page lists all of the estimators and top-level functions in dask_ml
Unless otherwise noted, the estimators implemented in dask-ml
appropriate for parallel and distributed training.
: Model Selection¶
Utilities for hyperparameter optimization.
These estimators will operate in parallel. Their scalability depends on the underlying estimators being used.
Dask-ML has a few cross validation utilities.
Split arrays into random train and test matrices. |
is a simple helper that
uses model_selection.ShuffleSplit
Random permutation cross-validator. |
K-Folds cross-validator |
Dask-ML provides drop-in replacements for grid and randomized search. These are appropriate for datasets where the CV splits fit in memory.
Exhaustive search over specified parameter values for an estimator. |
Randomized search on hyper parameters. |
For hyperparameter optimization on larger-than-memory datasets, Dask-ML provides the following:
Incrementally search for hyper-parameters on models that support partial_fit |
Find the best parameters for a particular model with an adaptive cross-validation algorithm. |
Perform the successive halving algorithm [R424ea1a907b1-1]. |
Incrementally search for hyper-parameters on models that support partial_fit |
: Ensemble Methods¶
Blockwise training and ensemble voting classifier. |
Blockwise training and ensemble voting regressor. |
: Generalized Linear Models¶
The dask_ml.linear_model
module implements linear models for
classification and regression.
Esimator for linear regression. |
Esimator for logistic regression. |
Esimator for poisson regression. |
: Naive Bayes¶
Fit a naive bayes model with a Gaussian likelihood |
: Meta-Estimators¶
dask-ml provides some meta-estimators that help use regular estimators that follow the scikit-learn API. These meta-estimators make the underlying estimator work well with Dask Arrays or DataFrames.
Meta-estimator for parallel predict and transform. |
Metaestimator for feeding Dask Arrays to an estimator blockwise. |
: Clustering¶
Unsupervised Clustering Algorithms
Scalable KMeans for clustering |
Apply parallel Spectral Clustering |
: Matrix Decomposition¶
Incremental principal components analysis (IPCA). |
Principal component analysis (PCA) |
Methods |
: Preprocessing Data¶
Utilties for Preprocessing data.
Standardize features by removing the mean and scaling to unit variance. |
Scale features using statistics that are robust to outliers. |
Transform features by scaling each feature to a given range. |
Transforms features using quantile information. |
Transform columns of a DataFrame to categorical dtype. |
Dummy (one-hot) encode categorical columns. |
Ordinal (integer) encode categorical columns. |
Encode labels with value between 0 and n_classes-1. |
Generate polynomial and interaction features. |
Construct a transformer from a an arbitrary callable |
: Feature extraction¶
Convert a collection of text documents to a matrix of token counts |
Convert a collection of text documents to a matrix of token occurrences. |
Implements feature hashing, aka the hashing trick. |
: Composite Estimators¶
Meta-estimators for building composite models with transformers.
Meta-estimators for composing models with multiple transformers.
These estimators are useful for working with heterogenous tabular data.
Applies transformers to columns of an array or pandas DataFrame. |
Construct a ColumnTransformer from the given transformers. |
: Imputing Missing Data¶
Methods |
: Metrics¶
Score functions, performance metrics, and pairwise distance computations.
Regression Metrics¶
Mean absolute error regression loss. |
Mean absolute percentage error regression loss. |
Mean squared error regression loss. |
Mean squared logarithmic error regression loss. |
\(R^2\) (coefficient of determination) regression score function. |
Classification Metrics¶
Accuracy classification score. |
Log loss, aka logistic loss or cross-entropy loss. |
: XGBoost¶
Train an XGBoost model on dask arrays or dataframes.
This may be used for training an XGBoost model on a cluster. XGBoost
will be setup in distributed mode alongside your existing
Train an XGBoost model on a Dask Cluster |
Distributed prediction with XGBoost |
: Datasets¶
dask-ml provides some utilities for generating toy datasets.
Generate a dummy dataset for modeling count data. |
Generate isotropic Gaussian blobs for clustering. |
Generate a random regression problem. |
Uses the make_classification function to create a dask dataframe for testing. |