API Reference
Contents
API Reference¶
This page lists all of the estimators and top-level functions in dask_ml
.
Unless otherwise noted, the estimators implemented in dask-ml
are
appropriate for parallel and distributed training.
dask_ml.model_selection
: Model Selection¶
Utilities for hyperparameter optimization.
These estimators will operate in parallel. Their scalability depends on the underlying estimators being used.
Dask-ML has a few cross validation utilities.
|
Split arrays into random train and test matrices. |
model_selection.train_test_split()
is a simple helper that
uses model_selection.ShuffleSplit
internally.
|
Random permutation cross-validator. |
|
K-Folds cross-validator |
Dask-ML provides drop-in replacements for grid and randomized search. These are appropriate for datasets where the CV splits fit in memory.
|
Exhaustive search over specified parameter values for an estimator. |
|
Randomized search on hyper parameters. |
For hyperparameter optimization on larger-than-memory datasets, Dask-ML provides the following:
|
Incrementally search for hyper-parameters on models that support partial_fit |
|
Find the best parameters for a particular model with an adaptive cross-validation algorithm. |
Perform the successive halving algorithm [R424ea1a907b1-1]. |
|
|
Incrementally search for hyper-parameters on models that support partial_fit |
dask_ml.ensemble
: Ensemble Methods¶
|
Blockwise training and ensemble voting classifier. |
|
Blockwise training and ensemble voting regressor. |
dask_ml.linear_model
: Generalized Linear Models¶
The dask_ml.linear_model
module implements linear models for
classification and regression.
|
Esimator for linear regression. |
|
Esimator for logistic regression. |
|
Esimator for poisson regression. |
dask_ml.naive_bayes
: Naive Bayes¶
|
Fit a naive bayes model with a Gaussian likelihood |
dask_ml.wrappers
: Meta-Estimators¶
dask-ml provides some meta-estimators that help use regular estimators that follow the scikit-learn API. These meta-estimators make the underlying estimator work well with Dask Arrays or DataFrames.
|
Meta-estimator for parallel predict and transform. |
|
Metaestimator for feeding Dask Arrays to an estimator blockwise. |
dask_ml.cluster
: Clustering¶
Unsupervised Clustering Algorithms
|
Scalable KMeans for clustering |
|
Apply parallel Spectral Clustering |
dask_ml.decomposition
: Matrix Decomposition¶
|
Incremental principal components analysis (IPCA). |
|
Principal component analysis (PCA) |
|
Methods |
dask_ml.preprocessing
: Preprocessing Data¶
Utilties for Preprocessing data.
|
Standardize features by removing the mean and scaling to unit variance. |
|
Scale features using statistics that are robust to outliers. |
|
Transform features by scaling each feature to a given range. |
|
Transforms features using quantile information. |
|
Transform columns of a DataFrame to categorical dtype. |
|
Dummy (one-hot) encode categorical columns. |
|
Ordinal (integer) encode categorical columns. |
|
Encode labels with value between 0 and n_classes-1. |
|
Generate polynomial and interaction features. |
|
Construct a transformer from a an arbitrary callable |
dask_ml.feature_extraction.text
: Feature extraction¶
|
Convert a collection of text documents to a matrix of token counts |
Convert a collection of text documents to a matrix of token occurrences. |
|
Implements feature hashing, aka the hashing trick. |
dask_ml.compose
: Composite Estimators¶
Meta-estimators for building composite models with transformers.
Meta-estimators for composing models with multiple transformers.
These estimators are useful for working with heterogenous tabular data.
|
Applies transformers to columns of an array or pandas DataFrame. |
Construct a ColumnTransformer from the given transformers. |
dask_ml.impute
: Imputing Missing Data¶
|
Methods |
dask_ml.metrics
: Metrics¶
Score functions, performance metrics, and pairwise distance computations.
Regression Metrics¶
|
Mean absolute error regression loss. |
Mean absolute percentage error regression loss. |
|
|
Mean squared error regression loss. |
|
Mean squared logarithmic error regression loss. |
|
\(R^2\) (coefficient of determination) regression score function. |
Classification Metrics¶
|
Accuracy classification score. |
|
Log loss, aka logistic loss or cross-entropy loss. |
dask_ml.xgboost
: XGBoost¶
Train an XGBoost model on dask arrays or dataframes.
This may be used for training an XGBoost model on a cluster. XGBoost
will be setup in distributed mode alongside your existing
dask.distributed
cluster.
|
|
|
|
|
Train an XGBoost model on a Dask Cluster |
|
Distributed prediction with XGBoost |
dask_ml.datasets
: Datasets¶
dask-ml provides some utilities for generating toy datasets.
|
Generate a dummy dataset for modeling count data. |
|
Generate isotropic Gaussian blobs for clustering. |
|
Generate a random regression problem. |
|
|
|
Uses the make_classification function to create a dask dataframe for testing. |