Contributing
Contents
Contributing¶
Thanks for helping to build dask-ml
!
Cloning the Repository¶
Make a fork of the dask-ml repo and clone the fork
git clone https://github.com/<your-github-username>/dask-ml
cd dask-ml
You may want to add https://github.com/dask/dask-ml
as an upstream remote
repository.
git remote add upstream https://github.com/dask/dask-ml
Creating an environment¶
We have conda environment YAML files with all the necessary dependencies
in the ci
directory.
conda env create -f ci/environment-latest.yaml --name=dask-ml-dev
to create a conda environment and install all the dependencies.
If you’re using pip, you can view the list of all the required and optional
dependencies within setup.py
(see the install_requires
field for
required dependencies and extras_require
for optional dependencies).
Building dask-ml¶
Dask-ML is a pure-python repository. Development installation should be as simple as cloning the repository and running the following in the cloned directory:
pip
:
python -m pip install -e ".[dev]"
conda
:
conda activate dask-ml-dev
python -m pip install --no-deps -e .
If you have any trouble, please open an issue on the dask-ml issue tracker.
Style¶
Dask-ML uses black for formatting
and flake8 for linting. If you installed
dask-ml with python -m pip install -e ".[dev]"
these tools will already be
installed.
black .
flake8
isort -rc dask_ml tests
You may wish to setup a pre-commit hook to run black when you commit changes.
Running tests¶
Dask-ml uses py.test for testing. You can run tests from the main dask-ml directory as follows:
pytest tests
Alternatively you may choose to run only a subset of the full test suite. For example to test only the preprocessing submodule we would run tests as follows:
pytest tests/preprocessing
Coverage¶
If your Pull Request decreases the lines of code covered, the CI may fail. Sometimes this is OK, and a maintainer will merge it anyway. To check the coverage locally, use
pytest --cov --cov-report=html
You can still use all the usual pytest command-line options in addition to those.
Pre-Commit Hooks¶
Install and build the pre commit tool as:
python -m pip install pre-commit
pre-commit install
to install a few plugins like black, isort, and flake8. These tools will automatically
be run on each commit. You can skip the checks with git commit --no-verify
.
Conventions¶
For the most part, we follow scikit-learn’s API design. If you’re implementing a new estimator, it will ideally pass scikit-learn’s estimator check.
We have some additional decisions to make in the dask context. Ideally
All attributes learned during
.fit
should be concrete, i.e. they should not be dask collections.To the extent possible, transformers should support
numpy.ndarray
pandas.DataFrame
dask.Array
dask.DataFrame
If possible, transformers should accept a
columns
keyword to limit the transformation to just those columns, while passing through other columns untouched.inverse_transform
should behave similarly (ignoring other columns) so thatinverse_transform(transform(X))
equalsX
.Methods returning arrays (like
.transform
,.predict
), should return the same type as the input. So if adask.array
is passed in, adask.array
with the same chunks should be returned.
Documentation¶
We use numpydoc for our docstrings.
Building the docs is possible with
$ conda env create -f ci/environment-docs.yaml --name=dask-ml-dev-docs
$ conda activate dask-ml-dev-docs
$ python -m pip install -e .
$ cd docs
$ make html
Examples are written as Jupyter notebooks with their output stripped, either manually or using nbstripout. We want examples to be runnable on binder so they should be small, but include instructions for how to scale up to larger problems.
The source for most examples is maintained in the dask-examples repository. Updates should be made there, and they’re automatically included as part of the Dask-ML documentation build process.
When adding an example for new feature that’s only available in main, the
notebook should be first included in Dask-ML repository under
docs/source/examples/
. These examples will be moved to
dask/dask-examples
as part of the Dask-ML release process.