Skip to content

A distributed data processing framework in Haskell.

License

Notifications You must be signed in to change notification settings

utdemir/distributed-dataset

Repository files navigation

distributed-dataset

CI Status

A distributed data processing framework in pure Haskell. Inspired by Apache Spark.

Packages

distributed-dataset

This package provides a Dataset type which lets you express and execute transformations on a distributed multiset. Its API is highly inspired by Apache Spark.

It uses pluggable Backends for spawning executors and ShuffleStores for exchanging information. See 'distributed-dataset-aws' for an implementation using AWS Lambda and S3.

It also exposes a more primitive Control.Distributed.Fork module which lets you run IO actions remotely. It is especially useful when your task is embarrassingly parallel.

distributed-dataset-aws

This package provides a backend for 'distributed-dataset' using AWS services. Currently it supports running functions on AWS Lambda and using an S3 bucket as a shuffle store.

distributed-dataset-opendatasets

Provides Dataset's reading from public open datasets. Currently it can fetch GitHub event data from GH Archive.

Running the example

  • Clone the repository.

    $ git clone https://github.com/utdemir/distributed-dataset
    $ cd distributed-dataset
  • Make sure that you have AWS credentials set up. The easiest way is to install AWS command line interface and to run:

    $ aws configure
  • Create an S3 bucket to put the deployment artifact in. You can use the console or the CLI:

    $ aws s3api create-bucket --bucket my-s3-bucket
  • Build an run the example:

    • If you use Nix on Linux:

      • (Recommended) Use my binary cache on Cachix to reduce compilation times:
      nix-env -i cachix # or your preferred installation method
      cachix use utdemir
      • Then:

        $ nix run -f ./default.nix example-gh -c example-gh my-s3-bucket
    • If you use stack (requires Docker, works on Linux and MacOS):

      $ stack run --docker-mount $HOME/.aws/ --docker-env HOME=$HOME example-gh my-s3-bucket

Stability

Experimental. Expect lots of missing features, bugs, instability and API changes. You will probably need to modify the source if you want to do anything serious. See issues.

Contributing

I am open to contributions; any issue, PR or opinion is more than welcome.

  • In order to develop distributed-dataset, you can use;
    • On Linux: Nix, cabal-install or stack.
    • On MacOS: stack with docker.
  • Use ormolu to format source code.

Nix

  • You can use my binary cache on cachix so that you don't recompile half of the Hackage.
  • nix-shell will drop you into a shell with ormolu, cabal-install and steeloverseer alongside with all required haskell and system dependencies. You can use cabal new-* commands there.
  • Easiest way to get a development environment would be to run sos at the top level directory inside of a nix-shell.

Stack

  • Make sure that you have Docker installed.
  • Use stack as usual, it will automatically use a Docker image
  • Run ./make.sh stack-build before you send a PR to test different resolvers.

Related Work

Papers

Projects

Releases

No releases published

Packages

No packages published