Skip to content
forked from indix/ml2npy

Export spark ml SparseVectors as numpy csr matrix

License

Notifications You must be signed in to change notification settings

vumaasha/ml2npy

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

61 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ml2npy - Export spark ml SparseVectors as numpy csr matrix

Build Status

Maven Central

The aim of this project is to provide that tools that efficiently implement the components that are required for large scale text mining.

The idea for this project came out from experience,

  1. Most of time it is data preprocessing that is expensive and demanding
  2. Distributed algorithm implementations are not still as effective as Multicore/sequential implementations.

This project intends to leverage the best of both worlds. In case of text mining, a traditional powerful approach is to use TF-IDF as numerical representation of the document. This enables a vareity of machine learning techniques to be readily applied on the data. Converting a document in to TF-IDF or any other numerical format is compute intensive and once a numerical representation is available, we could try out various algorithms and models on the preprocessed data.

Numerical representation of text tends to be very sparse. By choosing sparse matrix formats to save this data, we could save memory and disk usage. ml2npy provides tools and utilities to load a large corpus of text and save its numerical respresentation as CSR Matrix in numpy format

Why Npy format?

Python and scikit-learn ecosystem has made machine learning a lot more accessible. By being able to load data in to python, means a lot of algorithms could be easily applied.

About

Export spark ml SparseVectors as numpy csr matrix

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Scala 96.4%
  • Python 3.6%