-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
GSoC 2018 project ideas
A list of ideas for Google Summer of Code 2018 of new functionality and projects in Gensim, topic modelling for humans.
Potential mentors:
First of all, please have a look at Gensim's road-map 2018, which describes our main targets for this year, also, check GSoC 2018 Guide.
You can suggest any project related to NLP and machine learning, which, in your opinion, will be a successful addition to Gensim, but considering our wishes will improve your chances of being accepted.
Below you will find the directions that we would be very happy to see in Gensim.
Difficulty: Medium; requires excellent C and optimization skills
Background: A package for working with sparse matrices, optimized with Cython, memory-efficient, and fast. An improvement or replacement over the recently deprecated scipy's sparsetools
package, which is only single-threaded, memory-hungry with unnecessary copies, and too slow.
Should also include fast (Cythonized) transformations between the "Gensim streamed corpus" and various formats (scipy.sparse, numpy...). Similar to our existing matutils module. Must support fast operations with sparse matrices, such as multiplying a sparse CSC matrix with a dense matrix, multiplying CSC by a random matrix etc.
To do: Need to develop a new package to replace scipy.sparse
in Gensim. Should significantly increase the performance of sparse multiplications, making use of multiple cores.
Resources:
Difficulty: Medium; requires good API design skills
Background: On powerful machines (>10 cores) we lose ~linear-scalability of our parallelized models. The reason is "worker starvation": on-the-fly reading of the input corpus from its persistent storage (disk, database, generated) and chunking it into jobs for the parallel workers becomes the bottle neck. The master worker is overwhelmed managing the data reading, chunking and data distribution to workers. An example is described here. To fix this, we want to allow our users to supply their input in multiple streams, as opposed to the single stream right now. This new multi-stream API should be supported by all Gensim models (word2vec, LDA, LSI, etc).
To do:
- Understand the current bottlenecks, how it happens and why
- Develop the API for a "multi-threaded corpus"
- Integrate the new API into all models; e.g. by introducing a new parameter
input_streams
(of which the current parametercorpus
stream is a special case,input_streams = [corpus]
)
Resources:
Difficulty: Medium; requires excellent UX skills and native English
Background: We already have a large number of models, therefore, we want to pay more attention to the model quality (documentation and model discovery being the main thing here). If we have a great model users don't know how (or when) to use - they won't use it! For this reason, we want to significantly improve our documentation.
To do:
- [already under way, WIP] Consistent docstrings for all methods and classes in Gensim
- An updated new "beginner tutorial chain": an API-centric walk through the Gensim terminology, design choices, ways of doing things the Gensim way, best practices, FAQ
- Use-case-centric User-guides for majorall models and use-case pipelines (sphinx-gallery), focusing on how to solve concrete popular task X
- A modernnNew slick project website: the current website https://radimrehurek.com/gensim/ is very popular in terms of visitors, but looks embarrassingly dated.
- Improved UX: analysis of visitor flow, minimizing clicks for common documentation patterns, a logical structure for all documentation, intuitive navigation, improving information discovery for the different types of visitor types (newbies, API docs, use-case docs, power users…)
Resources:
- Numpy docstring style
- Numpy docstring style from sphinx
- Sphinx-Gallery
- Gensim documentation project
- Daniele Procida: "How documentation works, and how to make it work for your project", video, PyCon 2017
- Daniele Procida: "What nobody tells you about documentation", blogpost
Difficulty: Hard
Background: Non-negative matrix factorization is an algorithm similar to Latent Semantic Analysis/Latent Dirichlet Allocation. It belongs among matrix factorization methods and can be phrased as an online learning algorithm, where the model is updated on-the-fly, without the entire training corpus residing in main memory or requiring random access to individual documents. We want a streamed, on-the-fly, efficient implementation.
To do: Based on an existing online parallel implementation in libmf, implement NNMF in Python/Cython in Gensim and evaluate its quality and performance. Must support multiple cores on the same machine (multi-threading or multi-processing) and online training.
Implementation must accept input data in a streamed format (a sequence of document vectors). It can use NumPy/SciPy as its building blocks, pushing as much number crunching in the low-level (ideally, BLAS) routines as possible.
We aim for a robust, industry-strength implementations in Gensim, not flimsy academic code. Check corner cases, summarize performance insights into documentation tips and usage examples.
Evaluation can use the Lee corpus of human similarity judgments included with Gensim or evaluate in some other way.
Resources:
Difficulty: Medium
Background: Gensim stands for generate similar, and has traditionally focused on working with vector representations for words, sentences and documents, followed by e.g. cosine similarity to assess the similarity between these word/sentence/document vectors: sim = cossim(vector1, vector2)
. But in fact, it is not so important for us to represent the texts as vectors as a required intermediate step; it is only important to assess the similarity: `sim = blackbox(text1, text2). Methods that don't use vectors, or methods that use vectors only internally as an ancillary step, are fine too. This problem is also called similarity learning, and neural networks lends themselves naturally.
To do: Develop a neural network for this task, from collecting data to publishing benchmark results.
Resources:
If you'd like to work on any of the topics above, or have your own ideas for a project, get in touch at [email protected].