- Digital Nomad
- [email protected]
etl
YTsaurus is a scalable and fault-tolerant open-source big data platform.
Quilt is a data mesh for connecting people with actionable data
The Apache Tika toolkit detects and extracts metadata and text from over a thousand different file types (such as PPT, XLS, and PDF).
A Unified Toolkit for Deep Learning Based Document Image Analysis
Source code for my collection of articles on using pandas.
Distributed task queue with full async support
A fast and reliable background task processing library for Python 3.
A Pure Python, React-style Framework for Scaling Your Jupyter and Web Apps
Free and source-available fair-code licensed workflow automation tool. Easily automate tasks across different services.
Cluster tools for running Dask on Databricks
Port of Wappalyzer (uncovers technologies used on websites) to automate mass scanning.
This project aims to maintain Wappalyzer technologies
Open source libraries and APIs to build custom preprocessing pipelines for labeling, training, or production machine learning pipelines.
Bokeh Plotting Backend for Pandas and GeoPandas
Easily create large video dataset from video urls
Freeing data processing from scripting madness by providing a set of platform-agnostic customizable pipeline processing blocks.
Up to 200x Faster Dot Products & Similarity Metrics — for Python, Rust, C, JS, and Swift, supporting f64, f32, f16 real & complex, i8, and bit vectors using SIMD for both AVX2, AVX-512, NEON, SVE, …
All-in-one infrastructure for search, recommendations, RAG, and analytics offered via API
Efficient data transformation and modeling framework that is backwards compatible with dbt.
⬛️ CLI tool for saving complete web pages as a single HTML file
Ready-to-use OCR with 80+ supported languages and all popular writing scripts including Latin, Chinese, Arabic, Devanagari, Cyrillic and etc.
🔥 Turn entire websites into LLM-ready markdown or structured data. Scrape, crawl and extract with a single API.
the file filesystem: mount semi-structured data (like JSON) as a Unix filesystem
A lightweight message queue. Like AWS SQS and RSMQ but on Postgres.
High-performance and seamless sharing and modification of Python objects between processes, without the periodic overhead of serialization and deserialization. Provides fast inter-process communica…