Quote ingest using apache stack: arrow / parquet #536
Labels
data-layer
real-time and historical data processing and storage
dependencies
we are the dependent, or are you?
fsp
financial signal processing
integration
external stack and/or lib augmentations
perf
efficiency and latency optimization
research
probably just a link dump..
In Follow up to #486, it'd sure be nice to be able to move away
from our current
multiprocessing.shared_memory
approach forreal-time quote/tick ingest and possibly leverage an apache
standard format such as
arrow
andparquet
.As part of improving the
.parquet
file based tsdb IO from #486obviously it'd be ideal to support df appends instead of only full
overwrites 😂.
ToDo content from #486
pertaining to
StorageClient.write_ohlcv()
write on backfills andrt ingest. rn the write is masked out mostly bc there's some
details to work out on when/how frequent the writes to parquet
files should happen, particularly whether to "append" to parquet
files: turns out there's options for appending (faster then
overwriting i guess?) to parquet, particularly using
fastparquet
,see the below resources:
for python we can likely use: https://fastparquet.readthedocs.io/en/latest/api.html#fastparquet.write
times
options with the int96 format whichembeds nanoseconds B)
custom_metadata
: dict can only be used on overwrite 👀to update metadata if needed?
https://stackoverflow.com/questions/39234391/how-to-append-data-to-an-existing-parquet-file
https://stackoverflow.com/questions/47191675/pandas-write-dataframe-to-parquet-format-with-append/74209756#74209756
other langs and spark related:
The text was updated successfully, but these errors were encountered: