[feat] Create a script to split, index and access per block utxo data dump #208

maciejka · 2024-09-19T11:24:17Z

Context

Script generate_data.py requires information about previous utxos referenced by tx inputs. Currently it is implemented as a series of getrawtransaction and getblockheader requests to a bitcoin node. It is inefficient, as it takes >30s to process a single block, which means that accessing full history would take weeks.

In order to speed this process up, a query to the Google Bitcoin Data Set was created:

SELECT 
  inputs.block_number block_number,
  array_agg(
    struct(
      outputs.transaction_hash as txid, 
      outputs.index as vout,
      outputs.value,
      outputs.script_hex as pk_script,
      outputs.block_number as block_height,
      txs.is_coinbase
    )
  ) as outputs
FROM `bigquery-public-data.crypto_bitcoin.inputs` as inputs
JOIN `bigquery-public-data.crypto_bitcoin.outputs` as outputs 
  ON outputs.transaction_hash = inputs.spent_transaction_hash
  AND outputs.index = inputs.spent_output_index
JOIN `bigquery-public-data.crypto_bitcoin.transactions` as txs
  ON txs.hash = inputs.spent_transaction_hash
JOIN `bigquery-public-data.crypto_bitcoin.blocks` as blocks
  ON blocks.number = outputs.block_number
group by block_number
order by block_number

This gives us per block information required by the script.

The data, one block data in json format per line, was exported to the Cloud Storage. There are 772143 rows spread across 1732 files, each <1Gb large. Example: 000000000000.json, last file name is 000000001731.json

The task

Your task is to create a Python script, which will:

download and split the files into chunks of managable size
create block number -> chunk name index which will allow to locate chunk quickly
create a python function which when given a block number will return corresponding utxo set

Details

Download and Split

For each data dump file:

download ile from GCS
create a directory with a name corresponding to the name of the file
use unix split command to break file into chunks, chunks should be placed in the directory created in the previous step, split by number of lines (number of line should be an easily changeable parameter), e.g.: split -l 10 utxos_000000000049.json

Processing all files is not in scope. Just make sure it works on a couple of files. We will run it on a machine in the cloud.

Index

You need to:

load each chunk created in the previous step
update an in memory block number -> chunk name map
save the map as a json file
add consistency checks (there should be only file per block, any other ideas)

`get_utxo_set` function

Given block number:

use index file to locate corresponding chunk, assume that index file is available in the filesystem
if chunk is not present execute download and split.
locate corresponding line in the chunk
return parsed json data

Time constraints

This task is on the critical path, it is blocking other tasks, please do not volunteer if you can't complete it in 2-3 days.

The text was updated successfully, but these errors were encountered:

fishonamos · 2024-09-19T11:40:07Z

Kindly assign @maciejka. Will love to take it up.

maciejka · 2024-09-20T08:01:47Z

@fishonamos please note that processing all files is not in scope. Just make sure it works on a couple of files. We will run it on a machine in the cloud.

maciejka added good first issue Good for newcomers advanced labels Sep 19, 2024

maciejka added this to the Milestone 2 — Partial transaction validation milestone Sep 19, 2024

keep-starknet-strange deleted a comment Sep 19, 2024

m-kus assigned fishonamos Sep 19, 2024

maciejka mentioned this issue Sep 19, 2024

[feat] Create a script to download, concat and transform previous timestamps data dump #210

Closed

maciejka added the time sensitive Blocks another issue or milestone label Sep 19, 2024

maciejka changed the title ~~[feat] create a script to split, index and access per block utxo data dump~~ [feat] Create a script to split, index and access per block utxo data dump Sep 19, 2024

maciejka mentioned this issue Sep 20, 2024

[feat] Integrate new sources of data into generate_data.py #215

Closed

fishonamos mentioned this issue Sep 22, 2024

feat: Create a script to split, index and access per block utxo data … #217

Merged

3 tasks

maciejka closed this as completed in #217 Sep 24, 2024

maciejka mentioned this issue Sep 25, 2024

Fixes to utxo/timestamp data generation scripts #221

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[feat] Create a script to split, index and access per block utxo data dump #208

[feat] Create a script to split, index and access per block utxo data dump #208

maciejka commented Sep 19, 2024 •

edited

Loading

fishonamos commented Sep 19, 2024

maciejka commented Sep 20, 2024

[feat] Create a script to split, index and access per block utxo data dump #208

[feat] Create a script to split, index and access per block utxo data dump #208

Comments

maciejka commented Sep 19, 2024 • edited Loading

Context

The task

Details

Download and Split

Index

get_utxo_set function

Time constraints

fishonamos commented Sep 19, 2024

maciejka commented Sep 20, 2024

maciejka commented Sep 19, 2024 •

edited

Loading

`get_utxo_set` function