Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[feat] Create a script to split, index and access per block utxo data dump #208

Closed
maciejka opened this issue Sep 19, 2024 · 2 comments · Fixed by #217
Closed

[feat] Create a script to split, index and access per block utxo data dump #208

maciejka opened this issue Sep 19, 2024 · 2 comments · Fixed by #217
Assignees
Labels
advanced good first issue Good for newcomers time sensitive Blocks another issue or milestone

Comments

@maciejka
Copy link
Collaborator

maciejka commented Sep 19, 2024

Context

Script generate_data.py requires information about previous utxos referenced by tx inputs. Currently it is implemented as a series of getrawtransaction and getblockheader requests to a bitcoin node. It is inefficient, as it takes >30s to process a single block, which means that accessing full history would take weeks.

In order to speed this process up, a query to the Google Bitcoin Data Set was created:

SELECT 
  inputs.block_number block_number,
  array_agg(
    struct(
      outputs.transaction_hash as txid, 
      outputs.index as vout,
      outputs.value,
      outputs.script_hex as pk_script,
      outputs.block_number as block_height,
      txs.is_coinbase
    )
  ) as outputs
FROM `bigquery-public-data.crypto_bitcoin.inputs` as inputs
JOIN `bigquery-public-data.crypto_bitcoin.outputs` as outputs 
  ON outputs.transaction_hash = inputs.spent_transaction_hash
  AND outputs.index = inputs.spent_output_index
JOIN `bigquery-public-data.crypto_bitcoin.transactions` as txs
  ON txs.hash = inputs.spent_transaction_hash
JOIN `bigquery-public-data.crypto_bitcoin.blocks` as blocks
  ON blocks.number = outputs.block_number
group by block_number
order by block_number

This gives us per block information required by the script.

The data, one block data in json format per line, was exported to the Cloud Storage. There are 772143 rows spread across 1732 files, each <1Gb large. Example: 000000000000.json, last file name is 000000001731.json

The task

Your task is to create a Python script, which will:

  • download and split the files into chunks of managable size
  • create block number -> chunk name index which will allow to locate chunk quickly
  • create a python function which when given a block number will return corresponding utxo set

Details

Download and Split

For each data dump file:

  1. download ile from GCS
  2. create a directory with a name corresponding to the name of the file
  3. use unix split command to break file into chunks, chunks should be placed in the directory created in the previous step, split by number of lines (number of line should be an easily changeable parameter), e.g.: split -l 10 utxos_000000000049.json

Processing all files is not in scope. Just make sure it works on a couple of files. We will run it on a machine in the cloud.

Index

You need to:

  1. load each chunk created in the previous step
  2. update an in memory block number -> chunk name map
  3. save the map as a json file
  4. add consistency checks (there should be only file per block, any other ideas)

get_utxo_set function

Given block number:

  1. use index file to locate corresponding chunk, assume that index file is available in the filesystem
  2. if chunk is not present execute download and split.
  3. locate corresponding line in the chunk
  4. return parsed json data

Time constraints

This task is on the critical path, it is blocking other tasks, please do not volunteer if you can't complete it in 2-3 days.

@fishonamos
Copy link
Contributor

Kindly assign @maciejka. Will love to take it up.

@maciejka maciejka added the time sensitive Blocks another issue or milestone label Sep 19, 2024
@maciejka maciejka changed the title [feat] create a script to split, index and access per block utxo data dump [feat] Create a script to split, index and access per block utxo data dump Sep 19, 2024
@maciejka
Copy link
Collaborator Author

@fishonamos please note that processing all files is not in scope. Just make sure it works on a couple of files. We will run it on a machine in the cloud.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
advanced good first issue Good for newcomers time sensitive Blocks another issue or milestone
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants