Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue with final matrix? #188

Open
16nbb1 opened this issue Jan 11, 2024 · 0 comments
Open

Issue with final matrix? #188

16nbb1 opened this issue Jan 11, 2024 · 0 comments

Comments

@16nbb1
Copy link

16nbb1 commented Jan 11, 2024

Hi! I'm trying to reproduce the prediction matrices in the tutorial.

import os
import json
import subprocess
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import pysam
from basenji import dataset, dna_io, seqnn
import tensorflow as tf

I then loaded in your model

model_dir = '/Users/nadejdaboev/Desktop/Sushant/Akita/'
params_file = model_dir+'params_og.json'
model_file  = model_dir+'model_best.h5'
with open(params_file) as params_open:
    params = json.load(params_open)
    params_model = params['model']
    params_train = params['train']
seqnn_model = seqnn.SeqNN(params_model)

I unfortantely could not install and import cooltools. This is my workaround for that issue. I copied the set_diag function directly from cooltools source code

def set_diag(arr, x, i=0, copy=False):
    if copy:
        arr = arr.copy()
    start = max(i, -arr.shape[1] * i)
    stop = max(0, (arr.shape[1] - i)) * arr.shape[1]
    step = arr.shape[1] + 1
    arr.flat[start:stop:step] = x
    return arr

def from_upper_triu(vector_repr, matrix_len, num_diags):
    z = np.zeros((matrix_len,matrix_len))
    triu_tup = np.triu_indices(matrix_len,num_diags)
    z[triu_tup] = vector_repr
    for i in range(-num_diags+1,num_diags):
        set_diag(z, np.nan, i)
    return z + z.T

I initially tried to use my own local reference .fasta file for chromosome 15 and tried to segment a region

fasta_open = pysam.Fastafile('/Users/nadejdaboev/Desktop/Sushant/Genome/chr15.fasta')
#fasta_open.references
seq = fasta_open.fetch( 'NC_000015.10', 63281152, 64329728 ).upper()

This didn't end up working at the prediction stage, so I overwrote this sequence using the fasta provided

fasta_open = pysam.Fastafile('/Users/nadejdaboev/Desktop/Sushant/Genome/hg38.ml.fa')
seq = fasta_open.fetch( 'chr15', 63281152, 64329728 ).upper()
seq_1hot = dna_io.dna_1hot(seq)
test_pred_from_seq = seqnn_model.model.predict(np.expand_dims(seq_1hot,0))

When I print out a bit of the seq I get: CAAAAACAAAAACTCCCTTCTGACCGCTGCCTTACTCAAG.... the resulting seq_1hot aligns with this sequence.

I then used the following, using the values from the json, to set up the reshaping of the array

target_length = 99681
target_length1 = 1048576 // 2048
target_crop = 65536 // 2048
target_length1_cropped = target_length1 - 2*target_crop
hic_diags =  2

This is consistently outputting the following, as I was expecting:
flattened representation length: 99681
symmetrix matrix size: (448,448)

I then tried to visualize

plt.figure(figsize=(8,4))
target_index = 0
vmin=-2
vmax=2
mat = from_upper_triu(test_pred_from_seq[:,:, 0], target_length1_cropped, hic_diags)
plt.subplot(121) 
im = plt.matshow(mat, fignum=False, cmap= 'RdBu_r', vmax=vmax, vmin=vmin)
plt.colorbar(im, fraction=.04, pad = 0.05, ticks=[-2,-1, 0, 1,2]);
plt.title('chr15: 63281152-64329728')

I've attached the resulting plot I get.

Screen Shot 2024-01-11 at 5 50 04 PM

I initially thought it was from the reshaping/cooltools issue, so I checked mat

mat

The output is as follows and is definitely not what I was expecting:

array([[ nan, nan, -0.22503351, ..., -50.06995773,
-50.18245697, -50.29498291],
[ nan, nan, nan, ..., -49.95742035,
-50.06995773, -50.18245697],
[ -0.22503351, nan, nan, ..., -49.84490967,
-49.95742035, -50.06995773],
...,
[-50.06995773, -49.95742035, -49.84490967, ..., nan,
nan, -0.22503351],
[-50.18245697, -50.06995773, -49.95742035, ..., nan,
nan, nan],
[-50.29498291, -50.18245697, -50.06995773, ..., -0.22503351,
nan, nan]])

Then I looked at test_pred_from_seq[:,:, 0] itself

test_pred_from_seq[:,:, 0]

array([[-0.2250335 , -0.33755007, -0.450067 , ..., -0.2250335 ,
-0.33755007, -0.2250335 ]], dtype=float32)

I don't believe this is correct either. Please let me know if I've made any fatal errors. I can't seem to track my mistake.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant