SpikeDataloader wrong data fetch sequence: post_guard and run_post_mgmt are running twice when using LearningDense in process graph #659

chkadway-tcs · 2023-04-05T11:14:04Z

Describe the bug
Please review and confirm if the below mentioned bug is legitimate or not.

When using lava.proc.io.dataloader.SpikeDataloader with lava.proc.dense.process.LearningDense:

1] getitem of dataset object gets called twice, at SpikeDataloader reset interval timestep
2] post_guard method of SpikeDataloader process-model runs twice & returns True twice
3] run_post_mgmt method of SpikeDataloader process-model runs twice, because post_guard runs twice
4] data-sample fetch is not linear, network gets trained on odd indexes of dataset, then even

NO such issue, when using lava.proc.io.dataloader.SpikeDataloader with lava.proc.dense.process.Dense

To reproduce current behavior
Steps to reproduce the behavior:

When I run below python code...

import os
import math
import time
import typing as ty

import numpy as np

from misc.my_dataset import RandomDataset
from misc.my_dataloader import SpikeDataloader

from lava.proc.lif.process import LIFReset, LearningLIF
from lava.proc.dense.process import Dense, LearningDense

from misc.hyper_params import model_name, n_neurons, stdp_params, lif1_params

results_path = "./results/"

if not os.path.isdir(results_path+model_name):
    os.mkdir(results_path+model_name)

from lava.proc.learning_rules.stdp_learning_rule import STDPLoihi

stdp = STDPLoihi(learning_rate=stdp_params["learning_rate"], 
                 A_plus=stdp_params["A_plus"], 
                 A_minus=stdp_params["A_minus"], 
                 tau_plus=stdp_params["tau_plus"], 
                 tau_minus=stdp_params["tau_minus"], 
                 t_epoch=stdp_params["t_epoch"])

n_samples = 5
T_per_sample = 20

sim_time = n_samples*T_per_sample

dataset = RandomDataset(n_samples=n_samples, n_dim=n_neurons["input"], spike_len=T_per_sample)

print(f"[TRAIN] Samples: {n_samples}, T: {T_per_sample}, Sim-Time: {sim_time}")

np.random.seed(22)

init_weights = np.random.rand(n_neurons["hidden"], n_neurons["input"])
print(f"[BEFORE training] weights.min(): {init_weights.min()}, weights.max(): {init_weights.max()}, weights.sum(): {init_weights.sum()}")

# Instantiate SpikeGenerator
spike_gen = SpikeDataloader(dataset=dataset, interval=T_per_sample, offset=0)

plastic_dense = LearningDense(weights=init_weights, learning_rule=stdp, name='plastic_dense')

lif1 = LIFReset(shape=(n_neurons["hidden"], ), # Number of units in this process
           vth=lif1_params["vth"], # Membrane threshold, higher threshold means lesser spikes
           dv=lif1_params["dv"], # Inverse membrane time-constant, smaller value means samller decay
           du=lif1_params["du"], # Inverse synaptic time-constant, smaller value means samller decay
           reset_interval = T_per_sample,
           reset_offset = 3,
           name='lif1')

print("\nHyperparameters: ", model_name, n_neurons, stdp_params, lif1_params)

# Connect spike_gen to dense_input
spike_gen.s_out.connect(plastic_dense.s_in)

# Connect dense_input to LIF1 population
plastic_dense.a_out.connect(lif1.a_in)

lif1.s_out.connect(plastic_dense.s_in_bap)

from lava.magma.core.run_conditions import RunSteps, RunContinuous
from lava.magma.core.run_configs import Loihi1SimCfg, Loihi2SimCfg

tick = time.time()
spike_gen.run(condition=RunSteps(num_steps=sim_time), run_cfg=Loihi1SimCfg(select_tag="floating_pt"))
tock = time.time()

print("\nRUN time:", (tock - tick)/60, " min")

lif1_weights = plastic_dense.weights.get()

spike_gen.stop()

print(f"[AFTER training] weights.min(): {lif1_weights.min()}, weights.max(): {lif1_weights.max()}, weights.sum(): {lif1_weights.sum()}")

I get the below output where at every timestep post_guard method of SpikeDataloader is executed twice, As a result, SpikeDataloader loads data in wrong sequence (if our dataset iterable has 5 samples, then odd indexes are fetched and injected in network, then the even samples). Below print statements are obtained by adding print statements to ProcessModel of SpikeDataloader.

[TRAIN] Samples: 5, T: 20, Sim-Time: 100
[BEFORE training] weights.min(): 0.001163202710826039, weights.max(): 0.9998928289029887, weights.sum(): 543.1606742989552
[dataset getitem] sample_indx=0, label=1, spike_count=[ 6 12 15 14 10  9 11  8  9 12  8]

Hyperparameters:  model_intel_lava_issue {'input': 11, 'hidden': 100} {'learning_rate': 0.1, 'A_plus': -0.1, 'A_minus': 0.1, 'tau_plus': 10, 'tau_minus': 10, 't_epoch': 1} {'vth': 1.0, 'dv': 0.0, 'du': 1.0, 'refrac_interval': 5}

[run_spk] t=1, self.sample_time % self.interval.item()=0
[post_guard] t=1, post_guard_bool=[ True]
[post_guard] t=1, post_guard_bool=[ True]

[run_post_mgmt] t=1, sample_id=0, sample_time=1
[dataset getitem] sample_indx=0, label=1, spike_count=[ 6 12 15 14 10  9 11  8  9 12  8]
[post_guard] t=1, post_guard_bool=[ True]
[post_guard] t=1, post_guard_bool=[ True]

[run_post_mgmt] t=1, sample_id=1, sample_time=0
[dataset getitem] sample_indx=1, label=0, spike_count=[11  6 14 11  8 16 11 14 10  6 13]

[run_spk] t=2, self.sample_time % self.interval.item()=0
[post_guard] t=2, post_guard_bool=[False]
[post_guard] t=2, post_guard_bool=[False]

[run_spk] t=3, self.sample_time % self.interval.item()=1
[post_guard] t=3, post_guard_bool=[False]
[post_guard] t=3, post_guard_bool=[False]

[run_spk] t=4, self.sample_time % self.interval.item()=2
[post_guard] t=4, post_guard_bool=[False]
[post_guard] t=4, post_guard_bool=[False]
...
...
[run_spk] t=20, self.sample_time % self.interval.item()=18
[post_guard] t=20, post_guard_bool=[False]
[post_guard] t=20, post_guard_bool=[False]

[run_spk] t=21, self.sample_time % self.interval.item()=19
[post_guard] t=21, post_guard_bool=[ True]
[post_guard] t=21, post_guard_bool=[ True]

[run_post_mgmt] t=21, sample_id=2, sample_time=20
[dataset getitem] sample_indx=2, label=0, spike_count=[10 10 10  8 11 12  9 11  9  9 12]
[post_guard] t=21, post_guard_bool=[ True]
[post_guard] t=21, post_guard_bool=[ True]

[run_post_mgmt] t=21, sample_id=3, sample_time=0
[dataset getitem] sample_indx=3, label=0, spike_count=[ 8 12  8 12 12  7  4 15  8  9 10]

[run_spk] t=22, self.sample_time % self.interval.item()=0
[post_guard] t=22, post_guard_bool=[False]
[post_guard] t=22, post_guard_bool=[False]

[run_spk] t=23, self.sample_time % self.interval.item()=1
[post_guard] t=23, post_guard_bool=[False]
[post_guard] t=23, post_guard_bool=[False]

[run_spk] t=24, self.sample_time % self.interval.item()=2
[post_guard] t=24, post_guard_bool=[False]
[post_guard] t=24, post_guard_bool=[False]
...
...
RUN time: 0.00281527837117513  min
[AFTER training] weights.min(): -6.34409606075274, weights.max(): 7.421761656760727, weights.sum(): 1124.7906023245246

Afterwards I used lava.proc.io.source.RingBuffer to avoid the wrong data fetch sequnce of SpikeDataloader process model. Along with that I wrote a custom Process and ProcessModel (WeightSnapshot) to access the weight matrix of LearningDense (via RefPort). This custom process with RefPort also has post_guard and run_post_mgmt methods. Therefore, the issue of double execution of post_guard and run_post_mgmt still persists.

NO such issue, when custom process with RefPort is used with lava.proc.dense.process.Dense. This bug seems specific to LearningDense (when we use LearningDense in process graph).

To reproduce current behavior
Steps to reproduce the behavior:

When I run below python code...

import os
import math
import time
import typing as ty

import numpy as np
import pandas as pd

from lava.proc.lif.process import LIFReset
from lava.proc.dense.process import LearningDense

from misc.my_dataset import RandomDataset
from misc.hyper_params import model_name, n_neurons, stdp_params, lif1_params

results_path = "./results/"

if not os.path.isdir(results_path+model_name):
    os.mkdir(results_path+model_name)

n_samples = 5
T_per_sample = 100

sim_time = n_samples*T_per_sample

dataset = RandomDataset(n_samples=n_samples, n_dim=n_neurons["input"], spike_len=T_per_sample)

print(f"[TRAIN] Samples: {n_samples}, T: {T_per_sample}, Sim-Time: {sim_time}")

from lava.proc.learning_rules.stdp_learning_rule import STDPLoihi

stdp = STDPLoihi(learning_rate=stdp_params["learning_rate"], 
                 A_plus=stdp_params["A_plus"], 
                 A_minus=stdp_params["A_minus"], 
                 tau_plus=stdp_params["tau_plus"], 
                 tau_minus=stdp_params["tau_minus"], 
                 t_epoch=stdp_params["t_epoch"])

np.random.seed(22)

init_weights = np.random.rand(n_neurons["hidden"], n_neurons["input"])
print(f"[BEFORE training] weights.min(): {init_weights.min()}, weights.max(): {init_weights.max()}, weights.sum(): {init_weights.sum()}")

from misc.custom_processes import WeightSnapshot

from lava.proc.io.source import RingBuffer as RingBufferSource
from lava.proc.io.sink import RingBuffer as RingBufferSink

spike_gen = RingBufferSource(data=dataset[0][0])

plastic_dense = LearningDense(weights=init_weights, learning_rule=stdp, name='plastic_dense')

dense_wt_snap = WeightSnapshot(shape=init_weights.shape, snapshot_interval = T_per_sample)

lif1 = LIFReset(shape=(n_neurons["hidden"], ), # Number of units in this process
           vth=lif1_params["vth"], # Membrane threshold, higher threshold means lesser spikes
           dv=lif1_params["dv"], # Inverse membrane time-constant, smaller value means samller decay
           du=lif1_params["du"], # Inverse synaptic time-constant, smaller value means samller decay
           reset_interval = T_per_sample,
           reset_offset = 3,
           name='lif1')

spk_buffer = RingBufferSink(shape=(n_neurons["hidden"],), buffer=T_per_sample)

print("Hyperparameters: ", model_name, n_neurons, stdp_params, lif1_params)

# Connect spike_gen to dense_input
spike_gen.s_out.connect(plastic_dense.s_in)

# Connect dense_input to LIF1 population
plastic_dense.a_out.connect(lif1.a_in)
dense_wt_snap.wt_ref.connect_var(plastic_dense.weights)

lif1.s_out.connect(plastic_dense.s_in_bap)
lif1.s_out.connect(spk_buffer.a_in)

from lava.magma.core.run_conditions import RunSteps, RunContinuous
from lava.magma.core.run_configs import Loihi1SimCfg, Loihi2SimCfg

spike_gen.run(condition=RunSteps(num_steps=0), run_cfg=Loihi1SimCfg(select_tag="floating_pt"))

for i in range(n_samples):
    
    X_spike, _ = dataset[i]
    
    spike_gen.data.set(X_spike)
    
    spike_gen.run(condition=RunSteps(num_steps=T_per_sample), run_cfg=Loihi1SimCfg(select_tag="floating_pt"))

    lif1_weights = dense_wt_snap.wt_snapshot.get()
    lif1_spk = spk_buffer.data.get()
    
    print(f"[for loop wt snap] Weight Sum: {lif1_weights.sum()}")

spike_gen.run(condition=RunSteps(num_steps=4), run_cfg=Loihi1SimCfg(select_tag="floating_pt"))
spike_gen.stop()

print(f"[AFTER training] weights.min(): {lif1_weights.min()}, weights.max(): {lif1_weights.max()}, weights.sum(): {lif1_weights.sum()}")

I get the below output for above code. [ref port wt snap] is the print statement placed inside run_post_mgmt of WeightSnapshot ProcessModel.

[TRAIN] Samples: 5, T: 100, Sim-Time: 500
[BEFORE training] weights.min(): 0.001163202710826039, weights.max(): 0.9998928289029887, weights.sum(): 543.1606742989552

[dataset getitem] sample_indx=0, label=0, spike_count=[57 49 50 59 49 51 50 43 54 50 51]

Hyperparameters:  model_intel_lava_issue {'input': 11, 'hidden': 100} {'learning_rate': 0.1, 'A_plus': -0.1, 'A_minus': 0.1, 'tau_plus': 10, 'tau_minus': 10, 't_epoch': 1} {'vth': 1.0, 'dv': 0.0, 'du': 1.0, 'refrac_interval': 5}

[dataset getitem] sample_indx=0, label=0, spike_count=[57 49 50 59 49 51 50 43 54 50 51]
[ref port wt snap] Weight Sum: 618.1487046149632 at time step: 100
[ref port wt snap] Weight Sum: -77.86055120364368 at time step: 100
[for loop wt snap] Weight Sum: -77.86055120364368

[dataset getitem] sample_indx=1, label=0, spike_count=[54 50 50 51 41 53 45 47 54 51 51]
[ref port wt snap] Weight Sum: 1394.6220756593202 at time step: 200
[ref port wt snap] Weight Sum: 1134.3848719536343 at time step: 200
[for loop wt snap] Weight Sum: 1134.3848719536343

[dataset getitem] sample_indx=2, label=0, spike_count=[50 53 49 47 43 52 52 49 44 51 48]
[ref port wt snap] Weight Sum: 2046.2285016958122 at time step: 300
[ref port wt snap] Weight Sum: 1424.5851358957186 at time step: 300
[for loop wt snap] Weight Sum: 1424.5851358957186

[dataset getitem] sample_indx=3, label=1, spike_count=[48 54 50 53 52 50 42 54 44 43 53]
[ref port wt snap] Weight Sum: -412.9456746640211 at time step: 400
[ref port wt snap] Weight Sum: 45.10908769779778 at time step: 400
[for loop wt snap] Weight Sum: 45.10908769779778

[dataset getitem] sample_indx=4, label=1, spike_count=[60 60 49 45 48 49 57 37 55 46 52]
[ref port wt snap] Weight Sum: -652.6456657718043 at time step: 500
[ref port wt snap] Weight Sum: -151.65821622253867 at time step: 500
[for loop wt snap] Weight Sum: -151.65821622253867

[AFTER training] weights.min(): -11.892270358369256, weights.max(): 9.962492213286719, weights.sum(): -151.65821622253867

I have attached a zip file with these python scripts. script-1: intel_dataloader_issue_1.py and script-2: intel_dataloader_issue_2.py
Python srcipts zip to reproduce above bug: intel_lava_dataloader_learningdense_issue_TCS.zip

Expected behavior
When using LearningDense in our network process graph, then all the proecsses with post_guard and run_post_mgmt are running twice. This should not be the expected behaviour. This also affects the expected behavoiur of SpikeLoader (expected behaviour: SpikeLoader should load data samples in a linear sequence and post_guard should run only once).

Environment (please complete the following information):

Device: Laptop
OS: UBUNTU 20.04 LTS, python 3.8.10
Lava 0.6.0 (installed from lava_nc-0.6.0.tar.gz)

Additional context
Have dicussed this bug with Sumedh and Sumit (Intel NCL Team).

The text was updated successfully, but these errors were encountered:

gkarray · 2023-04-11T11:48:19Z

Hello @chkadway-tcs, thanks for reporting this. It is indeed a legitimate bug.

I investigated it, found the source of the problem and a potential fix.
I will open a PR with the fix shortly.

chkadway-tcs · 2023-04-11T14:27:11Z

Thanks @gkarray, that was quick.

chkadway-tcs added the 1-bug Something isn't working label Apr 5, 2023

github-actions bot added the 0-needs-review For all new issues label Apr 5, 2023

gkarray removed the 0-needs-review For all new issues label Apr 11, 2023

gkarray self-assigned this Apr 11, 2023

gkarray mentioned this issue Apr 11, 2023

Changing order of checks when advancing phase in LoihiPyRuntimeService #662

Merged

9 tasks

gkarray linked a pull request Apr 11, 2023 that will close this issue

Changing order of checks when advancing phase in LoihiPyRuntimeService #662

Merged

9 tasks

gkarray closed this as completed in #662 Apr 17, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SpikeDataloader wrong data fetch sequence: post_guard and run_post_mgmt are running twice when using LearningDense in process graph #659

SpikeDataloader wrong data fetch sequence: post_guard and run_post_mgmt are running twice when using LearningDense in process graph #659

chkadway-tcs commented Apr 5, 2023

gkarray commented Apr 11, 2023

chkadway-tcs commented Apr 11, 2023

SpikeDataloader wrong data fetch sequence: post_guard and run_post_mgmt are running twice when using LearningDense in process graph #659

SpikeDataloader wrong data fetch sequence: post_guard and run_post_mgmt are running twice when using LearningDense in process graph #659

Comments

chkadway-tcs commented Apr 5, 2023

gkarray commented Apr 11, 2023

chkadway-tcs commented Apr 11, 2023