Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Machine crashing before run trainer.train() line #5372

Open
VictorGimenez opened this issue Oct 3, 2024 · 1 comment
Open

Machine crashing before run trainer.train() line #5372

VictorGimenez opened this issue Oct 3, 2024 · 1 comment

Comments

@VictorGimenez
Copy link

VictorGimenez commented Oct 3, 2024

I ran a script to inference a custom dataset made by myself with more than 300 annotation following this tutorial: https://colab.research.google.com/drive/16jcaJoc6bCFAQ96jDe2HwtXj7BMD_-m5 I used the following parameters in the config:

cfg = get_cfg()
cfg.merge_from_file(model_zoo.get_config_file("COCO-Detection/faster_rcnn_R_50_FPN_3x.yaml"))
cfg.DATASETS.TRAIN = ["objects" + "_" + "train"]
cfg.DATASETS.TEST = []
cfg.DATALOADER.NUM_WORKERS = 3
cfg.MODEL.ROI_HEADS.NUM_CLASSES = 4
cfg.MODEL.WEIGHTS = model_zoo.get_checkpoint_url("COCO-Detection/faster_rcnn_R_50_FPN_3x.yaml")
cfg.SOLVER.IMS_PER_BATCH = 2
cfg.SOLVER.BASE_LR = 0.0025
cfg.SOLVER.MAX_ITER = 400
cfg.SOLVER.STEPS = []      
cfg.MODEL.ROI_HEADS.BATCH_SIZE_PER_IMAGE = 200
cfg.MODEL.DEVICE = "cuda"
cfg.MODEL.ROI_HEADS.SCORE_THRESH_TEST = 0.3
cfg.MODEL.WEIGHTS = os.path.join(cfg.OUTPUT_DIR, "model_final.pth")

At the moment that I ran this line:

import numpy as np
from threading import Thread
from queue import Queue
import sys
import os
from tqdm import tqdm
import cv2 as cv
import torch
import json
import paho.mqtt.publish as publish
import time

np.set_printoptions(threshold=sys.maxsize)

# import some common detectron2 utilities
from detectron2 import model_zoo
from detectron2.config import get_cfg
from detectron2.engine import DefaultPredictor, DefaultTrainer
from detectron2.checkpoint import DetectionCheckpointer
# from detectron2.utils.video_visualizer import VideoVisualizer
from detectron2.utils.visualizer import ColorMode, Visualizer
from detectron2.modeling import build_model
from detectron2.data import MetadataCatalog, DatasetCatalog
from detectron2.structures import BoxMode, Boxes, Instances


from shapely.geometry import Polygon, Point
...
...
cfg.OUTPUT_DIR="./output"
os.makedirs(cfg.OUTPUT_DIR, exist_ok=True)
trainer = DefaultTrainer(cfg) 
trainer.resume_or_load(resume=False)
trainer.train()

Unfortunately my execution didn't come to run the trainer.train() line and my machine crashed on the trainer.resume_or_load(resume=False), I checked my main memory and swap partitions with htop and at the time that my script came in this line the progress bar stays whole red and both bars went until their limit.

I ran the script directly as python <name_of_my_script.py>.

I would like to know if anyone faced the same as me, and how to fix it!

Copy link

github-actions bot commented Oct 3, 2024

You've chosen to report an unexpected problem or bug. Unless you already know the root cause of it, please include details about it by filling the issue template.
The following information is missing: "Instructions To Reproduce the Issue and Full Logs";

@github-actions github-actions bot added needs-more-info More info is needed to complete the issue and removed needs-more-info More info is needed to complete the issue labels Oct 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant