Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

model.ckpt.data-00000-of-00001: Data loss: not an sstable (bad magic number): perhaps your file is in a different file format and you need to use a different restore operator? #2676

Closed
wpq3142 opened this issue Nov 1, 2017 · 26 comments

Comments

@wpq3142
Copy link

wpq3142 commented Nov 1, 2017

System information

  • What is the top-level directory of the model you are using: /home/wpq/workspace/models-master/research
  • Have I written custom code (as opposed to using a stock example script provided in TensorFlow):no
  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Linux Ubuntu 16.04
  • TensorFlow installed from (source or binary): source
  • TensorFlow version (use command below): 1.4.0-rc1
  • Bazel version (if compiling from source):
  • CUDA/cuDNN version:cuDNN v7.0.3 (Sept 28, 2017), CUDA 9.0
  • GPU model and memory:gtx650 2g
  • Exact command to reproduce:
    python3 object_detection/train.py
    --clone_on_cpu true
    --logtostderr
    --pipeline_config_path /home/wpq/data/potato/model/rfcn_resnet101_coco.config
    --train_dir /home/wpq/data/potato/model/train

Describe the problem

download the new :faster_rcnn_inception_resnet_v2_atrous_coco_11_06_2017.tar.gz

rfcn_resnet101_coco.config :
model {
faster_rcnn {
num_classes: 37
image_resizer {
keep_aspect_ratio_resizer {
min_dimension: 600
max_dimension: 1024
}
}
feature_extractor {
type: 'faster_rcnn_inception_resnet_v2'
first_stage_features_stride: 8
}

Source code / logs

2017-11-01 15:11:40.186072: W tensorflow/core/util/tensor_slice_reader.cc:95] Could not open /home/wpq/data/potato/data/model.ckpt.data-00000-of-00001: Data loss: not an sstable (bad magic number): perhaps your file is in a different file format and you need to use a different restore operator?
Traceback (most recent call last):
File "/home/wpq/workspace/models-master/research/object_detection/train.py", line 163, in
tf.app.run()
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/platform/app.py", line 48, in run
_sys.exit(main(_sys.argv[:1] + flags_passthrough))
File "/home/wpq/workspace/models-master/research/object_detection/train.py", line 159, in main
worker_job_name, is_chief, FLAGS.train_dir)
File "/home/wpq/workspace/models-master/research/object_detection/trainer.py", line 254, in train
var_map, train_config.fine_tune_checkpoint))
File "/home/wpq/workspace/models-master/research/object_detection/utils/variables_helper.py", line 122, in get_variables_available_in_checkpoint
ckpt_reader = tf.train.NewCheckpointReader(checkpoint_path)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/pywrap_tensorflow_internal.py", line 150, in NewCheckpointReader
return CheckpointReader(compat.as_bytes(filepattern), status)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/errors_impl.py", line 473, in exit
c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.DataLossError: Unable to open table file /home/wpq/data/potato/data/model.ckpt.data-00000-of-00001: Data loss: not an sstable (bad magic number): perhaps your file is in a different file format and you need to use a different restore operator?

Process finished with exit code 1

@wpq3142
Copy link
Author

wpq3142 commented Nov 1, 2017

File format is inconsistent,Look at posts:
http://votec.top/2016/12/24/tensorflow-r12-tf-train-Saver/

slim.get_or_create_global_step() change to: tf.train.get_or_create_global_step()

@scotthuang1989
Copy link

@wpq3142
this exception raised at here:

ckpt_reader = tf.train.NewCheckpointReader(checkpoint_path)

I don't dive into the implementation of this API, but I suppose this API is for new format.

@jart
Copy link
Contributor

jart commented Nov 1, 2017

I'm assuming the model code here would need to be updated to maybe determine which format the checkpoint is written in, and if so, use the correct API? If so, that sounds like a straightforward change and we'd welcome contributions helping to clean up the model.

@tombstone tombstone added the stat:awaiting response Waiting on input from the contributor label Nov 3, 2017
@tombstone
Copy link
Contributor

tombstone commented Nov 3, 2017

@wpq3142 Can you tell us how you are configuring this particular entry in the config:
fine_tune_checkpoint: "PATH_TO_BE_CONFIGURED/model.ckpt".

It should look like
fine_tune_checkpoint: "/home/wpq/data/potato/data/model.ckpt"

Moreover, it also looks like you are using rfcn_resnet101_coco.config with a faster_rcnn_inception_resnet_v2_atrous_coco_11_06_2017 checkpoint. These two are not compatible. You need use rfcn_resnet101_coco_11_06_2017.tar.gz with the rfcn_resnet101_coco.config

@wpq3142
Copy link
Author

wpq3142 commented Nov 3, 2017

@tombstone

I downloaded the latest model,It's working right now,Configuration is as follows:
--clone_on_cpu true
--logtostderr
--pipeline_config_path /home/wpq/data/potato/model/faster_rcnn_nas_coco.config
--train_dir /home/wpq/data/potato/model/train

For one reason, I seem to lack a space between keys and values,

@aselle aselle removed the stat:awaiting response Waiting on input from the contributor label Nov 3, 2017
@paulrich1234
Copy link

you just need to restore (.ckpt) not (.ckpt.meta)
something like this 👍
sess = tf.Session()
saver.restore(sess, 'mymodel/model100-500-0.998.ckpt')

@pbashivan
Copy link

Apparently in V2 checkpoints, you should only include the filename up to ".ckpt". For instance if the checkpoint filename is model.ckpt.data-00000-of-00001 then you should only use model.ckpt. Using the full filename leads to getting a DataLossError.

@praneethpj
Copy link

Apparently in V2 checkpoints, you should only include the filename up to ".ckpt". For instance if the checkpoint filename is model.ckpt.data-00000-of-00001 then you should only use model.ckpt. Using the full filename leads to getting a DataLossError.

@pbashivan thank you so much

@shellyfung
Copy link

shellyfung commented Mar 21, 2019

I have fixed the issue by this:
replace model.ckpt the model.ckpt-200000
where 20000 is your checkpoint number

@codexponent
Copy link

Solved on #7696

@Rajamohanreddyai
Copy link

Hello all, just follow the below video and export your own model with in a 10 seconds

https://youtu.be/w0Ebsbz7HYA

@phosseini
Copy link

Apparently in V2 checkpoints, you should only include the filename up to ".ckpt". For instance if the checkpoint filename is model.ckpt.data-00000-of-00001 then you should only use model.ckpt. Using the full filename leads to getting a DataLossError.

This works, and in my case, I used the longest common prefix among my check point related files which was model.ckpt-1000000 and it worked for me. I had the three following files in my folder:

model.ckpt-1000000.data-00000-of-00001
model.ckpt-1000000.index
model.ckpt-1000000.meta

I just thought this might be the case for some folks.

@patspeis
Copy link

patspeis commented Jun 23, 2019

I was running into this and this worked for me. All I had to do was run the following on my windows 10 x64 machine and it worked:

python export_inference_graph.py --input_type image_tensor --pipeline_config_path ssd_mobilenet_v1_coco.config --trained_checkpoint_prefix models\model.ckpt-1000 --output_directory tuned_model

Instead of:

python export_inference_graph.py --input_type image_tensor --pipeline_config_path ssd_mobilenet_v1_coco.config --trained_checkpoint_prefix models\model.ckpt-1000.data-###-### --output_directory tuned_model

tl;dr Dont reference single files in the --trained_checkpoint_prefix flag. Just reference the batch (the prefix) of those three files.

Hope it helps.

@anjani-dhrangadhariya
Copy link

@phosseini is correct. The model itself is made up of three different files with three different extensions showing what kind of model data each file stores.

For me too, using the longest shared file name prefix solved the issue.

model.ckpt-1000000.data-00000-of-00001
model.ckpt-1000000.index
model.ckpt-1000000.meta

@kamrankausar
Copy link

tensorflow.python.framework.errors_impl.DataLossError: Unable to open table file ./model_dir/model.ckpt-1000000.data-00000-of-00001: Data loss: not an sstable (bad magic number): perhaps your file is in a different file format and you need to use a different restore operator?

@snrnsrk06
Copy link

snrnsrk06 commented Apr 23, 2020

I am trying to run opened project properly, the code saved files as model-10.data-0000-of-0001, .index, .meta.
and The part in code to save files is described as below:

saver = tf.train.Saver(max_to_keep=50)

if self.pretrained_model is not None:
        print("Start training with pretrained Model..")
        saver.restore(sess, self.pretrained_model)



if (e + 1) % self.save_every == 0:
          saver.save(sess, self.model_path + 'model', global_step=e + 1)
          print("model-%s saved." % (e + 1))

One of solution in this issue is to change the file name.

model.ckpt-1000000.data-00000-of-00001
model.ckpt-1000000.index
model.ckpt-1000000.meta

How to touch the code in my situation? How to change the file name? It looks the save method determine file name automatically. Or should i change the file name manually?

/////////////////////////////////////////////////////////////////////////////////////////////

It can be

if (e + 1) % self.save_every == 0:
                    saver.save(sess, self.model_path + 'model.ckpt', global_step=e + 1)
                    print("model-%s saved." % (e + 1))

but not enough

saver.restore(sess, self.model_path + cur_model2)

cur_model is 'model.ckpt-50.data-0000-of-0001', .index, .meta.

cur_model2 = cur_model[0:cur_model.find('-') + cur_model[cur_model.find('-'):].find('.')]
saver.restore(sess, self.model_path + cur_model2)

Just include file name in restore.

cur_model2 is 'model.ckpt-50'

@Rajput245
Copy link

Rajput245 commented May 22, 2020

none of the above worked.
model.ckpt-1000000
model.ckpt-1000000.index
model.ckpt-1000000.meta
solved this problem for me..

@dome272
Copy link

dome272 commented Aug 25, 2020

Apparently in V2 checkpoints, you should only include the filename up to ".ckpt". For instance if the checkpoint filename is model.ckpt.data-00000-of-00001 then you should only use model.ckpt. Using the full filename leads to getting a DataLossError.

you are a legend

@mikelty
Copy link

mikelty commented Nov 17, 2020

in some models, it could also be caused by lacking a .meta file and / or a .index file.

@BassantTolba1234
Copy link

Please all,
After I trained the tensrflow session , I do not have the name of files as .ckpt.data
model.ckpt-1000000.data-00000-of-00001
model.ckpt-1000000.index
model.ckpt-1000000.meta
but instead
Pretrained.data-00000-of-00001
Pretrained.index
Pretrained.meta
what should I do to solve the above problem of Data loss with my these saved files ??

@saramsv
Copy link

saramsv commented Apr 3, 2021

none of the above worked.
model.ckpt-1000000
model.ckpt-1000000.index
model.ckpt-1000000.meta
solved this problem for me..

@Rajput245 I have the same problem. Were you able to fix it?

@joan-yanqiong
Copy link

joan-yanqiong commented Jan 19, 2022

Hi guys, I don't know if it is still a problem for you, but I had the following files:
model.ckpt-100000.data-00000-of-00001
model.ckpt-100000.index
model.ckpt-100000.meta

When I used the following code:

import tensorflow.compat.v1 as tf
import tf_slim as slim

checkpoint_path = absolute_path_to/model.ckpt-100000

init_fn = slim.assign_from_checkpoint_fn(
        checkpoint_path, slim.get_model_variables(model_variables))
sess = tf.Session()
init_fn(sess)

I hope this helps you!

@pinzhi000
Copy link

In my situation I don't have "ckpt" at all.

I just have the following 2 files:
image

What do I do?

@joan-yanqiong
Copy link

I would maybe try to just add the ckpt after 'variables'.

@pinzhi000
Copy link

I just resolved this issue. I saved the model as a .h5 file and that worked.

@yohannesSM
Copy link

yohannesSM commented Jul 23, 2022

import tensorflow as tf
from tensorflow.python.training import checkpoint_utils as cp
print(cp.list_variables('path/model_name.ckpt'))
#use only the model name up to the .ckpt part. Do not other magical numbers

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests