-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
dataset #2
Comments
Thank you for your reply. I downloaded the data set according to your link, with a total of 17263 PDBs. However, your previous work(MutCompute) described in the paper provided 19427 structures. After processing, this article only has 16569 structures. This is different from 17263. What is the reason? @clauswilke |
Different filtering compared to previous works. From the paper:
|
OK, I understand this question. However, 17263 PDB was downloaded according to the link, but this paper describes the use of 16569 PDB training model. What is the reason? @clauswilke |
Ah, that's a question for @danny305. I would suspect the 17263 are before filtering for similarity to the PSICOV dataset. |
Thank you for your reply. I think you may be right. |
I have reviewed the method of building data sets in your previous work(MutCompute). I found that the method of building data sets is similar to that in this paper. Both are filtered according to the resolution of 2.5Å and the sequence similarity of 50, but the number of data sets after filtering is different.Are there any subtle differences that I didn't notice? Additionally, deposited crystallographic structures are refined by algorithms of their time which are not necessarily the current state of the art. To improve data set composition and uniformity, we gathered all PDB structures with less than 2.5 Å resolution and at most 50% sequence similarity and drew from structures in the PDB-REDO database, where existing protein structures are refined in a uniform manner.13 These two changes in data consistency resulted in 19 436 structures for training with 300 of these structures held for out-of-sample testing and increased wild-type prediction accuracy to 63% |
Can you provide the training set and verification set (PDB ID) used in the paper? @akulikova64 @danny305 @clauswilke
The text was updated successfully, but these errors were encountered: