Renumbering residues #51

ajasja · 2018-05-25T10:31:44Z

Hi! Nice library, has a lot of potential.

Is there a way to renumber residues?
Renumbering atoms seems trivial (just assign a range to the atom_number), however renumbering residues would probably require some heavy duty group_by magic and could be built in.
(renumbering atoms could also be built in:)

rasbt · 2018-05-25T16:09:44Z

Hi,

I just see that I didn't seem to have implemented something like that.

There are some easy ways to do that using pandas base functionality. E.g., to decrease the residue numbers by 1 you can simply do

ppdb.df['ATOM']['residue_number'] -= 1

However, if the residue numbers are not in order, or if there are gaps, like (1, 2, 10, 20), which you want to rename to (1, 2, 3, 4, 5), you would have to do it differently. E.g.,

you could first get all the unique residue numbers in the order they appear:

from collections import OrderedDict

ordered_unique_elements = \
    list(OrderedDict.fromkeys(ppdb.df['ATOM']['residue_number']))

and then map from the old residue numbers to the new, contiguous residue numbers:

mapping_dict = {ordered_unique_elements[i]: i+1 
                for i in range(0, len(ordered_unique_elements))}

ppdb.df['ATOM']['residue_number'] = \
    ppdb.df['ATOM']['residue_number'].map(mapping_dict)

I could actually add that as a method to BioPandas, or maybe just explain it in documentation. What do you think?

ajasja · 2018-05-26T16:38:35Z

Wow, that is some elegant python code!
I would recommend adding a renumber_residues method to biopandas. I would expect this is a common enough operation. Form completeness I would also add a renumber_atoms.

Do you think both methods should handle renumbering the ANISOU records at the same time? Otherwise the records might go out of sync.

What I'm trying to achieve is to split a pdb by chains, reorder the chains and combine them in a different pdb. I've looked also at pdbtools, however that is more command line based and I'd like to do that in python code.

wojdyr · 2018-05-31T14:03:53Z

@rasbt when you re-number sequentially you should also consider insertion codes (i.e. get all unique residue numbers + icodes, assign new numbers and remove the insertion codes).

rasbt · 2018-05-31T15:55:52Z

Good point. Yeah, with the renumbering, there are so many things to consider, all of which are pretty use-case specific. (Probably why I haven't made such a function/method in the past).

I am still thinking whether a standardized renumbering method should be added vs extending the documentation with easy-to-follow examples that give people more flexibility ...

drewaight · 2020-01-17T00:34:45Z

I would second a renumbering function, especially for antibody sequences. The insertion code makes it pretty difficult

rasbt · 2020-01-18T04:23:09Z

Sounds good, I agree. I am currently caught up with a pretty long to do list of other things (and the semester is going to start Tue); so I am not sure when I will get to this, yet. If someone wants to take a crack at it, I welcome PRs.

drewaight · 2020-01-19T17:56:22Z

Insertion codes were never much of an issue for me until i started in working with antibodies, where they are everything (different programs use different numbering, its a nightmare!) Anyway with the help of Stack Overflow I was able to figure this out, (https://stackoverflow.com/questions/59804249/mapping-tuple-dictionary-to-multiple-columns-of-a-dataframe). I will make a PR when I'm less embarrassed of my brute force methods and ugly code. For now here are my notes.

ppdb.amino3to1 will 'cut_out' duplicate residue numbers with insertions. You sequence needs to be rid of insertion codes (unique 'residue_number') for the sequence to be returned correctly. For an antibody complex for instance, I split off the heavy and light chain sequences from ppdb.df['ATOM'] into separate dataframes and renumbered them sequentially without insertion codes with the following function (inspired by sebastian)

def seq_order(df):
    from collections import OrderedDict
    df['residue_insertion'] = df['residue_number'].astype(str)+df['insertion'].astype(str)
    ordered_seq = list(OrderedDict.fromkeys(df['residue_insertion']))
    seq_dict = {ordered_seq[i]: i+1 for i in range(0, len(ordered_seq))}
    df['residue_insertion'] = df['residue_insertion'].map(seq_dict)
    df['residue_number'] = df['residue_insertion']
    df.drop(['residue_insertion'], axis=1, inplace = True)
    df['insertion'] = None
    return(df)

I added the renumbered heavy and light chain dataframes back into ppdb.df['ATOM']
to run ppdb.amino3to1() (i think this fuction only works on PandasPdb and not on subset dataframes)

I worked with my renumbering script (Anarci) to output a dataframe such the output had columns corresponding to the 'residue_num' 'insertion' 'new_res' and 'new_ins'

   residue_number           insertion        new_res      new_ins
0               2                                1         
1               3                                2        
2               3                 A              3        
3               5                                4              A

then a left sided merge back into the corresponding heavy or light chain dataframe (I'm a little unsure how this works still), drop the origional residue_numbers and rename the new. Merge the whole thing back into the PandasPdb and write out.

I'm sure theres a more elegant way, but please give me a break, I'm a crystallographer. I love Biopandas by the way. I hope this helps anyone struggling with the same issue.

luwei0917 · 2022-11-24T09:11:35Z

def seq_order(df):
    from collections import OrderedDict
    df['residue_insertion'] = df['residue_number'].astype(str)+df['insertion'].fillna('')
    ordered_seq = list(OrderedDict.fromkeys(df['residue_insertion']))
    seq_dict = {ordered_seq[i]: i+1 for i in range(0, len(ordered_seq))}
    df['residue_insertion'] = df['residue_insertion'].map(seq_dict)
    df['residue_number'] = df['residue_insertion']
    df.drop(['residue_insertion'], axis=1, inplace = True)

    return df

is slightly better in my opinion.

johnnytam100 · 2023-04-18T04:45:53Z

Same request for such a built-in feature.

rasbt added the enhancement label May 31, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Renumbering residues #51

Renumbering residues #51

ajasja commented May 25, 2018

rasbt commented May 25, 2018

ajasja commented May 26, 2018 •

edited

Loading

wojdyr commented May 31, 2018

rasbt commented May 31, 2018

drewaight commented Jan 17, 2020

rasbt commented Jan 18, 2020

drewaight commented Jan 19, 2020

luwei0917 commented Nov 24, 2022

johnnytam100 commented Apr 18, 2023

Renumbering residues #51

Renumbering residues #51

Comments

ajasja commented May 25, 2018

rasbt commented May 25, 2018

ajasja commented May 26, 2018 • edited Loading

wojdyr commented May 31, 2018

rasbt commented May 31, 2018

drewaight commented Jan 17, 2020

rasbt commented Jan 18, 2020

drewaight commented Jan 19, 2020

luwei0917 commented Nov 24, 2022

johnnytam100 commented Apr 18, 2023

ajasja commented May 26, 2018 •

edited

Loading