pypath.utils.mapping.Mapper§

class pypath.utils.mapping.Mapper(ncbi_tax_id=None, cleanup_period=10, lifetime=300, translate_deleted_uniprot=None, keep_invalid_uniprot=None, trembl_swissprot_by_genesymbol=None)[source]§

Bases: Logger

__init__(ncbi_tax_id=None, cleanup_period=10, lifetime=300, translate_deleted_uniprot=None, keep_invalid_uniprot=None, trembl_swissprot_by_genesymbol=None)[source]§
cleanup_periodint

Periodically check and remove unused mapping data. Time in seconds. If None tables kept forever.

lifetimeint

If a table has not been used for longer than this preiod it is to be removed at next cleanup.

translate_deleted_uniprotbool

Do an extra attempt to translate deleted or obsolete UniProt IDs by retrieving their archived datasheet and use the gene symbol to find the corresponding valid UniProt ID?

keep_invalid_uniprotbool

If the target ID is UniProt, keep the results if they fit the format for UniProt IDs (we won’t check if they are deleted or from a different taxon). The alternative is to keep only those which are in the list of all UniProt IDs for the given organism.

trembl_swissprot_by_genesymbolbool

Attempt to translate TrEMBL IDs to SwissProt by translating to gene symbols and then to SwissProt.

Methods

__init__([ncbi_tax_id, cleanup_period, ...])

cleanup_period : int

chain_map(name, id_type, by_id_type, ...[, ...])

Translate IDs which can not be directly translated in two steps: from id_type to via_id_type and from there to target_id_type.

create_reverse(key)

Creates a mapping table with id_type and target_id_type (i.e. direction of the ID translation) swapped.

deleted_uniprot_genesymbol(uniprot)

get_table_key(id_type, target_id_type[, ...])

Returns a tuple unambigously identifying a mapping table.

guess_type(name[, entity_type])

From a string, tries to guess the ID type and optionally the entity type.

has_mapping_table(id_type, target_id_type[, ...])

Tells if a mapping table is loaded.

id_from_label(label[, label_id_type, ...])

id_from_label0(label[, label_id_type, ...])

id_types()

A list of all identifier types that can be handled by any of the resources.

identifier(label[, ncbi_tax_id, id_type, ...])

For a label returns the corresponding primary identifier.

identifier0(label[, ncbi_tax_id, id_type, ...])

label(name[, entity_type, id_type, ncbi_tax_id])

For any kind of entity, either protein, miRNA or protein complex, returns the preferred human readable label.

load_genesymbol5([ncbi_tax_id])

Creates a Gene Symbol to UniProt mapping table with the first 5 characters of each Gene Symbol.

load_mapping(resource, **kwargs)

Loads a single mapping table based on input definition in resource.

load_uniprot_static(keys[, ncbi_tax_id])

Loads mapping tables from the huge static mapping file from UniProt.

map_name(name[, id_type, target_id_type, ...])

Translates one instance of one ID type to a different one.

map_name0(name[, id_type, target_id_type, ...])

Translates the name and returns only one of the resulted IDs.

map_names(names[, id_type, target_id_type, ...])

Same as map_name but translates multiple IDs at once.

mapping_tables()

List of mapping tables available to load.

only_uniprot_ac(uniprots)

For one or more strings returns only those which match the format of UniProt accession numbers.

only_valid_uniprots(uniprots[, ncbi_tax_id])

other_organism_uniprot(uniprot[, ncbi_tax_id])

Tells if uniprot is an UniProt ID from some other organism than ncbi_tax_id.

primary_uniprot(uniprots[, ncbi_tax_id])

For an iterable of UniProt IDs returns a set with the secondary IDs changed to the corresponding primary IDs.

reload()

Reload the class from the module level.

remove_expired()

Removes tables last used a longer time ago than their lifetime.

remove_key(key)

Removes the table with key key if exists.

remove_table(id_type, target_id_type, ...)

Removes the table defined by the ID types and organism.

reverse_key(key)

For a mapping table key returns a new key with the identifiers reversed.

reverse_mapping(mapping_table)

Creates an opposite direction MappingTable by swapping the dictionary inside an existing MappingTable object.

swissprots(uniprots[, ncbi_tax_id])

Creates a dict translating a set of potentially secondary and non-reviewed UniProt IDs to primary SwissProt IDs (whenever is possible).

translate_deleted_uniprot_by_genesymbol(uniprot)

Due to potentially ambiguous translation always returns set.

translate_deleted_uniprots_by_genesymbol(...)

translation_df(id_type, target_id_type[, ...])

Translation table as a data frame.

translation_dict(id_type, target_id_type[, ...])

Translation table as a dict.

trembl_swissprot(uniprots[, ncbi_tax_id])

For an iterable of TrEMBL and SwissProt IDs, returns a set with only SwissProt, mapping from TrEMBL to gene symbols, and then back to SwissProt.

uniprot_cleanup(uniprots[, ncbi_tax_id])

We use this function as a standard callback when the target ID type is UniProt.

valid_uniprot(uniprot[, ncbi_tax_id])

If the UniProt ID uniprot exist in the proteome of the organism ncbi_tax_id returns the ID, otherwise returns None.

which_table(id_type, target_id_type[, load, ...])

Returns the table which is suitable to convert an ID of id_type to target_id_type.

Attributes

default_label_types

default_name_types

label_type_to_id_type

chain_map(name, id_type, by_id_type, target_id_type, ncbi_tax_id=None, **kwargs)[source]§

Translate IDs which can not be directly translated in two steps: from id_type to via_id_type and from there to target_id_type.

Args

name (str): The original name to be converted. id_type (str): The type of the name. by_id_type (str): The intermediate name type. target_id_type (str): The name type to translate to, more or

less the same values are available as for id_type.

ncbi_tax_id (int): The NCBI Taxonomy identifier of the organism. kwargs: Passed to map_name.

Returns

Set of IDs of type target_id_type.

create_reverse(key)[source]§

Creates a mapping table with id_type and target_id_type (i.e. direction of the ID translation) swapped.

get_table_key(id_type, target_id_type, ncbi_tax_id=None)[source]§

Returns a tuple unambigously identifying a mapping table.

guess_type(name, entity_type=None)[source]§

From a string, tries to guess the ID type and optionally the entity type. Returns a tuple of strings: ID type and entity type.

has_mapping_table(id_type, target_id_type, ncbi_tax_id=None)[source]§

Tells if a mapping table is loaded. If it’s loaded, it resets the expiry timer so the table remains loaded.

Returns

(bool): True if the mapping table is loaded.

classmethod id_types()[source]§

A list of all identifier types that can be handled by any of the resources.

Returns
(list): A list of tuples with the identifier type labels used

in pypath and in the original resource. If the latter is None, typically the ID type has no name in the original resource.

identifier(label: str | Iterable[str], ncbi_tax_id: int | None = None, id_type: str | None = None, entity_type: Literal['drug', 'lncrna', 'mirna', 'protein', 'small_molecule'] | None = None) Set[str] | List[Set[str]][source]§

For a label returns the corresponding primary identifier. The type of default identifiers is determined by the settings module. Note, this kind of translation is not always unambigous, one gene symbol might correspond to multiple UniProt IDs.

label(name, entity_type=None, id_type=None, ncbi_tax_id=None)[source]§

For any kind of entity, either protein, miRNA or protein complex, returns the preferred human readable label. For proteins this means Gene Symbols, for miRNAs miRNA names, for complexes a series of Gene Symbols.

load_genesymbol5(ncbi_tax_id=None)[source]§

Creates a Gene Symbol to UniProt mapping table with the first 5 characters of each Gene Symbol.

load_mapping(resource, **kwargs)[source]§

Loads a single mapping table based on input definition in resource. **kwargs passed to MapReader.

load_uniprot_static(keys, ncbi_tax_id=None)[source]§

Loads mapping tables from the huge static mapping file from UniProt. Takes long to download and process, also requires more memory. This is the last thing we try if everything else failed.

map_name(name, id_type=None, target_id_type=None, ncbi_tax_id=None, strict=False, expand_complexes=True, uniprot_cleanup=True)[source]§

Translates one instance of one ID type to a different one. Returns set of the target ID type.

This function should be used to convert individual IDs. It takes care about everything and ideally you don’t need to think on the details.

How does it work: looks up dictionaries between the original and target ID type, if doesn’t find, attempts to load from the predefined inputs. If the original name is genesymbol, first it looks up among the preferred gene names from UniProt, if not found, it takes an attempt with the alternative gene names. If the gene symbol still couldn’t be found, and strict = False, the last attempt only the first 5 characters of the gene symbol matched. If the target name type is uniprot, then it converts all the ACs to primary. Then, for the Trembl IDs it looks up the preferred gene names, and find Swissprot IDs with the same preferred gene name.

Args

name (str): The original name to be converted. id_type (str): The type of the name. Available by default:

  • genesymbol (gene name)

  • entrez (Entrez Gene ID [#])

  • refseqp (NCBI RefSeq Protein ID [NP_|XP_*])

  • ensp (Ensembl protein ID [ENSP*])

  • enst (Ensembl transcript ID [ENST*])

  • ensg (Ensembl genomic DNA ID [ENSG*])

  • hgnc (HGNC ID [HGNC:#])

  • gi (GI number [#])

  • embl (DDBJ/EMBL/GeneBank CDS accession)

  • embl_id (DDBJ/EMBL/GeneBank accession)

And many more, see the code of pypath.internals.input_formats

target_id_type (str): The name type to translate to, more or

less the same values are available as for id_type.

ncbi_tax_id (int): NCBI Taxonomy ID of the organism. strict (bool): In case a Gene Symbol can not be translated,

try to add number “1” to the end, or try to match only its first five characters. This option is rarely used, but it makes possible to translate some non-standard gene names typically found in old, unmaintained resources.

expand_complexes (bool): When encountering complexes,

translated the IDs of its components and return a set of IDs. The alternative behaviour is to return the Complex objects.

uniprot_cleanup (bool): When the target_id_type is UniProt

ID, call the uniprot_cleanup function at the end.

map_name0(name, id_type=None, target_id_type=None, ncbi_tax_id=None, strict=False, expand_complexes=None, uniprot_cleanup=None)[source]§

Translates the name and returns only one of the resulted IDs. It means in case of ambiguous ID translation, a random one of them will be picked and returned. Recommended to use only if the translation between the given ID types is mostly unambigous and the loss of information can be ignored. See more details at map_name.

map_names(names, id_type=None, target_id_type=None, ncbi_tax_id=None, strict=False, expand_complexes=True, uniprot_cleanup=True)[source]§

Same as map_name but translates multiple IDs at once. These two functions could be seamlessly implemented as one, still I created separate functions to always make it explicit if a set of translated IDs come from multiple original IDs.

Args

name (str): The original name to be converted. id_type (str): The type of the name. Available by default:

  • genesymbol (gene name)

  • entrez (Entrez Gene ID [#])

  • refseqp (NCBI RefSeq Protein ID [NP_*|XP_*])

  • ensp (Ensembl protein ID [ENSP*])

  • enst (Ensembl transcript ID [ENST*])

  • ensg (Ensembl genomic DNA ID [ENSG*])

  • hgnc (HGNC ID [HGNC:#])

  • gi (GI number [#])

  • embl (DDBJ/EMBL/GeneBank CDS accession)

  • embl_id (DDBJ/EMBL/GeneBank accession)

And many more, see the code of pypath.internals.input_formats

target_id_type (str): The name type to translate to, more or

less the same values are available as for id_type.

ncbi_tax_id (int): NCBI Taxonomy ID of the organism. strict (bool): In case a Gene Symbol can not be translated,

try to add number “1” to the end, or try to match only its first five characters. This option is rarely used, but it makes possible to translate some non-standard gene names typically found in old, unmaintained resources.

expand_complexes (bool): When encountering complexes,

translated the IDs of its components and return a set of IDs. The alternative behaviour is to return the Complex objects.

uniprot_cleanup (bool): When the target_id_type is UniProt

ID, call the uniprot_cleanup function at the end.

static mapping_tables()[source]§

List of mapping tables available to load.

Returns
(list): A list of tuples, each representing an ID translation

table, with the ID types, the data source and the loader class.

only_uniprot_ac(uniprots)[source]§

For one or more strings returns only those which match the format of UniProt accession numbers. The format is defined here: https://www.uniprot.org/help/accession_numbers

If string provided, returns string or None. If iterable provided, returns set (potentially empty if none of the strings are valid).

other_organism_uniprot(uniprot, ncbi_tax_id=None)[source]§

Tells if uniprot is an UniProt ID from some other organism than ncbi_tax_id.

primary_uniprot(uniprots, ncbi_tax_id=None)[source]§

For an iterable of UniProt IDs returns a set with the secondary IDs changed to the corresponding primary IDs. Anything what is not a secondary UniProt ID left intact.

reload()[source]§

Reload the class from the module level.

remove_expired()[source]§

Removes tables last used a longer time ago than their lifetime.

remove_key(key)[source]§

Removes the table with key key if exists.

remove_table(id_type, target_id_type, ncbi_tax_id)[source]§

Removes the table defined by the ID types and organism.

reverse_key(key)[source]§

For a mapping table key returns a new key with the identifiers reversed.

Args

key (tuple): A mapping table key.

Returns

A tuple representing a mapping table key, identifiers swapped.

static reverse_mapping(mapping_table)[source]§

Creates an opposite direction MappingTable by swapping the dictionary inside an existing MappingTable object.

Args

mapping_table (MappingTable): A MappingTable object.

Returns

A new MappingTable object.

swissprots(uniprots, ncbi_tax_id=None)[source]§

Creates a dict translating a set of potentially secondary and non-reviewed UniProt IDs to primary SwissProt IDs (whenever is possible).

translate_deleted_uniprot_by_genesymbol(uniprot, ncbi_tax_id=None)[source]§

Due to potentially ambiguous translation always returns set.

translation_df(id_type: str, target_id_type: str, ncbi_tax_id: int | None = None) DataFrame | None[source]§

Translation table as a data frame.

translation_dict(id_type: str, target_id_type: str, ncbi_tax_id: int | None = None) MappingTable | None[source]§

Translation table as a dict.

trembl_swissprot(uniprots, ncbi_tax_id=None)[source]§

For an iterable of TrEMBL and SwissProt IDs, returns a set with only SwissProt, mapping from TrEMBL to gene symbols, and then back to SwissProt. If this kind of translation is not successful for any of the IDs it will be kept in the result, no matter if it’s not a SwissProt ID. If the

uniprot_cleanup(uniprots, ncbi_tax_id=None)[source]§

We use this function as a standard callback when the target ID type is UniProt. It checks if the format of the IDs are correct, if they are part of the organism proteome, attempts to translate secondary and deleted IDs to their primary, recent counterparts.

Args

uniprots (str,set): One or more UniProt IDs. ncbi_tax_id (int): The NCBI Taxonomy identifier of the organism.

Returns

Set of checked and potentially translated UniProt iDs. Elements which do not fit the criteria will be discarded.

valid_uniprot(uniprot, ncbi_tax_id=None)[source]§

If the UniProt ID uniprot exist in the proteome of the organism ncbi_tax_id returns the ID, otherwise returns None.

which_table(id_type, target_id_type, load=True, ncbi_tax_id=None)[source]§

Returns the table which is suitable to convert an ID of id_type to target_id_type. If no such table have been loaded yet, it attempts to load from UniProt. If all attempts failed returns None.