pypath.utils.mapping.Mapper§
- class pypath.utils.mapping.Mapper(ncbi_tax_id=None, cleanup_period=10, lifetime=300, translate_deleted_uniprot=None, keep_invalid_uniprot=None, trembl_swissprot_by_genesymbol=None)[source]§
Bases:
Logger
- __init__(ncbi_tax_id=None, cleanup_period=10, lifetime=300, translate_deleted_uniprot=None, keep_invalid_uniprot=None, trembl_swissprot_by_genesymbol=None)[source]§
- cleanup_periodint
Periodically check and remove unused mapping data. Time in seconds. If None tables kept forever.
- lifetimeint
If a table has not been used for longer than this preiod it is to be removed at next cleanup.
- translate_deleted_uniprotbool
Do an extra attempt to translate deleted or obsolete UniProt IDs by retrieving their archived datasheet and use the gene symbol to find the corresponding valid UniProt ID?
- keep_invalid_uniprotbool
If the target ID is UniProt, keep the results if they fit the format for UniProt IDs (we won’t check if they are deleted or from a different taxon). The alternative is to keep only those which are in the list of all UniProt IDs for the given organism.
- trembl_swissprot_by_genesymbolbool
Attempt to translate TrEMBL IDs to SwissProt by translating to gene symbols and then to SwissProt.
Methods
__init__
([ncbi_tax_id, cleanup_period, ...])cleanup_period : int
chain_map
(name, id_type, by_id_type, ...[, ...])Translate IDs which can not be directly translated in two steps: from id_type to via_id_type and from there to target_id_type.
create_reverse
(key)Creates a mapping table with
id_type
andtarget_id_type
(i.e. direction of the ID translation) swapped.deleted_uniprot_genesymbol
(uniprot)get_table_key
(id_type, target_id_type[, ...])Returns a tuple unambigously identifying a mapping table.
guess_type
(name[, entity_type])From a string, tries to guess the ID type and optionally the entity type.
has_mapping_table
(id_type, target_id_type[, ...])Tells if a mapping table is loaded.
id_from_label
(label[, label_id_type, ...])id_from_label0
(label[, label_id_type, ...])id_types
()A list of all identifier types that can be handled by any of the resources.
identifier
(label[, ncbi_tax_id, id_type, ...])For a label returns the corresponding primary identifier.
identifier0
(label[, ncbi_tax_id, id_type, ...])label
(name[, entity_type, id_type, ncbi_tax_id])For any kind of entity, either protein, miRNA or protein complex, returns the preferred human readable label.
load_genesymbol5
([ncbi_tax_id])Creates a Gene Symbol to UniProt mapping table with the first 5 characters of each Gene Symbol.
load_mapping
(resource, **kwargs)Loads a single mapping table based on input definition in
resource
.load_uniprot_static
(keys[, ncbi_tax_id])Loads mapping tables from the huge static mapping file from UniProt.
map_name
(name[, id_type, target_id_type, ...])Translates one instance of one ID type to a different one.
map_name0
(name[, id_type, target_id_type, ...])Translates the name and returns only one of the resulted IDs.
map_names
(names[, id_type, target_id_type, ...])Same as
map_name
but translates multiple IDs at once.List of mapping tables available to load.
only_uniprot_ac
(uniprots)For one or more strings returns only those which match the format of UniProt accession numbers.
only_valid_uniprots
(uniprots[, ncbi_tax_id])other_organism_uniprot
(uniprot[, ncbi_tax_id])Tells if
uniprot
is an UniProt ID from some other organism thanncbi_tax_id
.primary_uniprot
(uniprots[, ncbi_tax_id])For an iterable of UniProt IDs returns a set with the secondary IDs changed to the corresponding primary IDs.
reload
()Reload the class from the module level.
Removes tables last used a longer time ago than their lifetime.
remove_key
(key)Removes the table with key
key
if exists.remove_table
(id_type, target_id_type, ...)Removes the table defined by the ID types and organism.
reverse_key
(key)For a mapping table key returns a new key with the identifiers reversed.
reverse_mapping
(mapping_table)Creates an opposite direction MappingTable by swapping the dictionary inside an existing MappingTable object.
swissprots
(uniprots[, ncbi_tax_id])Creates a dict translating a set of potentially secondary and non-reviewed UniProt IDs to primary SwissProt IDs (whenever is possible).
Due to potentially ambiguous translation always returns set.
translate_deleted_uniprots_by_genesymbol
(...)translation_df
(id_type, target_id_type[, ...])Translation table as a data frame.
translation_dict
(id_type, target_id_type[, ...])Translation table as a dict.
trembl_swissprot
(uniprots[, ncbi_tax_id])For an iterable of TrEMBL and SwissProt IDs, returns a set with only SwissProt, mapping from TrEMBL to gene symbols, and then back to SwissProt.
uniprot_cleanup
(uniprots[, ncbi_tax_id])We use this function as a standard callback when the target ID type is UniProt.
valid_uniprot
(uniprot[, ncbi_tax_id])If the UniProt ID
uniprot
exist in the proteome of the organismncbi_tax_id
returns the ID, otherwise returns None.which_table
(id_type, target_id_type[, load, ...])Returns the table which is suitable to convert an ID of id_type to target_id_type.
Attributes
default_label_types
default_name_types
label_type_to_id_type
- chain_map(name, id_type, by_id_type, target_id_type, ncbi_tax_id=None, **kwargs)[source]§
Translate IDs which can not be directly translated in two steps: from id_type to via_id_type and from there to target_id_type.
- Args
name (str): The original name to be converted. id_type (str): The type of the name. by_id_type (str): The intermediate name type. target_id_type (str): The name type to translate to, more or
less the same values are available as for
id_type
.ncbi_tax_id (int): The NCBI Taxonomy identifier of the organism. kwargs: Passed to map_name.
- Returns
Set of IDs of type target_id_type.
- create_reverse(key)[source]§
Creates a mapping table with
id_type
andtarget_id_type
(i.e. direction of the ID translation) swapped.
- get_table_key(id_type, target_id_type, ncbi_tax_id=None)[source]§
Returns a tuple unambigously identifying a mapping table.
- guess_type(name, entity_type=None)[source]§
From a string, tries to guess the ID type and optionally the entity type. Returns a tuple of strings: ID type and entity type.
- has_mapping_table(id_type, target_id_type, ncbi_tax_id=None)[source]§
Tells if a mapping table is loaded. If it’s loaded, it resets the expiry timer so the table remains loaded.
- Returns
(bool): True if the mapping table is loaded.
- classmethod id_types()[source]§
A list of all identifier types that can be handled by any of the resources.
- Returns
- (list): A list of tuples with the identifier type labels used
in pypath and in the original resource. If the latter is None, typically the ID type has no name in the original resource.
- identifier(label: str | Iterable[str], ncbi_tax_id: int | None = None, id_type: str | None = None, entity_type: Literal['drug', 'lncrna', 'mirna', 'protein', 'small_molecule'] | None = None) Set[str] | List[Set[str]] [source]§
For a label returns the corresponding primary identifier. The type of default identifiers is determined by the settings module. Note, this kind of translation is not always unambigous, one gene symbol might correspond to multiple UniProt IDs.
- label(name, entity_type=None, id_type=None, ncbi_tax_id=None)[source]§
For any kind of entity, either protein, miRNA or protein complex, returns the preferred human readable label. For proteins this means Gene Symbols, for miRNAs miRNA names, for complexes a series of Gene Symbols.
- load_genesymbol5(ncbi_tax_id=None)[source]§
Creates a Gene Symbol to UniProt mapping table with the first 5 characters of each Gene Symbol.
- load_mapping(resource, **kwargs)[source]§
Loads a single mapping table based on input definition in
resource
.**kwargs
passed toMapReader
.
- load_uniprot_static(keys, ncbi_tax_id=None)[source]§
Loads mapping tables from the huge static mapping file from UniProt. Takes long to download and process, also requires more memory. This is the last thing we try if everything else failed.
- map_name(name, id_type=None, target_id_type=None, ncbi_tax_id=None, strict=False, expand_complexes=True, uniprot_cleanup=True)[source]§
Translates one instance of one ID type to a different one. Returns set of the target ID type.
This function should be used to convert individual IDs. It takes care about everything and ideally you don’t need to think on the details.
How does it work: looks up dictionaries between the original and target ID type, if doesn’t find, attempts to load from the predefined inputs. If the original name is genesymbol, first it looks up among the preferred gene names from UniProt, if not found, it takes an attempt with the alternative gene names. If the gene symbol still couldn’t be found, and strict = False, the last attempt only the first 5 characters of the gene symbol matched. If the target name type is uniprot, then it converts all the ACs to primary. Then, for the Trembl IDs it looks up the preferred gene names, and find Swissprot IDs with the same preferred gene name.
- Args
name (str): The original name to be converted. id_type (str): The type of the name. Available by default:
genesymbol (gene name)
entrez (Entrez Gene ID [#])
refseqp (NCBI RefSeq Protein ID [NP_|XP_*])
ensp (Ensembl protein ID [ENSP*])
enst (Ensembl transcript ID [ENST*])
ensg (Ensembl genomic DNA ID [ENSG*])
hgnc (HGNC ID [HGNC:#])
gi (GI number [#])
embl (DDBJ/EMBL/GeneBank CDS accession)
embl_id (DDBJ/EMBL/GeneBank accession)
And many more, see the code of
pypath.internals.input_formats
- target_id_type (str): The name type to translate to, more or
less the same values are available as for
id_type
.
ncbi_tax_id (int): NCBI Taxonomy ID of the organism. strict (bool): In case a Gene Symbol can not be translated,
try to add number “1” to the end, or try to match only its first five characters. This option is rarely used, but it makes possible to translate some non-standard gene names typically found in old, unmaintained resources.
- expand_complexes (bool): When encountering complexes,
translated the IDs of its components and return a set of IDs. The alternative behaviour is to return the Complex objects.
- uniprot_cleanup (bool): When the target_id_type is UniProt
ID, call the uniprot_cleanup function at the end.
- map_name0(name, id_type=None, target_id_type=None, ncbi_tax_id=None, strict=False, expand_complexes=None, uniprot_cleanup=None)[source]§
Translates the name and returns only one of the resulted IDs. It means in case of ambiguous ID translation, a random one of them will be picked and returned. Recommended to use only if the translation between the given ID types is mostly unambigous and the loss of information can be ignored. See more details at map_name.
- map_names(names, id_type=None, target_id_type=None, ncbi_tax_id=None, strict=False, expand_complexes=True, uniprot_cleanup=True)[source]§
Same as
map_name
but translates multiple IDs at once. These two functions could be seamlessly implemented as one, still I created separate functions to always make it explicit if a set of translated IDs come from multiple original IDs.- Args
name (str): The original name to be converted. id_type (str): The type of the name. Available by default:
genesymbol (gene name)
entrez (Entrez Gene ID [#])
refseqp (NCBI RefSeq Protein ID [NP_*|XP_*])
ensp (Ensembl protein ID [ENSP*])
enst (Ensembl transcript ID [ENST*])
ensg (Ensembl genomic DNA ID [ENSG*])
hgnc (HGNC ID [HGNC:#])
gi (GI number [#])
embl (DDBJ/EMBL/GeneBank CDS accession)
embl_id (DDBJ/EMBL/GeneBank accession)
And many more, see the code of
pypath.internals.input_formats
- target_id_type (str): The name type to translate to, more or
less the same values are available as for
id_type
.
ncbi_tax_id (int): NCBI Taxonomy ID of the organism. strict (bool): In case a Gene Symbol can not be translated,
try to add number “1” to the end, or try to match only its first five characters. This option is rarely used, but it makes possible to translate some non-standard gene names typically found in old, unmaintained resources.
- expand_complexes (bool): When encountering complexes,
translated the IDs of its components and return a set of IDs. The alternative behaviour is to return the Complex objects.
- uniprot_cleanup (bool): When the target_id_type is UniProt
ID, call the uniprot_cleanup function at the end.
- static mapping_tables()[source]§
List of mapping tables available to load.
- Returns
- (list): A list of tuples, each representing an ID translation
table, with the ID types, the data source and the loader class.
- only_uniprot_ac(uniprots)[source]§
For one or more strings returns only those which match the format of UniProt accession numbers. The format is defined here: https://www.uniprot.org/help/accession_numbers
If string provided, returns string or None. If iterable provided, returns set (potentially empty if none of the strings are valid).
- other_organism_uniprot(uniprot, ncbi_tax_id=None)[source]§
Tells if
uniprot
is an UniProt ID from some other organism thanncbi_tax_id
.
- primary_uniprot(uniprots, ncbi_tax_id=None)[source]§
For an iterable of UniProt IDs returns a set with the secondary IDs changed to the corresponding primary IDs. Anything what is not a secondary UniProt ID left intact.
- remove_table(id_type, target_id_type, ncbi_tax_id)[source]§
Removes the table defined by the ID types and organism.
- reverse_key(key)[source]§
For a mapping table key returns a new key with the identifiers reversed.
- Args
key (tuple): A mapping table key.
- Returns
A tuple representing a mapping table key, identifiers swapped.
- static reverse_mapping(mapping_table)[source]§
Creates an opposite direction MappingTable by swapping the dictionary inside an existing MappingTable object.
- Args
mapping_table (MappingTable): A MappingTable object.
- Returns
A new MappingTable object.
- swissprots(uniprots, ncbi_tax_id=None)[source]§
Creates a dict translating a set of potentially secondary and non-reviewed UniProt IDs to primary SwissProt IDs (whenever is possible).
- translate_deleted_uniprot_by_genesymbol(uniprot, ncbi_tax_id=None)[source]§
Due to potentially ambiguous translation always returns set.
- translation_df(id_type: str, target_id_type: str, ncbi_tax_id: int | None = None) DataFrame | None [source]§
Translation table as a data frame.
- translation_dict(id_type: str, target_id_type: str, ncbi_tax_id: int | None = None) MappingTable | None [source]§
Translation table as a dict.
- trembl_swissprot(uniprots, ncbi_tax_id=None)[source]§
For an iterable of TrEMBL and SwissProt IDs, returns a set with only SwissProt, mapping from TrEMBL to gene symbols, and then back to SwissProt. If this kind of translation is not successful for any of the IDs it will be kept in the result, no matter if it’s not a SwissProt ID. If the
- uniprot_cleanup(uniprots, ncbi_tax_id=None)[source]§
We use this function as a standard callback when the target ID type is UniProt. It checks if the format of the IDs are correct, if they are part of the organism proteome, attempts to translate secondary and deleted IDs to their primary, recent counterparts.
- Args
uniprots (str,set): One or more UniProt IDs. ncbi_tax_id (int): The NCBI Taxonomy identifier of the organism.
- Returns
Set of checked and potentially translated UniProt iDs. Elements which do not fit the criteria will be discarded.