The pypath book§
Contents
-
1 Introduction
-
2 Build, load and save databases
-
2.1 The OmniPath app
-
2.2 Built-in database definitions
-
2.3 Networks
-
2.3.1 Strictly literature curated network
-
2.3.2 The OmniPath network with extra activity flow, enzyme-substrate and ligand-receptor interactions
-
2.3.3 Transcriptional regulation network from DoRothEA and other resources
-
2.3.4 Literature curated miRNA post-transcriptional regulation network
-
2.3.5 Transcriptional regulation of miRNA
-
2.3.6 lncRNA-mRNA interactions
-
2.3.7 Small molecule-protein interactions
-
-
2.4 Enzyme-substrate relationships
-
2.5 Protein complexes
-
2.6 Annotations
-
2.7 Inter-cellular communication roles
-
-
3 Data directly from the original resources
-
4 Interesting resources
-
4.1 RaMP
-
4.1.1 TL;DR
-
-
4.2 HMDB (Human Metabolome Database)
-
4.2.1 Direct access to HMDB data
-
4.2.2 Higher level access to HMDB data
-
4.2.3 ID translation with HMDB
-
-
4.3 NCBI E-Utils
-
-
5 Download management
-
5.1 Cache management and customization
-
5.2 Download failures
-
5.2.1 Corrupted cache content
-
5.2.2 Network communication issues: look into the curl debug log
-
5.2.3 Timeouts
-
5.2.4 Access and inspect the Curl object
-
5.2.5 Is it failing only for you?
-
5.2.6 Read the log
-
5.2.7 TLS (SSL, HTTPS) errors
-
-
-
6 Resources
-
6.1 Licenses
-
6.1.1 Example: build a network for commercial use
-
-
6.2 Resource information
-
6.3 Resource definitions for a certain database or dataset
-
-
7 Building networks
-
7.1 Which network datasets are pre-defined in pypath?
-
7.2 The Network object
-
7.3 Network in pandas.DataFrame
-
7.4 Self interactions (loop edges) in the network
-
7.5 Molecular complexes in the network
-
-
8 Translating identifiers
-
8.1 Pre-defined ID translation tables
-
8.2 Direct access to ID translation tables
-
-
9 Orthology translation
-
9.1 Orthology translation tables as dictionaries
-
9.2 Orthology translation data frames
-
-
10 Taxonomy
-
10.1 Translating to NCBI Taxonomy, scientific names and common names
-
10.2 Organism from UniProt ID
-
-
11 UniProt
-
11.1 The UniProt input module
-
11.1.1 All UniProt IDs for one organism
-
11.1.2 UniProt ID format validation
-
11.1.3 UniProt ID validation
-
11.1.4 Single UniProt protein datasheet
-
11.1.5 History of UniProt records
-
11.1.6 UniProt REST API
-
11.1.7 Processed UniProt annotations
-
-
11.2 The UniProt utils module
-
11.2.1 Datasheets
-
11.2.2 Tables
-
-
11.3 Sanitizing UniProt IDs
-
-
12 Enzyme-substrate interactions
-
12.1 Enzyme-substrate objects
-
12.2 Enzyme-substrate data frame
-
-
13 Protein sequences
-
14 Annotations
-
14.1 Load a single annotation resource
-
14.2 Load the full annotations database by the database manager
-
14.3 Load only selected annotations
-
14.4 Data frames of annotations
-
-
15 Inter-cellular signaling roles
-
15.1 Build an intercellular communication network
-
15.2 Quantitative overview of intercell annotations
-
15.3 Intercell database as data frame
-
15.4 Browse intercell categories
-
-
16 Gene Ontology
-
17 Protein complexes
-
17.1 Protein complex objects
-
17.2 Protein complex data frame
-
-
18 Saving datasets as pickles
-
19 Log messages and sessions
-
19.1 Basic info about the session
-
19.2 Read the log file
-
19.3 Logging to the console
-
19.4 Disable logging
-
19.5 Write to the log
-
19.5.1 Sending a single message
-
19.5.2 Connect a module or class to the pypath logger
-
-
-
20 BEL export
-
21 CellPhoneDB export
-
22 The legacy igraph-based network object
-
22.1 I just want a network quickly and play around with pypath
-
22.2 How do I build networks from any data with pypath?
-
22.2.1 Defining input formats
-
22.2.2 Creating PyPath object and loading the 2 test files
-
-
22.3 Structure of the legacy network object
-
22.3.1 Directions and signs
-
22.3.2 Accessing nodes in the network
-
-
22.4 Querying relationships with our without causality
-
22.5 Accessing edges by identifiers
-
22.6 Literature references
-
22.7 Plotting the network with igraph
-
Introduction§
OmniPath consists of 5 main database segments: network (interactions), enzyme-substrate interactions (enz_sub or ptms), protein complexes (complexes), molecular entity annotations (annotations) and intercellular communication roles (intercell). You can access all these by the web service at https://omnipathdb.org/ and the R/Bioconductor package OmnipathR, furthermore the network and some of the annotations by the Cytoscape app. However only pypath is able to build these databases directly from the original sources with various options for customization and to provide a rich and versatile API for each database enjoying the almost unlimited flexibility of Python. This book attempts to be a guided tour around pypath, however almost all objects, modules, APIs presented here have many more methods, options and features than we have a chance to cover. If you feel like there might be something useful for you, don’t hesitate to ask us by github.
This document has been run with the following pypath version:
[1]:
import pypath
pypath.__version__
executed in 0ms, finished 16:49:47 2023-03-09
[1]:
'0.14.36'
Build, load and save databases§
We provide a high level interface in the module pypath.omnipath.app. This is the easiest way to build, manage and access the OmniPath databases, hence this is what we present in the Quick start section. In further sections we show the lower level modules more in detail.
The OmniPath app§
pypath.omnipath is an application which contains a database manager at omnipath.db. This manager is empty by default. It builds and loads the databases on demand.
[2]:
from pypath import omnipath
omnipath.db
executed in 1.34s, finished 14:11:27 2022-12-03
[2]:
<pypath.omnipath.app.DatabaseManager at 0x602fb851cd90>
Built-in database definitions§
The databases presented below are pre-defined in pypath. You can also list them by:
[3]:
from pypath import omnipath
omnipath.db.datasets
executed in 0ms, finished 14:11:32 2022-12-03
[3]:
['omnipath',
'curated',
'complex',
'annotations',
'intercell',
'tf_target',
'dorothea',
'small_molecule',
'tf_mirna',
'mirna_mrna',
'lncrna_mrna',
'enz_sub']
Networks§
OmniPath offers multiple built in network datasets: the OmniPath PPI network the more strict literature curated PPI network, the special ligand-receptor PPI network and various other PPI datasets, the transcriptional regulation network from DoRothEA and other resources, miRNA post-transcriptional regulation network and also transcriptional regulation network for miRNAs.
Strictly literature curated network§
[4]:
from pypath import omnipath
cu = omnipath.db.get_db('curated')
cu
executed in 16.83s, finished 13:17:13 2022-12-02
[4]:
<Network: 7980 nodes, 35551 interactions>
The OmniPath network with extra activity flow, enzyme-substrate and ligand-receptor interactions§
[5]:
from pypath import omnipath
op = omnipath.db.get_db('omnipath')
op
executed in 1m, finished 13:18:55 2022-12-02
[5]:
<Network: 18558 nodes, 94358 interactions>
Transcriptional regulation network from DoRothEA and other resources§
Note: according to the default settings, DoRothEA confidence levels A-D and all original
resources will be loaded. To load only DoRothEA, use the key "dorothea"
instead of "tf_target"
.
[6]:
from pypath import omnipath
tft = omnipath.db.get_db('tf_target')
tft
executed in 2m 12.72s, finished 13:21:54 2022-12-02
[6]:
<Network: 18986 nodes, 326708 interactions>
Literature curated miRNA post-transcriptional regulation network§
[1]:
from pypath import omnipath
mi = omnipath.db.get_db('mirna_mrna')
mi
executed in 2.28s, finished 13:31:55 2022-12-02
[1]:
<Network: 1264 nodes, 3288 interactions>
Transcriptional regulation of miRNA§
[4]:
from pypath import omnipath
tmi = omnipath.db.get_db('tf_mirna')
tmi
executed in 0ms, finished 13:32:41 2022-12-02
[4]:
<Network: 1032 nodes, 4960 interactions>
lncRNA-mRNA interactions§
[6]:
from pypath import omnipath
lnc = omnipath.db.get_db('lncrna_mrna')
lnc
executed in 0ms, finished 13:33:03 2022-12-02
[6]:
<Network: 243 nodes, 217 interactions>
Small molecule-protein interactions§
These interactions are either ligand-receptor connections, enzyme inhibitions, allosteric regulations or enzyme-metabolite interactions. Currently it is a small, experimental dataset, but will be largely extended in the future.
[1]:
from pypath import omnipath
smol = omnipath.db.get_db('small_molecule')
smol
executed in 7.94s, finished 13:57:17 2022-12-02
[1]:
<Network: 1980 nodes, 3147 interactions>
Enzyme-substrate relationships§
[7]:
from pypath import omnipath
es = omnipath.db.get_db('enz_sub')
es
executed in 6.14s, finished 13:33:26 2022-12-02
[7]:
<Enzyme-substrate database: 41426 relationships>
Protein complexes§
[8]:
from pypath import omnipath
co = omnipath.db.get_db('complex')
co
executed in 0ms, finished 13:33:31 2022-12-02
[8]:
<Complex database: 28173 complexes>
Annotations§
The annotations database is huge, building or even loading it takes long time and requires quite some memory.
[9]:
from pypath import omnipath
an = omnipath.db.get_db('annotations')
an
executed in 2m 43.60s, finished 13:36:28 2022-12-02
[9]:
<Annotation database: 5490653 records about 50872 entities from 68 resources>
Inter-cellular communication roles§
This database is quick to build, but it requires the annotations database, which is really heavy.
[10]:
from pypath import omnipath
ic = omnipath.db.get_db('intercell')
ic
executed in 23.34s, finished 13:37:12 2022-12-02
[10]:
<Intercell annotations: 301527 records about 48570 entities>
Data directly from the original resources§
The pypath.inputs
module contains clients for more than 150 molecular biology and biomedical resources, and overall
almost 500 functions that download data directly from these resources. Maintaining such a large
number of clients is troublesome, hence at any time some of them are broken, you can check them in
our daily status report. Each
submodule of pypath.inputs
is named after its corresponding resource, all lowercase, e.g. “depod” (DEPOD) or “cytosig”
(CytoSig). Within these modules each function name starts with the name of the resource, and
ends with the kind of data it retrieves. For example, pypath.inputs.signor.signor_interactions
downloads interactions from
SIGNOR. The labels *”_interactions”,”_enz_sub”,”_complexes”* and
*”_annotations”* retrieve records intended to these respective databases. However, the records at
this stage are not fully processed yet. Some functions have different postfixes, e.g. *”_raw”* means
the data is close to the format provided by the resource itself; *”_mapping”* means it is intended
for a translation table. The purpose of the input functions is to 1) handle the download; 2) read the
raw data; 3) extract the relevant parts; 4) do the specific part of processing, i.e. bring the data
to a state when it is suitable for the generic database classes for further processing. The outputs
of these functions is not standard in any ways, though you may observ repeated patterns. The input
functions typically return lists or dictionaries. These are arbitrarily designed towards the aims of
selecting the relevant fields and give straightforward, accessible Python data structures for
processing within or outside of pypath.
We use SIGNOR as an example because this resource provides data for almost all OmniPath databases.
The signor_complexes
function returns a set of pypath.internals.intera.Complex
objects, ready to be added to the OmniPath
complexes database (built by pypath.core.complex.ComplexAggregator
).
[2]:
from pypath.inputs import signor
signor.signor_complexes()
executed in 0ms, finished 15:24:43 2022-12-03
[2]:
{'COMPLEX:P23511_P25208_Q13952': Complex NFY: COMPLEX:P23511_P25208_Q13952,
'COMPLEX:P68104_P85299_Q6R327_Q8TB45_Q9BVC4': Complex mTORC2: COMPLEX:P68104_P85299_Q6R327_Q8TB45_Q9BVC4,
'COMPLEX:P42345_Q8N122_Q8TB45_Q96B36_Q9BVC4': Complex mTORC1: COMPLEX:P42345_Q8N122_Q8TB45_Q96B36_Q9BVC4,
'COMPLEX:P63208_Q13616_Q9Y297': Complex SCF-betaTRCP: COMPLEX:P63208_Q13616_Q9Y297,
'COMPLEX:Q09472_Q92793': Complex CBP/p300: COMPLEX:Q09472_Q92793,
'COMPLEX:Q09472_Q92793_Q92831': Complex P300/PCAF: COMPLEX:Q09472_Q92793_Q92831,
'COMPLEX:Q13485_Q15796': Complex SMAD2/SMAD4: COMPLEX:Q13485_Q15796,
'COMPLEX:P84022_Q13485': Complex SMAD3/SMAD4: COMPLEX:P84022_Q13485,
'COMPLEX:P05412_Q13485': Complex SMAD4/JUN: COMPLEX:P05412_Q13485,
'COMPLEX:Q15796_Q9HAU4': Complex SMAD2/SMURF2: COMPLEX:Q15796_Q9HAU4,
'COMPLEX:O15105_Q01094_Q13547': Complex SMAD7/HDAC1/E2F-1: COMPLEX:O15105_Q01094_Q13547,
'COMPLEX:P19838_Q04206': Complex NfKb-p65/p50: COMPLEX:P19838_Q04206,
'COMPLEX:O14920_O15111': Complex IK
The signor_interactions
function returns a list of arbitrary tuples that represent the most important properties of SIGNOR
interaction records in a human readable way, and ready to be processed by the pypath.core.network.Network
object.
[5]:
signor.signor_interactions()[:10]
executed in 0ms, finished 14:11:52 2022-12-03
[5]:
[SignorInteraction(source='O15530', target='O15530', source_isoform=None, target_isoform=None, source_type='protein', target_type='protein', effect='unknown', mechanism='phosphorylation', ncbi_tax_id='9606', pubmeds='10455013', direct=True, ptm_type='phosphorylation', ptm_residue='Ser396', ptm_motif='SSSSSSHsLSASDTG'),
SignorInteraction(source='Q9NQ66', target='CHEBI:18035', source_isoform=None, target_isoform=None, source_type='protein', target_type='smallmolecule', effect='up-regulates quantity', mechanism='', ncbi_tax_id='-1', pubmeds='23880553', direct=True, ptm_type='', ptm_residue='Small molecule catalysis', ptm_motif=''),
SignorInteraction(source='P62136', target='O15169', source_isoform=None, target_isoform=None, source_type='protein', target_type='protein', effect='down-regulates activity', mechanism='dephosphorylation', ncbi_tax_id='9606', pubmeds='17318175', direct=True, ptm_type='dephosphorylation', ptm_residue='Ser77', ptm_motif='YEPEGSAsPTPPYLK'),
SignorInteraction(sou
Note, the records above contain also enzyme-PTM data, hence the signor.signor_enzyme_substrate
function only converts them to an intermediate format to make it easier to process for pypath.core.enz_sub.EnzymeSubstrateAggregator
.
[4]:
signor.signor_enzyme_substrate()[:2]
executed in 0ms, finished 13:58:20 2022-12-02
[4]:
[{'typ': 'phosphorylation',
'resnum': 396,
'instance': 'SSSSSSHSLSASDTG',
'substrate': 'O15530',
'start': 389,
'end': 403,
'kinase': 'O15530',
'resaa': 'S',
'motif': 'SSSSSSHSLSASDTG',
'enzyme_isoform': None,
'substrate_isoform': None,
'references': {'10455013'}},
{'typ': 'dephosphorylation',
'resnum': 77,
'instance': 'YEPEGSASPTPPYLK',
'substrate': 'O15169',
'start': 70,
'end': 84,
'kinase': 'P62136',
'resaa': 'S',
'motif': 'YEPEGSASPTPPYLK',
'enzyme_isoform': None,
'substrate_isoform': None,
'references': {'17318175'}}]
Finally, SIGNOR also assigns proteins to pathways. This information is intended for the OmniPath
annotations database, and retrieved by the signor.signor_pathway_annotations
function. This function returns a dict of sets
which is typical for *_annotation* functions. This format requires practically no further
processing.
[5]:
signor.signor_pathway_annotations()['O14733']
executed in 1.48s, finished 13:58:28 2022-12-02
[5]:
{SignorPathway(pathway='TNF alpha'),
SignorPathway(pathway='Toll like receptors')}
We haven’t mention all functions in the inputs.signor
module. The rest of the functions retrieve additional information
needed by the four functions above, and are of limited direct use for users. For example,
signor_protein_families
returns a dict with the internal ID and members of protein families; this data is used to process the
interactions and complexes, but not too interesting on its own.
[6]:
signor.signor_protein_families()['SIGNOR-PF2']
executed in 0ms, finished 13:58:53 2022-12-02
[6]:
['Q9HBW0', 'Q92633', 'Q9UBY5']
Interesting resources§
Here we showcase a few potentially useful features in pypath.inputs
.
RaMP§
RaMP is a human metabolite and metabolic network database providing ID translation, annotations and enzymatic reactions of metabolites. Let’s take a closer look first at the full database contents. It is available as a MySQL database, below we list the tables and their column names:
[6]:
from pypath.inputs import ramp
ramp.ramp_list_tables()
executed in 2.20s, finished 16:51:14 2023-03-09
[6]:
{'analyte': ['rampId', 'type'],
'analytehasontology': ['rampCompoundId', 'rampOntologyId'],
'analytehaspathway': ['rampId', 'pathwayRampId', 'pathwaySource'],
'analytesynonym': ['Synonym', 'rampId', 'geneOrCompound', 'source'],
'catalyzed': ['rampCompoundId', 'rampGeneId'],
'chem_props': ['ramp_id',
'chem_data_source',
'chem_source_id',
'iso_smiles',
'inchi_key_prefix',
'inchi_key',
'inchi',
'mw',
'monoisotop_mass',
'common_name',
'mol_formula'],
'db_version': ['ramp_version',
'load_timestamp',
'version_notes',
'met_intersects_json',
'gene_intersects_json',
'met_intersects_json_pw_mapped',
'gene_intersects_json_pw_mapped',
'db_sql_url'],
'entity_status_info': ['status_category',
'entity_source_id',
'entity_source_name',
'entity_count'],
'metabolite_class': ['ramp_id',
'class_source_id',
'class_level_name',
'class_name',
'source'],
'ontology': ['rampOntologyId', 'commonName', 'HMDBOntologyType', 'metCount'],
'pathway': ['pathwayR
Using the ramp_raw
function, we can access these tables either as Python dicts, or pandas.DataFrame
s, or loaded into an
SQLite
database. For
further inspection, the data frames are the most convenient. Most of the ID translation data is
contained in the source
table:
Note: At the very first time, retrieving these tables takes quite some time, not only due to the large download, but also a performance bottleneck when processing the MySQL dumps. Thanks to caching, loading the tables subsequently happens much faster.
[8]:
tables = ramp.ramp_raw(['analytesynonym', 'chem_props', 'source'])
tables['source']
executed in 4.25s, finished 16:54:17 2023-03-09
[8]:
sourceId | rampId | IDtype | geneOrCompound | commonName | priorityHMDBStatus | dataSource | pathwayCount | |
---|---|---|---|---|---|---|---|---|
0 | hmdb:HMDB0000001 | RAMP_C_000000001 | hmdb | compound | 1-Methylhistidine | quantified | hmdb | 2 |
1 | hmdb:HMDB0000479 | RAMP_C_000000001 | hmdb | compound | 3-Methylhistidine | quantified | hmdb | 2 |
2 | chebi:50599 | RAMP_C_000000001 | chebi | compound | 1-Methylhistidine | quantified | hmdb | 2 |
3 | chemspider:83153 | RAMP_C_000000001 | chemspider | compound | 1-Methylhistidine | quantified | hmdb | 2 |
4 | kegg:C01152 | RAMP_C_000000001 | kegg | compound | 1-Methylhistidine | quantified | hmdb_kegg | 2 |
... | ... | ... | ... | ... | ... | ... | ... | ... |
756552 | uniprot:H0YDB7 | RAMP_G_000009307 | uniprot | gene | RAB38 | NULL | wiki | 10 |
756553 | uniprot:A0A024R191 | RAMP_G_000009307 | uniprot | gene | RAB38 | NULL | wiki | 10 |
756554 | uniprot:H0YEA4 | RAMP_G_000009307 | uniprot | gene | RAB38 | NULL | wiki | 10 |
756555 | entrez:23682 | RAMP_G_000009307 | entrez | gene | RAB38 | NULL | wiki | 10 |
756556 | gene_symbol:RAB38 | RAMP_G_000009307 | gene_symbol | gene | RAB38 | NULL | wiki | 10 |
756557 rows × 8 columns
Structural and physicochemical info is available in the chem_props
table:
[10]:
tables['chem_props']
executed in 0ms, finished 17:00:46 2023-03-09
[10]:
ramp_id | chem_data_source | chem_source_id | iso_smiles | inchi_key_prefix | inchi_key | inchi | mw | monoisotop_mass | common_name | mol_formula | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | RAMP_C_000000001 | hmdb | hmdb:HMDB0000001 | [H]OC(=O)[C@@]([H])(N([H])[H])C([H])([H])C1=C(... | BRMWTNUJHUMWMS | BRMWTNUJHUMWMS-LURJTMIESA-N | InChI=1S/C7H11N3O2/c1-10-3-5(9-4-10)2-6(8)7(11... | 169.181 | 169.085 | 1-Methylhistidine | C7H11N3O2 |
1 | RAMP_C_000000001 | hmdb | hmdb:HMDB0000479 | [H][C@](N)(CC1=CN=CN1C)C(O)=O | JDHILDINMRGULE | JDHILDINMRGULE-LURJTMIESA-N | InChI=1S/C7H11N3O2/c1-10-4-9-3-5(10)2-6(8)7(11... | 169.181 | 169.085 | 3-Methylhistidine | C7H11N3O2 |
2 | RAMP_C_000000001 | chebi | chebi:27596 | Cn1cncc1C[C@H](N)C(O)=O | JDHILDINMRGULE | JDHILDINMRGULE-LURJTMIESA-N | InChI=1S/C7H11N3O2/c1-10-4-9-3-5(10)2-6(8)7(11... | NULL | 169.085 | N(pros)-methyl-L-histidine | C7H11N3O2 |
3 | RAMP_C_000000001 | chebi | chebi:50599 | Cn1cnc(C[C@H](N)C(O)=O)c1 | BRMWTNUJHUMWMS | BRMWTNUJHUMWMS-LURJTMIESA-N | InChI=1S/C7H11N3O2/c1-10-3-5(9-4-10)2-6(8)7(11... | NULL | 169.085 | N(tele)-methyl-L-histidine | C7H11N3O2 |
4 | RAMP_C_000000002 | hmdb | hmdb:HMDB0000002 | NCCCN | XFNJVJPLKCPIBV | XFNJVJPLKCPIBV-UHFFFAOYSA-N | InChI=1S/C3H10N2/c4-2-1-3-5/h1-5H2 | 74.1249 | 74.0844 | 1,3-Diaminopropane | C3H10N2 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
275898 | RAMP_C_000258279 | lipidmaps | LIPIDMAPS:LMPK15050003 | C1(OC)C(=O)C(C[C@H](OC(C)=O)CCCCCCCCCCCCC)=C(O... | UXLMJHNFDRMGPW | UXLMJHNFDRMGPW-LJQANCHMSA-N | InChI=1S/C24H38O6/c1-4-5-6-7-8-9-10-11-12-13-1... | NULL | 422.267 | 2-hydroxy-5-methoxy-3-(2R-acetoxy-pentadecyl)-... | C24H38O6 |
275899 | RAMP_C_000258280 | lipidmaps | LIPIDMAPS:LMPK15050004 | C1(OC)C(=O)C(C[C@H](OC(C)=O)CCCCCCCCCCCCC)=CC(... | CVZNKLNAHBTINT | CVZNKLNAHBTINT-JOCHJYFZSA-N | InChI=1S/C24H38O5/c1-4-5-6-7-8-9-10-11-12-13-1... | NULL | 406.272 | 5-methoxy-3-(2R-acetoxy-pentadecyl)-1,4-benzoq... | C24H38O5 |
275900 | RAMP_C_000226089 | lipidmaps | LIPIDMAPS:LMPK15050005 | C1(OC)C(=O)C(C[C@H](OC(C)=O)CCCCCCCCCCC)=CC(=O... | JIUGZSYPFREDLG | JIUGZSYPFREDLG-HXUWFJFHSA-N | InChI=1S/C22H34O5/c1-4-5-6-7-8-9-10-11-12-13-2... | NULL | 378.241 | 5-methoxy-3-(2R-acetoxy-tridecyl)-1,4-benzoqui... | C22H34O5 |
275901 | RAMP_C_000258283 | lipidmaps | LIPIDMAPS:LMPK15050008 | C1(O)C(=O)C(CCCCCCCCCCCCCCC)=C(O)C(=O)C=1 | GXDURRGUXLDZKN | GXDURRGUXLDZKN-UHFFFAOYSA-N | InChI=1S/C21H34O4/c1-2-3-4-5-6-7-8-9-10-11-12-... | NULL | 350.246 | Suberonone | C21H34O4 |
275902 | RAMP_C_000258284 | lipidmaps | LIPIDMAPS:LMPK15050009 | C1(O)C(=O)C(CCCCCCCCCCCCC)=C(O)C(=O)C=1 | AMKNOBHCKRZHIO | AMKNOBHCKRZHIO-UHFFFAOYSA-N | InChI=1S/C19H30O4/c1-2-3-4-5-6-7-8-9-10-11-12-... | NULL | 322.214 | Rapanone | C19H30O4 |
275903 rows × 11 columns
Raw RaMP data can be accessed also as an SQLite database. The advantage here is the high
performance and flexibility of operations. Conversion to pandas
and vice versa is really
easy, you can always have the result in a data frame. Below, con
is a database connection ready
to execute your queries. It is an in-memory database, using alternatively an on-disk database is
possible. We use pypath.formats.sqlite
to look into the SQLite database.
[11]:
con = ramp.ramp_raw(['source', 'chem_props', 'analytesynonym'], sqlite = True)
con
executed in 10.56s, finished 17:07:00 2023-03-09
[11]:
<sqlite3.Connection at 0x6fa1e9e4e940>
Now we have already loaded these 3 big tables both as data frames and as SQLite tables, let’s see how much memory they use (normally half is enough, and they should stay in the memory only for short periods):
[13]:
from pypath.share import common
common.format_bytes(common.python_memory_usage())
executed in 0ms, finished 17:07:44 2023-03-09
[13]:
'3.7 GB'
Looking into the database, we see the 3 tables loaded, and their column names:
[19]:
from pypath.formats import sqlite
sqlite.list_columns(con)
executed in 0ms, finished 17:13:01 2023-03-09
[19]:
{'source': ['sourceId',
'rampId',
'IDtype',
'geneOrCompound',
'commonName',
'priorityHMDBStatus',
'dataSource',
'pathwayCount'],
'analytesynonym': ['Synonym', 'rampId', 'geneOrCompound', 'source'],
'chem_props': ['ramp_id',
'chem_data_source',
'chem_source_id',
'iso_smiles',
'inchi_key_prefix',
'inchi_key',
'inchi',
'mw',
'monoisotop_mass',
'common_name',
'mol_formula']}
Let’s see how to execute an SQL query and fetch the output into a data frame. This query takes
the source
table,
selects the records with HMDB and ChEBI IDs in two subqueries, and joins the two by rampId
, in order to obtain a
HMDB ←→ ChEBI
mapping table:
[22]:
import pandas as pd
query = (
'SELECT DISTINCT a.sourceId as hmdb, b.sourceId as chebi '
'FROM '
' (SELECT sourceId, rampId '
' FROM source '
' WHERE geneOrCompound = "compound" AND IDtype = "hmdb") a '
'JOIN '
' (SELECT sourceId, rampId '
' FROM source '
' WHERE geneOrCompound = "compound" AND IDtype = "chebi") b '
'ON a.rampId = b.rampId;'
)
df = pd.read_sql_query(query, con)
df
executed in 1ms, finished 17:18:37 2023-03-09
[22]:
hmdb | chebi | |
---|---|---|
0 | hmdb:HMDB0000001 | chebi:27596 |
1 | hmdb:HMDB0000001 | chebi:50599 |
2 | hmdb:HMDB0000479 | chebi:27596 |
3 | hmdb:HMDB0000479 | chebi:50599 |
4 | hmdb:HMDB00001 | chebi:27596 |
... | ... | ... |
104129 | hmdb:HMDB0126033 | chebi:25882 |
104130 | hmdb:HMDB0141947 | chebi:180150 |
104131 | hmdb:HMDB0128505 | chebi:7870 |
104132 | hmdb:HMDB0130984 | chebi:8227 |
104133 | hmdb:HMDB0130987 | chebi:8630 |
104134 rows × 2 columns
Such mapping tables can be easily accessed for any pairs of identifiers by the ramp_mapping
function. Before that,
let’s see the complete list of supported ID types:
[24]:
ramp.ramp_id_types()
executed in 4.45s, finished 17:23:09 2023-03-09
[24]:
{'CAS',
'EN',
'LIPIDMAPS',
'brenda',
'chebi',
'chemspider',
'ensembl',
'entrez',
'gene_symbol',
'hmdb',
'kegg',
'kegg_glycan',
'lipidbank',
'ncbiprotein',
'plantfa',
'pubchem',
'swisslipids',
'uniprot',
'wikidata'}
[31]:
ramp.ramp_mapping('LIPIDMAPS', 'swisslipids')
executed in 4.94s, finished 17:29:17 2023-03-09
[31]:
{'LMFA00000008': {'SLM:000390048'},
'LMFA01010001': {'SLM:000000510'},
'LMFA01010002': {'SLM:000000449'},
'LMFA01010003': {'SLM:000001194'},
'LMFA01010004': {'SLM:000001195'},
'LMFA01010005': {'SLM:000389552'},
'LMFA01010006': {'SLM:000001196'},
'LMFA01010007': {'SLM:000389947'},
'LMFA01010008': {'SLM:000000853'},
'LMFA01010010': {'SLM:000000855'},
'LMFA01010011': {'SLM:000389946'},
'LMFA01010012': {'SLM:000000719'},
'LMFA01010013': {'SLM:000001198'},
'LMFA01010014': {'SLM:000000825'},
'LMFA01010015': {'SLM:000001199'},
'LMFA01010017': {'SLM:000001095'},
'LMFA01010019': {'SLM:000001205'},
'LMFA01010020': {'SLM:000000829'},
'LMFA01010021': {'SLM:000001207'},
'LMFA01010022': {'SLM:000000827'},
'LMFA01010023': {'SLM:000001128'},
'LMFA01010024': {'SLM:000000414'},
'LMFA01010026': {'SLM:000000539'},
'LMFA01010027': {'SLM:000000980'},
'LMFA01010028': {'SLM:000000540'},
'LMFA01010030': {'SLM:000000543'},
'LMFA01010032': {'SLM:000000544'},
'LMFA01010034': {'SLM:00000
Above we got a dict of sets, alternatively data frames are available:
[32]:
ramp.ramp_mapping('LIPIDMAPS', 'swisslipids', return_df = True)
executed in 4.63s, finished 17:30:27 2023-03-09
[32]:
id_type_a | id_type_b | |
---|---|---|
0 | LMST02030086 | SLM:000485328 |
1 | LMST02030087 | SLM:000485330 |
2 | LMSP06020013 | SLM:000000534 |
3 | LMST02020001 | SLM:000001055 |
4 | LMST02020001 | SLM:000485315 |
... | ... | ... |
35218 | LMPR0104010007 | SLM:000389242 |
35219 | LMPR0104030005 | SLM:000390232 |
35220 | LMPR0104030006 | SLM:000390227 |
35221 | LMPR01070626 | SLM:000000432 |
35222 | LMPR01090015 | SLM:000389419 |
35223 rows × 2 columns
RaMP ID translation is also integrated into the higher level APIs in pypath.utils.mapping
. Below, we
first look into the available ID types and translation tables:
[34]:
from pypath.utils import mapping
m = mapping.get_mapper()
m.id_types()
executed in 0ms, finished 17:38:25 2023-03-09
[34]:
{IdType(pypath='CAS', original='CAS'),
IdType(pypath='LIPIDMAPS', original='LIPIDMAPS'),
IdType(pypath='MedChemExpress', original='MedChemExpress'),
IdType(pypath='actor', original='actor'),
IdType(pypath='affy', original='affy'),
IdType(pypath='affymetrix', original='affymetrix'),
IdType(pypath='agilent', original='agilent'),
IdType(pypath='alzforum', original='Alzforum_mut'),
IdType(pypath='araport', original='Araport'),
IdType(pypath='atlas', original='atlas'),
IdType(pypath='bindingdb', original='bindingdb'),
IdType(pypath='brenda', original='brenda'),
IdType(pypath='carotenoiddb', original='carotenoiddb'),
IdType(pypath='cas', original='CAS'),
IdType(pypath='cas_id', original='CAS'),
IdType(pypath='cgnc', original='CGNC'),
IdType(pypath='chebi', original='chebi'),
IdType(pypath='chembl', original='chembl'),
IdType(pypath='chemicalbook', original='chemicalbook'),
IdType(pypath='chemspider', original='chemspider'),
IdType(pypath='clinicaltrials', original='clinic
These are ID types not only from RaMP, but all the supported resources. In the mapping table
definitions, as translation between any two ID types is supported, id_type_b
is always None
:
[35]:
[t for t in m.mapping_tables() if t.resource == 'ramp']
executed in 0ms, finished 17:46:56 2023-03-09
[35]:
[MappingTableDefinition(id_type_a='kegg_glycan', id_type_b=None, resource='ramp', input_class='RampMapping', resource_id_type_a='kegg_glycan', resource_id_type_b=None),
MappingTableDefinition(id_type_a='hmdb', id_type_b=None, resource='ramp', input_class='RampMapping', resource_id_type_a='hmdb', resource_id_type_b=None),
MappingTableDefinition(id_type_a='wikidata', id_type_b=None, resource='ramp', input_class='RampMapping', resource_id_type_a='wikidata', resource_id_type_b=None),
MappingTableDefinition(id_type_a='LIPIDMAPS', id_type_b=None, resource='ramp', input_class='RampMapping', resource_id_type_a='LIPIDMAPS', resource_id_type_b=None),
MappingTableDefinition(id_type_a='kegg', id_type_b=None, resource='ramp', input_class='RampMapping', resource_id_type_a='kegg', resource_id_type_b=None),
MappingTableDefinition(id_type_a='CAS', id_type_b=None, resource='ramp', input_class='RampMapping', resource_id_type_a='CAS', resource_id_type_b=None),
MappingTableDefinition(id_type_a='chebi
TL;DR§
Up until this point this section is about extra insights, but what 99% of the users will do looks like this:
[36]:
from pypath.utils import mapping
mapping.map_name('131431', 'chebi', 'hmdb')
executed in 0ms, finished 17:53:38 2023-03-09
[36]:
{'HMDB0094709'}
HMDB (Human Metabolome Database)§
Direct access to HMDB data§
In the inputs.hmdb
module processes metabolite and protein data using lxml.etree
and some minimal
utilities from formats.xml
. The metabolite or protein records are available as lxml.etree.Element
objects, or
custom fields can be extracted into dicts, or into data frames. To iterate through the xml
elements, each representing a metabolite:
[1]:
from pypath.inputs import hmdb
next(hmdb.iter_metabolites())
executed in 1ms, finished 12:23:11 2023-04-24
[1]:
<Element {http://www.hmdb.ca}metabolite at 0x60b1846262c0>
On the Element
objects you can use directly lxml.etree
’s methods to extract information. An easier and flexible way to
extract information from these XML records is to define a schema with instructions for
lxml
. A full schema
for HMDB metabolites is available in hmdb.SCHEMA
:
[2]:
hmdb.METABOLITES_SCHEMA
executed in 0ms, finished 12:24:03 2023-04-24
[2]:
{'taxonomy': ('taxonomy',
{'description': ('description', None),
'direct_parent': ('direct_parent', None),
'kingdom': ('kingdom', None),
'class': ('class', None),
'sub_class': ('sub_class', None),
'molecular_framework': ('molecular_framework', None),
'alternative_parents': ('alternative_parents',
('alternative_parent', 'findall'),
None),
'substituents': ('substituents', ('substituent', 'findall'), None)}),
'spectra': ('spectra', ('spectrum', 'findall'), {'spectrum_id', 'type'}),
'biological_properties': ('biological_properties',
{'cellular_locations': ('cellular_locations', ('cellular', 'findall'), None),
'biospecimen_locations': ('biospecimen_locations',
('biospecimen', 'findall'),
None),
'tissue_locations': ('tissue_locations', ('tissue', 'findall'), None),
'pathways': ('pathways',
('pathway', 'findall'),
{'kegg_map_id', 'name', 'smpdb_id'})}),
'experimental_properties': ('experimental_properties',
('property', 'findall')
The schema for proteins is different:
[3]:
hmdb.PROTEINS_SCHEMA
executed in 0ms, finished 12:24:52 2023-04-24
[3]:
{'gene_properties': ('gene_properties',
{'chromosome_location': ('chromosome_location', None),
'locus': ('locus', None),
'gene_sequence': ('gene_sequence', None)}),
'protein_properties': ('protein_properties',
{'residue_number': ('residue_number', None),
'molecular_weight': ('molecular_weight', None),
'theoretical_pi': ('theoretical_pi', None),
'polypeptide_sequence': ('polypeptide_sequence', None),
'transmembrane_regions': ('transmembrane_regions',
('region', 'findall'),
None),
'signal_regions': ('signal_regions', ('region', 'findall'), None)}),
'pfams': ('pfams', ('pfam', 'findall'), {'name', 'pfam_id'}),
'metabolite_associations': ('metabolite_associations',
('metabolite', 'findall'),
{'accession', 'name'}),
'go_classifications': ('go_classifications',
('go_class', 'findall'),
{'category', 'description', 'go_id'}),
'pathways': ('pathways',
('pathway', 'findall'),
{'kegg_map_id', 'name', 'smpdb_id'}),
'general_references': ('general_
By default the full schema is used by hmdb.metabolites_raw
and hmdb.proteins_raw
, but you can
pass a smaller dict with only your fields of interest, largely reducing the processing time.
Using the head
argument we peek into the first N records of the data:
[4]:
list(hmdb.metabolites_raw(head = 3))
executed in 0ms, finished 12:25:31 2023-04-24
[4]:
[{'taxonomy': {'description': ' belongs to the class of organic compounds known as histidine and derivatives. Histidine and derivatives are compounds containing cysteine or a derivative thereof resulting from reaction of cysteine at the amino group or the carboxy group, or from the replacement of any hydrogen of glycine by a heteroatom.',
'direct_parent': 'Histidine and derivatives',
'kingdom': 'Organic compounds',
'class': 'Carboxylic acids and derivatives',
'sub_class': 'Amino acids, peptides, and analogues',
'molecular_framework': 'Aromatic heteromonocyclic compounds',
'alternative_parents': ['Amino acids',
'Aralkylamines',
'Azacyclic compounds',
'Carbonyl compounds',
'Carboxylic acids',
'Heteroaromatic compounds',
'Hydrocarbon derivatives',
'Imidazolyl carboxylic acids and derivatives',
'L-alpha-amino acids',
'Monoalkylamines',
'Monocarboxylic acids and derivatives',
'N-substituted imidazoles',
'Organic oxides',
The returned nested dict corresponds to the schema. Another example with a schema that extracts only the accession and name fields:
[6]:
list(hmdb.metabolites_raw(
schema = {
'accession': hmdb.METABOLITES_SCHEMA['accession'],
'name': hmdb.METABOLITES_SCHEMA['name'],
},
head = 20,
))
executed in 0ms, finished 12:25:55 2023-04-24
[6]:
[{'accession': 'HMDB0000001', 'name': '1-Methylhistidine'},
{'accession': 'HMDB0000002', 'name': '1,3-Diaminopropane'},
{'accession': 'HMDB0000005', 'name': '2-Ketobutyric acid'},
{'accession': 'HMDB0000008', 'name': '2-Hydroxybutyric acid'},
{'accession': 'HMDB0000010', 'name': '2-Methoxyestrone'},
{'accession': 'HMDB0000011', 'name': '3-Hydroxybutyric acid'},
{'accession': 'HMDB0000012', 'name': 'Deoxyuridine'},
{'accession': 'HMDB0000014', 'name': 'Deoxycytidine'},
{'accession': 'HMDB0000015', 'name': 'Cortexolone'},
{'accession': 'HMDB0000016', 'name': 'Deoxycorticosterone'},
{'accession': 'HMDB0000017', 'name': '4-Pyridoxic acid'},
{'accession': 'HMDB0000019', 'name': 'alpha-Ketoisovaleric acid'},
{'accession': 'HMDB0000020', 'name': 'p-Hydroxyphenylacetic acid'},
{'accession': 'HMDB0000021', 'name': 'Iodotyrosine'},
{'accession': 'HMDB0000022', 'name': '3-Methoxytyramine'},
{'accession': 'HMDB0000023', 'name': '(S)-3-Hydroxyisobutyric acid'},
{'accession': 'HMDB00
It works a similar way for proteins:
[7]:
list(hmdb.proteins_raw(
schema = {
'name': hmdb.PROTEINS_SCHEMA['name'],
'genesymbol': hmdb.PROTEINS_SCHEMA['gene_name'],
},
head = 20,
))
executed in 0ms, finished 12:29:23 2023-04-24
[7]:
[{'name': "5'-nucleotidase", 'genesymbol': 'NT5E'},
{'name': 'Deoxycytidylate deaminase', 'genesymbol': 'DCTD'},
{'name': 'UMP-CMP kinase', 'genesymbol': 'CMPK1'},
{'name': "Cytosolic 5'-nucleotidase 1B", 'genesymbol': 'NT5C1B'},
{'name': "Cytosolic 5'-nucleotidase 1A", 'genesymbol': 'NT5C1A'},
{'name': "5'(3')-deoxyribonucleotidase, cytosolic type",
'genesymbol': 'NT5C'},
{'name': 'Deoxycytidine kinase', 'genesymbol': 'DCK'},
{'name': "5'(3')-deoxyribonucleotidase, mitochondrial", 'genesymbol': 'NT5M'},
{'name': 'Hydroxymethylglutaryl-CoA lyase, mitochondrial',
'genesymbol': 'HMGCL'},
{'name': 'ATP-citrate synthase', 'genesymbol': 'ACLY'},
{'name': 'Histone acetyltransferase p300', 'genesymbol': 'EP300'},
{'name': 'Pyruvate dehydrogenase E1 component subunit beta, mitochondrial',
'genesymbol': 'PDHB'},
{'name': 'Acetyl-CoA acetyltransferase, cytosolic', 'genesymbol': 'ACAT2'},
{'name': 'CREB-binding protein', 'genesymbol': 'CREBBP'},
{'name': 'Diamine acetyltransfe
Higher level access to HMDB data§
By the hmdb.metabolites_table
and hmdb.proteins_table
functions you
can process the records into a pandas
data frame. This function accepts list of nameless or named arguments
using a simple notation (see its documentation). Instead of the simple notation of tuples,
alternatively, hmdb.Field
objects can be used to define the fields, though the arguments for
Field
and the tuples
or strings directly passed to hmdb.*_table
follow the same format. Let’s extract a data frame with SMILEs,
InChi Keys and HMDB accessions:
[8]:
hmdb.metabolites_table('accession', 'smiles', 'inchikey', head = 10)
executed in 0ms, finished 12:32:01 2023-04-24
[8]:
accession | smiles | inchikey | |
---|---|---|---|
0 | HMDB0000001 | CN1C=NC(C[C@H](N)C(O)=O)=C1 | BRMWTNUJHUMWMS-LURJTMIESA-N |
1 | HMDB0000002 | NCCCN | XFNJVJPLKCPIBV-UHFFFAOYSA-N |
2 | HMDB0000005 | CCC(=O)C(O)=O | TYEYBOSBBBHJIV-UHFFFAOYSA-N |
3 | HMDB0000008 | CC[C@H](O)C(O)=O | AFENDNXGAFYKQO-VKHMYHEASA-N |
4 | HMDB0000010 | [H][C@@]12CCC(=O)[C@@]1(C)CC[C@]1([H])C3=C(CC[... | WHEUWNKSCXYKBU-QPWUGHHJSA-N |
5 | HMDB0000011 | C[C@@H](O)CC(O)=O | WHBMMWSBFZVSSR-GSVOUGTGSA-N |
6 | HMDB0000012 | OC[C@H]1O[C@H](C[C@@H]1O)N1C=CC(=O)NC1=O | MXHRCPNRJAMMIM-SHYZEUOFSA-N |
7 | HMDB0000014 | NC1=NC(=O)N(C=C1)[C@H]1C[C@H](O)[C@@H](CO)O1 | CKTSBUTUHBMZGZ-SHYZEUOFSA-N |
8 | HMDB0000015 | [H][C@@]12CC[C@](O)(C(=O)CO)[C@@]1(C)CC[C@@]1(... | WHBHBVVOGNECLV-OBQKJFGGSA-N |
9 | HMDB0000016 | [H][C@@]12CC[C@H](C(=O)CO)[C@@]1(C)CC[C@@]1([H... | ZESRJSPZRDMNHY-YFWFAHHUSA-N |
10 | HMDB0000017 | CC1=NC=C(CO)C(C(O)=O)=C1O | HXACOUQIXZGNBF-UHFFFAOYSA-N |
The above example is simple, as each field has a simple string value. The synonyms
is an array within each
record, below first we process it as an array column, i.e. each row contains an array:
[9]:
hmdb.metabolites_table('accession', 'name', 'synonyms', head = 10)
executed in 0ms, finished 12:32:13 2023-04-24
[9]:
accession | name | synonyms | |
---|---|---|---|
0 | HMDB0000001 | 1-Methylhistidine | [(2S)-2-Amino-3-(1-methyl-1H-imidazol-4-yl)pro... |
1 | HMDB0000002 | 1,3-Diaminopropane | [1,3-Propanediamine, 1,3-Propylenediamine, Pro... |
2 | HMDB0000005 | 2-Ketobutyric acid | [2-Ketobutanoic acid, 2-Oxobutyric acid, 3-Met... |
3 | HMDB0000008 | 2-Hydroxybutyric acid | [(S)-2-Hydroxybutanoic acid, 2-Hydroxybutyrate... |
4 | HMDB0000010 | 2-Methoxyestrone | [2-(8S,9S,13S,14S)-3-Hydroxy-2-methoxy-13-meth... |
5 | HMDB0000011 | 3-Hydroxybutyric acid | [(R)-(-)-beta-Hydroxybutyric acid, (R)-3-Hydro... |
6 | HMDB0000012 | Deoxyuridine | [2-Deoxyuridine, dU, 2'-Deoxyuridine, 1-(2-Deo... |
7 | HMDB0000014 | Deoxycytidine | [4-Amino-1-[(2R,4S,5R)-4-hydroxy-5-(hydroxymet... |
8 | HMDB0000015 | Cortexolone | [11-Desoxy-17-hydroxycorticosterone, Cortodoxo... |
9 | HMDB0000016 | Deoxycorticosterone | [21-Hydroxy-4-pregnene-3,20-dione, 21-Hydroxyp... |
10 | HMDB0000017 | 4-Pyridoxic acid | [2-Methyl-3-hydroxy-4-carboxy-5-hydroxymethylp... |
Each element in the column is an array:
[10]:
hmdb_synonyms = hmdb.metabolites_table('accession', 'name', 'synonyms', head = 10)
hmdb_synonyms.synonyms[0]
executed in 0ms, finished 12:32:19 2023-04-24
[10]:
['(2S)-2-Amino-3-(1-methyl-1H-imidazol-4-yl)propanoic acid',
'Pi-methylhistidine',
'(2S)-2-Amino-3-(1-methyl-1H-imidazol-4-yl)propanoate',
'1 Methylhistidine',
'1-Methyl histidine',
'1-Methyl-histidine',
'1-Methyl-L-histidine',
'1-MHis',
'1-N-Methyl-L-histidine',
'L-1-Methylhistidine',
'N1-Methyl-L-histidine',
'1-Methylhistidine dihydrochloride',
'1-Methylhistidine']
Using the @
notation, the arrays can be expanded into multiple rows:
[11]:
hmdb.metabolites_table('accession', 'name', ('synonyms', '@'), head = 10)
executed in 0ms, finished 12:32:25 2023-04-24
[11]:
accession | name | synonyms | |
---|---|---|---|
0 | HMDB0000001 | 1-Methylhistidine | (2S)-2-Amino-3-(1-methyl-1H-imidazol-4-yl)prop... |
1 | HMDB0000001 | 1-Methylhistidine | Pi-methylhistidine |
2 | HMDB0000001 | 1-Methylhistidine | (2S)-2-Amino-3-(1-methyl-1H-imidazol-4-yl)prop... |
3 | HMDB0000001 | 1-Methylhistidine | 1 Methylhistidine |
4 | HMDB0000001 | 1-Methylhistidine | 1-Methyl histidine |
... | ... | ... | ... |
291 | HMDB0000017 | 4-Pyridoxic acid | 3-Hydroxy-5-hydroxymethyl-2-methyl-isonicotins... |
292 | HMDB0000017 | 4-Pyridoxic acid | 4 Pyridoxinic acid |
293 | HMDB0000017 | 4-Pyridoxic acid | Pyridoxinecarboxylic acid |
294 | HMDB0000017 | 4-Pyridoxic acid | 4 Pyridoxylic acid |
295 | HMDB0000017 | 4-Pyridoxic acid | 4 Pyridoxic acid |
296 rows × 3 columns
This already resulted almost 300 rows: be careful using @
for multiple columns, as it
yields rows in a combinatorial way, and the resulted data frames can easily grow huge. Another
notation is *
, it
means extract all elements from a dict into multiple columns. Below we apply it to the
taxonomy
column which
is a dict of multiple fields:
[12]:
hmdb.metabolites_table('accession', 'name', ('taxonomy', '*'), head = 10)
executed in 0ms, finished 12:32:30 2023-04-24
[12]:
accession | name | taxonomy__alternative_parents | taxonomy__class | taxonomy__description | taxonomy__direct_parent | taxonomy__kingdom | taxonomy__molecular_framework | taxonomy__sub_class | taxonomy__substituents | |
---|---|---|---|---|---|---|---|---|---|---|
0 | HMDB0000001 | 1-Methylhistidine | [Amino acids, Aralkylamines, Azacyclic compoun... | Carboxylic acids and derivatives | belongs to the class of organic compounds kno... | Histidine and derivatives | Organic compounds | Aromatic heteromonocyclic compounds | Amino acids, peptides, and analogues | [Alpha-amino acid, Amine, Amino acid, Aralkyla... |
1 | HMDB0000002 | 1,3-Diaminopropane | [Hydrocarbon derivatives, Organopnictogen comp... | Organonitrogen compounds | belongs to the class of organic compounds kno... | Monoalkylamines | Organic compounds | Aliphatic acyclic compounds | Amines | [Aliphatic acyclic compound, Hydrocarbon deriv... |
2 | HMDB0000005 | 2-Ketobutyric acid | [Alpha-hydroxy ketones, Alpha-keto acids and d... | Keto acids and derivatives | belongs to the class of organic compounds kno... | Short-chain keto acids and derivatives | Organic compounds | Aliphatic acyclic compounds | Short-chain keto acids and derivatives | [Aliphatic acyclic compound, Alpha-hydroxy ket... |
3 | HMDB0000008 | 2-Hydroxybutyric acid | [Carbonyl compounds, Carboxylic acids, Fatty a... | Hydroxy acids and derivatives | belongs to the class of organic compounds kno... | Alpha hydroxy acids and derivatives | Organic compounds | Aliphatic acyclic compounds | Alpha hydroxy acids and derivatives | [Alcohol, Aliphatic acyclic compound, Alpha-hy... |
4 | HMDB0000010 | 2-Methoxyestrone | [1-hydroxy-2-unsubstituted benzenoids, 17-oxos... | Steroids and steroid derivatives | belongs to the class of organic compounds kno... | Estrogens and derivatives | Organic compounds | Aromatic homopolycyclic compounds | Estrane steroids | [1-hydroxy-2-unsubstituted benzenoid, 17-oxost... |
5 | HMDB0000011 | 3-Hydroxybutyric acid | [Carbonyl compounds, Carboxylic acids, Fatty a... | Hydroxy acids and derivatives | belongs to the class of organic compounds kno... | Beta hydroxy acids and derivatives | Organic compounds | Aliphatic acyclic compounds | Beta hydroxy acids and derivatives | [Alcohol, Aliphatic acyclic compound, Beta-hyd... |
6 | HMDB0000012 | Deoxyuridine | [Azacyclic compounds, Heteroaromatic compounds... | Pyrimidine nucleosides | belongs to the class of organic compounds kno... | Pyrimidine 2'-deoxyribonucleosides | Organic compounds | Aromatic heteromonocyclic compounds | Pyrimidine 2'-deoxyribonucleosides | [Alcohol, Aromatic heteromonocyclic compound, ... |
7 | HMDB0000014 | Deoxycytidine | [Aminopyrimidines and derivatives, Azacyclic c... | Pyrimidine nucleosides | belongs to the class of organic compounds kno... | Pyrimidine 2'-deoxyribonucleosides | Organic compounds | Aromatic heteromonocyclic compounds | Pyrimidine 2'-deoxyribonucleosides | [Alcohol, Amine, Aminopyrimidine, Aromatic het... |
8 | HMDB0000015 | Cortexolone | [17-hydroxysteroids, 20-oxosteroids, 3-oxo del... | Steroids and steroid derivatives | belongs to the class of organic compounds kno... | 21-hydroxysteroids | Organic compounds | Aliphatic homopolycyclic compounds | Hydroxysteroids | [17-hydroxysteroid, 20-oxosteroid, 21-hydroxys... |
9 | HMDB0000016 | Deoxycorticosterone | [20-oxosteroids, 3-oxo delta-4-steroids, Alpha... | Steroids and steroid derivatives | belongs to the class of organic compounds kno... | 21-hydroxysteroids | Organic compounds | Aliphatic homopolycyclic compounds | Hydroxysteroids | [20-oxosteroid, 21-hydroxysteroid, 3-oxo-delta... |
10 | HMDB0000017 | 4-Pyridoxic acid | [Aromatic alcohols, Azacyclic compounds, Carbo... | Pyridines and derivatives | belongs to the class of organic compounds kno... | Pyridinecarboxylic acids | Organic compounds | Aromatic heteromonocyclic compounds | Pyridinecarboxylic acids and derivatives | [Alcohol, Aromatic alcohol, Aromatic heteromon... |
We see taxonomy
gave birth to 8 columns. If we expand all those columns, we get a data frame of more than 2,000
rows only from the first 10 records already:
[13]:
hmdb.metabolites_table('accession', 'name', ('taxonomy', '*', '@'), head = 10)
executed in 0ms, finished 12:32:37 2023-04-24
[13]:
accession | name | taxonomy__alternative_parents | taxonomy__class | taxonomy__description | taxonomy__direct_parent | taxonomy__kingdom | taxonomy__molecular_framework | taxonomy__sub_class | taxonomy__substituents | |
---|---|---|---|---|---|---|---|---|---|---|
0 | HMDB0000001 | 1-Methylhistidine | Amino acids | Carboxylic acids and derivatives | belongs to the class of organic compounds kno... | Histidine and derivatives | Organic compounds | Aromatic heteromonocyclic compounds | Amino acids, peptides, and analogues | Alpha-amino acid |
1 | HMDB0000001 | 1-Methylhistidine | Amino acids | Carboxylic acids and derivatives | belongs to the class of organic compounds kno... | Histidine and derivatives | Organic compounds | Aromatic heteromonocyclic compounds | Amino acids, peptides, and analogues | Amine |
2 | HMDB0000001 | 1-Methylhistidine | Amino acids | Carboxylic acids and derivatives | belongs to the class of organic compounds kno... | Histidine and derivatives | Organic compounds | Aromatic heteromonocyclic compounds | Amino acids, peptides, and analogues | Amino acid |
3 | HMDB0000001 | 1-Methylhistidine | Amino acids | Carboxylic acids and derivatives | belongs to the class of organic compounds kno... | Histidine and derivatives | Organic compounds | Aromatic heteromonocyclic compounds | Amino acids, peptides, and analogues | Aralkylamine |
4 | HMDB0000001 | 1-Methylhistidine | Amino acids | Carboxylic acids and derivatives | belongs to the class of organic compounds kno... | Histidine and derivatives | Organic compounds | Aromatic heteromonocyclic compounds | Amino acids, peptides, and analogues | Aromatic heteromonocyclic compound |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
2235 | HMDB0000017 | 4-Pyridoxic acid | Vinylogous acids | Pyridines and derivatives | belongs to the class of organic compounds kno... | Pyridinecarboxylic acids | Organic compounds | Aromatic heteromonocyclic compounds | Pyridinecarboxylic acids and derivatives | Organooxygen compound |
2236 | HMDB0000017 | 4-Pyridoxic acid | Vinylogous acids | Pyridines and derivatives | belongs to the class of organic compounds kno... | Pyridinecarboxylic acids | Organic compounds | Aromatic heteromonocyclic compounds | Pyridinecarboxylic acids and derivatives | Organopnictogen compound |
2237 | HMDB0000017 | 4-Pyridoxic acid | Vinylogous acids | Pyridines and derivatives | belongs to the class of organic compounds kno... | Pyridinecarboxylic acids | Organic compounds | Aromatic heteromonocyclic compounds | Pyridinecarboxylic acids and derivatives | Primary alcohol |
2238 | HMDB0000017 | 4-Pyridoxic acid | Vinylogous acids | Pyridines and derivatives | belongs to the class of organic compounds kno... | Pyridinecarboxylic acids | Organic compounds | Aromatic heteromonocyclic compounds | Pyridinecarboxylic acids and derivatives | Pyridine carboxylic acid |
2239 | HMDB0000017 | 4-Pyridoxic acid | Vinylogous acids | Pyridines and derivatives | belongs to the class of organic compounds kno... | Pyridinecarboxylic acids | Organic compounds | Aromatic heteromonocyclic compounds | Pyridinecarboxylic acids and derivatives | Vinylogous acid |
2240 rows × 10 columns
The hmdb.metabolites_mapping
and hmdb.proteins_mapping
function
provides data frames or dicts for translation between a pair of identifier types. For example,
translate KEGG Pathway IDs to SMILES, default output is dict of sets:
[14]:
hmdb.metabolites_mapping('kegg', 'smiles', head = 10)
executed in 0ms, finished 12:33:27 2023-04-24
[14]:
{'C00109': {'CCC(=O)C(O)=O'},
'C00526': {'OC[C@H]1O[C@H](C[C@@H]1O)N1C=CC(=O)NC1=O'},
'C00847': {'CC1=NC=C(CO)C(C(O)=O)=C1O'},
'C00881': {'NC1=NC(=O)N(C=C1)[C@H]1C[C@H](O)[C@@H](CO)O1'},
'C00986': {'NCCCN'},
'C01089': {'C[C@@H](O)CC(O)=O'},
'C01152': {'CN1C=NC(C[C@H](N)C(O)=O)=C1'},
'C03205': {'[H][C@@]12CC[C@H](C(=O)CO)[C@@]1(C)CC[C@@]1([H])[C@@]2([H])CCC2=CC(=O)CC[C@]12C'},
'C05299': {'[H][C@@]12CCC(=O)[C@@]1(C)CC[C@]1([H])C3=C(CC[C@@]21[H])C=C(O)C(OC)=C3'},
'C05488': {'[H][C@@]12CC[C@](O)(C(=O)CO)[C@@]1(C)CC[C@@]1([H])[C@@]2([H])CCC2=CC(=O)CC[C@]12C'},
'C05984': {'CC[C@H](O)C(O)=O'}}
The same data in a data frame:
[15]:
hmdb.metabolites_mapping('kegg', 'smiles', head = 10, return_df = True)
executed in 0ms, finished 12:33:31 2023-04-24
[15]:
id_a | id_b | |
---|---|---|
0 | C01152 | CN1C=NC(C[C@H](N)C(O)=O)=C1 |
1 | C00986 | NCCCN |
2 | C00109 | CCC(=O)C(O)=O |
3 | C05984 | CC[C@H](O)C(O)=O |
4 | C05299 | [H][C@@]12CCC(=O)[C@@]1(C)CC[C@]1([H])C3=C(CC[... |
5 | C01089 | C[C@@H](O)CC(O)=O |
6 | C00526 | OC[C@H]1O[C@H](C[C@@H]1O)N1C=CC(=O)NC1=O |
7 | C00881 | NC1=NC(=O)N(C=C1)[C@H]1C[C@H](O)[C@@H](CO)O1 |
8 | C05488 | [H][C@@]12CC[C@](O)(C(=O)CO)[C@@]1(C)CC[C@@]1(... |
9 | C03205 | [H][C@@]12CC[C@H](C(=O)CO)[C@@]1(C)CC[C@@]1([H... |
10 | C00847 | CC1=NC=C(CO)C(C(O)=O)=C1O |
ID translation with HMDB§
HMDB is also integrated into the ID translation service. Thanks to the multiple levels of caching, only the first call takes long time, subsequent calls are pretty fast:
[16]:
from pypath.utils import mapping
mapping.map_name('C01152', 'kegg', 'inchi')
executed in 0ms, finished 12:33:39 2023-04-24
[16]:
{'InChI=1S/C7H11N3O2/c1-10-3-5(9-4-10)2-6(8)7(11)12/h3-4,6H,2,8H2,1H3,(H,11,12)/t6-/m0/s1',
'InChI=1S/C7H11N3O2/c1-10-4-9-3-5(10)2-6(8)7(11)12/h3-4,6H,2,8H2,1H3,(H,11,12)/t6-/m0/s1'}
The two InChi Keys correspond to the two constitutional isomers included in the KEGG ID: 1- and 3-Methylhistidine. A useful feature of HMDB that it has many synonyms and IUPAC names, making it possible to parse a large variety of metabolite names:
[17]:
mapping.map_name('C01152', 'kegg', 'hmdb_synonym')
executed in 0ms, finished 12:33:41 2023-04-24
[17]:
{'(2S)-2-Amino-3-(1-methyl-1H-imidazol-4-yl)propanoate',
'(2S)-2-Amino-3-(1-methyl-1H-imidazol-4-yl)propanoic acid',
'(2S)-2-Amino-3-(1-methyl-1H-imidazol-5-yl)propanoate',
'(2S)-2-Amino-3-(1-methyl-1H-imidazol-5-yl)propanoic acid',
'1 Methylhistidine',
'1-MHis',
'1-Methyl histidine',
'1-Methyl-L-histidine',
'1-Methyl-histidine',
'1-Methylhistidine',
'1-Methylhistidine dihydrochloride',
'1-N-Methyl-L-histidine',
'3-Methyl-L-histidine',
'3-Methylhistidine',
'3-Methylhistidine dihydrochloride',
'3-Methylhistidine hydride',
'3-N-Methyl-L-histidine',
'L-1-Methylhistidine',
'L-3-Methylhistidine',
'N Tau-methylhistidine',
'N(Tau)-methylhistidine',
'N(pros)-Methyl-L-histidine',
'N-pros-Methyl-L-histidine',
'N1-Methyl-L-histidine',
'N3-Methyl-L-histidine',
'Pi-methylhistidine',
'Tau-methyl-L-histidine',
'Tau-methylhistidine'}
[18]:
mapping.map_name('N(pros)-Methyl-L-histidine', 'hmdb_synonym', 'inchi')
executed in 1.81s, finished 12:33:46 2023-04-24
[18]:
{'InChI=1S/C7H11N3O2/c1-10-4-9-3-5(10)2-6(8)7(11)12/h3-4,6H,2,8H2,1H3,(H,11,12)/t6-/m0/s1'}
The name
provided
by HMDB is typically the best human readable name, hence it can be used as labels in figures or
tables:
[19]:
mapping.map_name('HMDB0000001', 'hmdb', 'hmdb_name')
executed in 0ms, finished 12:33:47 2023-04-24
[19]:
{'1-Methylhistidine'}
SwissLipids§
The pypath.inputs.swisslipids
module provides access to the datasets available from SwissLipids
for download. Each function returns a csv.DictReader
, which is a generator that yields rows as dicts:
[5]:
from pypath.inputs import swisslipids
executed in 0ms, finished 19:38:03 2024-10-06
[3]:
tissues = swisslipids.swisslipids_tissues()
tissues
executed in 0ms, finished 19:37:12 2024-10-06
[3]:
<csv.DictReader at 0x6a4f241d0230>
[4]:
next(tissues)
executed in 0ms, finished 19:37:38 2024-10-06
[4]:
{'Lipid ID': 'SLM:000056561',
'Lipid name': 'Phosphatidylcholine (40:6)',
'Tissue/Cell ID': 'UBERON:0001969',
'Tissue/Cell name': 'blood plasma',
'Taxon ID': '9606',
'Taxon scientific name': 'Homo sapiens',
'Evidence tag ID': '6814'}
Alternatively, the datasets can be retrieved as data frames by the return_df
argument. The “lipids” and
“lipids2uniprot” datasets use a large amount of memory if loaded this way.
[6]:
swisslipids.swisslipids_tissues(return_df = True)
executed in 0ms, finished 19:40:23 2024-10-06
[6]:
Lipid ID | Lipid name | Tissue/Cell ID | Tissue/Cell name | Taxon ID | Taxon scientific name | Evidence tag ID | |
---|---|---|---|---|---|---|---|
0 | SLM:000056561 | Phosphatidylcholine (40:6) | UBERON:0001969 | blood plasma | 9606 | Homo sapiens | 6814 |
1 | SLM:000056510 | Phosphatidylcholine (34:3) | UBERON:0001969 | blood plasma | 9606 | Homo sapiens | 6806 |
2 | SLM:000056525 | Phosphatidylcholine (36:4) | UBERON:0001969 | blood plasma | 9606 | Homo sapiens | 6809 |
3 | SLM:000056524 | Phosphatidylcholine (36:3) | UBERON:0001969 | blood plasma | 9606 | Homo sapiens | 6808 |
4 | SLM:000056509 | Phosphatidylcholine (34:2) | UBERON:0001969 | blood plasma | 9606 | Homo sapiens | 6805 |
... | ... | ... | ... | ... | ... | ... | ... |
934 | SLM:000098542 | Phosphatidylethanolamine (O-18:0/16:0) | UBERON:0000468 | multi-cellular organism | 6239 | Caenorhabditis elegans | 15918 |
935 | SLM:000098543 | Phosphatidylethanolamine (O-18:0/16:1) | UBERON:0000468 | multi-cellular organism | 6239 | Caenorhabditis elegans | 15917 |
936 | SLM:000098546 | Phosphatidylethanolamine (O-18:0/18:0) | UBERON:0000468 | multi-cellular organism | 6239 | Caenorhabditis elegans | 15916 |
937 | SLM:000098549 | Phosphatidylethanolamine (O-18:0/18:3) | UBERON:0000468 | multi-cellular organism | 6239 | Caenorhabditis elegans | 15913 |
938 | SLM:000098557 | Phosphatidylethanolamine (O-18:0/20:5) | UBERON:0000468 | multi-cellular organism | 6239 | Caenorhabditis elegans | 15910 |
939 rows × 7 columns
LIPID MAPS§
LIPID MAPS is an international non-profit consortium that develops and maintains standards and
tools for lipid research. Currently pypath
features a client for its Structure Database, called LMSD. Pypath uses
the SDF format, which includes all fields available in the database.
[7]:
from pypath.inputs import lipidmaps
executed in 0ms, finished 19:47:28 2024-10-06
When the function returns, the file is already downloaded and opened, but not parsed yet, hence the object reports 0 records:
[8]:
lmsd = lipidmaps.lmsd_sdf()
lmsd
executed in 1.29s, finished 19:47:47 2024-10-06
[8]:
<SDF file `structures.sdf`: 0 records>
One option to retrieve the records is to simply iterate the object:
[12]:
for lipid in lmsd:
break
lipid
executed in 0ms, finished 19:51:42 2024-10-06
[12]:
{'id': 'LMFA00000001',
'source': 'LIPID_MAPS_STRUCTURE_DATABASE',
'comment': '',
'mol': '',
'name': {'LM_ID': 'LMFA00000001',
'SYSTEMATIC_NAME': '2-methoxy-12-methyloctadec-17-en-5-ynoyl anhydride',
'FORMULA': 'C40H66O5',
'INCHI_KEY': 'VOGBKCAANIAXCI-UHFFFAOYSA-N',
'INCHI': 'InChI=1S/C40H66O5/c1-7-9-11-23-29-35(3)31-25-19-15-13-17-21-27-33-37(43-5)39(41)45-40(42)38(44-6)34-28-22-18-14-16-20-26-32-36(4)30-24-12-10-8-2/h7-8,35-38H,1-2,9-16,19-20,23-34H2,3-6H3',
'SMILES': 'C(C(OC)CCC#CCCCCCC(C)CCCCC=C)(=O)OC(C(OC)CCC#CCCCCCC(C)CCCCC=C)=O',
'ABBREVIATION': 'FA 40:7;O3',
'SYNONYMS': 'Acetylenic acids',
'PUBCHEM_CID': '10930192',
'CHEBI_ID': '178363'},
'annot': {'NAME': '2-methoxy-12-methyloctadec-17-en-5-ynoyl anhydride',
'CATEGORY': 'Fatty Acyls [FA]',
'MAIN_CLASS': 'Other Fatty Acyls [FA00]',
'EXACT_MASS': '626.491025'}}
The same object is able to index the SDF file, and retrieve records on demand. The indexing covers all names, synonyms and identifiers used in the database.
[13]:
lmsd.index()
executed in 24.31s, finished 19:54:26 2024-10-06
After indexing, the database shows its size:
[15]:
lmsd
executed in 0ms, finished 19:55:54 2024-10-06
[15]:
<SDF file `structures.sdf`: 48116 records>
[16]:
len(lmsd)
executed in 0ms, finished 19:56:03 2024-10-06
[16]:
48116
The records can be retrieved by any of their names or identifiers:
[14]:
lmsd['LMFA00000001']
executed in 0ms, finished 19:54:52 2024-10-06
[14]:
[({'id': 'LMFA00000001',
'source': 'LIPID_MAPS_STRUCTURE_DATABASE',
'comment': '',
'mol': '',
'name': {'LM_ID': 'LMFA00000001',
'SYSTEMATIC_NAME': '2-methoxy-12-methyloctadec-17-en-5-ynoyl anhydride',
'FORMULA': 'C40H66O5',
'INCHI_KEY': 'VOGBKCAANIAXCI-UHFFFAOYSA-N',
'INCHI': 'InChI=1S/C40H66O5/c1-7-9-11-23-29-35(3)31-25-19-15-13-17-21-27-33-37(43-5)39(41)45-40(42)38(44-6)34-28-22-18-14-16-20-26-32-36(4)30-24-12-10-8-2/h7-8,35-38H,1-2,9-16,19-20,23-34H2,3-6H3',
'SMILES': 'C(C(OC)CCC#CCCCCCC(C)CCCCC=C)(=O)OC(C(OC)CCC#CCCCCCC(C)CCCCC=C)=O',
'ABBREVIATION': 'FA 40:7;O3',
'SYNONYMS': 'Acetylenic acids',
'PUBCHEM_CID': '10930192',
'CHEBI_ID': '178363'},
'annot': {'NAME': '2-methoxy-12-methyloctadec-17-en-5-ynoyl anhydride',
'CATEGORY': 'Fatty Acyls [FA]',
'MAIN_CLASS': 'Other Fatty Acyls [FA00]',
'EXACT_MASS': '626.491025'}},
0),
({'id': 'LMFA00000001',
'source': 'LIPID_MAPS_STRUCTURE_DATABASE',
'comment': '',
'mol':
And it also supports the in
operator:
[17]:
'PC(18:1/18:0)' in lmsd
executed in 0ms, finished 19:57:02 2024-10-06
[17]:
True
[18]:
lmsd['PC(18:1/18:0)']
executed in 1ms, finished 19:57:28 2024-10-06
[18]:
[({'id': 'LMGP01010888',
'source': 'LIPID_MAPS_STRUCTURE_DATABASE',
'comment': '',
'mol': '',
'name': {'LM_ID': 'LMGP01010888',
'SYSTEMATIC_NAME': '1-(9Z-octadecenoyl)-2-octadecanoyl-sn-glycero-3-phosphocholine',
'FORMULA': 'C44H86NO8P',
'INCHI_KEY': 'NMJCSTNQFYPVOR-VHONOUADSA-N',
'INCHI': 'InChI=1S/C44H86NO8P/c1-6-8-10-12-14-16-18-20-22-24-26-28-30-32-34-36-43(46)50-40-42(41-52-54(48,49)51-39-38-45(3,4)5)53-44(47)37-35-33-31-29-27-25-23-21-19-17-15-13-11-9-7-2/h20,22,42H,6-19,21,23-41H2,1-5H3/b22-20-/t42-/m1/s1',
'SMILES': '[C@](COP(=O)([O-])OCC[N+](C)(C)C)([H])(OC(CCCCCCCCCCCCCCCCC)=O)COC(CCCCCCC/C=C\\CCCCCCCC)=O',
'ABBREVIATION': 'PC 36:1',
'SYNONYMS': 'Choline phosphate, 3-ester with L-1-oleo-2-stearin; L-1-Oleoyl-2-stearoyl lecithin; L-1-Oleoyl-2-stearoyl-3-phosphatidylcholine; OSPC; PC(18:1/18:0); PC(36:1); PC(18:0_18:1)',
'PUBCHEM_CID': '24778936',
'HMDB_ID': 'HMDB0008102',
'CHEBI_ID': '76073',
'SWISSLIPIDS_ID': 'SLM:000012
Finally, the records can be loaded into memory, in this case their retrieval is faster:
[21]:
lmsd.load()
executed in 0ms, finished 20:02:08 2024-10-06
[23]:
lmsd['PC(18:1/18:0)']
executed in 1ms, finished 20:02:51 2024-10-06
[23]:
[{'id': 'LMGP01010888',
'source': 'LIPID_MAPS_STRUCTURE_DATABASE',
'comment': '',
'mol': '',
'name': {'LM_ID': 'LMGP01010888',
'SYSTEMATIC_NAME': '1-(9Z-octadecenoyl)-2-octadecanoyl-sn-glycero-3-phosphocholine',
'FORMULA': 'C44H86NO8P',
'INCHI_KEY': 'NMJCSTNQFYPVOR-VHONOUADSA-N',
'INCHI': 'InChI=1S/C44H86NO8P/c1-6-8-10-12-14-16-18-20-22-24-26-28-30-32-34-36-43(46)50-40-42(41-52-54(48,49)51-39-38-45(3,4)5)53-44(47)37-35-33-31-29-27-25-23-21-19-17-15-13-11-9-7-2/h20,22,42H,6-19,21,23-41H2,1-5H3/b22-20-/t42-/m1/s1',
'SMILES': '[C@](COP(=O)([O-])OCC[N+](C)(C)C)([H])(OC(CCCCCCCCCCCCCCCCC)=O)COC(CCCCCCC/C=C\\CCCCCCCC)=O',
'ABBREVIATION': 'PC 36:1',
'SYNONYMS': 'L-1-Oleoyl-2-stearoyl-3-phosphatidylcholine;PC(36:1);PC(18:0_18:1);PC(18:1/18:0);Choline phosphate, 3-ester with L-1-oleo-2-stearin;OSPC;L-1-Oleoyl-2-stearoyl lecithin',
'PUBCHEM_CID': '24778936',
'HMDB_ID': 'HMDB0008102',
'CHEBI_ID': '76073',
'SWISSLIPIDS_ID': 'SLM:000012332'},
'annot': {'NA
NCBI E-Utils§
The ESummary endpoint of the NCBI
E-Utils API provides metadata about records in NCBI databases. A client to this API endpoint is
available in the pypath.inputs.eutils
module. The parameter ids
can be one integer, or a list of
integers or strings:
[3]:
from pypath.inputs import eutils
eutils.esummary(ids = 6063, db = 'geoprofiles')
executed in 0ms, finished 22:43:56 2023-11-14
[3]:
{'uids': ['6063'],
'6063': {'uid': '6063',
'gds': '5',
'gpl': '13',
'erank': '8eSiQ',
'evalue': 'joAzE',
'title': 'Diurnal and circadian-regulated genes (I)',
'taxon': 'Arabidopsis thaliana',
'gdstype': 'Expression profiling by array',
'valtype': 'log ratio',
'idref': '6063',
'genename': '',
'genedesc': '',
'ugname': 'AT4G11560',
'ugdesc': 'Bromo-adjacent homology (BAH) domain-containing protein',
'nucdesc': '9366 Lambda-PRL2 Arabidopsis thaliana cDNA clone 135J10T7, mRNA sequence',
'entrez_gene_id': '',
'gbacc': 'T46103',
'ptacc': '',
'cloneid': '135J10T7',
'orf': '',
'spotid': '',
'vmin': '-0.395000',
'vmax': '0.201000',
'groups': 'A1B3C1',
'abscall': '',
'aflag': 20,
'aoutl': '',
'rstd': 31,
'rmean': 50}}
A simple wrapper for PubMed is available in the pypath.inputs.pubmed
module:
[2]:
from pypath.inputs import pubmed
pubmed.get_pubmeds('33209674')
executed in 0ms, finished 22:42:02 2023-11-14
[2]:
{'uids': ['33209674'],
'33209674': {'uid': '33209674',
'pubdate': '2020 Oct',
'epubdate': '',
'source': 'Transl Androl Urol',
'authors': [{'name': 'Kim H', 'authtype': 'Author', 'clusterid': ''},
{'name': 'Lee SH', 'authtype': 'Author', 'clusterid': ''},
{'name': 'Kim DH', 'authtype': 'Author', 'clusterid': ''},
{'name': 'Lee JY', 'authtype': 'Author', 'clusterid': ''},
{'name': 'Hong SH', 'authtype': 'Author', 'clusterid': ''},
{'name': 'Ha US', 'authtype': 'Author', 'clusterid': ''},
{'name': 'Kim IH', 'authtype': 'Author', 'clusterid': ''}],
'lastauthor': 'Kim IH',
'title': 'Gemcitabine maintenance versus observation after first-line chemotherapy in patients with metastatic urothelial carcinoma: a retrospective study.',
'sorttitle': 'gemcitabine maintenance versus observation after first line chemotherapy in patients with metastatic urothelial carcinoma a retrospective study',
'volume': '9',
'issue': '5',
'pages': '2113-2121',
'lang': ['eng']
One last example, querying the Entrez Gene database:
[4]:
from pypath.inputs import eutils
eutils.esummary(ids = 1956, db = 'gene')
executed in 0ms, finished 22:48:09 2023-11-14
[4]:
{'uids': ['1956'],
'1956': {'uid': '1956',
'name': 'EGFR',
'description': 'epidermal growth factor receptor',
'status': '',
'currentid': '',
'chromosome': '7',
'geneticsource': 'genomic',
'maplocation': '7p11.2',
'otheraliases': 'ERBB, ERBB1, ERRP, HER1, NISBD2, PIG61, mENA',
'otherdesignations': 'epidermal growth factor receptor|EGFR vIII|avian erythroblastic leukemia viral (v-erb-b) oncogene homolog|cell growth inhibiting protein 40|cell proliferation-inducing protein 61|epidermal growth factor receptor tyrosine kinase domain|erb-b2 receptor tyrosine kinase 1|proto-oncogene c-ErbB-1|receptor tyrosine-protein kinase erbB-1',
'nomenclaturesymbol': 'EGFR',
'nomenclaturename': 'epidermal growth factor receptor',
'nomenclaturestatus': 'Official',
'mim': ['131550'],
'genomicinfo': [{'chrloc': '7',
'chraccver': 'NC_000007.14',
'chrstart': 55019016,
'chrstop': 55211627,
'exoncount': 32}],
'geneweight': 580393,
'summary': 'The protein encoded b
Download management§
Cache management and customization§
The pypath.omnipath.app
saves the databases to pickle dumps by default under the
~/.pypath/pickles/
directory and after the first build loads them from there. The very first build of each database
might take quite long time (up to >90 min in case of the OmniPath network or annotations)
because of the large number of downloads. Subsequent builds will be much faster because
pypath
stores all the
downloaded data in a local cache and downloads again only upon request from the user. Loading the
databases from pickle dumps takes only seconds. However if you want to build with different
settings you should be aware to set up a different cache file name.
Download failures§
Issuing hundreds of requests to dozens of servers sooner or later comes with failures. These might happen just by accident, especially on slow networks, it is always recommended to try again. The
Corrupted cache content§
Sometimes a truncated or corrupted file remains in the cache, in this case you can use the
context managers in pypath.share.curl
to control the cache. E.g. if the download of the
DEPOD database failed and keeps failing due to a corrupted file, use the cache_delete_on
context:
[7]:
from pypath.share import curl
from pypath.inputs import depod
with curl.cache_delete_on():
depod = depod.depod_enzyme_substrate()
executed in 5.61s, finished 13:59:07 2022-12-02
The cache_off
context forces download even if a cache item is available; the cache_print_on
context prints
paths to the accessed cache files to the terminal, though the paths can always be found in the
log; the dry_run_on
context sets up the pypath.share.curl.Curl
object and stops just before the actual download.
Network communication issues: look into the curl debug log§
Downloads might fail also due to TLS or HTTP errors, wrong headers or parameters, and many
other reasons. In this case a full debug output from curl
might be very useful. The
debug_on
context
writes curl debug into the logfile:
[8]:
from pypath.share import curl
from pypath.inputs import depod
with curl.debug_on():
depod = depod.depod_enzyme_substrate()
executed in 0ms, finished 13:59:12 2022-12-02
Timeouts§
From the log we can find out if the download fails due to a timeout. In this case, the timeout
parameters can be altered by a settings context. Apart from a timeout for the completion of the
download, there is curl_connect_timeout
(timeout for establishing connection to the server), and
curl_extended_timeout
,
that is used for servers that are known to be exceptionally slow. Another parameter, curl_retries
is the number of
attempts before giving up. By default it’s 3, and that should be more than enough.
[9]:
from pypath.share import settings
from pypath.inputs import depod
with settings.context(curl_timeout = 360):
depod = depod.depod_enzyme_substrate()
executed in 0ms, finished 13:59:17 2022-12-02
Access and inspect the Curl
object§
Often the Curl
object is created in a function from the pypath.inputs
module, deep in a call stack, hence accessing it for
investigation is difficult. Using the preserve_on
context, the last Curl
instance is kept under the
pypath.share.curl.LASTCURL
attribute:
[10]:
from pypath.share import curl
from pypath.inputs import depod
with curl.preserve_on():
depod = depod.depod_enzyme_substrate()
depod_curl = curl.LASTCURL
depod_curl
executed in 0ms, finished 13:59:24 2022-12-02
[10]:
<pypath.share.curl.Curl at 0x6947386dc8b0>
[11]:
depod_curl.url, depod_curl.req_headers, depod_curl.fileobj, depod_curl.status
executed in 0ms, finished 13:59:28 2022-12-02
[11]:
('http://depod.bioss.uni-freiburg.de/download/DEPOD_201405_human_phosphatase-substrate.mitab',
[],
<_io.TextIOWrapper name='/home/denes/.pypath/cache/6a711369ecf9dcff8c5ed88996685b54-DEPOD_201405_human_phosphatase-substrate.mitab' mode='r' encoding='iso-8859-1'>,
0)
Is it failing only for you?§
Okay, this is the one you should check first: we run almost all downloads in pypath
daily, you can always check
in the report wether a
particular function run successfully last night on our server. If it fails also in our daily
build, it still can be a transient error that disappears within a few days, or it can be a
permanent error. In the latter case, we first try to fix the issue in pypath (maybe the behaviour
or the address of the third party server has changed). If we have no way to fix it, we start
hosting the data on our own
server and make pypath download it from there.
Read the log§
Above we mentioned a lot the pypath log. Here is how to access the log, see more details in the section about logging:
[12]:
import pypath
pypath.log()
executed in 0ms, finished 13:59:34 2022-12-02
[2022-12-02 14:57:09] Welcome!
[2022-12-02 14:57:09] Logger started, logging into `/home/denes/pypath/notebooks/pypath_log/pypath-s3e92.log`.
[2022-12-02 14:57:09] Session `s3e92` started.
[2022-12-02 14:57:09] [pypath]
- session ID: `s3e92`
- working directory: `/home/denes/pypath/notebooks`
- logfile: `/home/denes/pypath/notebooks/pypath_log/pypath-s3e92.log`
- pypath version: 0.14.30
[2022-12-02 14:57:09] [curl] Creating Curl object to retrieve data from `https://www.ensembl.org/info/about/species.html`
[2022-12-02 14:57:09] [curl] Cache file path: `/home/denes/.pypath/cache/535b06d53a59e75bb693369bc5fdc556-species.html`
[2022-12-02 14:57:09] [curl] Cache file found, no need for download.
[2022-12-02 14:57:09] [curl] Opening plain text file `/home/denes/.pypath/cache/535b06d53a59e75bb693369bc5fdc556-species.html`.
[2022-12-02 14:57:09] [curl] Contents of `/home/denes/.pypath/cache/535b06d53a59e75bb693369bc5fdc556-species.html` has been read and the file has been closed.
[2022-1
TLS (SSL, HTTPS) errors§
Failed to verify certificate, invalid, expired, self-signed, missing certificates. These might be the most common reasons why people open issues for our software. TLS is a method for encrypted, typically HTTP, communication. The server has a certificate and uses it to sign and encrypt the data before sending it to the client. The client trusts the server certificate because it is signed by another certificate. And that is signed by another one, and so on, until we reach a so called root certificate that is known and trusted by the client. The number of root certificates used globally is so small that every single computer stores them locally and updates them time to time from trusted sources, such as the provider of the operating system, web browser or programming language. Having up-to-date certificate store and correctly configured TLS clients on your computer is your (or your system admin’s) duty, we can here only give a generic procedure to address these issues. In 97% of the cases the issue is in your computer, but sometimes the server might be responsible. If you experience a TLS issue:
-
Check the status of the server: initiate a scan at a free TLS checking service, such as SSL Labs: look for any issue with the certificate chain, such as missing or expired certificates, old or too new ciphers not supported by your client, etc.
-
Identify the server that your client failed to establish a TLS connection to (in case of
pypath
, look into the log) -
Identify your software that contains the TLS client: in case of
pypath
, it usespycurl
, a Python module built onlibcurl
-
Identify the provider of the client software: it can be PyPI, Anaconda, your operating system, etc.
-
Find out which certificate store that software uses: most of them uses the store from your operating system, but for example Java or Mozilla Firefox come with their own certificates
-
Check if the certificate store is up-to-date, update if necessary
-
Alternatively, identify the missing root certificate and add it manually to the store; you can also add a non-root certificate if the server has a serious issue and the chain can not be followed until a valid root certificate
Please open TLS related issues for our software only if you
-
Experience a server side issue with omnipathdb.org
-
You have a strong reason to think the reason is in the code written by us or can be easily fixed within our code
Resources§
[2]:
from pypath import resources
rc = resources.get_controller()
rc
executed in 0ms, finished 14:27:45 2022-12-03
[2]:
<pypath.resources.controller.ResourceController at 0x6cc25e25dcf0>
Licenses§
The license of SIGNOR is CC BY-SA, it allows commercial (for-profit) use:
[3]:
rc.license('SIGNOR'), rc.license('SIGNOR').commercial
executed in 0ms, finished 14:27:47 2022-12-03
[3]:
(<License CC BY-SA 4.0>, True)
Example: build a network for commercial use§
For our users, the most important aspect of licenses is whether they allow for-profit use in companies. In the near future we intend to provide more convenient interface for license options; until then, see the example below.
[4]:
from pypath.core import network
from pypath import resources
co = resources.get_controller()
pw_academic = co.collect_network('pathway')
pw_commercial = co.collect_network('pathway', license_purpose = 'commercial')
len(pw_academic), len(pw_commercial), set(pw_academic.values()) - set(pw_commercial.values())
executed in 0ms, finished 18:45:22 2023-03-10
[4]:
(24,
19,
{<NetworkResource: Baccin2019 (post_translational, activity_flow)>,
<NetworkResource: Cellinker (post_translational, activity_flow)>,
<NetworkResource: HPMR (post_translational, activity_flow)>,
<NetworkResource: PDZBase (post_translational, activity_flow)>,
<NetworkResource: TRIP (post_translational, activity_flow)>})
Above we see that five resources have been disabled by applying the for-profit licensing restriction. The licenses of those five resources:
[5]:
[r.license for r in set(pw_academic.values()) - set(pw_commercial.values())]
executed in 0ms, finished 18:48:02 2023-03-10
[5]:
[<License CC BY-NC-SA 3.0>,
<License No license>,
<License CC BY-NC 4.0>,
<License CC BY-NC 4.0>,
<License CC BY-NC 4.0>]
The licenses of the resources that allow for profit use:
[6]:
[r.license for r in pw_commercial.values()]
executed in 0ms, finished 18:50:35 2023-03-10
[6]:
[<License CC BY 4.0>,
<License CC BY-SA 3.0>,
<License CC BY-SA 3.0>,
<License CC BY 4.0>,
<License CC BY-SA 3.0>,
<License CC BY-SA 3.0>,
<License CC BY 4.0>,
<License NAR Open Access>,
<License CC BY-SA 4.0>,
<License CC BY 4.0>,
<License GPLv3>,
<License GPLv3>,
<License GPLv3>,
<License MIT>,
<License GPLv3>,
<License MIT>,
<License MIT>,
<License CC BY 4.0>,
<License GPLv3>]
Taking a closer look at a non-profit license:
[10]:
license = pw_academic['trip'].license
license.purpose, license.purpose.enables('for-profit')
executed in 0ms, finished 18:54:45 2023-03-10
[10]:
(<License purpose: academic>, False)
The collected resources can be used directly to build databases, in this case a network database:
[11]:
net_academic = network.Network(pw_academic)
net_commercial = network.Network(pw_commercial)
net_academic, net_commercial
executed in 1m 2.79s, finished 18:57:02 2023-03-10
[11]:
(<Network: 6833 nodes, 25607 interactions>,
<Network: 6429 nodes, 23288 interactions>)
As we see, the for-profit usable network is smaller by about 400 nodes and 2,300 edges, and it might miss even more of the fine grained details, but likely it is suitable for analysis. No legal expert here, but some thoughts about licenses: even if you work for a company, you might download and explore data under any license, the restrictions apply if you start to actually use the resource; even if some resources restrict commercial use, you can always contact the copyright owners and ask them for permission, or ask your company to pay them licensing fee, so you can legally use their product.
Resource information§
[4]:
rc['MatrixDB']
executed in 0ms, finished 14:27:49 2022-12-03
[4]:
{'yearUsedRelease': 2015,
'releases': [2009, 2011, 2015],
'urls': {'articles': ['http://bioinformatics.oxfordjournals.org/content/25/5/690.long',
'http://nar.oxfordjournals.org/content/43/D1/D321.long',
'http://nar.oxfordjournals.org/content/39/suppl_1/D235.long'],
'webpages': ['http://matrixdb.univ-lyon1.fr/'],
'omictools': ['http://omictools.com/matrixdb-tool']},
'pubmeds': [19147664, 20852260, 25378329],
'taxons': ['mammalia'],
'annot': ['experiment'],
'recommend': ['small, literature curated interaction resource; many interactions for',
'receptors and extracellular proteins'],
'descriptions': ['Protein data were imported from the UniProtKB/Swiss-Prot database (Bairoch et',
'al., 2005) and identified by UniProtKB/SwissProt accession numbers. In order to',
'list all the partners of a protein, interactions are associated by default to the',
'accession number of the human protein. The actual source species used in experiments is',
'indicated in the page repor
Resource definitions for a certain database or dataset§
Note: This does not work yet for all databases and datasets, but likely in the near future this will be the preferred method to access resource definitions.
[197]:
rc.collect_enzyme_substrate()
executed in 0ms, finished 20:08:29 2022-12-02
[197]:
[<EnzymeSubstrateResource: phosphoELM>,
<EnzymeSubstrateResource: dbPTM>,
<EnzymeSubstrateResource: SIGNOR>,
<EnzymeSubstrateResource: HPRD>,
<EnzymeSubstrateResource: Li2012>,
<EnzymeSubstrateResource: DEPOD>,
<EnzymeSubstrateResource: PhosphoSite>,
<EnzymeSubstrateResource: PhosphoNetworks>,
<EnzymeSubstrateResource: MIMP>,
<EnzymeSubstrateResource: ProtMapper>,
<EnzymeSubstrateResource: KEA>]
The resource definitions carry all information necessary to load the resource, for example:
[202]:
phosphoelm = rc.collect_enzyme_substrate()[0]
phosphoelm.input_method, phosphoelm.id_type_enzyme
executed in 0ms, finished 20:09:51 2022-12-02
[202]:
('phosphoelm.phosphoelm_enzyme_substrate', 'uniprot')
Building networks§
For this you will need the Network
class from the pypath.core.network
module which takes care about building and querying the
network. Also you need the pypath.resources.network
module where you find a number of predefined input
settings organized in larger categories (e.g. activity flow, enzyme-substrate, transcriptional
regulation, etc). These input settings will tell pypath
how to download and process the
data.
[13]:
from pypath.core import network
from pypath.resources import network as netres
executed in 0ms, finished 13:59:49 2022-12-02
For example the netres.pathway
is a collection of databases which fit into the activity flow
concept, i.e. one protein either stimulates or inhibits the other. It is a dictionary with names as
keys and the input settings as values:
[14]:
netres.pathway
executed in 0ms, finished 13:59:52 2022-12-02
[14]:
{'trip': <NetworkResource: TRIP (post_translational, activity_flow)>,
'spike': <NetworkResource: SPIKE (post_translational, activity_flow)>,
'signalink3': <NetworkResource: SignaLink3 (post_translational, activity_flow)>,
'guide2pharma': <NetworkResource: Guide2Pharma (post_translational, activity_flow)>,
'ca1': <NetworkResource: CA1 (post_translational, activity_flow)>,
'arn': <NetworkResource: ARN (post_translational, activity_flow)>,
'nrf2': <NetworkResource: NRF2ome (post_translational, activity_flow)>,
'macrophage': <NetworkResource: Macrophage (post_translational, activity_flow)>,
'death': <NetworkResource: DeathDomain (post_translational, activity_flow)>,
'pdz': <NetworkResource: PDZBase (post_translational, activity_flow)>,
'signor': <NetworkResource: SIGNOR (post_translational, activity_flow)>,
'adhesome': <NetworkResource: Adhesome (post_translational, activity_flow)>,
'icellnet': <NetworkResource: ICELLNET (post_translational, activity_flow)>,
'celltalkdb': <Net
Such a dictionary you can pass to the load
method of the network.Network
object. Then it will download the data from the original sources,
translate the identifiers and merge the networks. Pypath stores all downloaded data in a cache, by
default ~/.pypath/cache
in
your user’s home directory. For this reason when you load a resource for the first time it might take
long but next time will be faster as data will be fetched from the cache. First create a pypath.network.Network
object, then
build the network:
[15]:
n = network.Network()
n.load(netres.pathway)
executed in 32.90s, finished 14:00:36 2022-12-02
[16]:
n
executed in 0ms, finished 14:02:23 2022-12-02
[16]:
<Network: 6833 nodes, 25607 interactions>
You can add more resource sets a similar way:
[18]:
n.load(netres.enzyme_substrate)
executed in 30.04s, finished 14:04:29 2022-12-02
[19]:
n
executed in 0ms, finished 14:05:38 2022-12-02
[19]:
<Network: 7979 nodes, 35550 interactions>
To load one single resource simply pass the NetworkResource
directly:
[20]:
n.load(netres.interaction['matrixdb'])
executed in 0ms, finished 14:05:42 2022-12-02
[21]:
n
executed in 0ms, finished 14:05:44 2022-12-02
[21]:
<Network: 8002 nodes, 35748 interactions>
Which network datasets are pre-defined in pypath?§
You can find all the pre-defined datasets in the pypath.resources.network
module.
This module currently is a wrapper around an older module, pypath.resources.data_formats
, the
actual definitions are written in this latter. As already we mentined above, the pathway
dataset contains the
literature curated activity flow resources. This was the original focus of pypath and OmniPath,
however since then we added a great variety of other kinds of resource definitions. Here we give an
overview of these.
-
pypath.resources.network.pathway
: activity flow networks with literature references -
pypath.resources.network.activity_flow
: synonym forpathway
-
pypath.resources.network.pathway_noref
: activity flow networks without literature references -
pypath.resources.network.pathway_all
: all activity flow data -
pypath.resources.network.ptm
: enzyme-substrate interaction networks with literature references -
pypath.resources.network.enzyme_substrate
: synonym forptm
-
pypath.resources.network.ptm_noref
: enzyme-substrate networks without literature references -
pypath.resources.network.ptm_all
: all enzyme-substrate data -
pypath.resources.network.interaction
: undirected interactions from both literature curated and high-throughput collections (e.g. IntAct, BioGRID) -
pypath.resources.network.interaction_misc
: undirected, high-scale interaction networks without the constraint of having any literature reference (e.g. the unbiased human interactome screen from the Vidal lab) -
pypath.resources.network.transcription_onebyone
: transcriptional regulation databases (TF-target interactions) with all databases downloaded directly and processed bypypath
-
pypath.resources.network.transcription
: transcriptional regulation only from the DoRothEA data -
pypath.resources.network.mirna_target
: miRNA-mRNA interactions from literature curated resources -
pypath.resources.network.tf_mirna
: transcriptional regulation of miRNA from literature curated resources -
pypath.resources.network.lncrna_protein
: lncRNA-protein interactions from literature curated datasets -
pypath.resources.network.ligand_receptor
: ligand-receptor interactions from both literature curated and other kinds of resources -
pypath.resources.network.pathwaycommons
: the PathwayCommons database -
pypath.resources.network.reaction
: process description databases; not guaranteed to work at this moment -
pypath.resources.network.reaction_misc
: alternative definitions to load process description databases; not guaranteed to work at this moment -
pypath.resources.network.small_molecule_protein
: signaling interactions between small molecules and proteins
To see the list of the resources in a dataset, you can check the dict keys or the name
attribute of each element:
[22]:
netres.pathway.keys()
executed in 0ms, finished 14:05:57 2022-12-02
[22]:
dict_keys(['trip', 'spike', 'signalink3', 'guide2pharma', 'ca1', 'arn', 'nrf2', 'macrophage', 'death', 'pdz', 'signor', 'adhesome', 'icellnet', 'celltalkdb', 'cellchatdb', 'connectomedb', 'talklr', 'cellinker', 'scconnect', 'hpmr', 'cellphonedb', 'ramilowski2015', 'lrdb', 'baccin2019'])
[23]:
[resource.name for resource in netres.pathway.values()]
executed in 0ms, finished 14:06:00 2022-12-02
[23]:
['TRIP',
'SPIKE',
'SignaLink3',
'Guide2Pharma',
'CA1',
'ARN',
'NRF2ome',
'Macrophage',
'DeathDomain',
'PDZBase',
'SIGNOR',
'Adhesome',
'ICELLNET',
'CellTalkDB',
'CellChatDB',
'connectomeDB2020',
'talklr',
'Cellinker',
'scConnect',
'HPMR',
'CellPhoneDB',
'Ramilowski2015',
'LRdb',
'Baccin2019']
The resource definitions above carry all the information about how to load the resource: which function to call, how to process the identifiers, references, directions, and all other attributes from the input. E.g. which column from SPIKE corresponds to the source node? Which identifier type is used? It is the second column, and it has gene symbols in it:
[24]:
netres.pathway['spike'].networkinput.id_col_a, netres.pathway['spike'].networkinput.id_type_a
executed in 0ms, finished 14:06:07 2022-12-02
[24]:
(1, 'genesymbol')
The Network
object§
Once you built a network you can use it for various purposes and write your own scripts for
further processing or analysis. Below we create a Network
object and populate it with
the pathway
dataset.
Warning: it is recommended to access databases by the manager. Running the code below takes really long and does not save or reload the database, it builds a fresh copy each time.
[2]:
from pypath.core import network
from pypath.resources import network as netres
n = network.Network()
n.load(netres.pathway)
n
executed in 36.07s, finished 14:15:48 2022-12-02
[2]:
<Network: 6833 nodes, 25607 interactions>
Almost all data is stored as a dict node pairs vs. interactions in Network.interactions
:
[3]:
n.interactions
executed in 0ms, finished 14:17:02 2022-12-02
[3]:
{(<Entity: TRPC1>,
<Entity: KCNMA1>): <Interaction: TRPC1 ============= KCNMA1 [Evidences: TRIP (2 references)]>,
(<Entity: TRPC1>,
<Entity: PPP3CA>): <Interaction: TRPC1 ============= PPP3CA [Evidences: TRIP (1 references)]>,
(<Entity: CALM2>,
<Entity: TRPC1>): <Interaction: CALM2 =======(-)==> TRPC1 [Evidences: TRIP (3 references)]>,
(<Entity: CALM3>,
<Entity: TRPC1>): <Interaction: CALM3 =======(-)==> TRPC1 [Evidences: TRIP (3 references)]>,
(<Entity: CALM1>,
<Entity: TRPC1>): <Interaction: CALM1 =======(-)==> TRPC1 [Evidences: TRIP (3 references)]>,
(<Entity: CASP1>,
<Entity: TRPC1>): <Interaction: CASP1 ============= TRPC1 [Evidences: TRIP (1 references)]>,
(<Entity: TRPC1>,
<Entity: CASP4>): <Interaction: TRPC1 ============= CASP4 [Evidences: TRIP (1 references)]>,
(<Entity: TRPC1>,
<Entity: CACNA1C>): <Interaction: TRPC1 ============= CACNA1C [Evidences: TRIP (1 references)]>,
(<Entity: TRPC1>,
<Entity: CAV1>): <Interaction: TRPC1 <=(+)======== CAV1 [Ev
The dict under Network.nodes
is kept in sync with the interactions, and facilitates node
access. Keys are primary identifiers (for proteins UniProt IDs by default), values are Entity
objects:
[26]:
n.nodes
executed in 0ms, finished 14:06:21 2022-12-02
[26]:
{'P48995': <Entity: TRPC1>,
'Q12791': <Entity: KCNMA1>,
'Q08209': <Entity: PPP3CA>,
'P0DP24': <Entity: CALM2>,
'P0DP25': <Entity: CALM3>,
'P0DP23': <Entity: CALM1>,
'P29466': <Entity: CASP1>,
'P49662': <Entity: CASP4>,
'Q13936': <Entity: CACNA1C>,
'Q03135': <Entity: CAV1>,
'P56539': <Entity: CAV3>,
'Q14247': <Entity: CTTN>,
'P14416': <Entity: DRD2>,
'P11532': <Entity: DMD>,
'P11362': <Entity: FGFR1>,
'Q02790': <Entity: FKBP4>,
'Q86YM7': <Entity: HOMER1>,
'Q9NSC5': <Entity: HOMER3>,
'Q99750': <Entity: MDFI>,
'Q14571': <Entity: ITPR2>,
'Q14573': <Entity: ITPR3>,
'P29966': <Entity: MARCKS>,
'Q13255': <Entity: GRM1>,
'P20591': <Entity: MX1>,
'P62166': <Entity: NCS1>,
'Q96D31': <Entity: ORAI1>,
'Q96SN7': <Entity: ORAI2>,
'Q9BRQ5': <Entity: ORAI3>,
'P11171': <Entity: EPB41>,
'P61586': <Entity: RHOA>,
'Q9Y225': <Entity: RNF24>,
'P21817': <Entity: RYR1>,
'P16615': <Entity: ATP2A2>,
'Q93084': <Entity: ATP2A3>,
'P60880': <Entity: SNAP25>,
'Q13586': <Entity: STI
An interaction between a pair of entities can be accessed:
[27]:
n.interaction('EGF', 'EGFR')
executed in 0ms, finished 14:06:27 2022-12-02
[27]:
<Interaction: EGFR <=(+)======== EGF [Evidences: Baccin2019, CellTalkDB, Fantom5, Guide2Pharma, HPMR, HPRD, ICELLNET, LRdb, Ramilowski2015, SIGNOR, SPIKE, SignaLink3, cellsignal.com, connectomeDB2020 (17 references)]>
Similarly, individual nodes can be looked up:
[28]:
n.entity('EGFR')
executed in 0ms, finished 14:06:29 2022-12-02
[28]:
<Entity: EGFR>
Labels (gene symbols for proteins by default), identifiers (such as UniProt IDs) and
Entity
objects can be
used to refer to nodes. Each node carries some basic information:
[29]:
egfr = n.entity('EGFR')
egfr.identifier, egfr.label, egfr.entity_type, egfr.id_type, egfr.taxon
executed in 0ms, finished 14:06:32 2022-12-02
[29]:
('P00533', 'EGFR', 'protein', 'uniprot', 9606)
Interactions feature a number of methods to access various information, such as their types, direction, effect, resources, references, etc. The very same methods are also available for the whole network. Below we only show a few examples of these methods.
[30]:
ia = n.interaction('EGF', 'EGFR')
ia
executed in 0ms, finished 14:06:34 2022-12-02
[30]:
<Interaction: EGFR <=(+)======== EGF [Evidences: Baccin2019, CellTalkDB, Fantom5, Guide2Pharma, HPMR, HPRD, ICELLNET, LRdb, Ramilowski2015, SIGNOR, SPIKE, SignaLink3, cellsignal.com, connectomeDB2020 (17 references)]>
[31]:
ia.get_resource_names()
executed in 0ms, finished 14:06:47 2022-12-02
[31]:
{'Baccin2019',
'CellTalkDB',
'HPMR',
'ICELLNET',
'LRdb',
'SIGNOR',
'SPIKE',
'SignaLink3',
'connectomeDB2020'}
[32]:
ia.get_references()
executed in 0ms, finished 14:06:50 2022-12-02
[32]:
{<Reference: 10085134>,
<Reference: 10209155>,
<Reference: 10788520>,
<Reference: 12093292>,
<Reference: 12297050>,
<Reference: 12620237>,
<Reference: 12648462>,
<Reference: 15620700>,
<Reference: 16274239>,
<Reference: 17145710>,
<Reference: 19531499>,
<Reference: 20458382>,
<Reference: 21071413>,
<Reference: 23331499>,
<Reference: 3494473>,
<Reference: 6289330>,
<Reference: 8639530>}
This is a valid direction for this interaction:
[33]:
ia.get_direction(('EGF', 'EGFR'))
executed in 0ms, finished 14:06:53 2022-12-02
[33]:
True
The opposite direction is not supported by any of the resources:
[34]:
ia.get_direction(('EGFR', 'EGF'))
executed in 0ms, finished 14:06:55 2022-12-02
[34]:
False
However, some resources provide no direction information, these are classified as “undirected”:
ia.get_direction(‘undirected’)
We can check which resources are those exactly:
[35]:
ia.get_direction('undirected', sources = True)
executed in 0ms, finished 14:07:23 2022-12-02
[35]:
{'HPMR', 'SPIKE'}
Effect signs (stimulation, inhibition) are available in a similar way. The first one of the Boolean values mean stimulation (activation), the second one inhibition.
[36]:
ia.get_sign(('EGF', 'EGFR'))
executed in 0ms, finished 14:07:25 2022-12-02
[36]:
[True, False]
Which resources support the effect signs:
[37]:
ia.get_sign(('EGF', 'EGFR'), sources = True)
executed in 0ms, finished 14:07:28 2022-12-02
[37]:
[{'SIGNOR', 'SPIKE', 'SignaLink3'}, set()]
Many methods start by get_...
, such as:
[38]:
ia.get_interaction_types()
executed in 0ms, finished 14:07:30 2022-12-02
[38]:
{'post_translational'}
Others are called ..._by_...
, these combine two get_...
methods:
[39]:
ia.references_by_resource()
executed in 0ms, finished 14:07:32 2022-12-02
[39]:
{'ICELLNET': {<Reference: 8639530>},
'SIGNOR': {<Reference: 12297050>, <Reference: 12648462>},
'SignaLink3': {<Reference: 10085134>,
<Reference: 10209155>,
<Reference: 19531499>,
<Reference: 21071413>,
<Reference: 23331499>},
'Baccin2019': {<Reference: 10788520>,
<Reference: 12093292>,
<Reference: 12297050>,
<Reference: 12620237>,
<Reference: 15620700>,
<Reference: 16274239>,
<Reference: 6289330>},
'LRdb': {<Reference: 10788520>,
<Reference: 12093292>,
<Reference: 12297050>,
<Reference: 12620237>,
<Reference: 15620700>,
<Reference: 16274239>,
<Reference: 6289330>},
'SPIKE': {<Reference: 12297050>,
<Reference: 17145710>,
<Reference: 20458382>,
<Reference: 3494473>},
'CellTalkDB': {<Reference: 12093292>},
'connectomeDB2020': {<Reference: 10788520>,
<Reference: 12093292>,
<Reference: 12297050>,
<Reference: 12620237>,
<Reference: 15620700>,
<Reference: 16274239>,
<Reference: 6289330>},
'HPMR': {<Reference: 6289330>}}
And all these methods accept the same filtering parameters. E.g. if you are interested only in certain resources, it’s possible to restrict the query to those. For example, the two resources below provide no positive sign interaction:
[40]:
ia.get_interactions_positive(resources = {'ICELLNET', 'HPMR'})
executed in 0ms, finished 14:07:39 2022-12-02
[40]:
()
While some other resources do:
[41]:
ia.get_interactions_positive(resources = {'SignaLink3'})
executed in 0ms, finished 14:07:42 2022-12-02
[41]:
((<Entity: EGF>, <Entity: EGFR>),)
Or see the references that do or do not provide effect sign:
[42]:
ia.get_references(effect = True), ia.get_references(effect = False)
executed in 0ms, finished 14:07:44 2022-12-02
[42]:
({<Reference: 10085134>,
<Reference: 10209155>,
<Reference: 12297050>,
<Reference: 12648462>,
<Reference: 19531499>,
<Reference: 20458382>,
<Reference: 21071413>,
<Reference: 23331499>},
{<Reference: 10085134>,
<Reference: 10209155>,
<Reference: 10788520>,
<Reference: 12093292>,
<Reference: 12297050>,
<Reference: 12620237>,
<Reference: 12648462>,
<Reference: 15620700>,
<Reference: 16274239>,
<Reference: 17145710>,
<Reference: 19531499>,
<Reference: 20458382>,
<Reference: 21071413>,
<Reference: 23331499>,
<Reference: 3494473>,
<Reference: 6289330>,
<Reference: 8639530>})
Network in pandas.DataFrame§
Contents of a pypath.core.network.Network
object can be exported to a pandas.DataFrame
:
[1]:
from pypath import omnipath
cu = omnipath.db.get_db('curated')
cu.make_df()
cu.df
executed in 23.41s, finished 15:24:19 2022-12-03
[1]:
id_a | id_b | type_a | type_b | directed | effect | type | dmodel | sources | references | |
---|---|---|---|---|---|---|---|---|---|---|
0 | P48995 | Q12791 | protein | protein | False | 0 | post_translational | {activity_flow} | {TRIP} | NaN |
1 | P48995 | Q08209 | protein | protein | False | 0 | post_translational | {activity_flow} | {TRIP} | NaN |
2 | P0DP23 | P48995 | protein | protein | True | -1 | post_translational | {activity_flow} | {TRIP} | NaN |
3 | P0DP25 | P48995 | protein | protein | True | -1 | post_translational | {activity_flow} | {TRIP} | NaN |
4 | P0DP24 | P48995 | protein | protein | True | -1 | post_translational | {activity_flow} | {TRIP} | NaN |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
44033 | Q14289 | Q9ULZ3 | protein | protein | True | 0 | post_translational | {enzyme_substrate} | {iPTMnet} | NaN |
44034 | P54646 | Q9Y2I7 | protein | protein | True | 0 | post_translational | {enzyme_substrate} | {iPTMnet} | NaN |
44035 | Q9BXM7 | Q9Y2N7 | protein | protein | True | 0 | post_translational | {enzyme_substrate} | {iPTMnet} | NaN |
44036 | P49137 | Q9Y385 | protein | protein | True | 0 | post_translational | {enzyme_substrate} | {iPTMnet} | NaN |
44037 | Q9UHC7 | P04637 | protein | protein | True | 0 | post_translational | {enzyme_substrate} | {iPTMnet} | NaN |
44038 rows × 10 columns
In the pypath.omnipath.export
module independent and more flexible interfaces are
available for building network data frames. These are used also for building the tables used by the
web server.
[12]:
from pypath import omnipath
from pypath.omnipath import export
cu = omnipath.db.get_db('curated')
e = export.Export(cu)
e.make_df(unique_pairs = False)
e.df
executed in 22.65s, finished 19:20:12 2023-03-10
[12]:
source | target | source_genesymbol | target_genesymbol | is_directed | is_stimulation | is_inhibition | consensus_direction | consensus_stimulation | consensus_inhibition | sources | references | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | P48995 | Q12791 | TRPC1 | KCNMA1 | 0 | 0 | 0 | 0 | 0 | 0 | TRIP | TRIP:19168436;TRIP:25139746 |
1 | P48995 | Q08209 | TRPC1 | PPP3CA | 0 | 0 | 0 | 0 | 0 | 0 | TRIP | TRIP:23228564 |
2 | P0DP23 | P48995 | CALM1 | TRPC1 | 1 | 0 | 1 | 1 | 0 | 1 | TRIP | TRIP:11290752;TRIP:11983166;TRIP:12601176 |
3 | P0DP25 | P48995 | CALM3 | TRPC1 | 1 | 0 | 1 | 1 | 0 | 1 | TRIP | TRIP:11290752;TRIP:11983166;TRIP:12601176 |
4 | P0DP24 | P48995 | CALM2 | TRPC1 | 1 | 0 | 1 | 1 | 0 | 1 | TRIP | TRIP:11290752;TRIP:11983166;TRIP:12601176 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
36729 | Q14289 | Q9ULZ3 | PTK2B | PYCARD | 1 | 0 | 0 | 0 | 0 | 0 | iPTMnet | iPTMnet:27796369 |
36730 | P54646 | Q9Y2I7 | PRKAA2 | PIKFYVE | 1 | 0 | 0 | 0 | 0 | 0 | iPTMnet | iPTMnet:24070423 |
36731 | Q9BXM7 | Q9Y2N7 | PINK1 | HIF3A | 1 | 0 | 0 | 0 | 0 | 0 | iPTMnet | iPTMnet:27551449 |
36732 | P49137 | Q9Y385 | MAPKAPK2 | UBE2J1 | 1 | 0 | 0 | 0 | 0 | 0 | iPTMnet | iPTMnet:24020373 |
36733 | Q9UHC7 | P04637 | MKRN1 | TP53 | 1 | 0 | 0 | 0 | 0 | 0 | iPTMnet | iPTMnet:19536131 |
36734 rows × 12 columns
The data frame built for the web service includes even more details. Using the extra_node_attrs
and extra_edge_attrs
arguments of the
Export
object, you can
fully customise these data frames.
[13]:
e.webservice_interactions_df()
e.df
executed in 21.99s, finished 19:22:51 2023-03-10
[13]:
source | target | source_genesymbol | target_genesymbol | is_directed | is_stimulation | is_inhibition | consensus_direction | consensus_stimulation | consensus_inhibition | ... | dorothea_tfbs | dorothea_coexp | dorothea_level | type | curation_effort | extra_attrs | ncbi_tax_id_source | entity_type_source | ncbi_tax_id_target | entity_type_target | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | P48995 | Q12791 | TRPC1 | KCNMA1 | 0 | 0 | 0 | 0 | 0 | 0 | ... | None | None | post_translational | 2 | {"TRIP_method":["Co-immunoprecipitation","Co-i... | 9606 | protein | 9606 | protein | |
1 | P48995 | Q08209 | TRPC1 | PPP3CA | 0 | 0 | 0 | 0 | 0 | 0 | ... | None | None | post_translational | 1 | {"TRIP_method":["Co-immunoprecipitation"]} | 9606 | protein | 9606 | protein | |
2 | P0DP23 | P48995 | CALM1 | TRPC1 | 1 | 0 | 1 | 1 | 0 | 1 | ... | None | None | post_translational | 3 | {"TRIP_method":["Fluorescence probe labeling",... | 9606 | protein | 9606 | protein | |
3 | P0DP25 | P48995 | CALM3 | TRPC1 | 1 | 0 | 1 | 1 | 0 | 1 | ... | None | None | post_translational | 3 | {"TRIP_method":["Fluorescence probe labeling",... | 9606 | protein | 9606 | protein | |
4 | P0DP24 | P48995 | CALM2 | TRPC1 | 1 | 0 | 1 | 1 | 0 | 1 | ... | None | None | post_translational | 3 | {"TRIP_method":["Fluorescence probe labeling",... | 9606 | protein | 9606 | protein | |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
36729 | Q14289 | Q9ULZ3 | PTK2B | PYCARD | 1 | 0 | 0 | 0 | 0 | 0 | ... | None | None | post_translational | 1 | {} | 9606 | protein | 9606 | protein | |
36730 | P54646 | Q9Y2I7 | PRKAA2 | PIKFYVE | 1 | 0 | 0 | 0 | 0 | 0 | ... | None | None | post_translational | 1 | {} | 9606 | protein | 9606 | protein | |
36731 | Q9BXM7 | Q9Y2N7 | PINK1 | HIF3A | 1 | 0 | 0 | 0 | 0 | 0 | ... | None | None | post_translational | 1 | {} | 9606 | protein | 9606 | protein | |
36732 | P49137 | Q9Y385 | MAPKAPK2 | UBE2J1 | 1 | 0 | 0 | 0 | 0 | 0 | ... | None | None | post_translational | 1 | {} | 9606 | protein | 9606 | protein | |
36733 | Q9UHC7 | P04637 | MKRN1 | TP53 | 1 | 0 | 0 | 0 | 0 | 0 | ... | None | None | post_translational | 1 | {} | 9606 | protein | 9606 | protein |
36734 rows × 34 columns
Self interactions (loop edges) in the network§
Depending on the downstream application, loops might be beneficial or undesired. By default
loops are disabled, but are enabled for OmniPath and the GRN networks among the built-in network
databases. The allow_loops
parameter can be set at the module level or at the instance level.
If set at the module level, it will be valid for all subsequently created instances:
[14]:
from pypath.share import settings
settings.setup(network_allow_loops = True)
executed in 0ms, finished 19:32:52 2023-03-10
If set at the instance level, it will be valid for the instance:
[15]:
from pypath.core import network
n = network.Network(allow_loops = True)
executed in 0ms, finished 19:33:44 2023-03-10
If you want keep loops only for certain resources, load first the resources where loops should be removed, then remove the loops, and load the resources where you wish to keep the loops:
[30]:
from pypath.core import network
from pypath import resources
co = resources.get_controller()
pw = co.collect_network('pathway')
gr = co.collect_network('dorothea', interaction_types = 'transcriptional')
n = network.Network(pw, allow_loops = False)
n.load(gr, allow_loops = True)
n.count_loops()
executed in 2m 24.45s, finished 19:56:41 2023-03-10
[30]:
149
[32]:
n.count_interactions_by_interaction_type()
executed in 16.50s, finished 19:59:10 2023-03-10
[32]:
{'post_translational': 33571, 'transcriptional': 281262}
Molecular complexes in the network§
Currently pypath
supports protein complexes, however, soon other kind of components, such as small molecules,
nucleic acids, will be supported too. Complexes are represented by pypath.internals.intera.Complex
objects, and can be network nodes. These objects optionally carry information about the defining
resources, references, stoichiometry and custom attributes. Apart from the components and
resources, none of these is mandatory. For more information, see the Protein complexes
section in this notebook. Here we only show how complexes are included in networks. The
Network
object either
represents each complex as a node (default behaviour), or expands the complex by creating a node
for each of its components and apply all the interactions of the complex to all of its components.
This latter method has adverse effects on network topology, and can be enabled by setting
network_expand_complexes
to True
. Only a few
resources list interactions of protein complexes, for example, SIGNOR, CollecTRI, Guide to
Pharmacology, CellphoneDB, etc. Let’s load such a resource:
[1]:
from pypath.core import network
from pypath.resources import network as netres
n = network.Network(netres.collectri)
executed in 38.12s, finished 20:35:23 2023-03-27
We can retrieve various information about the complexes in the network, e.g. count them:
[2]:
n.count_complexes()
executed in 1.45s, finished 20:37:11 2023-03-27
[2]:
33
Or list them:
[3]:
n.get_complexes()
executed in 1.50s, finished 20:37:34 2023-03-27
[3]:
{<Entity: FOS_JUN>,
<Entity: FOS_JUNB>,
<Entity: FOS_JUND>,
<Entity: JUN>,
<Entity: FOSL1_JUN>,
<Entity: FOSL2_JUN>,
<Entity: JUN_JUNB>,
<Entity: JUN_JUND>,
<Entity: FOSB_JUN>,
<Entity: FOSL1_JUNB>,
<Entity: FOSL1_JUND>,
<Entity: FOSL2_JUNB>,
<Entity: FOSL2_JUND>,
<Entity: JUNB>,
<Entity: JUNB_JUND>,
<Entity: FOSB_JUNB>,
<Entity: JUND>,
<Entity: FOSB_JUND>,
<Entity: NFKB1>,
<Entity: NFKB1_NFKB2>,
<Entity: NFKB1_RELB>,
<Entity: NFKB1_RELA>,
<Entity: NFKB1_REL>,
<Entity: NFKB2>,
<Entity: NFKB2_RELB>,
<Entity: NFKB2_RELA>,
<Entity: NFKB2_REL>,
<Entity: RELB>,
<Entity: RELA_RELB>,
<Entity: REL_RELB>,
<Entity: RELA>,
<Entity: REL_RELA>,
<Entity: REL>}
In the network, these are Entity
objects, and their identifier
attribute is the
Complex
object:
[4]:
cplex_entity = list(n.get_complexes())[0]
cplex_entity
executed in 1.40s, finished 20:39:53 2023-03-27
[4]:
<Entity: REL_RELA>
[6]:
cplex = cplex_entity.identifier
cplex
executed in 0ms, finished 20:40:32 2023-03-27
[6]:
Complex: COMPLEX:Q04206_Q04864
When creating a data frame, the complex objects are added to the identifier cells, where we used to have UniProt IDs for single proteins. The labels are the gene symbols of the components, separated by underscore by default.
[8]:
from pypath.omnipath import export
from pypath.internals import intera
e = export.Export(n)
e.make_df(unique_pairs = False)
e.df[[isinstance(s, intera.Complex) for s in e.df.source]]
executed in 9.65s, finished 20:44:06 2023-03-27
[8]:
source | target | source_genesymbol | target_genesymbol | is_directed | is_stimulation | is_inhibition | consensus_direction | consensus_stimulation | consensus_inhibition | sources | references | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | (P17535, P15407) | P04040 | FOSL1_JUND | CAT | 1 | 1 | 0 | 1 | 1 | 0 | CollecTRI;ExTRI_CollecTRI | CollecTRI:10022519;CollecTRI:10329043;CollecTR... |
2 | (P05412, P15408) | P04040 | FOSL2_JUN | CAT | 1 | 1 | 0 | 1 | 1 | 0 | CollecTRI;ExTRI_CollecTRI | CollecTRI:10022519;CollecTRI:10329043;CollecTR... |
3 | (P05412, P15407) | P04040 | FOSL1_JUN | CAT | 1 | 1 | 0 | 1 | 1 | 0 | CollecTRI;ExTRI_CollecTRI | CollecTRI:10022519;CollecTRI:10329043;CollecTR... |
4 | (P05412, P17275) | P04040 | JUN_JUNB | CAT | 1 | 1 | 0 | 1 | 1 | 0 | CollecTRI;ExTRI_CollecTRI | CollecTRI:10022519;CollecTRI:10329043;CollecTR... |
5 | (P17275, P17535) | P04040 | JUNB_JUND | CAT | 1 | 1 | 0 | 1 | 1 | 0 | CollecTRI;ExTRI_CollecTRI | CollecTRI:10022519;CollecTRI:10329043;CollecTR... |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
54980 | (P17535, P01100) | P01270 | FOS_JUND | PTH | 1 | 1 | 0 | 1 | 1 | 0 | CollecTRI;ExTRI_CollecTRI | CollecTRI:9989817 |
54981 | (P17275, P15408) | P01270 | FOSL2_JUNB | PTH | 1 | 1 | 0 | 1 | 1 | 0 | CollecTRI;ExTRI_CollecTRI | CollecTRI:9989817 |
54982 | (P05412, P53539) | P01270 | FOSB_JUN | PTH | 1 | 1 | 0 | 1 | 1 | 0 | CollecTRI;ExTRI_CollecTRI | CollecTRI:9989817 |
54983 | (P17275, P15407) | P01270 | FOSL1_JUNB | PTH | 1 | 1 | 0 | 1 | 1 | 0 | CollecTRI;ExTRI_CollecTRI | CollecTRI:9989817 |
54984 | (P17275) | P01270 | JUNB | PTH | 1 | 1 | 0 | 1 | 1 | 0 | CollecTRI;ExTRI_CollecTRI | CollecTRI:9989817 |
23235 rows × 12 columns
For some reason, pandas
show the Complex
objects as tuples.
[10]:
e.df[[isinstance(s, intera.Complex) for s in e.df.source]].source.iloc[0]
executed in 0ms, finished 20:45:07 2023-03-27
[10]:
Complex: COMPLEX:P15407_P17535
[12]:
e.webservice_interactions_df()
executed in 41.08s, finished 20:48:51 2023-03-27
[13]:
e.df
executed in 0ms, finished 20:50:14 2023-03-27
[13]:
source | target | source_genesymbol | target_genesymbol | is_directed | is_stimulation | is_inhibition | consensus_direction | consensus_stimulation | consensus_inhibition | ... | dorothea_tfbs | dorothea_coexp | dorothea_level | type | curation_effort | extra_attrs | ncbi_tax_id_source | entity_type_source | ncbi_tax_id_target | entity_type_target | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | P01106 | O14746 | MYC | TERT | 1 | 1 | 0 | 1 | 1 | 0 | ... | None | None | transcriptional | 75 | {} | 9606 | protein | 9606 | protein | |
1 | (P17535, P15407) | P04040 | FOSL1_JUND | CAT | 1 | 1 | 0 | 1 | 1 | 0 | ... | None | None | transcriptional | 14 | {} | 9606 | complex | 9606 | protein | |
2 | (P05412, P15408) | P04040 | FOSL2_JUN | CAT | 1 | 1 | 0 | 1 | 1 | 0 | ... | None | None | transcriptional | 14 | {} | 9606 | complex | 9606 | protein | |
3 | (P05412, P15407) | P04040 | FOSL1_JUN | CAT | 1 | 1 | 0 | 1 | 1 | 0 | ... | None | None | transcriptional | 14 | {} | 9606 | complex | 9606 | protein | |
4 | (P05412, P17275) | P04040 | JUN_JUNB | CAT | 1 | 1 | 0 | 1 | 1 | 0 | ... | None | None | transcriptional | 14 | {} | 9606 | complex | 9606 | protein | |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
67945 | Q01196 | Q13094 | RUNX1 | LCP2 | 1 | 1 | 0 | 1 | 1 | 0 | ... | None | None | transcriptional | 1 | {} | 9606 | protein | 9606 | protein | |
67946 | Q01196 | Q6MZQ0 | RUNX1 | PRR5L | 1 | 1 | 0 | 1 | 1 | 0 | ... | None | None | transcriptional | 1 | {} | 9606 | protein | 9606 | protein | |
67947 | Q15672 | P08151 | TWIST1 | GLI1 | 1 | 1 | 0 | 1 | 1 | 0 | ... | None | None | transcriptional | 1 | {} | 9606 | protein | 9606 | protein | |
67948 | P22415 | Q5SRE5 | USF1 | NUP188 | 1 | 1 | 0 | 1 | 1 | 0 | ... | None | None | transcriptional | 1 | {} | 9606 | protein | 9606 | protein | |
67949 | Q9UQR1 | Q5VYX0 | ZNF148 | RNLS | 1 | 1 | 0 | 1 | 1 | 0 | ... | None | None | transcriptional | 1 | {} | 9606 | protein | 9606 | protein |
67950 rows × 34 columns
When we export to CSV, the Complex
objects are converted to the string notation familiar from the OmniPath
web service. See for example COMPLEX:P15407_P17535
below, and its human readable label FOSL1_JUND
in the gene symbols
column:
[15]:
e.df[[ets == 'complex' for ets in e.df.entity_type_source]].to_csv(index = False)[:1000]
executed in 0ms, finished 20:55:26 2023-03-27
[15]:
'source,target,source_genesymbol,target_genesymbol,is_directed,is_stimulation,is_inhibition,consensus_direction,consensus_stimulation,consensus_inhibition,sources,references,omnipath,kinaseextra,ligrecextra,pathwayextra,mirnatarget,dorothea,tf_target,lncrna_mrna,tf_mirna,small_molecule,dorothea_curated,dorothea_chipseq,dorothea_tfbs,dorothea_coexp,dorothea_level,type,curation_effort,extra_attrs,ncbi_tax_id_source,entity_type_source,ncbi_tax_id_target,entity_type_target\nCOMPLEX:P15407_P17535,P04040,FOSL1_JUND,CAT,1,1,0,1,1,0,CollecTRI;ExTRI_CollecTRI,CollecTRI:10022519;CollecTRI:10329043;CollecTRI:12036993;CollecTRI:12538496;CollecTRI:17935786;CollecTRI:7489329;CollecTRI:7651432;CollecTRI:7818486;CollecTRI:8867782;CollecTRI:9030359;CollecTRI:9136992;CollecTRI:9142914;CollecTRI:9168892;CollecTRI:9687385,False,False,False,False,False,False,False,False,False,False,,,,,,transcriptional,14,{},9606,complex,9606,protein\nCOMPLEX:P05412_P15408,P04040,FOSL2_JUN,CAT,1,1,0,1,1,0,CollecTRI;ExTRI_C
Translating identifiers§
The pypath.utils.mapping
module is for ID translation, most of the time you can
simply call the map_name
method:
[1]:
from pypath.utils import mapping
mapping.map_name('P00533', 'uniprot', 'genesymbol')
executed in 1.38s, finished 12:31:45 2023-03-21
[1]:
{'EGFR'}
By default the map_name
function returns a set
because it accounts for ambiguous mapping. However most often the ID translation is unambiguous, and
you want to retrieve only one ID. The map_name0
returns a string, even in case of ambiguity, it returns a random
element from the resulted set:
[5]:
mapping.map_name0('GABARAPL3', 'genesymbol', 'uniprot')
executed in 0ms, finished 14:17:31 2022-12-02
[5]:
'Q9BY60'
Molecules have large variety of identifiers, but in pypath two identifier types are special:
-
The primary identifier defines the molecule category, e.g. if UniProt is the primary identifier for proteins, then a protein is anything that has a UniProt ID
-
The label is a human readable identifier, for proteins it’s gene symbol
The primary ID and label types are configured for each molecule type (protein, miRNA, drug, etc)
in the module settings. The mapping
module provides shortcuts to translate between these identifiers:
label
and id_from_label
.
[6]:
mapping.label('O75385')
executed in 0ms, finished 14:17:33 2022-12-02
[6]:
'ULK1'
[7]:
mapping.id_from_label('ULK1')
executed in 0ms, finished 14:17:35 2022-12-02
[7]:
{'O75385'}
[8]:
mapping.id_from_label0('ULK1')
executed in 0ms, finished 14:17:37 2022-12-02
[8]:
'O75385'
Multiple IDs can be translated in one call, however, it’s not possible to know certainly which output corresponds to which input.
[9]:
mapping.map_names(['ULK1', 'EGFR', 'SMAD2'], 'genesymbol', 'uniprot')
executed in 0ms, finished 14:17:40 2022-12-02
[9]:
{'O75385', 'P00533', 'Q15796'}
The default organism is defined in the module settings, it is human by default. Translating for
other organisms requires the ncbi_tax_id
argument. Most of the functions in pypath
accepts also common or latin
names, but map_name
accepts only numeric taxon IDs for efficiency. Let’s translate a mouse identifier:
[10]:
mapping.map_name('Smad2', 'genesymbol', 'uniprot', ncbi_tax_id = 10090)
executed in 0ms, finished 14:17:44 2022-12-02
[10]:
{'Q62432'}
If no direct translation table is available between two ID types, pypath
will try to translate by an
intermediate ID type.
[11]:
mapping.map_name('8408', 'entrez', 'genesymbol')
executed in 0ms, finished 14:17:46 2022-12-02
[11]:
{'ULK1'}
Behind the scenes the chain_map
function is called:
[12]:
m = mapping.get_mapper()
m.chain_map('8408', id_type = 'entrez', target_id_type = 'genesymbol', by_id_type = 'uniprot')
executed in 0ms, finished 14:17:47 2022-12-02
[12]:
{'ULK1'}
And the procedure corresponds to the following:
[13]:
mapping.map_names(
mapping.map_name('8408', 'entrez', 'uniprot'),
'uniprot',
'genesymbol',
)
executed in 0ms, finished 14:17:49 2022-12-02
[13]:
{'ULK1'}
Pre-defined ID translation tables§
A number of mapping tables are pre-defined, these load automatically on demand, and are removed
from the memory if not used for some time (5 minutes by default). New mapping tables are saved
directly into pickle files in the cache for a quick reload. Tables are either organism specific
(hence loaded for each organism one-by-one), or non-organism specific, such as drug IDs
(pypath
uses integer
0
in this case in place
of the numeric NCBI Taxonomy ID). The identifier translation data is retrieved from the following
sources:
-
UniProt legacy API (main UniProt API until autumn 2022):
internals.input_formats.UniprotMapping
-
UniProt uploadlists API (also outdated, replaced by the new UniProt API):
internals.inputs_formats.UniprotListMapping
-
Ensembl Biomart:
internals.input_formats.BiomartMapping
andinternals.input_formats.ArrayMapping
(for microarray probes) -
Protein Ontology Consortium:
internals.input_formats.ProMapping
-
UniChem:
internals.input_formats.UnichemMapping
-
Arbitrary files:
internals.input_formats.FileMapping
(this class is used to process data from miRBase, some files from the UniProt FTP site, and also user defined, custom cases) -
RaMP:
internals.input_formats.RampMapping
-
HMDB:
internals.input_formats.HmdbMapping
Some of the classes above are instantiated in internals.maps
, but most of the
instances are created on the fly when loading a mapping table in utils.mapping.MapReader
. This latter
class is responsible to take a table definition and load a utils.mapping.MappingTable
instance.
The whole process is managed by utils.mapping.Mapper
, this is the object all the ID translation queries are
dispatched to. It has a method to list the defined ID translation tables:
[3]:
mapping.mapping_tables()
executed in 0ms, finished 12:32:06 2023-03-21
[3]:
[MappingTableDefinition(id_type_a='embl', id_type_b='uniprot', resource='uniprot', input_class='UniprotMapping', resource_id_type_a='database(embl)', resource_id_type_b='id'),
MappingTableDefinition(id_type_a='genesymbol', id_type_b='uniprot', resource='uniprot', input_class='UniprotMapping', resource_id_type_a='genes(PREFERRED)', resource_id_type_b='id'),
MappingTableDefinition(id_type_a='genesymbol-syn', id_type_b='uniprot', resource='uniprot', input_class='UniprotMapping', resource_id_type_a='genes(ALTERNATIVE)', resource_id_type_b='id'),
MappingTableDefinition(id_type_a='entrez', id_type_b='uniprot', resource='uniprot', input_class='UniprotMapping', resource_id_type_a='database(geneid)', resource_id_type_b='id'),
MappingTableDefinition(id_type_a='hgnc', id_type_b='uniprot', resource='uniprot', input_class='UniprotMapping', resource_id_type_a='database(HGNC)', resource_id_type_b='id'),
MappingTableDefinition(id_type_a='refseqp', id_type_b='uniprot', resource='uniprot', input_cl
Pypath uses synonyms to refer to ID types: these are intended to be short, clear and lowercase
for ease of use. Most of the synonyms are defined in internals.input_formats
, in the
AC_QUERY
, AC_MAPPING
, BIOMART_MAPPING
, PRO_MAPPING
and ARRAY_MAPPING
dictionaries. UniChem
ID types are used exactly as provided by UniChem. To list all available ID types (below
pypath is the synonym used here, original is the name in the original
resource):
[4]:
mapping.id_types()
executed in 0ms, finished 12:32:14 2023-03-21
[4]:
{IdType(pypath='CAS', original='CAS'),
IdType(pypath='LIPIDMAPS', original='LIPIDMAPS'),
IdType(pypath='MedChemExpress', original='MedChemExpress'),
IdType(pypath='actor', original='actor'),
IdType(pypath='affy', original='affy'),
IdType(pypath='affymetrix', original='affymetrix'),
IdType(pypath='agilent', original='agilent'),
IdType(pypath='alzforum', original='Alzforum_mut'),
IdType(pypath='araport', original='Araport'),
IdType(pypath='atlas', original='atlas'),
IdType(pypath='bigg', original='bigg'),
IdType(pypath='bindingdb', original='bindingdb'),
IdType(pypath='biocyc', original='biocyc'),
IdType(pypath='brenda', original='brenda'),
IdType(pypath='carotenoiddb', original='carotenoiddb'),
IdType(pypath='cas', original='CAS'),
IdType(pypath='cas', original='cas_registry_number'),
IdType(pypath='cas_id', original='CAS'),
IdType(pypath='cgnc', original='CGNC'),
IdType(pypath='chebi', original='chebi'),
IdType(pypath='chembl', original='chembl'),
IdType(pypath='ch
Direct access to ID translation tables§
The Mapper
(or the
mapping
module) is able
to return ID translation tables as dicts or data frames:
[5]:
tbl = mapping.translation_dict('uniprot', 'genesymbol')
tbl
executed in 0ms, finished 12:33:55 2023-03-21
[5]:
<MappingTable from=uniprot, to=genesymbol, taxon=9606 (20243 IDs)>
[7]:
'P00533' in tbl
executed in 0ms, finished 12:34:16 2023-03-21
[7]:
True
[8]:
tbl['P00533']
executed in 0ms, finished 12:34:25 2023-03-21
[8]:
{'EGFR'}
[9]:
'EGFR' in tbl
executed in 0ms, finished 12:34:33 2023-03-21
[9]:
False
[10]:
list(tbl.items())[:10]
executed in 0ms, finished 12:34:50 2023-03-21
[10]:
[('Q00604', {'NDP'}),
('Q9HB19', {'PLEKHA2'}),
('Q16718', {'NDUFA5'}),
('P55769', {'SNU13'}),
('Q92886', {'NEUROG1'}),
('Q6T4R5', {'NHS'}),
('P80188', {'LCN2'}),
('Q86XR2', {'FAM129C'}),
('Q5T2W1', {'PDZK1'}),
('Q9BSH3', {'NICN1'})]
The same table as data frame:
[12]:
mapping.translation_df('uniprot', 'genesymbol')
executed in 0ms, finished 12:35:18 2023-03-21
[12]:
uniprot | genesymbol | |
---|---|---|
0 | Q00604 | NDP |
1 | Q9HB19 | PLEKHA2 |
2 | Q16718 | NDUFA5 |
3 | P55769 | SNU13 |
4 | Q92886 | NEUROG1 |
... | ... | ... |
20375 | Q96L92 | SNX27 |
20376 | Q9UNH6 | SNX7 |
20377 | Q5VWJ9 | SNX30 |
20378 | Q9BZZ2 | SIGLEC1 |
20379 | Q96BD0 | SLCO4A1 |
20380 rows × 2 columns
Orthology translation§
The utils.orthology
module (formerly utils.homology
) handles translation of data between organism by orthologous gene
pairs. Its most important function is translate
. The source organism is human by default, the target must be provided,
below we use mouse (NCBI Taxonomy 10090):
[2]:
from pypath.utils import orthology
orthology.translate('P00533', target = 10090)
executed in 22.33s, finished 18:03:50 2023-09-28
[2]:
{'Q01279'}
ID translation and orthology translation are integrated, hence not only UniProt IDs can be translated:
[3]:
orthology.translate('EGFR', target = 10090, id_type = 'genesymbol')
executed in 22.08s, finished 18:04:16 2023-09-28
[3]:
{'Egfr'}
This module uses data from the Orthologous Matrix )OMA), NCBI HomoloGene and Ensembl. The latter covers more organisms, and accepts some parameters (high confidence, one-to-one vs. one-to-many mapping). The default is to use only OMA as that one is the most comprehensive, up to date and easy to use resource. These parameters can be controlled by the settings module, or passed to the functions above and below, for example:
[8]:
orthology.translate('P00533', target = 10090, oma = False, homologene = False, ensembl = True, ensembl_hc = False, ensembl_types = 'one2one')
executed in 24.52s, finished 18:07:43 2023-09-28
[8]:
{'Q01279'}
Orthology translation tables as dictionaries§
The translation tables are available as dicts of sets, these are convenient for use outside of pypath:
[9]:
human_mouse_genesymbols = orthology.get_dict(target = 'mouse', id_type = 'genesymbol')
human_mouse_genesymbols['EGFR']
executed in 0ms, finished 18:08:26 2023-09-28
[9]:
{'Egfr'}
The relationship types and confdence levels can be included using the full_records
argument:
[11]:
human_mouse_genesymbols = orthology.get_dict(target = 'mouse', id_type = 'genesymbol', full_records = True)
human_mouse_genesymbols['EGFR']
executed in 0ms, finished 18:10:13 2023-09-28
[11]:
{OmaOrtholog(id='Egfr', rel_type='1:1', score=12704.5703125)}
Orthology translation data frames§
Similarly, pandas.DataFrame
s are available:
[13]:
human_mouse_genesymbols = orthology.get_df(target = 'mouse', id_type = 'genesymbol', full_records = True)
human_mouse_genesymbols
executed in 0ms, finished 18:11:16 2023-09-28
[13]:
source | target | rel_type | score | |
---|---|---|---|---|
0 | H4C3 | H4c1 | m:n | 1262.050049 |
1 | H4C3 | H4c3 | m:n | 1262.050049 |
2 | H4C3 | H4c12 | m:n | 1262.050049 |
3 | H4C3 | H4c11 | m:n | 1262.050049 |
4 | H4C3 | H4c9 | m:n | 1262.050049 |
... | ... | ... | ... | ... |
18446 | GDAP2 | Gdap2 | 1:1 | 5553.779785 |
18447 | ITGA8 | Itga8 | 1:1 | 10772.969727 |
18448 | SEMA3F | Sema3f | 1:1 | 9121.080078 |
18449 | EEPD1 | Eepd1 | 1:1 | 5874.350098 |
18450 | DRG2 | Drg2 | 1:1 | 4423.589844 |
18451 rows × 4 columns
Taxonomy§
Organisms matter everywhere, both in the input, output and processing parts of pypath. For this
reason we created a utility module to deal with translation of organism identifiers. We prefer NCBI
Taxonomy IDs as the primary organism identifier. These are simple numbers, 9606 is human, 10090 is
mouse, etc. Many databases use common English names or latin (scientific) names. Then some databases
use custom codes, such as hsapiens in Ensmebl (first letter of genus name + species name,
without space, all lowercase); hsa in miRBase and KEGG (first letter of genus name, first
two letters of species name). The pypath.utils.taxonomy
module features some convenient functions for handling all
these names.
Translating to NCBI Taxonomy, scientific names and common names§
The most often used is ensure_ncbi_tax_id
, which returns the NCBI Taxonomy ID for any comprehensible
input:
[21]:
from pypath.utils import taxonomy
taxonomy.ensure_ncbi_tax_id('human'), taxonomy.ensure_ncbi_tax_id('H sapiens'), taxonomy.ensure_ncbi_tax_id('hsapiens'), taxonomy.ensure_ncbi_tax_id(9606), taxonomy.ensure_ncbi_tax_id('Homo sapiens')
executed in 0ms, finished 14:18:22 2022-12-02
[21]:
(9606, 9606, 9606, 9606, 9606)
To access scientific names or common names:
[22]:
taxonomy.ensure_latin_name('cow')
executed in 0ms, finished 14:18:25 2022-12-02
[22]:
'Bos taurus'
[23]:
taxonomy.ensure_common_name('Erithacus rubecula')
executed in 0ms, finished 14:18:27 2022-12-02
[23]:
'European robin'
Organism from UniProt ID§
The uniprot_taxid
function returns the taxonomy ID for a SwissProt ID. Unfortunately it does not work for TrEMBL IDs,
that would require to keep too much data in memory.
[24]:
taxonomy.ensure_latin_name(taxonomy.uniprot_taxid('P53104'))
executed in 1.19s, finished 14:18:30 2022-12-02
[24]:
'Saccharomyces cerevisiae'
UniProt§
UniProt is a huge, diverse resource that is essential for pypath as we use it as a
reference set for proteomes and it provides ID translation data. Its input module pypath.inputs.uniprot
is already more
complex than an average input module. It harbors a little database manager that loads and unloads
tables on demand, ensuring fast and convenient operation. Further services are available in the
pypath.utils.uniprot
module.
The UniProt input module§
All UniProt IDs for one organism§
The complete set of UniProt IDs for an organism is considered to be the proteome of the organism, and it is used in many procedures across pypath. All SwissProt IDs, all TrEMBL IDs or both together can be retrieved:
[119]:
from pypath.inputs import uniprot as iuniprot
(
len(iuniprot.all_uniprots(organism = 10090)),
len(iuniprot.all_swissprots(organism = 10090)),
len(iuniprot.all_trembls(organism = 10090)),
)
executed in 3m 33.99s, finished 16:07:43 2022-12-02
[119]:
(86440, 17131, 69300)
UniProt ID format validation§
UniProt defines a format for its accessions, any string can be checked against this template to tell if it’s possibly a valid ID:
[124]:
from pypath.inputs import uniprot as iuniprot
iuniprot.valid_uniprot('A0A8D0H0C2')
executed in 0ms, finished 16:17:41 2022-12-02
[124]:
True
UniProt ID validation§
Another functions check if an ID indeed exists in UniProt. These functions require loading the list of all UniProt IDs for the organism, hence calling them the first time might take even a few minutes (in case new download is necessary). Subsequent calls will be much faster.
[125]:
from pypath.inputs import uniprot as iuniprot
iuniprot.is_uniprot('P00533')
executed in 0ms, finished 16:17:44 2022-12-02
[125]:
True
[122]:
iuniprot.is_swissprot('P00533')
executed in 0ms, finished 16:14:14 2022-12-02
[122]:
True
If the organism doesn’t match:
[123]:
iuniprot.is_uniprot('P00533', organism = 10090)
executed in 0ms, finished 16:15:07 2022-12-02
[123]:
False
Single UniProt protein datasheet§
Raw contents of protein datasheets can be retrieved. The structure is a Python list with tuples of two elements, the first is the tag of the line, the second is the line content.
[126]:
from pypath.inputs import uniprot as iuniprot
iuniprot.protein_datasheet('P00533')
executed in 0ms, finished 16:18:06 2022-12-02
[126]:
[('ID', 'EGFR_HUMAN Reviewed; 1210 AA.'),
('AC',
'P00533; O00688; O00732; P06268; Q14225; Q68GS5; Q92795; Q9BZS2; Q9GZX1;'),
('AC', 'Q9H2C9; Q9H3C9; Q9UMD7; Q9UMD8; Q9UMG5;'),
('DT', '21-JUL-1986, integrated into UniProtKB/Swiss-Prot.'),
('DT', '01-NOV-1997, sequence version 2.'),
('DT', '12-OCT-2022, entry version 283.'),
('DE', 'RecName: Full=Epidermal growth factor receptor {ECO:0000305};'),
('DE', 'EC=2.7.10.1;'),
('DE', 'AltName: Full=Proto-oncogene c-ErbB-1;'),
('DE', 'AltName: Full=Receptor tyrosine-protein kinase erbB-1;'),
('DE', 'Flags: Precursor;'),
('GN', 'Name=EGFR {ECO:0000312|HGNC:HGNC:3236}; Synonyms=ERBB, ERBB1, HER1;'),
('OS', 'Homo sapiens (Human).'),
('OC',
'Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; Mammalia;'),
('OC',
'Eutheria; Euarchontoglires; Primates; Haplorrhini; Catarrhini; Hominidae;'),
('OC', 'Homo.'),
('OX', 'NCBI_TaxID=9606;'),
('RN', '[1]'),
('RP',
'NUCLEOTIDE SEQUENCE [MRNA] (ISOFORM
History of UniProt records§
[131]:
from pypath.inputs import uniprot as iuniprot
egfr_history = list(iuniprot.uniprot_history('P00533'))
egfr_history
executed in 0ms, finished 16:21:15 2022-12-02
[131]:
[UniprotRecordHistory(entry_version='283', sequence_version='2', entry_name='EGFR_HUMAN', database='Swiss-Prot', number='2022_04', date='2022-10-12', replaces='', replaced_by=''),
UniprotRecordHistory(entry_version='282', sequence_version='2', entry_name='EGFR_HUMAN', database='Swiss-Prot', number='2022_03', date='2022-08-03', replaces='', replaced_by=''),
UniprotRecordHistory(entry_version='281', sequence_version='2', entry_name='EGFR_HUMAN', database='Swiss-Prot', number='2022_02', date='2022-05-25', replaces='', replaced_by=''),
UniprotRecordHistory(entry_version='280', sequence_version='2', entry_name='EGFR_HUMAN', database='Swiss-Prot', number='2022_01', date='2022-02-23', replaces='', replaced_by=''),
UniprotRecordHistory(entry_version='279', sequence_version='2', entry_name='EGFR_HUMAN', database='Swiss-Prot', number='2021_04', date='2021-09-29', replaces='', replaced_by=''),
UniprotRecordHistory(entry_version='278', sequence_version='2', entry_name='EGFR_HUMAN', database='
[132]:
iuniprot.uniprot_recent_version('P00533')
executed in 0ms, finished 16:21:57 2022-12-02
[132]:
UniprotRecordHistory(entry_version='283', sequence_version='2', entry_name='EGFR_HUMAN', database='Swiss-Prot', number='2022_04', date='2022-10-12', replaces='', replaced_by='')
[133]:
iuniprot.uniprot_history_recent_datasheet('P00533')
executed in 1ms, finished 16:22:33 2022-12-02
[133]:
[('ID', 'EGFR_HUMAN Reviewed; 1210 AA.'),
('AC',
'P00533; O00688; O00732; P06268; Q14225; Q68GS5; Q92795; Q9BZS2; Q9GZX1;'),
('AC', 'Q9H2C9; Q9H3C9; Q9UMD7; Q9UMD8; Q9UMG5;'),
('DT', '21-JUL-1986, integrated into UniProtKB/Swiss-Prot.'),
('DT', '01-NOV-1997, sequence version 2.'),
('DT', '12-OCT-2022, entry version 283.'),
('DE', 'RecName: Full=Epidermal growth factor receptor {ECO:0000305};'),
('DE', 'EC=2.7.10.1;'),
('DE', 'AltName: Full=Proto-oncogene c-ErbB-1;'),
('DE', 'AltName: Full=Receptor tyrosine-protein kinase erbB-1;'),
('DE', 'Flags: Precursor;'),
('GN', 'Name=EGFR {ECO:0000312|HGNC:HGNC:3236}; Synonyms=ERBB, ERBB1, HER1;'),
('OS', 'Homo sapiens (Human).'),
('OC',
'Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; Mammalia;'),
('OC',
'Eutheria; Euarchontoglires; Primates; Haplorrhini; Catarrhini; Hominidae;'),
('OC', 'Homo.'),
('OX', 'NCBI_TaxID=9606;'),
('RN', '[1]'),
('RP',
'NUCLEOTIDE SEQUENCE [MRNA] (ISOFORM
The functions above are able to retrieve the latest datasheet of deleted UniProt records. However, they are slow as several queries are performed to process a single protein.
UniProt REST API§
UniProt deployed its new API in the autumn of 2022, since then pypath has fully
transitioned to the new API. It is accessed by the inputs.uniprot.uniprot_data
and
inputs.uniprot.uniprot_query
functions, though for some purposes higher level
functions are more convenient for the users. For the functions above, a list of fields can be
passed. By default it uses only SwissProt. The output is a dict of dicts with fields as top level
keys and UniProt IDs as second level keys. The results often contain notes, additional info in
parentheses, prefixes and postfixes for identifiers, that are not needed in every situation.
Using uniprot_preprocess
instead of uniprot_data
cleans up some of
this clutter.
[1]:
from pypath.inputs import uniprot as iuniprot
iuniprot.uniprot_data(fields = ('family', 'keywords', 'transmembrane'))
executed in 28.47s, finished 03:24:10 2023-11-16
[1]:
{'family': {'A0A087X1C5': 'Cytochrome P450 family',
'A0A0B4J2F2': 'Protein kinase superfamily, CAMK Ser/Thr protein kinase family, AMPK subfamily',
'A0A0K2S4Q6': 'CD300 family',
'A0A1B0GTW7': 'Peptidase M8 family',
'A0AV02': 'SLC12A transporter family',
'A0AV96': 'RRM RBM47 family',
'A0AVF1': 'IFT56 family',
'A0AVI4': 'TMEM129 family',
'A0AVK6': 'E2F/DP family',
'A0AVT1': 'Ubiquitin-activating E1 family',
'A0FGR8': 'Extended synaptotagmin family',
'A0FGR9': 'Extended synaptotagmin family',
'A0JLT2': 'Mediator complex subunit 19 family',
'A0JP26': 'POTE family',
'A0MZ66': 'Shootin family',
'A0PJK1': 'Sodium:solute symporter (SSF) (TC 2.A.21) family',
'A0PJY2': 'Krueppel C2H2-type zinc-finger protein family',
'A0PK00': 'TMEM120 family',
'A0PK11': 'Clarin family',
'A1A4Y4': 'TRAFAC class dynamin-like GTPase superfamily, IRG family',
'A1A519': 'FAM170 family',
'A1A5B4': 'Anoctamin family',
'A1A5C7': 'Major facilitator (TC 2.A.1) superfamily, Orga
The inputs.uiprot.query_builder
funcion builds queries for the API.
[2]:
from pypath.inputs import uniprot
uniprot.query_builder('kinase', organism_id = 9606)
executed in 0ms, finished 03:30:18 2023-11-16
[2]:
'kinase AND organism_id:9606'
[3]:
uniprot.query_builder(organism = [9606, 10090, 10116])
executed in 0ms, finished 03:30:49 2023-11-16
[3]:
'(organism_id:9606 OR organism_id:10090 OR organism_id:10116)'
[4]:
uniprot.query_builder({'organism_id': 9606, 'reviewed': True})
executed in 0ms, finished 03:31:22 2023-11-16
[4]:
'(organism_id:9606 AND reviewed:true)'
[5]:
uniprot.query_builder({'length': (500,), 'mass': (50000,), 'op': 'OR'})
executed in 0ms, finished 03:31:41 2023-11-16
[5]:
'(length:[500 TO *] OR mass:[50000 TO *])'
[6]:
uniprot.query_builder(lit_author = ['Huang', 'Kovac', '_AND'])
executed in 0ms, finished 03:32:21 2023-11-16
[6]:
'(lit_author:Huang AND lit_author:Kovac)'
[7]:
uniprot.query_builder({'organism_id': [9606, 10090], 'reviewed': True})
executed in 0ms, finished 03:32:41 2023-11-16
[7]:
'((organism_id:9606 OR organism_id:10090) AND reviewed:true)'
[8]:
uniprot.query_builder({'length': (100, None), 'organism_id': 9606})
executed in 0ms, finished 03:33:04 2023-11-16
[8]:
'(length:[100 TO *] AND organism_id:9606)'
The query parameters can be passed the same way to uniprot_data
and uniprot_query
. For example, to
query records in one proteome:
[10]:
from pypath.inputs import uniprot
uniprot.uniprot_query(proteome = 'UP000004102')[:10]
executed in 0ms, finished 03:36:16 2023-11-16
[10]:
['D1YM56',
'D1YMJ2',
'D1YN32',
'D1YNB3',
'D1YPZ1',
'D1YR07',
'D1YR15',
'D1YR93',
'D1YRB4',
'D1YRB7']
All these functionalities are performed by the pypath.inputs.uniprot.UniprotQuery
class.
Processed UniProt annotations§
For a few important fields we have dedicated processing functions with the aim of making their format cleaner and better usable. Sometimes even these do an imperfect job, and certain fields are badly truncated or contain residual fragments of the stripped labels.
Note: All the data presented below is part of the OmniPath annotations database, the recommended way to access it is by the database manager.
[136]:
from pypath.inputs import uniprot as iuniprot
iuniprot.uniprot_taxonomy()
executed in 1ms, finished 16:40:33 2022-12-02
[136]:
{'P00521': {'Abelson murine leukemia virus'},
'P03333': {'Abelson murine leukemia virus'},
'H8ZM73': {'Abies balsamea', 'Balsam fir', 'Pinus balsamea'},
'H8ZM71': {'Abies balsamea', 'Balsam fir', 'Pinus balsamea'},
'Q9MV51': {'Abies firma', 'Momi fir'},
'O81086': {'Abies grandis', 'Grand fir', 'Pinus grandis'},
'O24474': {'Abies grandis', 'Grand fir', 'Pinus grandis'},
'O24475': {'Abies grandis', 'Grand fir', 'Pinus grandis'},
'O64404': {'Abies grandis', 'Grand fir', 'Pinus grandis'},
'O64405': {'Abies grandis', 'Grand fir', 'Pinus grandis'},
'Q948Z0': {'Abies grandis', 'Grand fir', 'Pinus grandis'},
'Q9M7D1': {'Abies grandis', 'Grand fir', 'Pinus grandis'},
'Q9M7D0': {'Abies grandis', 'Grand fir', 'Pinus grandis'},
'O22340': {'Abies grandis', 'Grand fir', 'Pinus grandis'},
'Q9M7C9': {'Abies grandis', 'Grand fir', 'Pinus grandis'},
'Q5K3V1': {'Abies homolepis', 'Nikko fir'},
'P21715': {'Abrothrix jelskii', 'Akodon jelskii', "Jelski's altiplano mouse"},
'P11140': {'Abru
[139]:
iuniprot.uniprot_ncbi_taxids_2()
executed in 0ms, finished 16:42:33 2022-12-02
[139]:
{648330: Taxon(ncbi_id=648330, latin='Aedes albopictus densovirus (isolate Boublik/1994)', english='AalDNV', latin_synonym=None),
10804: Taxon(ncbi_id=10804, latin='Adeno-associated virus 2', english='AAV-2', latin_synonym=None),
648242: Taxon(ncbi_id=648242, latin='Adeno-associated virus 2 (isolate Srivastava/1982)', english='AAV-2', latin_synonym=None),
118452: Taxon(ncbi_id=118452, latin='Abacion magnum', english='Millipede', latin_synonym=None),
72259: Taxon(ncbi_id=72259, latin='Abaeis nicippe', english='Sleepy orange butterfly', latin_synonym='Eurema nicippe'),
102642: Taxon(ncbi_id=102642, latin='Abax parallelepipedus', english='Ground beetle', latin_synonym=None),
392897: Taxon(ncbi_id=392897, latin='Abalistes stellaris', english='Starry triggerfish', latin_synonym='Balistes stellaris'),
75332: Taxon(ncbi_id=75332, latin='Abbottina rivularis', english='Chinese false gudgeon', latin_synonym='Gobio rivularis'),
515833: Taxon(ncbi_id=515833, latin='Abdopus aculeatus', engl
[140]:
iuniprot.uniprot_locations()
executed in 0ms, finished 16:42:50 2022-12-02
[140]:
{'Q96EC8': {UniprotLocation(location='Golgi apparatus membrane', features=('Multi-pass membrane protein',))},
'Q6ZMS4': {UniprotLocation(location='Nucleus', features=None)},
'Q8N8L2': {UniprotLocation(location='Nucleus', features=None)},
'Q15916': {UniprotLocation(location='Nucleus', features=None)},
'Q3MIS6': {UniprotLocation(location='Nucleus', features=None)},
'Q6P280': {UniprotLocation(location='Nucleus', features=None)},
'Q969W1': {UniprotLocation(location='Endoplasmic reticulum membrane', features=('Multi-pass membrane protein',))},
'O14978': {UniprotLocation(location='Nucleus', features=None)},
'Q66K41': {UniprotLocation(location='Nucleus', features=None)},
'Q15937': {UniprotLocation(location='Nucleus', features=None)},
'Q9P2J8': {UniprotLocation(location='Nucleus', features=None)},
'Q8ND82': {UniprotLocation(location='Nucleus', features=None)},
'Q9NP64': {UniprotLocation(location='Nucleolus', features=None),
UniprotLocation(location='Nucleus', features=None)},
'P
[141]:
iuniprot.uniprot_keywords()
executed in 0ms, finished 16:43:06 2022-12-02
[141]:
{'P63120': {UniprotKeyword(keyword='Aspartyl protease'),
UniprotKeyword(keyword='Autocatalytic cleavage'),
UniprotKeyword(keyword='ERV'),
UniprotKeyword(keyword='Hydrolase'),
UniprotKeyword(keyword='Protease'),
UniprotKeyword(keyword='Reference proteome'),
UniprotKeyword(keyword='Ribosomal frameshifting'),
UniprotKeyword(keyword='Transposable element')},
'Q96EC8': {UniprotKeyword(keyword='Acetylation'),
UniprotKeyword(keyword='Alternative splicing'),
UniprotKeyword(keyword='Golgi apparatus'),
UniprotKeyword(keyword='Membrane'),
UniprotKeyword(keyword='Phosphoprotein'),
UniprotKeyword(keyword='Reference proteome'),
UniprotKeyword(keyword='Transmembrane'),
UniprotKeyword(keyword='Transmembrane helix')},
'Q6ZMS4': {UniprotKeyword(keyword='Metal-binding'),
UniprotKeyword(keyword='Nucleus'),
UniprotKeyword(keyword='Phosphoprotein'),
UniprotKeyword(keyword='Reference proteome'),
UniprotKeyword(keyword='Repeat'),
UniprotKeyword(keyword='Zinc'),
Unipro
[142]:
iuniprot.uniprot_families()
executed in 0ms, finished 16:43:22 2022-12-02
[142]:
{'P63120': {UniprotFamily(family='Peptidase A2', subfamily='HERV class-II K(HML-2)')},
'Q96EC8': {UniprotFamily(family='YIP1', subfamily=None)},
'Q6ZMS4': {UniprotFamily(family='Krueppel C2H2-type zinc-finger protein', subfamily=None)},
'Q8N8L2': {UniprotFamily(family='Krueppel C2H2-type zinc-finger protein', subfamily=None)},
'Q3MIS6': {UniprotFamily(family='Krueppel C2H2-type zinc-finger protein', subfamily=None)},
'Q86UK7': {UniprotFamily(family='ZNF598', subfamily=None)},
'Q6P280': {UniprotFamily(family='Krueppel C2H2-type zinc-finger protein', subfamily=None)},
'Q969W1': {UniprotFamily(family='DHHC palmitoyltransferase', subfamily=None)},
'O14978': {UniprotFamily(family='Krueppel C2H2-type zinc-finger protein', subfamily=None)},
'Q15937': {UniprotFamily(family='Krueppel C2H2-type zinc-finger protein', subfamily=None)},
'Q9P2J8': {UniprotFamily(family='Krueppel C2H2-type zinc-finger protein', subfamily=None)},
'Q8IUH4': {UniprotFamily(family='DHHC palmitoyltransferase',
[143]:
iuniprot.uniprot_tissues()
executed in 1.12s, finished 16:43:55 2022-12-02
[143]:
{'Q15916': {UniprotTissue(tissue='Brain', level='high'),
UniprotTissue(tissue='Wide', level='high')},
'Q969W1': {UniprotTissue(tissue='Wide', level='undefined')},
'O14978': {UniprotTissue(tissue='Brain', level='undefined'),
UniprotTissue(tissue='Colon', level='undefined'),
UniprotTissue(tissue='Heart', level='undefined'),
UniprotTissue(tissue='Kidney', level='undefined'),
UniprotTissue(tissue='Leukocyte', level='undefined'),
UniprotTissue(tissue='Liver', level='undefined'),
UniprotTissue(tissue='Lung', level='undefined'),
UniprotTissue(tissue='Ovary', level='undefined'),
UniprotTissue(tissue='Pancreas', level='undefined'),
UniprotTissue(tissue='Placenta', level='undefined'),
UniprotTissue(tissue='Prostate', level='undefined'),
UniprotTissue(tissue='Skeletal muscle', level='undefined'),
UniprotTissue(tissue='Small intestine', level='undefined'),
UniprotTissue(tissue='Spleen', level='undefined'),
UniprotTissue(tissue='Testis', level='undefined'),
Uniprot
[144]:
iuniprot.uniprot_topology()
executed in 0ms, finished 16:44:13 2022-12-02
[144]:
{'Q96EC8': {UniprotTopology(topology='Cytoplasmic', start=2, end=84),
UniprotTopology(topology='Cytoplasmic', start=137, end=146),
UniprotTopology(topology='Cytoplasmic', start=206, end=212),
UniprotTopology(topology='Lumenal', start=106, end=115),
UniprotTopology(topology='Lumenal', start=168, end=184),
UniprotTopology(topology='Lumenal', start=234, end=236),
UniprotTopology(topology='Transmembrane', start=85, end=105),
UniprotTopology(topology='Transmembrane', start=116, end=136),
UniprotTopology(topology='Transmembrane', start=147, end=167),
UniprotTopology(topology='Transmembrane', start=185, end=205),
UniprotTopology(topology='Transmembrane', start=213, end=233)},
'Q969W1': {UniprotTopology(topology='Cytoplasmic', start=1, end=77),
UniprotTopology(topology='Cytoplasmic', start=138, end=198),
UniprotTopology(topology='Cytoplasmic', start=288, end=377),
UniprotTopology(topology='Lumenal', start=99, end=116),
UniprotTopology(topology='Lumenal', start=220,
The UniProt utils module§
Datasheets§
The pypath.utils.uniprot
module is an API around UniProt protein datasheets. It
is not suitable for bulk retrieval: that would work but take really long time. Calling its bulk
methods with more than a few dozens or hundreds of proteins might take minutes, as it downloads
protein datasheets one-by-one. To retrieve the full datasheets of one or more proteins use
query
:
[153]:
from pypath.utils import uniprot
uniprot.query('P00533', 'O75385', 'Q14457')
executed in 1ms, finished 17:57:18 2022-12-02
[153]:
[<UniProt datasheet P00533 (EGFR)>,
<UniProt datasheet O75385 (ULK1)>,
<UniProt datasheet Q14457 (BECN1)>]
[154]:
ulk1 = uniprot.query('O75385')
ulk1
executed in 0ms, finished 17:57:58 2022-12-02
[154]:
<UniProt datasheet O75385 (ULK1)>
Many attributes are available from the datasheet objects, just a few examples:
[156]:
ulk1.weight, ulk1.length, ulk1.subcellular_location, ulk1.sequence
executed in 0ms, finished 17:59:18 2022-12-02
[156]:
(112631,
1050,
'Cytoplasm, cytosol. Preautophagosomal structure. Note=Under starvation conditions, is localized to puncate structures primarily representing the isolation membrane that sequesters a portion of the cytoplasm resulting in the formation of an autophagosome.',
'MEPGRGGTETVGKFEFSRKDLIGHGAFAVVFKGRHREKHDLEVAVKCINKKNLAKSQTLLGKEIKILKELKHENIVALYDFQEMANSVYLVMEYCNGGDLADYLHAMRTLSEDTIRLFLQQIAGAMRLLHSKGIIHRDLKPQNILLSNPAGRRANPNSIRVKIADFGFARYLQSNMMAATLCGSPMYMAPEVIMSQHYDGKADLWSIGTIVYQCLTGKAPFQASSPQDLRLFYEKNKTLVPTIPRETSAPLRQLLLALLQRNHKDRMDFDEFFHHPFLDASPSVRKSPPVPVPSYPSSGSGSSSSSSSTSHLASPPSLGEMQQLQKTLASPADTAGFLHSSRDSGGSKDSSCDTDDFVMVPAQFPGDLVAEAPSAKPPPDSLMCSGSSLVASAGLESHGRTPSPSPPCSSSPSPSGRAGPFSSSRCGASVPIPVPTQVQNYQRIERNLQSPTQFQTPRSSAIRRSGSTSPLGFARASPSPPAHAEHGGVLARKMSLGGGRPYTPSPQVGTIPERPGWSGTPSPQGAEMRGGRSPRPGSSAPEHSPRTSGLGCRLHSAPNLSDLHVVRPKLPKPPTDPLGAVFSPPQASPPQPSHGLQSCRNLRGSPKLPDFLQRNPLPPILGSPTKAVPSFDFPKTPSSQNLLALLARQGVVMTPPRNRTLPDLSEVGPFHGQPLGPGLRPGEDPKGPFGRSFSTSRLTDLLLKAAFGTQAPDPGSTESLQEK
The collect
function collects certain features for a set of proteins.
Warning: This is a really inefficient way of retrieving data from UniProt. If you work with more than a handful of proteins, go for pypath.inputs.uniprot_data instead.
[158]:
uniprot.collect(['P00533', 'O75385', 'Q14457'], 'weight', 'length')
executed in 0ms, finished 18:02:29 2022-12-02
[158]:
OrderedDict([('ac', ['P00533', 'O75385', 'Q14457']),
('weight', [134277, 112631, 51896]),
('length', [1210, 1050, 450])])
Tables§
UniProt data can be printed to the console in a tabular format:
[159]:
uniprot.print_features(['P00533', 'O75385', 'Q14457'], 'weight', 'length')
executed in 0ms, finished 18:07:18 2022-12-02
╒═══════╤════════╤══════════╤══════════╕
│ No. │ ac │ weight │ length │
╞═══════╪════════╪══════════╪══════════╡
│ 1 │ P00533 │ 134277 │ 1210 │
├───────┼────────┼──────────┼──────────┤
│ 2 │ O75385 │ 112631 │ 1050 │
├───────┼────────┼──────────┼──────────┤
│ 3 │ Q14457 │ 51896 │ 450 │
╘═══════╧════════╧══════════╧══════════╛
There is a shortcut to print essential characterization of proteins as such a table. The
info
function is
really useful if you get to a set of proteins at some point of your analysis and you want to
quickly check what kind they are. To iterate through multiple groups of proteins, use
utils.uniprot.browse
.
The columns and format of these tables can be customized by kwargs
.
[160]:
uniprot.info(['P00533', 'O75385', 'Q14457'])
executed in 0ms, finished 18:09:45 2022-12-02
=====> [3 proteins] <=====
╒═══════╤════════╤══════════════╤══════════╤══════════╤═════════════╤══════════════╤════════════╤══════════════╕
│ No. │ ac │ genesymbol │ length │ weight │ full_name │ function_o │ keywords │ subcellula │
│ │ │ │ │ │ │ r_genecard │ │ r_location │
│ │ │ │ │ │ │ s │ │ │
╞═══════╪════════╪══════════════╪══════════╪══════════╪═════════════╪══════════════╪════════════╪══════════════╡
│ 1 │ P00533 │ EGFR │ 1210 │ 134277 │ Epidermal │ Receptor │ 3D- │ Cell │
│ │ │ │ │ │ growth │ tyrosine │ structure, │ membrane; │
│ │ │ │ │ │ factor │ kinase │ Alternativ │ Single- │
│ │ │ │ │ │ receptor │
Sanitizing UniProt IDs§
It is important to know that the ID translation module always do a number of checks when
translating to UniProt IDs. Unless the uniprot_cleanup
parameter is disabled. It translates secondary IDs to primary,
attempts to map TrEMBL IDs to SwissProts by gene symbols, removes IDs of other organisms or invalid
format. To exploit this behaviour it’s enough to map from UniProt to UniProt:
[162]:
from pypath.utils import mapping
mapping.map_name('Q9UQ28', 'uniprot', 'uniprot')
executed in 0ms, finished 18:20:02 2022-12-02
[162]:
{'O75385'}
Enzyme-substrate interactions§
The database is an instance of pypath.core.enz_sub.EnzymeSubstrateAggregator
class. The database is built with
the default or current configuration by the core.enz_sub.get_db
method.
Warning: it is recommended to access databases by the manager. Running the code below takes really long and does not save or reload the database, it builds a fresh copy each time.
[25]:
from pypath.core import enz_sub
es = enz_sub.get_db()
executed in 8m 1.81s, finished 14:26:37 2022-12-02
Instead, let’s acquire the database from the manager:
[6]:
from pypath import omnipath
es = omnipath.db.get_db('enz_sub')
executed in 7.27s, finished 15:37:33 2022-12-03
The database itself is stored as a dictionary (EnzymeSubstrateAggregator.enz_sub
)
with pairs of proteins as keys and a list of special objects representing enzyme-substrate
interactions as values. These can be accessed by pairs of labels, identifiers or Entity
objects, e.g. mTOR
phosphorylates AKT1:
[27]:
es[('MTOR', 'AKT1')]
executed in 0ms, finished 14:40:55 2022-12-02
[27]:
[<MTOR => Residue AKT1-1:S473:phosphorylation [Evidences: HPRD, KEA, MIMP, PhosphoSite, ProtMapper, REACH, SIGNOR, Sparser, dbPTM, phosphoELM (15 references)]>,
<MTOR => Residue AKT1-1:T450:phosphorylation [Evidences: HPRD, MIMP, PhosphoSite, ProtMapper, phosphoELM (0 references)]>,
<MTOR => Residue AKT1-1:T308:phosphorylation [Evidences: ProtMapper, Sparser (1 references)]>]
Enzyme-substrate objects§
Let’s take a closer look at one of the enzyme-PTM relationships, represented by pypath.internals.intera.DomainMotif
objects. Below some of the attributes are shown:
[28]:
e_ptm = es[('MTOR', 'AKT1')][0]
e_ptm.ptm.protein, e_ptm.ptm.protein.identifier, e_ptm.ptm.isoform, e_ptm.ptm.residue, e_ptm.ptm.residue.name, e_ptm.ptm.residue.number, e_ptm.ptm.typ, e_ptm.domain.protein
executed in 0ms, finished 14:40:57 2022-12-02
[28]:
(<Entity: AKT1>,
'P31749',
1,
<Residue AKT1-1:S473>,
'S',
473,
'phosphorylation',
<Entity: MTOR>)
The resources and references are available in Evidences
objects:
[29]:
e_ptm.evidences
executed in 0ms, finished 14:41:00 2022-12-02
[29]:
<Evidences: HPRD, KEA, MIMP, PhosphoSite, ProtMapper, REACH, SIGNOR, Sparser, dbPTM, phosphoELM (15 references)>
[30]:
e_ptm.evidences.get_resource_names()
executed in 0ms, finished 14:41:03 2022-12-02
[30]:
{'KEA', 'MIMP', 'PhosphoSite', 'ProtMapper', 'SIGNOR', 'dbPTM'}
[31]:
e_ptm.evidences.get_references()
executed in 0ms, finished 14:41:04 2022-12-02
[31]:
{<Reference: 14761976>,
<Reference: 15047712>,
<Reference: 15364915>,
<Reference: 15718470>,
<Reference: 15899889>,
<Reference: 16221682>,
<Reference: 17013611>,
<Reference: 19844585>,
<Reference: 20333297>,
<Reference: 20489726>,
<Reference: 21157483>,
<Reference: 21592956>,
<Reference: 23006971>,
<Reference: 8978681>,
<Reference: 9736715>}
Enzyme-substrate data frame§
The dabase object is able to export its contents into a pandas.DataFrame
:
[7]:
es.make_df()
es.df
executed in 1.03s, finished 15:37:39 2022-12-03
[7]:
enzyme | enzyme_genesymbol | substrate | substrate_genesymbol | isoforms | residue_type | residue_offset | modification | sources | references | curation_effort | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | P31749 | AKT1 | P63104 | YWHAZ | 1 | S | 58 | phosphorylation | HPRD;HPRD_MIMP;KEA;MIMP;PhosphoSite;PhosphoSit... | HPRD:11956222;KEA:11956222;KEA:12861023;KEA:16... | 11 |
1 | P31749 | AKT1 | P63104 | YWHAZ | 1 | S | 184 | phosphorylation | HPRD;HPRD_MIMP;KEA;MIMP;PhosphoSite_MIMP;phosp... | HPRD:11956222;KEA:11956222;KEA:15071501 | 3 |
2 | P45983 | MAPK8 | P63104 | YWHAZ | 1 | S | 184 | phosphorylation | HPRD;HPRD_MIMP;KEA;MIMP;PhosphoNetworks;Phosph... | HPRD:15696159;KEA:11956222;KEA:15071501;KEA:15... | 9 |
3 | P06493 | CDK1 | P11171 | EPB41 | 1 | S | 712 | phosphorylation | HPRD_MIMP;MIMP;PhosphoSite_MIMP;ProtMapper;REA... | ProtMapper:15525677;dbPTM:15525677;dbPTM:18220... | 5 |
4 | P06493 | CDK1 | P11171 | EPB41 | 1;2;5;7 | T | 60 | phosphorylation | MIMP;PhosphoSite;PhosphoSite_MIMP;ProtMapper;R... | ProtMapper:15525677;dbPTM:15525677;dbPTM:2171679 | 3 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
41421 | P29597 | TYK2 | P51692 | STAT5B | 1 | Y | 699 | phosphorylation | KEA | KEA:10830280;KEA:11751923;KEA:12411494 | 3 |
41422 | Q06418 | TYRO3 | P19174 | PLCG1 | 1;2 | Y | 771 | phosphorylation | KEA | KEA:12601080;KEA:15144186;KEA:15592455;KEA:160... | 8 |
41423 | Q9H4A3 | WNK1 | Q8TAX0 | OSR1 | 1 | T | 185 | phosphorylation | KEA | KEA:18270262 | 1 |
41424 | Q9H4A3 | WNK1 | Q96J92 | WNK4 | 1;3 | S | 335 | phosphorylation | KEA | KEA:15883153 | 1 |
41425 | Q9NYL2 | MAP3K20 | Q92903 | CDS1 | 1 | T | 68 | phosphorylation | KEA | KEA:10973490 | 1 |
41426 rows × 11 columns
Protein sequences§
The APIs for sequences are very basic, because we’ve never really needed them; but the fundamentals are probably there to make a nice, powerful API. Still, I don’t believe pypath will ever be strong in sequences, it’s just not our main topic.
[186]:
from pypath.utils import homology
seqc = homology.SequenceContainer(preload_seq = [9606])
akt1 = seqc.get_seq('P31749')
akt1.get_region(start = 10, end = 19, isoform = 2)
executed in 0ms, finished 19:40:09 2022-12-02
[186]:
(10, 19, 'TFIIRCLQWT')
[187]:
from pypath.utils import seq
human_proteome = seq.swissprot_seq()
human_proteome
executed in 0ms, finished 19:44:52 2022-12-02
[187]:
{'P63120': <pypath.utils.seq.Seq at 0x689900d45cc0>,
'Q96EC8': <pypath.utils.seq.Seq at 0x689908ea8f70>,
'Q6ZMS4': <pypath.utils.seq.Seq at 0x689908eaa4a0>,
'Q8N8L2': <pypath.utils.seq.Seq at 0x6899223538b0>,
'Q15916': <pypath.utils.seq.Seq at 0x689922353c70>,
'O60384': <pypath.utils.seq.Seq at 0x689922350730>,
'Q3MIS6': <pypath.utils.seq.Seq at 0x689922353310>,
'Q86UK7': <pypath.utils.seq.Seq at 0x689922353760>,
'Q6P280': <pypath.utils.seq.Seq at 0x689922353190>,
'Q969W1': <pypath.utils.seq.Seq at 0x689922350d90>,
'O14978': <pypath.utils.seq.Seq at 0x689922353220>,
'P61129': <pypath.utils.seq.Seq at 0x689922353370>,
'Q66K41': <pypath.utils.seq.Seq at 0x6899223534f0>,
'Q15937': <pypath.utils.seq.Seq at 0x689922350c70>,
'Q9P2J8': <pypath.utils.seq.Seq at 0x689922351450>,
'Q8ND82': <pypath.utils.seq.Seq at 0x689922353910>,
'Q9NP64': <pypath.utils.seq.Seq at 0x6899223502b0>,
'P98182': <pypath.utils.seq.Seq at 0x689922350280>,
'Q8IUH4': <pypath.utils.seq.Seq at 0x68992235
[191]:
list(human_proteome['P00533'].findall('YGCT'))
executed in 0ms, finished 19:48:41 2022-12-02
[191]:
[SeqLookup(isoform=1, offset=625)]
Annotations§
This database provides various annotations about the function, structure, localization and many
other properties of the proteins and genes. The database is an instance of pypath.core.annot.AnnotationTable
class. The database is built with the default or current configuration by the core.annot.get_db
method.
Warning: it is recommended to access databases by the manager. Running the code below takes really long and does not save or reload the database, it builds a fresh copy each time.
[38]:
from pypath.core import annot
an = annot.get_db()
an
executed in 1ms, finished 15:07:08 2022-12-02
[38]:
<Annotation database: 3788067 records about 51636 entities from 78 resources>
Load a single annotation resource§
The annotations database is huge, on disk it takes up 1-2 GB of space, it consists of 60-70
resources. But all these resources are not integrated with each other, each can be loaded
individually, by their dedicated classes in the core.annot
module. This practice can
be recommended and will be supported better in the future. Let’s load one resource:
[8]:
from pypath.core import annot
cpad = annot.Cpad()
cpad
executed in 48.26s, finished 15:38:57 2022-12-03
[8]:
<CPAD annotations: 2308 records about 1358 entities>
The resulted object is derived from the AnnotationBase
class, its data is stored under the annot
attribute, in a dict where
identifiers are keys and sets of annotation records are the values. The keys of the records are
shown by the get_names
method:
[35]:
cpad.get_names()
executed in 0ms, finished 15:06:45 2022-12-02
[35]:
('regulator_type',
'effect_on_pathway',
'pathway',
'effect_on_cancer',
'effect_on_cancer_outcome',
'cancer',
'pathway_category')
For each name we can list the possible values:
[36]:
cpad.get_values('cancer')
executed in 0ms, finished 15:06:47 2022-12-02
[36]:
{'Acute lymphoblastic leukemia (ALL) (precursor T lymphoblastic leukemia)',
'Acute myeloid leukemia (AML)',
'Basal cell carcinoma',
'Bladder cancer',
'Breast cancer',
'Cervical cancer',
'Cholangiocarcinoma',
'Choriocarcinoma',
'Chronic lymphocytic leukemia (CLL)',
'Chronic myeloid leukemia (CML)',
'Colorectal cancer',
'Endometrial cancer',
'Esophageal cancer',
"Ewing's sarcoma",
'Gallbladder cancer',
'Gastric cancer',
'Glioma',
'Hepatocellular carcinoma',
'Hodgkin lymphoma',
'Infantile hemangioma',
'Laryngeal cancer',
'Malignant melanoma',
'Malignant pleural mesothelioma',
'Mantle cell lymphoma',
'Multiple myeloma',
'Nasopharyngeal cancer',
'Neuroblastoma',
'Non-small cell lung cancer',
'Oral cancer',
'Osteosarcoma',
'Ovarian cancer',
'Pancreatic cancer',
'Pituitary adenomas',
'Prostate cancer',
'Renal cell carcinoma',
'Small cell lung cancer',
'Squamous cell carcinoma',
'Synovial sarcoma',
'Testicular cancer',
'Thyroid cancer'}
Based on their annotations the select
method filters the annotated molecules. For example, 78 complexes,
miRNAs and proteins are annotated as inhibiting colorectal cancer:
[37]:
cpad.select(cancer = 'Colorectal cancer', effect_on_cancer = 'Inhibiting')
executed in 0ms, finished 15:06:50 2022-12-02
[37]:
{'A6NDV4',
Complex: COMPLEX:O14745,
Complex: COMPLEX:O14862,
Complex: COMPLEX:O15169_P25054,
Complex: COMPLEX:O94813,
Complex: COMPLEX:O94953,
Complex: COMPLEX:P00533,
Complex: COMPLEX:P06733,
Complex Glucose transporter complex 1: COMPLEX:P11166,
Complex: COMPLEX:P25054,
Complex: COMPLEX:P40261,
Complex: COMPLEX:P49327,
Complex: COMPLEX:P54687,
Complex PTEN phosphatase complex: COMPLEX:P60484,
Complex: COMPLEX:Q01973,
Complex: COMPLEX:Q12888,
Complex: COMPLEX:Q13620,
Complex: COMPLEX:Q96CX2,
Complex: COMPLEX:Q99558,
'MIMAT0000069',
'MIMAT0000089',
'MIMAT0000093',
'MIMAT0000262',
'MIMAT0000274',
'MIMAT0000422',
'MIMAT0000427',
'MIMAT0000437',
'MIMAT0000449',
'MIMAT0000455',
'MIMAT0000460',
'MIMAT0000461',
'MIMAT0000617',
'MIMAT0003266',
'MIMAT0003320',
'O14745',
'O14862',
'O15169',
'O75473',
'O75888',
'O76041',
'O94813',
'O94953',
'P00533',
'P06733',
'P06756',
'P11166',
'P13631',
'P22676',
'P25054',
'P25791',
'P40261',
'P49327',
'P546
Load the full annotations database by the database manager§
Alternatively, the full annotations database can be accessed in the usual way:
[215]:
from pypath import omnipath
an = omnipath.db.get_db('annotations')
an
[215]:
<Annotation database: 5490653 records about 50872 entities from 68 resources>
The AnnotationTable
object contains the resource specific annotation objects under the annots
attribute:
[40]:
an.annots
executed in 0ms, finished 15:07:39 2022-12-02
[40]:
{'CellTypist': <CellTypist annotations: 927 records about 473 entities>,
'Integrins': <Integrins annotations: 62 records about 62 entities>,
'CellCellInteractions': <CellCellInteractions annotations: 5544 records about 4960 entities>,
'PanglaoDB': <PanglaoDB annotations: 8479 records about 4813 entities>,
'Lambert2018': <Lambert2018 annotations: 3281 records about 3277 entities>,
'CancerSEA': <CancerSEA annotations: 2515 records about 1992 entities>,
'Phobius': <Phobius annotations: 35382 records about 35382 entities>,
'GO_Intercell': <GO_Intercell annotations: 48799 records about 18377 entities>,
'MatrixDB': <MatrixDB annotations: 18127 records about 15903 entities>,
'Surfaceome': <Surfaceome annotations: 3558 records about 3558 entities>,
'Matrisome': <Matrisome annotations: 1514 records about 1514 entities>,
'HPA_secretome': <HPA_secretome annotations: 3568 records about 3568 entities>,
'HPMR': <HPMR annotations: 1748 records about 1695 entities>,
'CPAD': <CPAD annotati
For each of these you can query the names of the fields, their possible values and the set of proteins annotated with any combination of the values, just like for CPAD above. As another exemple, let’s take a look into the Matrisome database:
[41]:
matrisome = an.annots['Matrisome']
executed in 0ms, finished 15:07:45 2022-12-02
[42]:
matrisome.get_names()
executed in 0ms, finished 15:07:49 2022-12-02
[42]:
('mainclass', 'subclass', 'subsubclass')
[43]:
matrisome.get_values('subclass')
executed in 0ms, finished 15:07:53 2022-12-02
[43]:
{'Collagens',
'ECM Glycoproteins',
'ECM Regulators',
'ECM-affiliated Proteins',
'Proteoglycans',
'Secreted Factors',
'n/a'}
[44]:
matrisome.get_subset(subclass = 'Collagens')
executed in 0ms, finished 15:07:56 2022-12-02
[44]:
{'A6NMZ7',
'A8TX70',
'B4DZ39',
Complex Collagen type I homotrimer: COMPLEX:P02452,
Complex HT_DM_Cluster278: COMPLEX:P02452_P02462_P08572_P29400_P53420_Q01955_Q02388_Q14031_Q17RW2_Q8NFW1,
Complex Collagen type I trimer: COMPLEX:P02452_P08123,
Complex Collagen type II trimer: COMPLEX:P02458,
Complex Collagen type XI trimer variant 1: COMPLEX:P02458_P12107_P13942,
Complex: COMPLEX:P02458_P20908_P25067,
Complex: COMPLEX:P02458_P20908_P25067_P29400,
Complex: COMPLEX:P02458_P25067_P29400,
Complex Collagen type III trimer: COMPLEX:P02461,
Complex: COMPLEX:P02462,
Complex Collagen type IV trimer variant 1: COMPLEX:P02462_P08572,
Complex Collagen type XI trimer variant 2: COMPLEX:P05997_P12107,
Complex Collagen type XI trimer variant 3: COMPLEX:P05997_P12107_P20908,
Complex Collagen type V trimer variant 1: COMPLEX:P05997_P20908,
Complex Collagen type V trimer variant 2: COMPLEX:P05997_P20908_P25940,
Complex: COMPLEX:P08572,
Complex: COMPLEX:P12109_P12110,
Complex Collagen
Load only selected annotations§
Another option is to load only certain annotation resources into an AnnotationTable
object. We refer to
the resources by class names. For example, if you only want to load the pathway membership
annotations from SIGNOR, SignaLink, NetPath and KEGG, you can provide the names of the appropriate
classes:
[47]:
pathways = annot.AnnotationTable(
protein_sources = (
'SignalinkPathways',
'KeggPathways',
'NetpathPathways',
'SignorPathways',
),
complex_sources = (),
)
pathways
executed in 12.07s, finished 15:09:48 2022-12-02
[47]:
<Annotation database: 28745 records about 6762 entities from 4 resources>
The AnnotationTable
object provides methods to query all resources together, or build a boolean array out of them. To
see all annotations of one protein:
[48]:
pathways.all_annotations('P00533')
executed in 0ms, finished 15:10:17 2022-12-02
[48]:
[SignalinkPathway(pathway='Receptor tyrosine kinase'),
SignalinkPathway(pathway='JAK/STAT'),
KeggPathway(pathway='Proteoglycans in cancer'),
KeggPathway(pathway='Regulation of actin cytoskeleton'),
KeggPathway(pathway='Oxytocin signaling pathway'),
KeggPathway(pathway='Phospholipase D signaling pathway'),
KeggPathway(pathway='Pathways in cancer'),
KeggPathway(pathway='Hepatocellular carcinoma'),
KeggPathway(pathway='Colorectal cancer'),
KeggPathway(pathway='Melanoma'),
KeggPathway(pathway='EGFR tyrosine kinase inhibitor resistance'),
KeggPathway(pathway='Human papillomavirus infection'),
KeggPathway(pathway='Pancreatic cancer'),
KeggPathway(pathway='Non-small cell lung cancer'),
KeggPathway(pathway='Central carbon metabolism in cancer'),
KeggPathway(pathway='Endocytosis'),
KeggPathway(pathway='Endometrial cancer'),
KeggPathway(pathway='Choline metabolism in cancer'),
KeggPathway(pathway='Bladder cancer'),
KeggPathway(pathway='Parathyroid hormone synthesis, secretion
Data frames of annotations§
Data from annotation objects can be exported to a pandas.DataFrame
:
[9]:
cpad.make_df()
cpad.df
executed in 0ms, finished 15:40:14 2022-12-03
[9]:
uniprot | genesymbol | entity_type | source | label | value | record_id | |
---|---|---|---|---|---|---|---|
0 | Q16181 | SEPT7 | protein | CPAD | regulator_type | protein | 0 |
1 | Q16181 | SEPT7 | protein | CPAD | effect_on_pathway | Upregulation | 0 |
2 | Q16181 | SEPT7 | protein | CPAD | pathway | Actin cytoskeleton pathway | 0 |
3 | Q16181 | SEPT7 | protein | CPAD | effect_on_cancer | Inhibiting | 0 |
4 | Q16181 | SEPT7 | protein | CPAD | effect_on_cancer_outcome | inhibit glioma cell migration | 0 |
... | ... | ... | ... | ... | ... | ... | ... |
14396 | COMPLEX:P30990 | COMPLEX:NTS | complex | CPAD | cancer | Hepatocellular carcinoma | 2306 |
14397 | COMPLEX:P30990 | COMPLEX:NTS | complex | CPAD | effect_on_pathway | Upregulation | 2307 |
14398 | COMPLEX:P30990 | COMPLEX:NTS | complex | CPAD | pathway | ERK signaling pathway | 2307 |
14399 | COMPLEX:P30990 | COMPLEX:NTS | complex | CPAD | effect_on_cancer | Activating | 2307 |
14400 | COMPLEX:P30990 | COMPLEX:NTS | complex | CPAD | cancer | Gastric cancer | 2307 |
14401 rows × 7 columns
The data frame has a long format. It can be converted to the more conventional wide format using
standard pandas
procedures (well, in tidyverse you would simply call tidyr::pivot_wider
, in pandas
you have to do an unintuitive
sequence of 6 calls):
[10]:
index_cols = ['record_id', 'uniprot', 'genesymbol', 'label', 'entity_type']
(
cpad.df.drop('source', axis=1).
set_index(index_cols).
unstack('label').
droplevel(axis=1, level=0).
reset_index().
drop('record_id', axis=1)
)
executed in 0ms, finished 15:40:19 2022-12-03
[10]:
label | uniprot | genesymbol | entity_type | cancer | effect_on_cancer | effect_on_cancer_outcome | effect_on_pathway | pathway | pathway_category | regulator_type |
---|---|---|---|---|---|---|---|---|---|---|
0 | Q16181 | SEPT7 | protein | Glioma | Inhibiting | inhibit glioma cell migration | Upregulation | Actin cytoskeleton pathway | Regulation of actin cytoskeleton | protein |
1 | MIMAT0000431 | hsa-miR-140 | mirna | Squamous cell carcinoma | Inhibiting | suppress tumor cell migration and invasion | Upregulation | ADAM10 mediated Notch1 signaling pathway | Notch signaling pathway | mirna |
2 | MIMAT0005886 | hsa-miR-1297 | mirna | Prostate cancer | Inhibiting | inhibit proliferation and invasion | Upregulation | AEG1/Wnt signaling pathway | Wnt signaling pathway | mirna |
3 | Q9UP65 | PLA2G4C | protein | Breast cancer | Inhibiting | inhibit EGF-induced chemotaxis | Downregulation | Akt signaling pathway | PI3K-Akt signaling pathway | protein |
4 | Q92600 | CNOT9 | protein | Breast cancer | Inhibiting | suppress cell proliferation | Downregulation | Akt signaling pathway | PI3K-Akt signaling pathway | protein |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
2303 | COMPLEX:P16422 | COMPLEX:EPCAM | complex | Prostate cancer | Inhibiting | NaN | Downregulation | PI3K-Akt-mTOR signaling pathway | NaN | NaN |
2304 | COMPLEX:Q9Y6Y0 | COMPLEX:IVNS1ABP | complex | Prostate cancer | Inhibiting | NaN | Upregulation | Akt signaling pathway | NaN | NaN |
2305 | COMPLEX:Q96CX2 | COMPLEX:KCTD12 | complex | Colorectal cancer | Inhibiting | NaN | Upregulation | ERK signaling pathway | NaN | NaN |
2306 | COMPLEX:P30990 | COMPLEX:NTS | complex | Hepatocellular carcinoma | Activating | NaN | Upregulation | Wnt/beta-catenin signaling pathway | NaN | NaN |
2307 | COMPLEX:P30990 | COMPLEX:NTS | complex | Gastric cancer | Activating | NaN | Upregulation | ERK signaling pathway | NaN | NaN |
2308 rows × 10 columns
Inter-cellular signaling roles§
pypath
does not combine
the annotations in the annot
module, exactly what goes in goes out. For example, WNT pathway from Signor
and SignaLink won’t be merged automatically. However with the pypath.core.annot.CustomAnnotation
class anyone can do it. For inter-cellular communication categories the pypath.core.intercell
module combines
the data from all the relevant resources and creates categories based on a combination of evidences.
The database is an instance of the IntercellAnnotation
object, and the build is executed by the pypath.core.intercell.get_db
function.
Warning: it is recommended to access databases by the manager. Running the code below takes really long and does not save or reload the database, it builds a fresh copy each time.
[53]:
from pypath.core import intercell
ic = intercell.get_db() # this takes quite some time
# unless you load annotations from a pickle cache
ic
executed in 0ms, finished 15:13:03 2022-12-02
[53]:
<Intercell annotations: 310033 records about 43617 entities>
[11]:
from pypath import omnipath
ic = omnipath.db.get_db('intercell')
ic
executed in 2m 55.47s, finished 15:43:27 2022-12-03
[11]:
<Intercell annotations: 301527 records about 48570 entities>
This object stores its data under the classes
attribute. Classes are defined in pypath.core.intercell_annot.annot_combined_classes
. In addition, we manually
revised and excluded some proteins from the more generic classes, these are listed in pypath.core.intercell_annot.excludes
.
Each class has the following properties:
-
name
: all lowercase, human understandable name, without repeating the parent class (e.g. WNT receptors will be simply wnt, and the parent class will be receptor) -
parent
: for a specific class the parent is the generic category it belongs to; for generic classes thename
andparent
are the same -
resource
: the resource the data comes from, or OmniPath for composite classes (combined from multiple resources) -
scope
: specific or generic; e.g. TGF ligand is specific, ligand is generic -
aspect
: locational (e.g. plasma membrane) or functional (e.g. transporter)
Read more about the design of the intercell database in our paper.
[55]:
ic.classes
executed in 0ms, finished 15:16:54 2022-12-02
[55]:
{AnnotDefKey(name='transmembrane', parent='transmembrane', resource='UniProt_location'): <AnnotationGroup `transmembrane` from UniProt_location, 5150 elements>,
AnnotDefKey(name='transmembrane', parent='transmembrane', resource='UniProt_topology'): <AnnotationGroup `transmembrane` from UniProt_topology, 5760 elements>,
AnnotDefKey(name='transmembrane', parent='transmembrane', resource='UniProt_keyword'): <AnnotationGroup `transmembrane` from UniProt_keyword, 7041 elements>,
AnnotDefKey(name='transmembrane', parent='transmembrane_predicted', resource='Phobius'): <AnnotationGroup `transmembrane` from Phobius, 6444 elements>,
AnnotDefKey(name='transmembrane_phobius', parent='transmembrane_predicted', resource='Almen2009'): <AnnotationGroup `transmembrane_phobius` from Almen2009, 2072 elements>,
AnnotDefKey(name='transmembrane_sosui', parent='transmembrane_predicted', resource='Almen2009'): <AnnotationGroup `transmembrane_sosui` from Almen2009, 1663 elements>,
AnnotDefKey(name='trans
An easy way to access the classes is the select
method. The AnnotationGroup
objects behave as plain Python set
s, and besides that, they feature
many further attributes and methods.
[56]:
gaba_receptors = ic.select('gaba', parent = 'receptor')
gaba_receptors
executed in 0ms, finished 15:17:00 2022-12-02
[56]:
<AnnotationGroup `gaba` from HGNC, 40 elements>
[57]:
gaba_receptors.members
executed in 0ms, finished 15:17:02 2022-12-02
[57]:
{'A8MPY1',
Complex GABA-A receptor (GABRA1, GABRB2, GABRD): COMPLEX:O14764_P14867_P47870,
Complex GABA-A receptor, alpha-4/beta-3/delta: COMPLEX:O14764_P28472_P48169,
Complex GABA-A receptor, alpha-6/beta-3/delta: COMPLEX:O14764_P28472_Q16445,
Complex GABA-A receptor, alpha-4/beta-2/delta: COMPLEX:O14764_P47870_P48169,
Complex GABA-A receptor, alpha-6/beta-2/delta: COMPLEX:O14764_P47870_Q16445,
Complex GABBR1-GABBR2 complex: COMPLEX:O75899_Q9UBS5,
Complex: COMPLEX:P14867,
Complex GABA-A receptor, alpha-1/beta-3/gamma-2: COMPLEX:P14867_P18507_P28472,
Complex GABA-A receptor (GABRA1, GABRB2, GABRG2): COMPLEX:P14867_P18507_P47870,
Complex GABA-A receptor, alpha-5/beta-3/gamma-2: COMPLEX:P18507_P28472_P31644,
Complex GABA-A receptor, alpha-3/beta-3/gamma-2: COMPLEX:P18507_P28472_P34903,
Complex GABA-A receptor, alpha-2/beta-3/gamma-2: COMPLEX:P18507_P28472_P47869,
Complex GABA-A receptor, alpha-6/beta-3/gamma-2: COMPLEX:P18507_P28472_Q16445,
Complex: COMPLEX:P18507_Q8N1C3,
C
Build an intercellular communication network§
The intercell database can be connected to a Network
object to create an
intercellular communication network:
[58]:
cu = omnipath.db.get_db('curated')
ic.register_network(cu)
executed in 0ms, finished 15:17:08 2022-12-02
Quantitative overview of intercell annotations§
A data frame with basic statistics is available:
[13]:
ic.counts_df()
executed in 0ms, finished 15:45:17 2022-12-03
[13]:
category | parent | database | scope | aspect | source | consensus_score | transmitter | receiver | secreted | plasma_membrane_transmembrane | plasma_membrane_peripheral | n_uniprot | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | transmembrane | transmembrane | UniProt_location | generic | locational | resource_specific | 6 | False | False | False | True | False | 5150 |
1 | transmembrane | transmembrane | UniProt_topology | generic | locational | resource_specific | 6 | False | False | False | True | False | 5760 |
2 | transmembrane | transmembrane | UniProt_keyword | generic | locational | resource_specific | 1 | False | False | False | False | False | 7041 |
3 | transmembrane | transmembrane_predicted | Phobius | generic | locational | resource_specific | 1 | False | False | False | False | False | 6444 |
4 | transmembrane_phobius | transmembrane_predicted | Almen2009 | generic | locational | resource_specific | 0 | False | False | False | True | False | 2072 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
1120 | parin_adhesion_regulator | intracellular_intercellular_related | HGNC | specific | functional | resource_specific | 0 | True | False | False | False | False | 5 |
1121 | plakophilin_adhesion_regulator | intracellular_intercellular_related | HGNC | specific | functional | resource_specific | 0 | True | False | False | False | False | 3 |
1122 | actin_regulation_adhesome | intracellular_intercellular_related | Adhesome | specific | functional | resource_specific | 0 | True | False | False | False | False | 22 |
1123 | adhesion_cytoskeleton_adaptor | intracellular_intercellular_related | Adhesome | specific | functional | resource_specific | 0 | True | False | False | False | False | 118 |
1124 | intracellular_intercellular_related | intracellular_intercellular_related | OmniPath | generic | functional | composite | 0 | True | False | False | False | False | 291 |
1125 rows × 13 columns
Intercell database as data frame§
Just like the other databases, the object can be exported into a pandas.DataFrame
:
[14]:
ic.make_df()
ic.df[:10]
executed in 22.72s, finished 15:45:46 2022-12-03
[14]:
category | parent | database | scope | aspect | source | uniprot | genesymbol | entity_type | consensus_score | transmitter | receiver | secreted | plasma_membrane_transmembrane | plasma_membrane_peripheral | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | transmembrane | transmembrane | UniProt_location | generic | locational | resource_specific | Q96JP9 | CDHR1 | protein | 6 | False | False | False | True | False |
1 | transmembrane | transmembrane | UniProt_location | generic | locational | resource_specific | Q9P126 | CLEC1B | protein | 8 | False | False | False | True | False |
2 | transmembrane | transmembrane | UniProt_location | generic | locational | resource_specific | Q13585 | GPR50 | protein | 6 | False | False | False | True | False |
3 | transmembrane | transmembrane | UniProt_location | generic | locational | resource_specific | Q8N9I0 | SYT2 | protein | 7 | False | False | False | False | False |
4 | transmembrane | transmembrane | UniProt_location | generic | locational | resource_specific | O43614 | HCRTR2 | protein | 6 | False | False | False | True | False |
5 | transmembrane | transmembrane | UniProt_location | generic | locational | resource_specific | A6NJY1 | SLC9B1P1 | protein | 4 | False | False | False | False | False |
6 | transmembrane | transmembrane | UniProt_location | generic | locational | resource_specific | Q5RI15 | COX20 | protein | 5 | False | False | False | False | False |
7 | transmembrane | transmembrane | UniProt_location | generic | locational | resource_specific | Q13948 | CUX1 | protein | 5 | False | False | False | False | False |
8 | transmembrane | transmembrane | UniProt_location | generic | locational | resource_specific | Q8NGK4 | OR52K1 | protein | 6 | False | False | False | False | False |
9 | transmembrane | transmembrane | UniProt_location | generic | locational | resource_specific | Q8IYS2 | KIAA2013 | protein | 7 | False | False | False | True | False |
Browse intercell categories§
Use the select
method
to access intercell classes:
[72]:
ic.select(definition = 'neurotensin', parent = 'receptor')
executed in 0ms, finished 15:27:15 2022-12-02
[72]:
<AnnotationGroup `neurotensin` from HGNC, 2 elements>
Proteins in each category can be listed with their descriptions from UniProt. Loading the UniProt datasheets for each protein is a slow process, we don’t recomment calling this method on more than a few dozens of proteins.
[79]:
ic.show('neurotensin', parent = 'receptor')
executed in 1ms, finished 15:35:58 2022-12-02
=====> [2 proteins] <=====
╒═══════╤════════╤══════════════╤══════════╤══════════╤═════════════╤══════════════╤════════════╤══════════════╕
│ No. │ ac │ genesymbol │ length │ weight │ full_name │ function_o │ keywords │ subcellula │
│ │ │ │ │ │ │ r_genecard │ │ r_location │
│ │ │ │ │ │ │ s │ │ │
╞═══════╪════════╪══════════════╪══════════╪══════════╪═════════════╪══════════════╪════════════╪══════════════╡
│ 1 │ O95665 │ NTSR2 │ 410 │ 45385 │ Neurotensi │ Receptor │ Cell │ Cell │
│ │ │ │ │ │ n receptor │ for the tr │ membrane, │ membrane; │
│ │ │ │ │ │ type 2 │ idecapepti │ Disulfide │ Multi-pass │
│ │ │ │ │ │ │
Gene Ontology§
pypath.utils.go
is an
almost standalone module for management of the Gene Ontology tree and annotations. The main objects
here are GeneOntology
and
GOAnnotation
. The former
represents the ontology tree, i.e. terms and their relationships, the latter their assignment to gene
products. Both provides many versatile methods for querying.
[80]:
from pypath.utils import go
goa = go.GOAnnotation()
executed in 1.26s, finished 15:36:46 2022-12-02
[81]:
goa.ontology # the GeneOntology object
executed in 0ms, finished 15:36:48 2022-12-02
[81]:
<pypath.utils.go.GeneOntology at 0x689946b55570>
[82]:
goa # the GOAnnotation object
executed in 0ms, finished 15:36:50 2022-12-02
[82]:
<pypath.utils.go.GOAnnotation at 0x68991cdc9b40>
Among many others, the most versatile method is select
which is able to select the
annotated gene products by various expressions built from GO terms or IDs. It understands
AND
, OR
, NOT
and parentheses.
[83]:
query = """(cell surface OR
external side of plasma membrane OR
extracellular region) AND
(regulation of transmembrane transporter activity OR
channel regulator activity)"""
result = goa.select(query)
print(list(result)[:7])
executed in 0ms, finished 15:36:55 2022-12-02
['P21333', 'P80108', 'P62258', 'Q9NRX4', 'P54710', 'Q8NER1', 'P01303']
[84]:
goa.ontology.get_all_descendants('GO:0005576')
executed in 0ms, finished 15:36:56 2022-12-02
[84]:
{'GO:0001507',
'GO:0001527',
'GO:0003351',
'GO:0003355',
'GO:0005201',
'GO:0005576',
'GO:0005577',
'GO:0005582',
'GO:0005583',
'GO:0005584',
'GO:0005585',
'GO:0005586',
'GO:0005587',
'GO:0005588',
'GO:0005590',
'GO:0005591',
'GO:0005592',
'GO:0005595',
'GO:0005596',
'GO:0005599',
'GO:0005601',
'GO:0005602',
'GO:0005604',
'GO:0005606',
'GO:0005607',
'GO:0005608',
'GO:0005609',
'GO:0005610',
'GO:0005611',
'GO:0005612',
'GO:0005614',
'GO:0005615',
'GO:0005616',
'GO:0006858',
'GO:0006859',
'GO:0006860',
'GO:0009519',
'GO:0010367',
'GO:0016914',
'GO:0016942',
'GO:0020003',
'GO:0020004',
'GO:0020005',
'GO:0020006',
'GO:0030020',
'GO:0030021',
'GO:0030023',
'GO:0030197',
'GO:0030345',
'GO:0030934',
'GO:0030935',
'GO:0030938',
'GO:0031012',
'GO:0031395',
'GO:0032311',
'GO:0032579',
'GO:0033165',
'GO:0033166',
'GO:0034358',
'GO:0034359',
'GO:0034360',
'GO:0034361',
'GO:0034362',
'GO:0034363',
'GO:0034364',
'GO:0034365',
'GO:00343
Protein complexes§
The pypath.complex
module builds a non-redundant list of complexes from about 12 original resources. Complexes are
unique considering their set of components, and optionally carry stoichiometry information.
Homomultimers are also included, hence some complexes consist only of a single kind of protein. The
database is an instance of pypath.core.complex.ComplexAggregator
object and the built by the pypath.core.complex.get_db
function.
Warning: it is recommended to access databases by the manager. Running the code below takes really long and does not save or reload the database, it builds a fresh copy each time.
[90]:
from pypath.core import complex
co = complex.get_db()
co.update_index()
co
executed in 0ms, finished 15:39:31 2022-12-02
[90]:
<Complex database: 28173 complexes>
To retrieve all complexes containing a specific protein, here MTOR:
[91]:
co.proteins['P42345']
executed in 0ms, finished 15:39:42 2022-12-02
[91]:
{Complex: COMPLEX:O00141_O15530_O75879_P23443_P34931_P42345_Q6R327_Q8N122_Q9BPZ7_Q9BVC4_Q9H672,
Complex: COMPLEX:O00141_O15530_P07900_P23443_P31749_P31751_P42345_P78527_Q05513_Q05655_Q6R327_Q8N122_Q9BPZ7_Q9BVC4,
Complex: COMPLEX:O00141_O15530_P0CG47_P0CG48_P23443_P42345_Q15118_Q6R327_Q8N122_Q96BR1_Q9BPZ7_Q9BVC4,
Complex: COMPLEX:O00141_O15530_P23443_P42345_Q15118_Q6R327_Q8N122_Q96BR1_Q96J02_Q9BPZ7_Q9BVC4,
Complex: COMPLEX:O00141_O75879_P0CG48_P23443_P34931_P42345_P62753_Q6R327_Q8N122_Q9BPZ7_Q9BVC4_Q9NY26,
Complex: COMPLEX:O00141_P0CG48_P23443_P36894_P42345_P62942_P68106_Q15427_Q6R327_Q8N122_Q9BPZ7_Q9BVC4,
Complex: COMPLEX:O00141_P0CG48_P23443_P42345_P46781_P62753_Q6R327_Q8N122_Q96KQ7_Q9BPZ7_Q9BVC4_Q9NY26,
Complex: COMPLEX:O00141_P0CG48_P23443_P42345_P62753_P62942_Q6R327_Q8N122_Q9BPZ7_Q9BVC4_Q9NY26,
Complex: COMPLEX:O00141_P0CG48_P23443_P42345_P62753_Q15172_Q6R327_Q8IW41_Q9BPZ7_Q9BVC4_Q9H672,
Complex: COMPLEX:O00141_P0CG48_P23443_P42345_P62753_Q6R327_Q70Z35_Q8N122_Q8TCU6_Q9BPZ7
Note some of the complexes have human readable names, these are preferred at printing if available
from any of the databases. Otherwise the complexes are labelled by COMPLEX:list-of-components
.
Protein complex objects§
Take a closer look on one complex object. The hash of the is equivalent with the string
representation below, where the UniProt IDs are unique and alphabetically sorted. Hence you can
look up complexes using strings as keys despite the dict keys are in fact pypath.intera.Complex
objects:
[97]:
cplex = co.complexes['COMPLEX:Q09472_Q92793']
cplex
executed in 0ms, finished 15:41:36 2022-12-02
[97]:
Complex CBP/p300: COMPLEX:Q09472_Q92793
[98]:
cplex.components # stoichiometry
executed in 0ms, finished 15:41:38 2022-12-02
[98]:
{'Q92793': 1, 'Q09472': 1}
[99]:
cplex.sources # resources
executed in 0ms, finished 15:41:39 2022-12-02
[99]:
{'Signor'}
Protein complex data frame§
The database can be exported into a pandas.DataFrame
:
[18]:
co.make_df()
co.df
executed in 3.40s, finished 15:47:16 2022-12-03
[18]:
name | components | components_genesymbols | stoichiometry | sources | references | identifiers | |
---|---|---|---|---|---|---|---|
0 | NFY | P23511_P25208_Q13952 | NFYA_NFYB_NFYC | 1:1:1 | CORUM;Compleat;PDB;Signor;ComplexPortal;hu.MAP... | 15243141;14755292;9372932 | Signor:SIGNOR-C1;CORUM:4478;Compleat:HC1449;in... |
1 | mTORC2 | P68104_P85299_Q6R327_Q8TB45_Q9BVC4 | DEPTOR_EEF1A1_MLST8_PRR5_RICTOR | 0:0:0:0:0 | Signor | Signor:SIGNOR-C2 | |
2 | mTORC1 | P42345_Q8N122_Q8TB45_Q96B36_Q9BVC4 | AKT1S1_DEPTOR_MLST8_MTOR_RPTOR | 0:0:0:0:0 | Signor | Signor:SIGNOR-C3 | |
3 | SCF-betaTRCP | P63208_Q13616_Q9Y297 | BTRC_CUL1_SKP1 | 1:1:1 | CORUM;Compleat;Signor | 9990852 | Signor:SIGNOR-C5;CORUM:227;Compleat:HC757 |
4 | CBP/p300 | Q09472_Q92793 | CREBBP_EP300 | 0:0 | Signor | Signor:SIGNOR-C6 | |
... | ... | ... | ... | ... | ... | ... | ... |
28168 | Npnt complex 2 | Q5SZK8_Q6UXI9_Q86XX4 | FRAS1_FREM2_NPNT | 0:0:0 | CellChatDB | ||
28169 | NRP1_NRP2 | O14786_O60462_Q9Y4D7 | NRP1_NRP2_PLXND1 | 0:0:0 | CellChatDB | ||
28170 | NRP2_PLXNA2 | O60462_O75051 | NRP2_PLXNA2 | 0:0 | CellChatDB | ||
28171 | NRP2_PLXNA4 | O60462_Q9HCM2 | NRP2_PLXNA4 | 0:0 | CellChatDB | ||
28172 | PTCH2_SMO | Q99835_Q9Y6C5 | PTCH2_SMO | 0:0 | CellChatDB |
28173 rows × 7 columns
Saving datasets as pickles§
The large datasets above are compiled from many resources. Even if these are already available in
the cache, the data processing often takes longer than convenient, e.g. from a few minutes up to half
an hour. Most of the data integration objects in pypath
provide methods to save and
load their contents as pickle dumps. In fact, the database manager does this all the time, in a
coordinated way – for this reason, the methods below should be used only with good reason, and
relying on the database manager is preferred.
[ ]:
# for `pypath.annot.AnnotationTable` objects:
a.save_to_pickle('myannots.pickle')
a = annot.AnnotationTable(pickle_file = 'myannots.pickle')
# for `pypath.complex.ComplexAggregator` objects:
complexdb.save_to_pickle('mycomplexes.pickle')
complexdb = complex.ComplexAggregator(pickle_file = 'mycomplexes.pickle')
Log messages and sessions§
In pypath
all modules
sends messages to a log file named by default by the session ID (a 5 char random string). The default
path to the log file is ./pypath_log/pypath-xxxxx.log
where xxxxx
is the session ID.
Warning: The logger of pypath is really verbose, the log files can grow huge: several tens of thousands of lines, few MBs. It is recommended to empty the pypath_log directories time to time.
Basic info about the session§
The info
function
prints the most important information about the current session:
[100]:
import pypath
pypath.info()
executed in 0ms, finished 15:41:55 2022-12-02
[2022-12-02 16:41:55] [pypath]
- session ID: `l0n17`
- working directory: `/home/denes/pypath/notebooks`
- logfile: `/home/denes/pypath/notebooks/pypath_log/pypath-l0n17.log`
- pypath version: 0.14.31
Another function prints a disclaimer about licenses. Until recently this message was printed every time upon import, it is still important, but we removed it as in certain situations it can be annoying.
[101]:
pypath.disclaimer()
executed in 0ms, finished 15:41:59 2022-12-02
=== d i s c l a i m e r ===
All data accessed through this module,
either as redistributed copy or downloaded using the
programmatic interfaces included in the present module,
are free to use at least for academic research or
education purposes.
Please be aware of the licenses of all the datasets
you use in your analysis, and please give appropriate
credits for the original sources when you publish your
results. To find out more about data sources please
look at `pypath/resources/data/resources.json` or
https://omnipathdb.org/info and
`pypath.resources.urls.urls`.
Read the log file§
Calling pypath.log
opens the logfile by the default console application for paginating text files (in GNU systems
typically less
):
[ ]:
pypath.log()
executed in 0ms, finished 15:42:08 2022-12-02
The logger and the log file are bound to the session (the 5 random characters is the session ID):
[104]:
pypath.session
executed in 0ms, finished 15:42:27 2022-12-02
[104]:
<Session l0n17>
The logger:
[105]:
pypath.session.log
executed in 0ms, finished 15:42:46 2022-12-02
[105]:
Logger [/home/denes/pypath/notebooks/pypath_log/pypath-l0n17.log]
The path to the log file:
[106]:
pypath.session.log.fname
executed in 0ms, finished 15:42:49 2022-12-02
[106]:
'/home/denes/pypath/notebooks/pypath_log/pypath-l0n17.log'
Logging to the console§
Each log message has a numeric priority level, and messages with lower level than a threshold are printed to the console. By default only important warnings are dispatched to the console. To log everything to the console, set the threshold to a large number:
[107]:
pypath.session.log.console_level = 10
from pypath.inputs import signor
si = signor.signor_interactions()
pypath.session.log.console_level = -1
executed in 0ms, finished 15:42:56 2022-12-02
[2022-12-02 16:42:55] [curl] Creating Curl object to retrieve data from `https://signor.uniroma2.it/download_complexes.php`
[2022-12-02 16:42:55] [curl] Cache file path: `/home/denes/.pypath/cache/d7b8673e83e43a01c533f9de5a2b04b9-download_complexes.php`
[2022-12-02 16:42:55] [curl] Cache file found, no need for download.
[2022-12-02 16:42:55] [curl] Opening plain text file `/home/denes/.pypath/cache/d7b8673e83e43a01c533f9de5a2b04b9-download_complexes.php`.
[2022-12-02 16:42:55] [curl] Creating Curl object to retrieve data from `https://signor.uniroma2.it/download_complexes.php`
[2022-12-02 16:42:55] [curl] Cache file path: `/home/denes/.pypath/cache/d7b8673e83e43a01c533f9de5a2b04b9-download_complexes.php`
[2022-12-02 16:42:55] [curl] Cache file found, no need for download.
[2022-12-02 16:42:55] [curl] Opening plain text file `/home/denes/.pypath/cache/d7b8673e83e43a01c533f9de5a2b04b9-download_complexes.php`.
[2022-12-02 16:42:55] [curl] Creating Curl object to retrieve data from `https
Disable logging§
To avoid creation of a log file (and the directory pypath_log
) set the environment
variable PYPATH_LOG
or
the builtins.PYPATH_LOG
attribute:
[ ]:
# shell:
export PYPATH_LOG="/dev/null"
# then, start Python and use pypath
[108]:
import os
import builtins
builtins.PYPATH_LOG=os.devnull
import pypath
executed in 0ms, finished 15:43:10 2022-12-02
Write to the log§
Sending a single message§
First we change the console level so we can see the log messages. The label is optional. The
priority of the message is given by the level
, notice that the second message won’t be printed to the console as its
level is higher than 10:
[109]:
pypath.session.log.console_level = 10
pypath.session.log.msg('Greetings from the pypath tutorial notebook! :)', label = 'book')
pypath.session.log.msg('Not important, not shown on console but printed to the logfile.', level = 11)
executed in 0ms, finished 15:43:13 2022-12-02
[2022-12-02 16:43:13] [book] Greetings from the pypath tutorial notebook! :)
Connect a module or class to the pypath logger§
The preferred way of connecting to the logger is to make a class inherit from the Logger
class. Here the
name
will be the
default label for all messages coming from the instances of this class:
[110]:
from pypath.share import session
class ChildOfLogger(session.Logger):
def __init__(self):
session.Logger.__init__(self, name = 'child')
def say_something(self):
self._log('Have a nice day! :D')
col = ChildOfLogger()
col.say_something()
executed in 0ms, finished 15:43:17 2022-12-02
[2022-12-02 16:43:17] [child] Have a nice day! :D
Alternatively, a logger can be created anywhere and used from any module or function:
[111]:
from pypath.share import session
_logger = session.Logger(name = 'mylogger')
_log = _logger._log
_log('Message from a stray logger')
executed in 0ms, finished 15:43:20 2022-12-02
[2022-12-02 16:43:20] [mylogger] Message from a stray logger
Finally we just set the console level to a lower value, to avoid flooding the rest of this book with log messages:
[112]:
pypath.session.log.console = -1
executed in 0ms, finished 15:43:23 2022-12-02
BEL export§
Warning: This section hasn’t been thoroughly revised for long time, some parts might be outdated or broken.
Biological Expression Language (BEL, https://bel-commons.scai.fraunhofer.de/) is a versatile
description language to capture relationships between various biological entities spanning wide range
of the levels of biological organization. pypath
has a dedicated module to convert the network and the enzyme-substrate
interactions to BEL format:
[ ]:
from pypath.legacy import main
from pypath.resources import data_formats
from pypath.omnipath import bel
[ ]:
pa = main.PyPath()
pa.init_network(data_formats.pathway)
You can provide one or more resources to the Bel
class. Supported resources
currently are pypath.main.PyPath
and pypath.ptm.PtmAggregator
.
[ ]:
b = bel.Bel(resource = pa)
From the resources we compile a BELGraph
object which provides a Python interface for various operations and you
can also export the data in BEL format:
[ ]:
b.main()
[ ]:
b.bel_graph
[ ]:
b.bel_graph.summarize()
[ ]:
b.export_relationships('omnipath_pathways.bel')
[ ]:
with open('omnipath_pathways.bel', 'r') as fp:
bel_str = fp.read()
[ ]:
print(bel_str[:333])
CellPhoneDB export§
Warning: This section hasn’t been thoroughly revised for long time, some parts might be outdated or broken.
CellPhoneDB is a statistical method and a database for inferring inter-cellular communication
pathways between specific cell types from single-cell data. OmniPath/pypath uses CellPhoneDB as a
resource for interaction, protein complex and annotation data. Apart from this, pypath is able to
export its data in the appropriate format to provide input for the CellPhoneDB Python module. For
this you can use the pypath.cellphonedb
module:
[ ]:
from pypath.omnipath import cellphonedb
from pypath.share import settings
settings.setup(network_expand_complexes = False)
Here you can provide parameters for the network or provide an already built network. Also you can provide the datasets as pickles to make them load really fast. Otherwise this step will take quite long.
[ ]:
c = cellphonedb.CellPhoneDB()
You can access each of the CellPhoneDB input files as a pandas.DataFrame
and also they’ve been
exported to csv files. For example the interaction_input.csv
contains interactions from all the resources used for
building the network (here Signor, SingnaLink, etc.):
[ ]:
c.interaction_dataframe[:10]
The proteins and complexes are annotated (transmembrane, peripheral, secreted, etc.) using data
from the pypath.intercell
module (identical to the http://omnipathdb.org/intercell query of the web service):
[ ]:
c.protein_dataframe[:10]
[ ]:
The legacy igraph-based network object§
Warning: This section hasn’t been thoroughly revised for long time, some parts might be outdated or broken.
Until about 2019 (before pypath version 0.9) pypath
used an igraph.Graph
object (igraph.org) to organize all data structures
around. This legacy API still present in pypath.legacy.main
, however it is not maintained. This section of the book is
still here, but will be removed soon, along with the legacy
module.
[43]:
from pypath.legacy import main
No module `cairo` available.
Some plotting functionalities won't be accessible.
[ ]:
pa = main.PyPath()
#pa.load_omnipath() # This is commented out because it takes > 1h
# to run it for the first time due to the vast
# amount of data download.
# Once you populated the cache it still takes
# approx. 30 min to build the entire OmniPath
# as the process consists of quite some data
# processing. If you dump it in a pickle, you
# can load the network in < 1 min
I just want a network quickly and play around with pypath§
You can find the predefined formats in the pypath.resources.network
module. For
example, to load one resource from there, let’s say SIGNOR:
[ ]:
from pypath.legacy import main
from pypath.resources import network as netres
pa = main.PyPath()
pa.load_resources({'signor': netres.pathway['signor']})
Or to load all activity flow resources with literature references:
[ ]:
from pypath.legacy import main
from pypath.resources import network as netres
[ ]:
pa = main.PyPath()
pa.init_network(netres.pathway)
Or to load all activity flow resources, including the ones without literature references:
[ ]:
pa = main.PyPath()
pa.init_network(data_formats.pathway_all)
How do I build networks from any data with pypath?§
Here we show how to build a network from your own files. The advantage of building network with pypath is that you don’t need to worry about merging redundant elements, neither about different formats and identifiers. Let’s say you have two files with network data:
network1.csv
entrezA,entrezB,effect
1950,1956,inhibition
5290,207,stimulation
207,2932,inhibition
1956,5290,stimulation
network2.sif
EGF + EGFR
EGFR + PIK3CA
EGFR + SOS1
PIK3CA + RAC1
RAC1 + MAP3K1
SOS1 + HRAS
HRAS + MAP3K1
PIK3CA + AKT1
AKT1 - GSK3B
Note: you need to create these files in order to load them.
Defining input formats§
[ ]:
import pypath
import pypath.iinput_formats as input_formats
input1 = input_formats.ReadSettings(
name = 'egf1',
input = 'network1.csv',
header = True,
separator = ',',
id_col_a = 0,
id_col_b = 1,
id_type_a = 'entrez',
id_type_b = 'entrez',
sign = (2, 'stimulation', 'inhibition'),
ncbi_tax_id = 9606,
)
input2 = input_formats.ReadSettings(
name = 'egf2',
input = 'network2.sif',
separator = ' ',
id_col_a = 0,
id_col_b = 2,
id_type_a = 'genesymbol',
id_type_b = 'genesymbol',
sign = (1, '+', '-'),
ncbi_tax_id = 9606,
)
Creating PyPath object and loading the 2 test files§
[ ]:
inputs = {
'egf1': input1,
'egf2': input2
}
pa = main.PyPath()
pa.reload()
pa.init_network(lst = inputs)
Structure of the legacy network object§
[ ]:
from pypath.legacy import main as legacy
pa = legacy.PyPath()
[ ]:
pa.graph
Number of edges and nodes:
[ ]:
pa.ecount, pa.vcount
The edge and vertex sequences you can access in the es
and vs
attributes, you can iterate these
or index by integers. The edge and vertex attributes you can access by string keys. E.g. get the
sources of edge 0:
[ ]:
pa.graph.es[81]['sources']
Directions and signs§
By default the igraph
object is undirected but it carries all direction information in
Python objects assigned to each edge. Pypath can convert it to a directed igraph
object, but you still need
the Direction
objects
to have the signs, as igraph
has no signed network representation. Certain methods need the
directed igraph
object
and they will automatically create it, but you can create it manually:
[ ]:
pa.get_directed()
You find the directed network in the pa.dgraph
attribute:
[ ]:
pa.dgraph
Now let’s take a look on the pypath.main.Direction
objects which contain details about directions and
signs. First as an example, select a random edge:
[ ]:
edge = pa.graph.es[3241]
The Direction
object is in the dirs
edge attribute:
[ ]:
d = edge['dirs']
It has a method to print its content a human readable way:
[ ]:
print(pa.graph.es[3241]['dirs'])
From this we see the databases phosphoELM and Signor agree that protein P17252
has an effect on
Q15139
and Signor in
addition tells us this effect is stimulatory. However in your scripts you can query the
Direction
objects a
number of ways. Each Direction
object calls the two possible directions either straight or
reverse:
[ ]:
d.straight
[ ]:
d.reverse
It can tell you if one of these directions is supported by any of the network resources:
[ ]:
d.get_dir(d.straight)
Or it can return those resources:
[ ]:
d.get_dir(d.straight, sources = True)
The opposite direction is not supported by any resource:
[ ]:
d.get_dir(d.reverse, sources = True)
Similar way the signs can be queried. The returned pair of boolean values mean if the interaction in this direction is stimulatory or inhibitory, respectively.
[ ]:
d.get_sign(d.straight)
Or you can ask whether it is inhibition:
[ ]:
d.is_inhibition(d.straight)
Or if the interaction is directed at all:
[ ]:
d.is_directed()
Sometimes resources don’t agree, for example one tells an interaction is inhibition while
according to others it is stimulation; or one tells A effects B and another resource the other
way around. Here we preserve all these potentially contradicting information in the Direction
object and at the end
you decide what to do with it depending on your purpose. If you want to get rid of ambiguity
there is a method to get a consensus direction and sign which returns the attributes the most
resources agree on:
[ ]:
d.consensus_edges()
Accessing nodes in the network§
In igraph
the
vertices are numbered but this numbering can change at certain operations. Instead the we can use
the vertex attributes. In PyPath
for proteins the name
attribute is UniProt ID by
default and the label
is Gene Symbol.
[ ]:
pa.graph.vs['name'][:5]
[ ]:
pa.graph.vs['label'][:5]
The PyPath
object
offers a number of helper methods to access the nodes by their names. For example, uniprot
or up
returns the igraph.Vertex
for a UniProt
ID:
[ ]:
type(pa.up('P00533'))
Similarly genesymbol
or gs
for Gene Symbols:
[ ]:
type(pa.gs('ESR1'))
Each of these has a “plural” version:
[ ]:
len(list(pa.gss(['MTOR', 'ATG16L2', 'ULK1'])))
And a generic method where you can mix UniProts and Gene Symbols:
[ ]:
len(list(pa.proteins(['MTOR', 'P00533'])))
Querying relationships with our without causality§
Above you could see how to query the directions and names of individual edges and nodes.
Building on top of these, other methods give a way to query causality, i.e. which proteins are
affected by an other one, and which others are its regulators. The example below returns the nodes
PIK3CA is stimulated by, the gs
prefix tells we query by the Gene Symbol:
[ ]:
pa.gs_stimulated_by('PIK3CA')
It returns a so called _NamedVertexSeq
object, which you can get a series of igraph.Vertex
objects or Gene
Symbols or UniProt IDs from:
[ ]:
list(pa.gs_stimulated_by('PIK3CA').gs())[:5]
[ ]:
list(pa.gs_stimulated_by('PIK3CA').up())[:5]
Note, the names of these methods are a bit contraintuitive, the for example the gs_stimulates
returns the genes
stimulated by PIK3CA:
[ ]:
list(pa.gs_stimulates('PIK3CA').gs())[:5]
[ ]:
'PIK3CA' in set(pa.affected_by('AKT1').gs())
There are many similary methods, inhibited_by
returns negative regulators, affected_by
does not consider +/-
signs, without gs_
and
up_
prefixes you can
provide either of these identifiers, neighbors
does not consider the direction. At the end .gs()
converts the result for a list
of Gene Symbols, up()
to
UniProts, .ids()
to
vertex IDs and by default it yields igraph.Vertex
objects:
[ ]:
list(pa.neighbors('AKT1').ids())[:5]
Finally, with neighborhood
methods return the indirect neighborhood in custom number of steps
(however size of the neighborhood increases rapidly with number of steps):
[ ]:
print(list(pa.neighborhood('ATG3', 1).gs()))
[ ]:
print(list(pa.neighborhood('ATG3', 2).gs()))
[ ]:
len(list(pa.neighborhood('ATG3', 3).gs()))
[ ]:
len(list(pa.neighborhood('ATG3', 4).gs()))
Accessing edges by identifiers§
Just like nodes also edges can be accessed by identifiers like Gene Symbols. get_edge
returns an igraph.Edge
if the edge exists
otherwise None
.
[ ]:
type(pa.get_edge('EGF', 'EGFR'))
[ ]:
type(pa.get_edge('EGF', 'P00533'))
[ ]:
type(pa.get_edge('EGF', 'AKT1'))
[ ]:
print(pa.get_edge('EGF', 'EGFR')['dirs'])
Literature references§
Select a random edge and in the references
attribute you find a list of references:
[ ]:
edge = pa.get_edge( 'MAP1LC3B', 'SQSTM1')
edge['references']
Each reference has a PubMed ID:
[ ]:
edge['references'][0].pmid
[ ]:
edge['references'][0].open()
These 3 references come from 3 different databases, but there must be 2 overlaps between them:
[ ]:
edge['refs_by_source']
Plotting the network with igraph§
Here we use the network created above (because it is reasonable size, not like the networks we could get from most of the network databases). Igraph has excellent plotting abilities built on top of the cairo library.
[ ]:
import igraph
plot = igraph.plot(pa.graph, target = 'egf_network.png',
edge_width = 0.3, edge_color = '#777777',
vertex_color = '#97BE73', vertex_frame_width = 0,
vertex_size = 70.0, vertex_label_size = 15,
vertex_label_color = '#FFFFFF',
# due to a bug in either igraph or IPython,
# vertex labels are not visible on inline plots:
inline = False, margin = 120)
from IPython.display import Image
Image(filename='egf_network.png')