Search & lookup terms#
Entities and ontologies can be complex with many different identifiers.
Here we show Bionty’s lookup model for organism, genes, proteins and cell markers. You’ll see how to
access the reference table via
.df()
look up an entity term via
.lookup()
look up an entity term via
.search()
import bionty as bt
.fields
: fields of an ontology reference#
gene_bt = bt.Gene()
gene_bt
Gene
Organism: human
Source: ensembl, release-110
#terms: 75719
📖 Gene.df(): ontology reference table
🔎 Gene.lookup(): autocompletion of terms
🎯 Gene.search(): free text search of terms
✅ Gene.validate(): strictly validate values
🧐 Gene.inspect(): full inspection of values
👽 Gene.standardize(): convert to standardized names
🪜 Gene.diff(): difference between two versions
🔗 Gene.ontology: Pronto.Ontology object
gene_bt.fields
{'biotype',
'description',
'ensembl_gene_id',
'ncbi_gene_id',
'symbol',
'synonyms'}
Fields can be accessed as attributes for autocompletion:
(You can pass them to the field
parameter in any bionty function instead of strings.)
gene_bt.ncbi_gene_id
ncbi_gene_id
.df()
: reference table#
Data scientists love DataFrames, and every entity has a reference table containing all the fields.
df = gene_bt.df()
df.head()
ensembl_gene_id | symbol | ncbi_gene_id | biotype | description | synonyms | |
---|---|---|---|---|---|---|
0 | ENSG00000000003 | TSPAN6 | 7105 | protein_coding | tetraspanin 6 [Source:HGNC Symbol;Acc:HGNC:11858] | T245|TSPAN-6|TM4SF6 |
1 | ENSG00000000005 | TNMD | 64102 | protein_coding | tenomodulin [Source:HGNC Symbol;Acc:HGNC:17757] | TEM|CHM1L|BRICD4|MYODULIN|TENDIN |
2 | ENSG00000000419 | DPM1 | 8813 | protein_coding | dolichyl-phosphate mannosyltransferase subunit... | MPDS|CDGIE |
3 | ENSG00000000457 | SCYL3 | 57147 | protein_coding | SCY1 like pseudokinase 3 [Source:HGNC Symbol;A... | PACE-1|PACE1 |
4 | ENSG00000000460 | C1orf112 | 55732 | protein_coding | chromosome 1 open reading frame 112 [Source:HG... | APOLO1|FLIP|FLJ10706 |
To access the information of, for example the multiple gene symbols, we select the corresponding organism through Pandas:
df.set_index("symbol").loc[["LMNA", "TCF7", "BRCA1"]]
ensembl_gene_id | ncbi_gene_id | biotype | description | synonyms | |
---|---|---|---|---|---|
symbol | |||||
LMNA | ENSG00000160789 | 4000 | protein_coding | lamin A/C [Source:HGNC Symbol;Acc:HGNC:6636] | MADA|LMNL1|HGPS|CMD1A|PRO1|LMN1|LGMD1B |
TCF7 | ENSG00000081059 | 6932 | protein_coding | transcription factor 7 [Source:HGNC Symbol;Acc... | TCF-1 |
BRCA1 | ENSG00000012048 | 672 | protein_coding | BRCA1 DNA repair associated [Source:HGNC Symbo... | PPP1R53|FANCS|RNF53|BRCC1 |
.lookup()
: Lookup terms and records with autocompletion#
Terms can be searched with auto-complete using a lookup object.
lookup = gene_bt.lookup()
We provide dot.
accessor for normalized terms (lower case, only contains alphanumeric characters and underscores):
lookup.tcf7
Gene(ensembl_gene_id='ENSG00000081059', symbol='TCF7', ncbi_gene_id='6932', biotype='protein_coding', description='transcription factor 7 [Source:HGNC Symbol;Acc:HGNC:11639]', synonyms='TCF-1')
To look up the exact original strings, convert the lookup object to dict and use the bracket[]
accessor for autocompletion:
lookup_dict = lookup.dict()
lookup_dict["TCF7"]
Gene(ensembl_gene_id='ENSG00000081059', symbol='TCF7', ncbi_gene_id='6932', biotype='protein_coding', description='transcription factor 7 [Source:HGNC Symbol;Acc:HGNC:11639]', synonyms='TCF-1')
By default, the name
field is used to generate lookup keys.
You can specify another field to look up:
lookup = gene_bt.lookup(gene_bt.ncbi_gene_id)
If multiple entries are matched, they are returned as a list:
lookup.bt_100126572
Gene(ensembl_gene_id='ENSG00000203733', symbol='GJE1', ncbi_gene_id='100126572', biotype='protein_coding', description='gap junction protein epsilon 1 [Source:HGNC Symbol;Acc:HGNC:33251]', synonyms='CX23')
lookup_dict = lookup.dict()
lookup_dict["100126572"]
Gene(ensembl_gene_id='ENSG00000203733', symbol='GJE1', ncbi_gene_id='100126572', biotype='protein_coding', description='gap junction protein epsilon 1 [Source:HGNC Symbol;Acc:HGNC:33251]', synonyms='CX23')
.search
: Search a term against a field#
celltype_bt = bt.CellType()
Matching scores are stored in the __ratio__
column:
celltype_bt.search("cytotoxic T cells").head(3)
ontology_id | definition | synonyms | parents | __ratio__ | |
---|---|---|---|---|---|
name | |||||
cytotoxic T cell | CL:0000910 | A Mature T Cell That Differentiated And Acquir... | cytotoxic T lymphocyte|cytotoxic T-lymphocyte|... | [CL:0000911] | 96.969697 |
cell | CL:0000000 | A Material Entity Of Anatomical Origin (Part O... | None | [] | 90.000000 |
T cell | CL:0000084 | A Type Of Lymphocyte Whose Defining Characteri... | T-lymphocyte|T-cell|T lymphocyte | [CL:0000542] | 90.000000 |
By default, search also matches against each of the synonyms:
celltype_bt.search("P cell").head(3)
ontology_id | definition | synonyms | parents | __ratio__ | |
---|---|---|---|---|---|
name | |||||
nodal myocyte | CL:0002072 | A Specialized Cardiac Myocyte In The Sinoatria... | myocytus nodalis|P cell|cardiac pacemaker cell | [CL:0002086] | 100.000000 |
pigmented ciliary epithelial cell | CL:0002303 | A Cell That Is Part Of Pigmented Ciliary Epith... | PE cell | [CL:0000529] | 92.307692 |
double-positive, alpha-beta thymocyte | CL:0000809 | A Thymocyte Expressing The Alpha-Beta T Cell R... | DP cell|DP thymocyte|double-positive, alpha-be... | [CL:0000790] | 92.307692 |
You can turn off synonym matching with synonyms_field=None
:
celltype_bt.search("P cell", synonyms_field=None).head(3)
ontology_id | definition | synonyms | parents | __ratio__ | |
---|---|---|---|---|---|
name | |||||
PP cell | CL:0000696 | A Cell That Stores And Secretes Pancreatic Pol... | type F enteroendocrine cell | [CL:0000167, CL:0000164] | 92.307692 |
cell | CL:0000000 | A Material Entity Of Anatomical Origin (Part O... | None | [] | 90.000000 |
pancreatic PP cell | CL:0002275 | A Pp Cell Located In The Islets Of The Pancreas. | PP-cell of pancreatic islet|pancreatic polypep... | [] | 90.000000 |
Match against another field (default is “name”):
celltype_bt.search("CD8 postive alpha beta T cells", field=celltype_bt.definition).head(
3
)
ontology_id | name | synonyms | parents | __ratio__ | |
---|---|---|---|---|---|
definition | |||||
A T Cell Expressing An Alpha-Beta T Cell Receptor And The Cd8 Coreceptor. | CL:0000625 | CD8-positive, alpha-beta T cell | CD8-positive, alpha-beta T lymphocyte|CD8-posi... | [CL:0000791] | 95.081967 |
A Mature Alpha-Beta T Cell That Expresses An Alpha-Beta T Cell Receptor And The Cd4 Coreceptor. | CL:0000624 | CD4-positive, alpha-beta T cell | CD4-positive, alpha-beta T lymphocyte|CD4-posi... | [CL:0000791] | 91.803279 |
A T Cell That Expresses An Alpha-Beta T Cell Receptor Complex. | CL:0000789 | alpha-beta T cell | alpha-beta T-cell|alpha-beta T-lymphocyte|alph... | [CL:0000084] | 90.000000 |