Search & lookup terms#

Entities and ontologies can be complex with many different identifiers.

Here we show Bionty’s lookup model for organism, genes, proteins and cell markers. You’ll see how to

  • access the reference table via .df()

  • look up an entity term via .lookup()

  • look up an entity term via .search()

import bionty as bt

.fields: fields of an ontology reference#

gene_bt = bt.Gene()

gene_bt
Gene
Organism: human
Source: ensembl, release-110
#terms: 75719

📖 Gene.df(): ontology reference table
🔎 Gene.lookup(): autocompletion of terms
🎯 Gene.search(): free text search of terms
✅ Gene.validate(): strictly validate values
🧐 Gene.inspect(): full inspection of values
👽 Gene.standardize(): convert to standardized names
🪜 Gene.diff(): difference between two versions
🔗 Gene.ontology: Pronto.Ontology object
gene_bt.fields
{'biotype',
 'description',
 'ensembl_gene_id',
 'ncbi_gene_id',
 'symbol',
 'synonyms'}

Fields can be accessed as attributes for autocompletion:

(You can pass them to the field parameter in any bionty function instead of strings.)

gene_bt.ncbi_gene_id
ncbi_gene_id

.df(): reference table#

Data scientists love DataFrames, and every entity has a reference table containing all the fields.

df = gene_bt.df()
df.head()
ensembl_gene_id symbol ncbi_gene_id biotype description synonyms
0 ENSG00000000003 TSPAN6 7105 protein_coding tetraspanin 6 [Source:HGNC Symbol;Acc:HGNC:11858] T245|TSPAN-6|TM4SF6
1 ENSG00000000005 TNMD 64102 protein_coding tenomodulin [Source:HGNC Symbol;Acc:HGNC:17757] TEM|CHM1L|BRICD4|MYODULIN|TENDIN
2 ENSG00000000419 DPM1 8813 protein_coding dolichyl-phosphate mannosyltransferase subunit... MPDS|CDGIE
3 ENSG00000000457 SCYL3 57147 protein_coding SCY1 like pseudokinase 3 [Source:HGNC Symbol;A... PACE-1|PACE1
4 ENSG00000000460 C1orf112 55732 protein_coding chromosome 1 open reading frame 112 [Source:HG... APOLO1|FLIP|FLJ10706

To access the information of, for example the multiple gene symbols, we select the corresponding organism through Pandas:

df.set_index("symbol").loc[["LMNA", "TCF7", "BRCA1"]]
ensembl_gene_id ncbi_gene_id biotype description synonyms
symbol
LMNA ENSG00000160789 4000 protein_coding lamin A/C [Source:HGNC Symbol;Acc:HGNC:6636] MADA|LMNL1|HGPS|CMD1A|PRO1|LMN1|LGMD1B
TCF7 ENSG00000081059 6932 protein_coding transcription factor 7 [Source:HGNC Symbol;Acc... TCF-1
BRCA1 ENSG00000012048 672 protein_coding BRCA1 DNA repair associated [Source:HGNC Symbo... PPP1R53|FANCS|RNF53|BRCC1

.lookup(): Lookup terms and records with autocompletion#

Terms can be searched with auto-complete using a lookup object.

lookup = gene_bt.lookup()

We provide dot. accessor for normalized terms (lower case, only contains alphanumeric characters and underscores):

lookup.tcf7
Gene(ensembl_gene_id='ENSG00000081059', symbol='TCF7', ncbi_gene_id='6932', biotype='protein_coding', description='transcription factor 7 [Source:HGNC Symbol;Acc:HGNC:11639]', synonyms='TCF-1')

To look up the exact original strings, convert the lookup object to dict and use the bracket[] accessor for autocompletion:

lookup_dict = lookup.dict()
lookup_dict["TCF7"]
Gene(ensembl_gene_id='ENSG00000081059', symbol='TCF7', ncbi_gene_id='6932', biotype='protein_coding', description='transcription factor 7 [Source:HGNC Symbol;Acc:HGNC:11639]', synonyms='TCF-1')

By default, the name field is used to generate lookup keys.

You can specify another field to look up:

lookup = gene_bt.lookup(gene_bt.ncbi_gene_id)

If multiple entries are matched, they are returned as a list:

lookup.bt_100126572
Gene(ensembl_gene_id='ENSG00000203733', symbol='GJE1', ncbi_gene_id='100126572', biotype='protein_coding', description='gap junction protein epsilon 1 [Source:HGNC Symbol;Acc:HGNC:33251]', synonyms='CX23')
lookup_dict = lookup.dict()
lookup_dict["100126572"]
Gene(ensembl_gene_id='ENSG00000203733', symbol='GJE1', ncbi_gene_id='100126572', biotype='protein_coding', description='gap junction protein epsilon 1 [Source:HGNC Symbol;Acc:HGNC:33251]', synonyms='CX23')

.search: Search a term against a field#

celltype_bt = bt.CellType()

Matching scores are stored in the __ratio__ column:

celltype_bt.search("cytotoxic T cells").head(3)
ontology_id definition synonyms parents __ratio__
name
cytotoxic T cell CL:0000910 A Mature T Cell That Differentiated And Acquir... cytotoxic T lymphocyte|cytotoxic T-lymphocyte|... [CL:0000911] 96.969697
cell CL:0000000 A Material Entity Of Anatomical Origin (Part O... None [] 90.000000
T cell CL:0000084 A Type Of Lymphocyte Whose Defining Characteri... T-lymphocyte|T-cell|T lymphocyte [CL:0000542] 90.000000

By default, search also matches against each of the synonyms:

celltype_bt.search("P cell").head(3)
ontology_id definition synonyms parents __ratio__
name
nodal myocyte CL:0002072 A Specialized Cardiac Myocyte In The Sinoatria... myocytus nodalis|P cell|cardiac pacemaker cell [CL:0002086] 100.000000
pigmented ciliary epithelial cell CL:0002303 A Cell That Is Part Of Pigmented Ciliary Epith... PE cell [CL:0000529] 92.307692
double-positive, alpha-beta thymocyte CL:0000809 A Thymocyte Expressing The Alpha-Beta T Cell R... DP cell|DP thymocyte|double-positive, alpha-be... [CL:0000790] 92.307692

You can turn off synonym matching with synonyms_field=None:

celltype_bt.search("P cell", synonyms_field=None).head(3)
ontology_id definition synonyms parents __ratio__
name
PP cell CL:0000696 A Cell That Stores And Secretes Pancreatic Pol... type F enteroendocrine cell [CL:0000167, CL:0000164] 92.307692
cell CL:0000000 A Material Entity Of Anatomical Origin (Part O... None [] 90.000000
pancreatic PP cell CL:0002275 A Pp Cell Located In The Islets Of The Pancreas. PP-cell of pancreatic islet|pancreatic polypep... [] 90.000000

Match against another field (default is “name”):

celltype_bt.search("CD8 postive alpha beta T cells", field=celltype_bt.definition).head(
    3
)
ontology_id name synonyms parents __ratio__
definition
A T Cell Expressing An Alpha-Beta T Cell Receptor And The Cd8 Coreceptor. CL:0000625 CD8-positive, alpha-beta T cell CD8-positive, alpha-beta T lymphocyte|CD8-posi... [CL:0000791] 95.081967
A Mature Alpha-Beta T Cell That Expresses An Alpha-Beta T Cell Receptor And The Cd4 Coreceptor. CL:0000624 CD4-positive, alpha-beta T cell CD4-positive, alpha-beta T lymphocyte|CD4-posi... [CL:0000791] 91.803279
A T Cell That Expresses An Alpha-Beta T Cell Receptor Complex. CL:0000789 alpha-beta T cell alpha-beta T-cell|alpha-beta T-lymphocyte|alph... [CL:0000084] 90.000000