Validate, inspect & standardize identifiers#
To make data queryable by an entity identifier, one needs to ensure that identifiers comply to a chosen standard.
Bionty enables this by mapping metadata on the versioned ontologies using validate()
and inspect()
.
For terms that are not directly mappable, we offer (also see Search & lookup terms):
import bionty as bt
import pandas as pd
Inspect and mapping synonyms of gene identifiers#
To illustrate it, let us generate a DataFrame that stores a number of gene identifiers, some of which corrupted.
data = {
"gene symbol": ["A1CF", "A1BG", "FANCD1", "corrupted"],
"ncbi id": ["29974", "1", "5133", "corrupted"],
"ensembl_gene_id": [
"ENSG00000148584",
"ENSG00000121410",
"ENSG00000188389",
"ENSGcorrupted",
],
}
df_orig = pd.DataFrame(data).set_index("ensembl_gene_id")
df_orig
gene symbol | ncbi id | |
---|---|---|
ensembl_gene_id | ||
ENSG00000148584 | A1CF | 29974 |
ENSG00000121410 | A1BG | 1 |
ENSG00000188389 | FANCD1 | 5133 |
ENSGcorrupted | corrupted | corrupted |
First we can check whether any of our values are validated against the ontology reference.
Tip: available fields are accessible via gene_bt.fields
gene_bt = bt.Gene()
gene_bt
Gene
Organism: human
Source: ensembl, release-110
#terms: 75719
π Gene.df(): ontology reference table
π Gene.lookup(): autocompletion of terms
π― Gene.search(): free text search of terms
β
Gene.validate(): strictly validate values
π§ Gene.inspect(): full inspection of values
π½ Gene.standardize(): convert to standardized names
πͺ Gene.diff(): difference between two versions
π Gene.ontology: Pronto.Ontology object
validated = gene_bt.validate(df_orig.index, gene_bt.ensembl_gene_id)
validated
β
3 terms (75.00%) are validated
β 1 term (25.00%) is not validated: ENSGcorrupted
array([ True, True, True, False])
# show not validated terms
df_orig.index[~validated]
Index(['ENSGcorrupted'], dtype='object', name='ensembl_gene_id')
The same procedure is available for ncbi_gene_id or gene symbol. First, we validate which symbols are mappable against the ontology.
gene_bt.validate(df_orig["ncbi id"], gene_bt.ncbi_gene_id)
β
3 terms (75.00%) are validated
β 1 term (25.00%) is not validated: corrupted
array([ True, True, True, False])
validated_symbols = gene_bt.validate(df_orig["gene symbol"], gene_bt.symbol)
β
2 terms (50.00%) are validated
β 2 terms (50.00%) are not validated: FANCD1, corrupted
df_orig["gene symbol"][~validated_symbols]
ensembl_gene_id
ENSG00000188389 FANCD1
ENSGcorrupted corrupted
Name: gene symbol, dtype: object
Here, 2 of the gene symbols are not validated. What shall we do? Letβs run a full inspection of these symbols:
gene_bt.inspect(df_orig["gene symbol"], gene_bt.symbol);
β
2 terms (50.00%) are validated for symbol
β 2 terms (50.00%) are not validated for symbol: FANCD1, corrupted
detected 1 terms with synonym: FANCD1
β standardize terms via .standardize()
Inspect detects synonyms and suggests to use .standardize():
# mpping synonyms returns a list of standardized terms:
mapped_symbol_synonyms = gene_bt.standardize(df_orig["gene symbol"])
mapped_symbol_synonyms
π‘ standardized 3/4 terms
['A1CF', 'A1BG', 'BRCA2', 'corrupted']
Optionally, only returns a mapper of {synonym : standardized name}:
gene_bt.standardize(df_orig["gene symbol"], return_mapper=True)
π‘ standardized 3/4 terms
{'FANCD1': 'BRCA2'}
We can use the standardized symbols as the new standardized index:
df_curated = df_orig.reset_index()
df_curated.index = mapped_symbol_synonyms
df_curated
ensembl_gene_id | gene symbol | ncbi id | |
---|---|---|---|
A1CF | ENSG00000148584 | A1CF | 29974 |
A1BG | ENSG00000121410 | A1BG | 1 |
BRCA2 | ENSG00000188389 | FANCD1 | 5133 |
corrupted | ENSGcorrupted | corrupted | corrupted |
Standardize and look up unmapped CellMarker identifiers#
Depending on how the data was collected and which terminology was used, it is not always possible to curate values. Some values might have used a different standard or be corrupted.
This section will demonstrate how to look up unmatched terms and curate them using CellMarker
.
First, we take an example DataFrame whose index containing a valid & invalid cell markers (antibody targets) and an additional feature (time) from a flow cytometry dataset.
markers = pd.DataFrame(
index=[
"KI67",
"CCR7",
"CD14",
"CD8",
"CD45RA",
"CD4",
"CD3",
"CD127a",
"PD1",
"Invalid-1",
"Invalid-2",
"CD66b",
"Siglec8",
"Time",
]
)
Letβs instantiate the CellMarker ontology with the default database and version.
cellmarker_bt = bt.CellMarker()
cellmarker_bt
CellMarker
Organism: human
Source: cellmarker, 2.0
#terms: 15466
π CellMarker.df(): ontology reference table
π CellMarker.lookup(): autocompletion of terms
π― CellMarker.search(): free text search of terms
β
CellMarker.validate(): strictly validate values
π§ CellMarker.inspect(): full inspection of values
π½ CellMarker.standardize(): convert to standardized names
πͺ CellMarker.diff(): difference between two versions
π CellMarker.ontology: Pronto.Ontology object
Now letβs check which cell markers from the file can be found in the reference:
cellmarker_bt.inspect(markers.index, cellmarker_bt.name);
β
6 terms (42.90%) are validated for name
β 8 terms (57.10%) are not validated for name: KI67, CCR7, CD14, CD4, CD127a, Invalid-1, Invalid-2, Time
detected 4 terms with inconsistent casing/synonyms: KI67, CCR7, CD14, CD4
β standardize terms via .standardize()
Logging suggests we map synonyms:
synonyms_mapper = cellmarker_bt.standardize(markers.index, return_mapper=True)
π‘ standardized 10/14 terms
Now we mapped 4 additional terms:
synonyms_mapper
{'KI67': 'Ki67', 'CCR7': 'Ccr7', 'CD14': 'Cd14', 'CD4': 'Cd4'}
Letβs replace the synonyms with standardized names in the markers DataFrame:
markers.rename(index=synonyms_mapper, inplace=True)
From the logging, it can be seen that 4 terms were not found in the reference!
Among them Time
, Invalid-1
and Invalid-2
are non-marker channels which wonβt be curated by cell marker.
cellmarker_bt.inspect(markers.index, cellmarker_bt.name);
β
10 terms (71.40%) are validated for name
β 4 terms (28.60%) are not validated for name: CD127a, Invalid-1, Invalid-2, Time
We donβt really find CD127a
, letβs check in the lookup with auto-completion:
lookup = cellmarker_bt.lookup()
lookup.cd127
CellMarker(name='CD127', synonyms='', gene_symbol='IL7R', ncbi_gene_id='3575', uniprotkb_id='P16871', _5='cd127')
Indeed we find it should be cd127, we had a typo there with cd127a
.
Now letβs fix the markers so all of them can be linked:
Tip
Using the .lookup instead of passing a string helps eliminate possible typos!
curated_df = markers.rename(index={"CD127a": lookup.cd127.name})
Optionally, run a fuzzy match:
cellmarker_bt.search("CD127a").head()
synonyms | gene_symbol | ncbi_gene_id | uniprotkb_id | __agg__ | __ratio__ | |
---|---|---|---|---|---|---|
name | ||||||
CD127 | IL7R | 3575 | P16871 | cd127 | 90.909091 | |
CD1 | CD1A | 910 | P29016 | cd1 | 90.000000 | |
CD172a | None | None | None | cd172a | 83.333333 | |
CD167a | None | None | None | cd167a | 83.333333 | |
CD121a | None | None | None | cd121a | 83.333333 |
OK, now we can try to run curate again and all cell markers are linked!
cellmarker_bt.inspect(curated_df.index, cellmarker_bt.name);
β
11 terms (78.60%) are validated for name
β 3 terms (21.40%) are not validated for name: Invalid-1, Invalid-2, Time