Summary of data sources and ETL methodology

WOMBAT

Source data: Version 2011.2, SDF file and activities.tab file The following filters were applied to WOMBAT source data during ETL. Only data passing all filters was loaded into CARLSBAD.

  • Activities not associated with human, rat and mouse targets were skipped.
  • Activities not associated with a known target were skipped.
  • Activities associated with targets without an associated Swissprot idetifier were skipped.
  • Activities from primary screening were skipped.
  • Activities labelled inactive were skipped.
  • Activities with descriptive values (ie. active, dde) were skipped.
  • Only activities of the following types were loaded: EC50, ED50, IC50, IC80, IC90, Ki, Kb, Kd, Km, A2, D2

Additional transformation of WOMBAT data during ETL:

  • As all WOMBAT activities are already -log molar, no further transformations were performed.

IUPHAR

Source data: Automted data extraction was performed on the IUPAR web site during February 2011, and data was used to populate a MySQL database which served as the ETL source. 680 missing structures were added and 283 structures were corrected. The following filters were applied to IUPHAR source data during ETL. Only data passing all filters was loaded into CARLSBAD.

  • Activities not associated with human, rat and mouse targets were skipped.
  • Activities with unknown affinities or units were skipped.
  • Only activities with the following classes were loaded: Agonists, Antagonists, Pore Blockers, Activators, Allosteric Regulators, Gating inhibitors, Channel Blockers.

Additional transformation of IUPHAR data during ETL:

  • Midpoints or medians were used for affinities expressed as ranges
  • As all IUPHAR activities are already -log molar, no further transformation was performed.

ChEMBL

Source data: MySQL dump of ChEMBL v13, 2012-02-21. The following filters were applied to ChEMBL source data during ETL. Only data passing all filters was loaded into CARLSBAD.

  • Only activities from publications were loaded.
  • Activities assciated with ADMET assays were skipped.
  • Activities not associated with a protein target were skipped.
  • Activities not associated with human, rat and mouse targets were skipped.
  • Activities without values or units were skipped.
  • Only activities with the following types were loaded: EC50, IC50, pEC50, pIC50, Log EC50, Log IC50, Ki, Kb, Kd, pKi, pKb, pKd, Log Ki, Log Kb, LogKd, ED50, IC80, IC90, A2, D2, pA2, pD2, Km
  • Only activities with units expressed in molarity were loaded.
  • Only activities with an associated structure were loaded.

Additional transformation of IUPHAR data during ETL:

  • Activity values were converted to M, if necessary
  • Activity values were converted -Log10, if necessary

PDSP

Source data: kidb110121, with Uniprot IDs added by UM: PDSP_MP_093011UM.txt The following filters were applied to PDSP source data during ETL. Only data passing all filters was loaded into CARLSBAD.

  • Activities assciated with structures not parseable by OEChem were skipped.
  • Activities with qualified values (ie. > x) were skipped

Additional transformation of IUPHAR data during ETL:

  • Activity values were converted to M
  • Activity values were converted -Log10

PubChem MLP

Source data: The PubChem Assays and Substances to be loaded into CARLSBAD were selected using the Entrez EUtils API to search pcassay with the following filters: MLP[Filter], confirmatory[Filter] and pcassay_protein_target[Filter]. Substance structures were retrieved as SMILES using the PubChem Power User Gateway (PUG). Assay data was loaded from xml and csv files downloaded from the PubChem ftp site.
The following filters were applied to PubChem-MLP source data during ETL. Only data passing all filters was loaded into CARLSBAD.

  • Only activities associated with human, rat or mouse targets were loaded.
  • Only activities with the following result types were loaded: various versions of EC50, AC50, IC50, Ki, Potency
  • Activities without values or units were skipped.
  • Only activities with units expressed in molarity were loaded.
  • Only activities with an associated structure were loaded.

Additional transformation of IUPHAR data during ETL:

  • Activity values were converted to M, if necessary
  • Activity values were converted -Log10, if necessary

Target Curation

Our goal was to have only one target record in CARLSBAD for each unique protein represented in assays in the data sources. However, targets are named and identified in numerous different ways across our source databases, making it difficult to know whether a target from one data source is the same or different than one from a different data source (target consolidation). To improve the target consolidation process, a target curation step was performed after each data source was loaded, where newly loaded targets were annotated with data from Uniprot. Targets identified in the source data by Swissprot or Uniprot ID were annotated with name, description, sequence, identifier and classifier data from Uniprot. This allowed us to check for target redundancy by sequence and identifiers after each data source was loaded. Unfortunately, there is certainly still some degree of target redundancy in the CARLSBAD database.
The identifiers used are: NCBI GI, RefSeq and Gene and UniGene IDs; and PDB IDs.
The classifiers used are: InterPro, Pfam, and PROSITE domains; GO terms; NCBI RefSeq and Gene IDs; and Uniprot family.