Search Mailing List Archives


Limit search to: Subject & Body Subject Author
Sort by: Reverse Sort
Limit to: All This Week Last Week This Month Last Month
Select Date Range     through    

Problems with gene association files

Gabriel Berriz gberriz at hms.harvard.edu
Fri Jun 28 11:13:19 PDT 2002


Hi everyone.

When processing GO's gene association files, we often run into problematic 
records.  These fall into 4 categories, roughly in ascending degree of 
seriousness:

1. associations in which the GO attrib is one of the attributes "unknown" 
of the three main branches of the ontology.

2. associations in which the GO attrib is a descendant of one of the 
attributes "obsolete" of the three main branches of the ontology; 
presumably, these associations are obsolete.

3. associations in which the GO attrib is deprecated in favor of another one.

4. assorted invalid records, in which some important field is missing or 
malformed.

Here is a summary of the above for the current version of the listed files:

gene_associations.sgd:
2 "descendant-of-obsolete" attribs (2 associations)
2 deprecated attribs (6 associations)
3 "descendant-of-unknown" attribs (6762 associations)

gene_associations.mgi:
24 "descendant-of-obsolete" attribs (429 associations)
3 "descendant-of-unknown" attribs (839 associations)
1 invalid entry

gene_associations.fb:
22 "descendant-of-obsolete" attribs (478 associations)
4 deprecated attribs (21 associations)
6 invalid entries

gene_associations.wb:
22 "descendant-of-obsolete" attribs (178 associations)
3 deprecated attribs (19 associations)
1 "descendant-of-unknown" attrib (96 associations)
29 invalid entries


(For those interested, more detailed information about these problematic 
records is given below.)


I would like to add the following few points to The Wish List:

1. Is it useful to have gene associations to the category of 
unknown?  Wouldn't lack of evidence already imply that?  I propose that all 
these associations be eliminated, or reassigned to a GO attribute other 
than one of the "unknown" ones.

2. It seems reasonable that if an attribute has been classified as obsolete 
(by making it a descendant of one of the "obsolete" attributes), then the 
associations in which this attribute appears should also become obsolete, 
and therefore expunged from the association file, or at least reassigned to 
a non-obsolete attribute.

3. It would be very helpful if the gene association files were in synch 
with the ontology files.  The next best thing would be a clean-up script 
that replaces all occurrences of deprecated attributes in the association 
files with their current versions.

4. The most serious problem are the records that are completely invalid, 
due to a missing or malformed field.  It would be nice if the gene 
association files were checked for data integrity before they are 
published.  It would be easy to write a Perl script to automate this checking.

=============================================================
Gabriel F. Berriz, PhD
Bioinformatics Developer
Roth Lab
Biological Chemistry and Molecular Pharmacology -- Harvard Medical School
Seeley G. Mudd Building 322B
Boston, MA 02115-5701
Telephone: 617.432.3555
Fax: 617.432.3557




gene_associations.sgd

2 "descendant-of-obsolete" attribs (2 associations):

0006502: C-terminal protein prenylation<-obsolete<-biological_process
0006504: C-terminal protein geranylgeranylation<-C-terminal protein 
prenylation<-obsolete<-biological_process


2 deprecated attribs (6 associations):

0003907: purine-specific oxidized base lesion DNA 
N-glycosylase/8-oxoguanine DNA glycosylase/DNA glycosylase/AP-lyase/DNA 
glycosylase/beta-lyase/bifunctional DNA glycosylase/formamidopyrimidine-DNA 
glycosylase (now 0008534)
0019004: pyrimidine-specific oxidized base lesion DNA N-glycosylase/DNA 
glycosylase/AP-lyase/DNA glycosylase/beta-lyase/bifunctional DNA 
glycosylase/endodeoxyribonuclease III/endonuclease III (now 0000703)


3 "descendant-of-unknown" attribs (6762 associations):

0008372: cellular_component unknown<-cellular_component
0005554: molecular_function unknown<-molecular_function
0000004: biological_process unknown<-biological_process


==================================================

gene_associations.mgi

24 "descendant-of-obsolete" attribs (429 associations):

0005558: minor histocompatibility antigen<-obsolete<-molecular_function
0009460: cytochrome b<-cytochrome<-obsolete<-molecular_function
0005211: plasma glycoprotein<-obsolete<-molecular_function
0009461: cytochrome c<-cytochrome<-obsolete<-molecular_function
0016583: nucleosome modeling<-obsolete<-biological_process
0015023: syndecan<-obsolete<-molecular_function
0005206: heparin sulfate proteoglycan<-obsolete<-molecular_function
0005207: extracellular matrix glycoprotein<-obsolete<-molecular_function
0005208: amyloid protein<-obsolete<-molecular_function
0005073: common-partner SMAD protein<-obsolete<-molecular_function
0005074: inhibitory SMAD protein<-obsolete<-molecular_function
0005075: pathway-specific SMAD protein<-obsolete<-molecular_function
0003820: class I major histocompatibility complex 
antigen<-obsolete<-molecular_function; class I major histocompatibility 
complex antigen<-endogenous peptide receptor<-transmembrane 
receptor<-receptor<-signal transducer<-molecular_function
0009487: glutaredoxin<-obsolete<-molecular_function
0006502: C-terminal protein prenylation<-obsolete<-biological_process
0003821: class II major histocompatibility complex 
antigen<-obsolete<-molecular_function; class II major histocompatibility 
complex antigen<-exogenous peptide receptor<-transmembrane 
receptor<-receptor<-signal transducer<-molecular_function
0003822: MHC-interacting protein<-obsolete<-molecular_function
0008222: tumor antigen<-obsolete<-molecular_function
0003819: major histocompatibility complex antigen/MHC 
protein<-obsolete<-molecular_function
0016171: cell surface antigen<-obsolete<-molecular_function
0005189: milk protein<-obsolete<-molecular_function
0005570: small nuclear RNA<-obsolete<-molecular_function
0005555: blood group antigen<-obsolete<-molecular_function
0005557: lymphocyte antigen<-obsolete<-molecular_function


3 "descendant-of-unknown" attribs (839 associations):

0005554: molecular_function unknown<-molecular_function
0008372: cellular_component unknown<-cellular_component
0000004: biological_process unknown<-biological_process


1 invalid entry:

*** Invalid evidence type '05/06/2002
' in
         F       ubiquitin protein ligase E3 component n-recognin 1      E3 
alpha        gene    taxon:10090     05/06/2002

==================================================

gene_associations.fb

22 "descendant-of-obsolete" attribs (478 associations):

0005566: ribosomal RNA<-obsolete<-molecular_function
0008337: selectin<-obsolete<-molecular_function
0005569: small nucleolar RNA<-obsolete<-molecular_function
0008436: heterogeneous nuclear 
ribonucleoprotein/hnRNP<-obsolete<-molecular_function
0009461: cytochrome c<-cytochrome<-obsolete<-molecular_function
0016583: nucleosome modeling<-obsolete<-biological_process
0005206: heparin sulfate proteoglycan<-obsolete<-molecular_function
0008001: fibrinogen<-obsolete<-molecular_function
0005207: extracellular matrix glycoprotein<-obsolete<-molecular_function
0009464: cytochrome b5<-cytochrome b<-cytochrome<-obsolete<-molecular_function
0005208: amyloid protein<-obsolete<-molecular_function
0004600: cyclophilin<-obsolete<-molecular_function
0009477: cytochrome c1<-cytochrome c<-cytochrome<-obsolete<-molecular_function
0009487: glutaredoxin<-obsolete<-molecular_function
0003750: cell cycle regulator<-obsolete<-molecular_function
0003733: ribonucleoprotein<-obsolete<-molecular_function
0003734: small nuclear ribonucleoprotein/snRNP<-obsolete<-molecular_function
0000047: Rieske iron-sulfur protein<-obsolete<-molecular_function
0005188: larval serum protein (sensu 
Insecta)/arylphorin<-obsolete<-molecular_function
0005570: small nuclear RNA<-obsolete<-molecular_function
0005563: transfer RNA<-obsolete<-molecular_function
0005734: box C+D snoRNP protein<-obsolete<-cellular_component


4 deprecated attribs (21 associations):

0008279: cohesin/14S cohesin (now 0008278)
0008620: condensin/13S condensin (now 0005676)
0007408: neuroblast determination/neuroblast identity determination (now 
0007400)
0019004: pyrimidine-specific oxidized base lesion DNA N-glycosylase/DNA 
glycosylase/AP-lyase/DNA glycosylase/beta-lyase/bifunctional DNA 
glycosylase/endodeoxyribonuclease III/endonuclease III (now 0000703)


6 invalid entries:

*** Empty attribute ID field in
FB      FBgn0025720     Ate1            GO:     FB:FBrf0145404  NAS 
     F               gene    taxonID:7227
*** Empty attribute ID field in
FB      FBgn0038659     endoA           GO:     FB:FBrf0145518  NAS 
     F               gene    taxonID:7227
*** Empty attribute ID field in
FB      FBgn0060297     p74S            GO: 
FB:FBrf0128634|PMID:10892653    IDA             F               gene 
taxonID:7227
*** Empty attribute ID field in
FB      FBgn0060298     p69             GO: 
FB:FBrf0128634|PMID:10892653    IDA             F               gene 
taxonID:7227
*** Empty attribute ID field in
FB      FBgn0060299     p145            GO: 
FB:FBrf0128634|PMID:10892653    IDA             F               gene 
taxonID:7227
*** Empty gene ID field in
FB                              GO:0006951      FB:FBrf0105495  NAS 
     P               gene    taxonID:7227
==================================================

gene_associations.wb

22 "descendant-of-obsolete" attribs (178 associations):

0005906: clathrin adaptor/adaptin<-obsolete<-cellular_component
0009406: virulence<-obsolete<-biological_process
0003892: proliferating cell nuclear antigen/PCNA<-obsolete<-molecular_function
0006832: small molecule transport<-obsolete<-biological_process
0009460: cytochrome b<-cytochrome<-obsolete<-molecular_function
0007012: actin cytoskeleton reorganization<-obsolete<-biological_process
0005208: amyloid protein<-obsolete<-molecular_function
0004431: 1-phosphatidylinositol-4-phosphate kinase/class I 
PI3K<-obsolete<-molecular_function
0005065: heterotrimeric G protein<-obsolete<-molecular_function
0009487: glutaredoxin<-obsolete<-molecular_function
0000008: thioredoxin<-obsolete<-molecular_function
0006502: C-terminal protein prenylation<-obsolete<-biological_process
0004429: 1-phosphatidylinositol 3-kinase/class III 
PI3K<-obsolete<-molecular_function
0005620: periplasmic space<-obsolete<-cellular_component
0003750: cell cycle regulator<-obsolete<-molecular_function
0003734: small nuclear ribonucleoprotein/snRNP<-obsolete<-molecular_function
0007048: oncogenesis<-obsolete<-biological_process
0008304: eukaryotic translation initiation factor 4 
complex/eIF-4<-obsolete<-cellular_component
0000047: Rieske iron-sulfur protein<-obsolete<-molecular_function
0005480: vesicle transport<-obsolete<-molecular_function
0008164: organophosphorous resistance<-obsolete<-biological_process
0005468: small-molecule carrier or transporter<-obsolete<-molecular_function


3 deprecated attribs (19 associations):

0008497: phospholipid transporter/phospholipid transporter (now 0005548)
0008433: guanyl-nucleotide exchange factor/GEF/GNRP/guanyl-nucleotide 
releasing factor (now 0005085)
0003710: transcriptional activator/transcription activating factor (now 
0016563)


1 "descendant-of-unknown" attrib (96 associations):

0005554: molecular_function unknown<-molecular_function


29 invalid entries:

*** Empty gene ID field in
WB                              GO:0000003      PMID:11137018   IMP 
WB:KK:B0250.1   P               B0250.3 protein taxon:6239      20020201
*** Empty gene ID field in
WB                              GO:0000003      PMID:11137018   IMP 
WB:KK:B0250.1   P               B0250.3 protein taxon:6239      20020201
*** Empty gene ID field in
WB                              GO:0000003      PMID:11231151   IMP 
WB:SA:yk190g1   P               B0491.8 protein taxon:6239      20020201
*** Empty gene ID field in
WB                              GO:0000003      PMID:11231151   IMP 
WB:SA:yk237h6   P               C33E10.3        protein 
taxon:6239      20020201
*** Empty gene ID field in
WB                              GO:0000003      PMID:11231151   IMP 
WB:SA:yk237h6   P               C33E10.7        protein 
taxon:6239      20020201
*** Empty gene ID field in
WB                              GO:0000003      PMID:11231151   IMP 
WB:SA:yk237h6   P               C33E10.9        protein 
taxon:6239      20020201
*** Empty gene ID field in
WB                              GO:0002119      PMID:11099033   IMP 
WB:JA:F53B8.1   P               F53B8.1 protein taxon:6239      20020201
*** Empty gene ID field in
WB                              GO:0002119      PMID:11231151   IMP 
WB:SA:yk282c1   P               Y39B6B.EE       protein 
taxon:6239      20020201
*** Empty gene ID field in
WB                              GO:0002119      PMID:11231151   IMP 
WB:SA:yk301a11  P               Y102A5C.6       protein 
taxon:6239      20020201
*** Empty gene ID field in
WB                              GO:0007276      PMID:11231151   IMP 
WB:SA:yk301a11  P               Y102A5C.6       protein 
taxon:6239      20020201
*** Empty gene ID field in
WB                              GO:0007345      PMID:11099033   IMP 
WB:JA:F53B8.1   P               F53B8.1 protein taxon:6239      20020201
*** Empty gene ID field in
WB                              GO:0007345      PMID:11099033   IMP 
WB:JA:F53B8.1   P               F53B8.1 protein taxon:6239      20020201
*** Empty gene ID field in
WB                              GO:0007345      PMID:11137018   IMP 
WB:KK:B0250.1   P               B0250.3 protein taxon:6239      20020201
*** Empty gene ID field in
WB                              GO:0007345      PMID:11137018   IMP 
WB:KK:B0250.1   P               B0250.3 protein taxon:6239      20020201
*** Empty gene ID field in
WB                              GO:0007345      PMID:11231151   IMP 
WB:SA:yk190g1   P               B0491.8 protein taxon:6239      20020201
*** Empty gene ID field in
WB                              GO:0007345      PMID:11231151   IMP 
WB:SA:yk282c1   P               Y39B6B.EE       protein 
taxon:6239      20020201
*** Empty gene ID field in
WB                              GO:0007345      PMID:11231151   IMP 
WB:SA:yk301g7   P               F56D12.5        protein 
taxon:6239      20020201
*** Empty gene ID field in
WB                              GO:0007345      PMID:11231151   IMP 
WB:SA:yk342e7   P               Y52B11A.11      protein 
taxon:6239      20020201
*** Empty gene ID field in
WB                              GO:0007397      PMID:11099033   IMP 
WB:JA:F53B8.1   P               F53B8.1 protein taxon:6239      20020201
*** Empty gene ID field in
WB                              GO:0007582      PMID:11099033   IMP 
WB:JA:F53B8.1   P               F53B8.1 protein taxon:6239      20020201
*** Empty gene ID field in
WB                              GO:0007582      PMID:11231151   IMP 
WB:SA:yk190g1   P               B0491.8 protein taxon:6239      20020201
*** Empty gene ID field in
WB                              GO:0007582      PMID:11231151   IMP 
WB:SA:yk282c1   P               Y39B6B.EE       protein 
taxon:6239      20020201
*** Empty gene ID field in
WB                              GO:0007582      PMID:11231151   IMP 
WB:SA:yk301a11  P               Y102A5C.6       protein 
taxon:6239      20020201
*** Empty gene ID field in
WB                              GO:0007626      PMID:11099033   IMP 
WB:JA:F53B8.1   P               F53B8.1 protein taxon:6239      20020201
*** Empty gene ID field in
WB                              GO:0040007      PMID:11231151   IMP 
WB:SA:yk282c1   P               Y39B6B.EE       protein 
taxon:6239      20020201
*** Empty gene ID field in
WB                              GO:0040007      PMID:11231151   IMP 
WB:SA:yk301a11  P               Y102A5C.6       protein 
taxon:6239      20020201
*** Empty gene ID field in
WB                              GO:0040010      PMID:11231151   IMP 
WB:SA:yk282c1   P               Y39B6B.EE       protein 
taxon:6239      20020201
*** Empty gene ID field in
WB                              GO:0040010      PMID:11231151   IMP 
WB:SA:yk301a11  P               Y102A5C.6       protein 
taxon:6239      20020201
*** Empty gene ID field in
WB                              GO:0040011      PMID:11099033   IMP 
WB:JA:F53B8.1   P               F53B8.1 protein taxon:6239      20020201
================================================== 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.stanford.edu/pipermail/go-friends/attachments/20020628/a2d8efc9/attachment.html>


More information about the go-friends mailing list