Search Mailing List Archives


Limit search to: Subject & Body Subject Author
Sort by: Reverse Sort
Limit to: All This Week Last Week This Month Last Month
Select Date Range     through    

Problems with gene association files

Chris Mungall cjm at fruitfly.org
Fri Jun 28 12:00:00 PDT 2002


Hi Gabriel

The gene-associations files are direct submissions from the various model
organism and other groups, as such there will probably always be problems
such as this.

You can always get the data either from the xml or from the monthly
database release, which is filtered for these kind of things; although I
appreciate it's often useful getting the latest data.

We already have various scripts on the sourceforge site for parsing and
exporting this kind of data; another option would be for us to make these
easier to use and encourage the contributing groups to use these prior to
submission, as you suggest.

On Fri, 28 Jun 2002, Gabriel Berriz wrote:

> Hi everyone.
>
> When processing GO's gene association files, we often run into problematic
> records.  These fall into 4 categories, roughly in ascending degree of
> seriousness:
>
> 1. associations in which the GO attrib is one of the attributes "unknown"
> of the three main branches of the ontology.
>
> 2. associations in which the GO attrib is a descendant of one of the
> attributes "obsolete" of the three main branches of the ontology;
> presumably, these associations are obsolete.
>
> 3. associations in which the GO attrib is deprecated in favor of another one.
>
> 4. assorted invalid records, in which some important field is missing or
> malformed.
>
> Here is a summary of the above for the current version of the listed files:
>
> gene_associations.sgd:
> 2 "descendant-of-obsolete" attribs (2 associations)
> 2 deprecated attribs (6 associations)
> 3 "descendant-of-unknown" attribs (6762 associations)
>
> gene_associations.mgi:
> 24 "descendant-of-obsolete" attribs (429 associations)
> 3 "descendant-of-unknown" attribs (839 associations)
> 1 invalid entry
>
> gene_associations.fb:
> 22 "descendant-of-obsolete" attribs (478 associations)
> 4 deprecated attribs (21 associations)
> 6 invalid entries
>
> gene_associations.wb:
> 22 "descendant-of-obsolete" attribs (178 associations)
> 3 deprecated attribs (19 associations)
> 1 "descendant-of-unknown" attrib (96 associations)
> 29 invalid entries
>
>
> (For those interested, more detailed information about these problematic
> records is given below.)
>
>
> I would like to add the following few points to The Wish List:
>
> 1. Is it useful to have gene associations to the category of
> unknown?  Wouldn't lack of evidence already imply that?  I propose that all
> these associations be eliminated, or reassigned to a GO attribute other
> than one of the "unknown" ones.
>
> 2. It seems reasonable that if an attribute has been classified as obsolete
> (by making it a descendant of one of the "obsolete" attributes), then the
> associations in which this attribute appears should also become obsolete,
> and therefore expunged from the association file, or at least reassigned to
> a non-obsolete attribute.
>
> 3. It would be very helpful if the gene association files were in synch
> with the ontology files.  The next best thing would be a clean-up script
> that replaces all occurrences of deprecated attributes in the association
> files with their current versions.
>
> 4. The most serious problem are the records that are completely invalid,
> due to a missing or malformed field.  It would be nice if the gene
> association files were checked for data integrity before they are
> published.  It would be easy to write a Perl script to automate this checking.
>
> =============================================================
> Gabriel F. Berriz, PhD
> Bioinformatics Developer
> Roth Lab
> Biological Chemistry and Molecular Pharmacology -- Harvard Medical School
> Seeley G. Mudd Building 322B
> Boston, MA 02115-5701
> Telephone: 617.432.3555
> Fax: 617.432.3557
>
>
>
>
> gene_associations.sgd
>
> 2 "descendant-of-obsolete" attribs (2 associations):
>
> 0006502: C-terminal protein prenylation<-obsolete<-biological_process
> 0006504: C-terminal protein geranylgeranylation<-C-terminal protein
> prenylation<-obsolete<-biological_process
>
>
> 2 deprecated attribs (6 associations):
>
> 0003907: purine-specific oxidized base lesion DNA
> N-glycosylase/8-oxoguanine DNA glycosylase/DNA glycosylase/AP-lyase/DNA
> glycosylase/beta-lyase/bifunctional DNA glycosylase/formamidopyrimidine-DNA
> glycosylase (now 0008534)
> 0019004: pyrimidine-specific oxidized base lesion DNA N-glycosylase/DNA
> glycosylase/AP-lyase/DNA glycosylase/beta-lyase/bifunctional DNA
> glycosylase/endodeoxyribonuclease III/endonuclease III (now 0000703)
>
>
> 3 "descendant-of-unknown" attribs (6762 associations):
>
> 0008372: cellular_component unknown<-cellular_component
> 0005554: molecular_function unknown<-molecular_function
> 0000004: biological_process unknown<-biological_process
>
>
> ==================================================
>
> gene_associations.mgi
>
> 24 "descendant-of-obsolete" attribs (429 associations):
>
> 0005558: minor histocompatibility antigen<-obsolete<-molecular_function
> 0009460: cytochrome b<-cytochrome<-obsolete<-molecular_function
> 0005211: plasma glycoprotein<-obsolete<-molecular_function
> 0009461: cytochrome c<-cytochrome<-obsolete<-molecular_function
> 0016583: nucleosome modeling<-obsolete<-biological_process
> 0015023: syndecan<-obsolete<-molecular_function
> 0005206: heparin sulfate proteoglycan<-obsolete<-molecular_function
> 0005207: extracellular matrix glycoprotein<-obsolete<-molecular_function
> 0005208: amyloid protein<-obsolete<-molecular_function
> 0005073: common-partner SMAD protein<-obsolete<-molecular_function
> 0005074: inhibitory SMAD protein<-obsolete<-molecular_function
> 0005075: pathway-specific SMAD protein<-obsolete<-molecular_function
> 0003820: class I major histocompatibility complex
> antigen<-obsolete<-molecular_function; class I major histocompatibility
> complex antigen<-endogenous peptide receptor<-transmembrane
> receptor<-receptor<-signal transducer<-molecular_function
> 0009487: glutaredoxin<-obsolete<-molecular_function
> 0006502: C-terminal protein prenylation<-obsolete<-biological_process
> 0003821: class II major histocompatibility complex
> antigen<-obsolete<-molecular_function; class II major histocompatibility
> complex antigen<-exogenous peptide receptor<-transmembrane
> receptor<-receptor<-signal transducer<-molecular_function
> 0003822: MHC-interacting protein<-obsolete<-molecular_function
> 0008222: tumor antigen<-obsolete<-molecular_function
> 0003819: major histocompatibility complex antigen/MHC
> protein<-obsolete<-molecular_function
> 0016171: cell surface antigen<-obsolete<-molecular_function
> 0005189: milk protein<-obsolete<-molecular_function
> 0005570: small nuclear RNA<-obsolete<-molecular_function
> 0005555: blood group antigen<-obsolete<-molecular_function
> 0005557: lymphocyte antigen<-obsolete<-molecular_function
>
>
> 3 "descendant-of-unknown" attribs (839 associations):
>
> 0005554: molecular_function unknown<-molecular_function
> 0008372: cellular_component unknown<-cellular_component
> 0000004: biological_process unknown<-biological_process
>
>
> 1 invalid entry:
>
> *** Invalid evidence type '05/06/2002
> ' in
>          F       ubiquitin protein ligase E3 component n-recognin 1      E3
> alpha        gene    taxon:10090     05/06/2002
>
> ==================================================
>
> gene_associations.fb
>
> 22 "descendant-of-obsolete" attribs (478 associations):
>
> 0005566: ribosomal RNA<-obsolete<-molecular_function
> 0008337: selectin<-obsolete<-molecular_function
> 0005569: small nucleolar RNA<-obsolete<-molecular_function
> 0008436: heterogeneous nuclear
> ribonucleoprotein/hnRNP<-obsolete<-molecular_function
> 0009461: cytochrome c<-cytochrome<-obsolete<-molecular_function
> 0016583: nucleosome modeling<-obsolete<-biological_process
> 0005206: heparin sulfate proteoglycan<-obsolete<-molecular_function
> 0008001: fibrinogen<-obsolete<-molecular_function
> 0005207: extracellular matrix glycoprotein<-obsolete<-molecular_function
> 0009464: cytochrome b5<-cytochrome b<-cytochrome<-obsolete<-molecular_function
> 0005208: amyloid protein<-obsolete<-molecular_function
> 0004600: cyclophilin<-obsolete<-molecular_function
> 0009477: cytochrome c1<-cytochrome c<-cytochrome<-obsolete<-molecular_function
> 0009487: glutaredoxin<-obsolete<-molecular_function
> 0003750: cell cycle regulator<-obsolete<-molecular_function
> 0003733: ribonucleoprotein<-obsolete<-molecular_function
> 0003734: small nuclear ribonucleoprotein/snRNP<-obsolete<-molecular_function
> 0000047: Rieske iron-sulfur protein<-obsolete<-molecular_function
> 0005188: larval serum protein (sensu
> Insecta)/arylphorin<-obsolete<-molecular_function
> 0005570: small nuclear RNA<-obsolete<-molecular_function
> 0005563: transfer RNA<-obsolete<-molecular_function
> 0005734: box C+D snoRNP protein<-obsolete<-cellular_component
>
>
> 4 deprecated attribs (21 associations):
>
> 0008279: cohesin/14S cohesin (now 0008278)
> 0008620: condensin/13S condensin (now 0005676)
> 0007408: neuroblast determination/neuroblast identity determination (now
> 0007400)
> 0019004: pyrimidine-specific oxidized base lesion DNA N-glycosylase/DNA
> glycosylase/AP-lyase/DNA glycosylase/beta-lyase/bifunctional DNA
> glycosylase/endodeoxyribonuclease III/endonuclease III (now 0000703)
>
>
> 6 invalid entries:
>
> *** Empty attribute ID field in
> FB      FBgn0025720     Ate1            GO:     FB:FBrf0145404  NAS
>      F               gene    taxonID:7227
> *** Empty attribute ID field in
> FB      FBgn0038659     endoA           GO:     FB:FBrf0145518  NAS
>      F               gene    taxonID:7227
> *** Empty attribute ID field in
> FB      FBgn0060297     p74S            GO:
> FB:FBrf0128634|PMID:10892653    IDA             F               gene
> taxonID:7227
> *** Empty attribute ID field in
> FB      FBgn0060298     p69             GO:
> FB:FBrf0128634|PMID:10892653    IDA             F               gene
> taxonID:7227
> *** Empty attribute ID field in
> FB      FBgn0060299     p145            GO:
> FB:FBrf0128634|PMID:10892653    IDA             F               gene
> taxonID:7227
> *** Empty gene ID field in
> FB                              GO:0006951      FB:FBrf0105495  NAS
>      P               gene    taxonID:7227
> ==================================================
>
> gene_associations.wb
>
> 22 "descendant-of-obsolete" attribs (178 associations):
>
> 0005906: clathrin adaptor/adaptin<-obsolete<-cellular_component
> 0009406: virulence<-obsolete<-biological_process
> 0003892: proliferating cell nuclear antigen/PCNA<-obsolete<-molecular_function
> 0006832: small molecule transport<-obsolete<-biological_process
> 0009460: cytochrome b<-cytochrome<-obsolete<-molecular_function
> 0007012: actin cytoskeleton reorganization<-obsolete<-biological_process
> 0005208: amyloid protein<-obsolete<-molecular_function
> 0004431: 1-phosphatidylinositol-4-phosphate kinase/class I
> PI3K<-obsolete<-molecular_function
> 0005065: heterotrimeric G protein<-obsolete<-molecular_function
> 0009487: glutaredoxin<-obsolete<-molecular_function
> 0000008: thioredoxin<-obsolete<-molecular_function
> 0006502: C-terminal protein prenylation<-obsolete<-biological_process
> 0004429: 1-phosphatidylinositol 3-kinase/class III
> PI3K<-obsolete<-molecular_function
> 0005620: periplasmic space<-obsolete<-cellular_component
> 0003750: cell cycle regulator<-obsolete<-molecular_function
> 0003734: small nuclear ribonucleoprotein/snRNP<-obsolete<-molecular_function
> 0007048: oncogenesis<-obsolete<-biological_process
> 0008304: eukaryotic translation initiation factor 4
> complex/eIF-4<-obsolete<-cellular_component
> 0000047: Rieske iron-sulfur protein<-obsolete<-molecular_function
> 0005480: vesicle transport<-obsolete<-molecular_function
> 0008164: organophosphorous resistance<-obsolete<-biological_process
> 0005468: small-molecule carrier or transporter<-obsolete<-molecular_function
>
>
> 3 deprecated attribs (19 associations):
>
> 0008497: phospholipid transporter/phospholipid transporter (now 0005548)
> 0008433: guanyl-nucleotide exchange factor/GEF/GNRP/guanyl-nucleotide
> releasing factor (now 0005085)
> 0003710: transcriptional activator/transcription activating factor (now
> 0016563)
>
>
> 1 "descendant-of-unknown" attrib (96 associations):
>
> 0005554: molecular_function unknown<-molecular_function
>
>
> 29 invalid entries:
>
> *** Empty gene ID field in
> WB                              GO:0000003      PMID:11137018   IMP
> WB:KK:B0250.1   P               B0250.3 protein taxon:6239      20020201
> *** Empty gene ID field in
> WB                              GO:0000003      PMID:11137018   IMP
> WB:KK:B0250.1   P               B0250.3 protein taxon:6239      20020201
> *** Empty gene ID field in
> WB                              GO:0000003      PMID:11231151   IMP
> WB:SA:yk190g1   P               B0491.8 protein taxon:6239      20020201
> *** Empty gene ID field in
> WB                              GO:0000003      PMID:11231151   IMP
> WB:SA:yk237h6   P               C33E10.3        protein
> taxon:6239      20020201
> *** Empty gene ID field in
> WB                              GO:0000003      PMID:11231151   IMP
> WB:SA:yk237h6   P               C33E10.7        protein
> taxon:6239      20020201
> *** Empty gene ID field in
> WB                              GO:0000003      PMID:11231151   IMP
> WB:SA:yk237h6   P               C33E10.9        protein
> taxon:6239      20020201
> *** Empty gene ID field in
> WB                              GO:0002119      PMID:11099033   IMP
> WB:JA:F53B8.1   P               F53B8.1 protein taxon:6239      20020201
> *** Empty gene ID field in
> WB                              GO:0002119      PMID:11231151   IMP
> WB:SA:yk282c1   P               Y39B6B.EE       protein
> taxon:6239      20020201
> *** Empty gene ID field in
> WB                              GO:0002119      PMID:11231151   IMP
> WB:SA:yk301a11  P               Y102A5C.6       protein
> taxon:6239      20020201
> *** Empty gene ID field in
> WB                              GO:0007276      PMID:11231151   IMP
> WB:SA:yk301a11  P               Y102A5C.6       protein
> taxon:6239      20020201
> *** Empty gene ID field in
> WB                              GO:0007345      PMID:11099033   IMP
> WB:JA:F53B8.1   P               F53B8.1 protein taxon:6239      20020201
> *** Empty gene ID field in
> WB                              GO:0007345      PMID:11099033   IMP
> WB:JA:F53B8.1   P               F53B8.1 protein taxon:6239      20020201
> *** Empty gene ID field in
> WB                              GO:0007345      PMID:11137018   IMP
> WB:KK:B0250.1   P               B0250.3 protein taxon:6239      20020201
> *** Empty gene ID field in
> WB                              GO:0007345      PMID:11137018   IMP
> WB:KK:B0250.1   P               B0250.3 protein taxon:6239      20020201
> *** Empty gene ID field in
> WB                              GO:0007345      PMID:11231151   IMP
> WB:SA:yk190g1   P               B0491.8 protein taxon:6239      20020201
> *** Empty gene ID field in
> WB                              GO:0007345      PMID:11231151   IMP
> WB:SA:yk282c1   P               Y39B6B.EE       protein
> taxon:6239      20020201
> *** Empty gene ID field in
> WB                              GO:0007345      PMID:11231151   IMP
> WB:SA:yk301g7   P               F56D12.5        protein
> taxon:6239      20020201
> *** Empty gene ID field in
> WB                              GO:0007345      PMID:11231151   IMP
> WB:SA:yk342e7   P               Y52B11A.11      protein
> taxon:6239      20020201
> *** Empty gene ID field in
> WB                              GO:0007397      PMID:11099033   IMP
> WB:JA:F53B8.1   P               F53B8.1 protein taxon:6239      20020201
> *** Empty gene ID field in
> WB                              GO:0007582      PMID:11099033   IMP
> WB:JA:F53B8.1   P               F53B8.1 protein taxon:6239      20020201
> *** Empty gene ID field in
> WB                              GO:0007582      PMID:11231151   IMP
> WB:SA:yk190g1   P               B0491.8 protein taxon:6239      20020201
> *** Empty gene ID field in
> WB                              GO:0007582      PMID:11231151   IMP
> WB:SA:yk282c1   P               Y39B6B.EE       protein
> taxon:6239      20020201
> *** Empty gene ID field in
> WB                              GO:0007582      PMID:11231151   IMP
> WB:SA:yk301a11  P               Y102A5C.6       protein
> taxon:6239      20020201
> *** Empty gene ID field in
> WB                              GO:0007626      PMID:11099033   IMP
> WB:JA:F53B8.1   P               F53B8.1 protein taxon:6239      20020201
> *** Empty gene ID field in
> WB                              GO:0040007      PMID:11231151   IMP
> WB:SA:yk282c1   P               Y39B6B.EE       protein
> taxon:6239      20020201
> *** Empty gene ID field in
> WB                              GO:0040007      PMID:11231151   IMP
> WB:SA:yk301a11  P               Y102A5C.6       protein
> taxon:6239      20020201
> *** Empty gene ID field in
> WB                              GO:0040010      PMID:11231151   IMP
> WB:SA:yk282c1   P               Y39B6B.EE       protein
> taxon:6239      20020201
> *** Empty gene ID field in
> WB                              GO:0040010      PMID:11231151   IMP
> WB:SA:yk301a11  P               Y102A5C.6       protein
> taxon:6239      20020201
> *** Empty gene ID field in
> WB                              GO:0040011      PMID:11099033   IMP
> WB:JA:F53B8.1   P               F53B8.1 protein taxon:6239      20020201
> ==================================================


--
This message is from the GOFriends moderated mailing list.  A list of public
announcements and discussion of the Gene Ontology (GO) project.
Problems with the list?           E-mail: owner-gofriends at geneontology.org
Subscribing   send   "subscribe"   to   gofriends-request at geneontology.org
Unsubscribing send   "unsubscribe"  to  gofriends-request at geneontology.org
Web:          http://www.geneontology.org/



More information about the go-friends mailing list