Search Mailing List Archives


Limit search to: Subject & Body Subject Author
Sort by: Reverse Sort
Limit to: All This Week Last Week This Month Last Month
Select Date Range     through    

Problems with gene association files

Rebecca Foulger ref26 at gen.cam.ac.uk
Mon Jul 1 07:04:17 PDT 2002


Dear Gabriel,

I've had a look at the FlyBase errors you point out, and I think it is also a problem with timing. As Harold pointed out with MGI, when GO terms are made obsolete, any of these terms in our FB files are replaced with appropriate valid terms as soon as possible. Therefore the problem terms you mention have already been appropriately updated in our files.

Hope this helps!

Thanks

Becky


--------------------------------------------------------------
Rebecca Foulger.

FlyBase (Cambridge),
Department of Genetics,
University of Cambridge,
Downing Street,                       email: ref26 at gen.cam.ac.uk
Cambridge,  CB2 3EH,                  Ph : 01223-333963
UK.                                   FAX: 01223-333992
-------------------------------------------------------------- 



> From owner-gofriends at genome.stanford.edu Fri Jun 28 19:43:08 2002
> Envelope-to: ref26 at gen.cam.ac.uk
> Delivery-date: Fri, 28 Jun 2002 19:43:08 +0100
> X-Authentication-Warning: alberich.Stanford.EDU: majordom set sender to owner-gofriends at genome-mail.stanford.edu using -f
> X-Sender: gabriel_berriz at hms.harvard.edu
> X-Mailer: QUALCOMM Windows Eudora Version 5.0
> Date: Fri, 28 Jun 2002 14:13:19 -0400
> To: gofriends at genome.Stanford.EDU
> From: Gabriel Berriz <gberriz at hms.harvard.edu>
> Subject: Problems with gene association files
> Mime-Version: 1.0
> Content-Type: multipart/alternative; boundary="=====================_1377855761==_.ALT"
> Sender: owner-gofriends at genome.stanford.edu
> Content-Length: 54879
> 
> Hi everyone.
> 
> When processing GO's gene association files, we often run into problematic 
> records.  These fall into 4 categories, roughly in ascending degree of 
> seriousness:
> 
> 1. associations in which the GO attrib is one of the attributes "unknown" 
> of the three main branches of the ontology.
> 
> 2. associations in which the GO attrib is a descendant of one of the 
> attributes "obsolete" of the three main branches of the ontology; 
> presumably, these associations are obsolete.
> 
> 3. associations in which the GO attrib is deprecated in favor of another one.
> 
> 4. assorted invalid records, in which some important field is missing or 
> malformed.
> 
> Here is a summary of the above for the current version of the listed files:
> 
> gene_associations.sgd:
> 2 "descendant-of-obsolete" attribs (2 associations)
> 2 deprecated attribs (6 associations)
> 3 "descendant-of-unknown" attribs (6762 associations)
> 
> gene_associations.mgi:
> 24 "descendant-of-obsolete" attribs (429 associations)
> 3 "descendant-of-unknown" attribs (839 associations)
> 1 invalid entry
> 
> gene_associations.fb:
> 22 "descendant-of-obsolete" attribs (478 associations)
> 4 deprecated attribs (21 associations)
> 6 invalid entries
> 
> gene_associations.wb:
> 22 "descendant-of-obsolete" attribs (178 associations)
> 3 deprecated attribs (19 associations)
> 1 "descendant-of-unknown" attrib (96 associations)
> 29 invalid entries
> 
> 
> (For those interested, more detailed information about these problematic 
> records is given below.)
> 
> 
> I would like to add the following few points to The Wish List:
> 
> 1. Is it useful to have gene associations to the category of 
> unknown?  Wouldn't lack of evidence already imply that?  I propose that all 
> these associations be eliminated, or reassigned to a GO attribute other 
> than one of the "unknown" ones.
> 
> 2. It seems reasonable that if an attribute has been classified as obsolete 
> (by making it a descendant of one of the "obsolete" attributes), then the 
> associations in which this attribute appears should also become obsolete, 
> and therefore expunged from the association file, or at least reassigned to 
> a non-obsolete attribute.
> 
> 3. It would be very helpful if the gene association files were in synch 
> with the ontology files.  The next best thing would be a clean-up script 
> that replaces all occurrences of deprecated attributes in the association 
> files with their current versions.
> 
> 4. The most serious problem are the records that are completely invalid, 
> due to a missing or malformed field.  It would be nice if the gene 
> association files were checked for data integrity before they are 
> published.  It would be easy to write a Perl script to automate this checking.
> 
> =============================================================
> Gabriel F. Berriz, PhD
> Bioinformatics Developer
> Roth Lab
> Biological Chemistry and Molecular Pharmacology -- Harvard Medical School
> Seeley G. Mudd Building 322B
> Boston, MA 02115-5701
> Telephone: 617.432.3555
> Fax: 617.432.3557
> 
> 
> 
> 
> gene_associations.sgd
> 
> 2 "descendant-of-obsolete" attribs (2 associations):
> 
> 0006502: C-terminal protein prenylation<-obsolete<-biological_process
> 0006504: C-terminal protein geranylgeranylation<-C-terminal protein 
> prenylation<-obsolete<-biological_process
> 
> 
> 2 deprecated attribs (6 associations):
> 
> 0003907: purine-specific oxidized base lesion DNA 
> N-glycosylase/8-oxoguanine DNA glycosylase/DNA glycosylase/AP-lyase/DNA 
> glycosylase/beta-lyase/bifunctional DNA glycosylase/formamidopyrimidine-DNA 
> glycosylase (now 0008534)
> 0019004: pyrimidine-specific oxidized base lesion DNA N-glycosylase/DNA 
> glycosylase/AP-lyase/DNA glycosylase/beta-lyase/bifunctional DNA 
> glycosylase/endodeoxyribonuclease III/endonuclease III (now 0000703)
> 
> 
> 3 "descendant-of-unknown" attribs (6762 associations):
> 
> 0008372: cellular_component unknown<-cellular_component
> 0005554: molecular_function unknown<-molecular_function
> 0000004: biological_process unknown<-biological_process
> 
> 
> ==================================================
> 
> gene_associations.mgi
> 
> 24 "descendant-of-obsolete" attribs (429 associations):
> 
> 0005558: minor histocompatibility antigen<-obsolete<-molecular_function
> 0009460: cytochrome b<-cytochrome<-obsolete<-molecular_function
> 0005211: plasma glycoprotein<-obsolete<-molecular_function
> 0009461: cytochrome c<-cytochrome<-obsolete<-molecular_function
> 0016583: nucleosome modeling<-obsolete<-biological_process
> 0015023: syndecan<-obsolete<-molecular_function
> 0005206: heparin sulfate proteoglycan<-obsolete<-molecular_function
> 0005207: extracellular matrix glycoprotein<-obsolete<-molecular_function
> 0005208: amyloid protein<-obsolete<-molecular_function
> 0005073: common-partner SMAD protein<-obsolete<-molecular_function
> 0005074: inhibitory SMAD protein<-obsolete<-molecular_function
> 0005075: pathway-specific SMAD protein<-obsolete<-molecular_function
> 0003820: class I major histocompatibility complex 
> antigen<-obsolete<-molecular_function; class I major histocompatibility 
> complex antigen<-endogenous peptide receptor<-transmembrane 
> receptor<-receptor<-signal transducer<-molecular_function
> 0009487: glutaredoxin<-obsolete<-molecular_function
> 0006502: C-terminal protein prenylation<-obsolete<-biological_process
> 0003821: class II major histocompatibility complex 
> antigen<-obsolete<-molecular_function; class II major histocompatibility 
> complex antigen<-exogenous peptide receptor<-transmembrane 
> receptor<-receptor<-signal transducer<-molecular_function
> 0003822: MHC-interacting protein<-obsolete<-molecular_function
> 0008222: tumor antigen<-obsolete<-molecular_function
> 0003819: major histocompatibility complex antigen/MHC 
> protein<-obsolete<-molecular_function
> 0016171: cell surface antigen<-obsolete<-molecular_function
> 0005189: milk protein<-obsolete<-molecular_function
> 0005570: small nuclear RNA<-obsolete<-molecular_function
> 0005555: blood group antigen<-obsolete<-molecular_function
> 0005557: lymphocyte antigen<-obsolete<-molecular_function
> 
> 
> 3 "descendant-of-unknown" attribs (839 associations):
> 
> 0005554: molecular_function unknown<-molecular_function
> 0008372: cellular_component unknown<-cellular_component
> 0000004: biological_process unknown<-biological_process
> 
> 
> 1 invalid entry:
> 
> *** Invalid evidence type '05/06/2002
> ' in
>          F       ubiquitin protein ligase E3 component n-recognin 1      E3 
> alpha        gene    taxon:10090     05/06/2002
> 
> ==================================================
> 
> gene_associations.fb
> 
> 22 "descendant-of-obsolete" attribs (478 associations):
> 
> 0005566: ribosomal RNA<-obsolete<-molecular_function (NOT IN GOGENES)
> 0008337: selectin<-obsolete<-molecular_function (NOT IN GOGENES)
> 0005569: small nucleolar RNA<-obsolete<-molecular_function (NOT IN GOGENES)
> 0008436: heterogeneous nuclear (NOT IN GOGENES)
> ribonucleoprotein/hnRNP<-obsolete<-molecular_function
> 0009461: cytochrome c<-cytochrome<-obsolete<-molecular_function (NOT IN GOGENES)
> 0016583: nucleosome modeling<-obsolete<-biological_process (NOT IN GOGENES)
> 0005206: heparin sulfate proteoglycan<-obsolete<-molecular_function (WAITING TO BE CHANGED)
> 0008001: fibrinogen<-obsolete<-molecular_function (NOT IN GOGENES)
> 0005207: extracellular matrix glycoprotein<-obsolete<-molecular_function (NOT IN GOGENES)
> 0009464: cytochrome b5<-cytochrome b<-cytochrome<-obsolete<-molecular_function (NOT IN GOGENES)
> 0005208: amyloid protein<-obsolete<-molecular_function (NOT IN GOGENES)
> 0004600: cyclophilin<-obsolete<-molecular_function (NOT IN GOGENES)
> 0009477: cytochrome c1<-cytochrome c<-cytochrome<-obsolete<-molecular_function (NOT IN GOGENES)
> 0009487: glutaredoxin<-obsolete<-molecular_function (NOT IN GOGENES)
> 0003750: cell cycle regulator<-obsolete<-molecular_function (NOT IN GOGENES)
> 0003733: ribonucleoprotein<-obsolete<-molecular_function (NOT IN GOGENES)
> 0003734: small nuclear ribonucleoprotein/snRNP<-obsolete<-molecular_function (NOT IN GOGENES)
> 0000047: Rieske iron-sulfur protein<-obsolete<-molecular_function (NOT IN GOGENES)
> 0005188: larval serum protein (sensu 
> Insecta)/arylphorin<-obsolete<-molecular_function (NOT IN GOGENES)
> 0005570: small nuclear RNA<-obsolete<-molecular_function (NOT IN GOGENES)
> 0005563: transfer RNA<-obsolete<-molecular_function (NOT IN GOGENES)
> 0005734: box C+D snoRNP protein<-obsolete<-cellular_component (NOT IN GOGENES)
> 
> 
> 4 deprecated attribs (21 associations):
> 
> 0008279: cohesin/14S cohesin (now 0008278) (ALL CHANGED ALREADY IN GOGENES)
> 0008620: condensin/13S condensin (now 0005676) (ALL CHANGED ALREADY IN GOGENES)
> 0007408: neuroblast determination/neuroblast identity determination (now 
> 0007400) (ALL CHANGED ALREADY IN GOGENES)
> 0019004: pyrimidine-specific oxidized base lesion DNA N-glycosylase/DNA 
> glycosylase/AP-lyase/DNA glycosylase/beta-lyase/bifunctional DNA 
> glycosylase/endodeoxyribonuclease III/endonuclease III (now 0000703) (ALL CHANGED ALREADY IN GOGENES)
> 
> 
> 6 invalid entries: (updated in our files already)
> 
> *** Empty attribute ID field in
> FB      FBgn0025720     Ate1            GO:     FB:FBrf0145404  NAS 
>      F               gene    taxonID:7227
> *** Empty attribute ID field in
> FB      FBgn0038659     endoA           GO:     FB:FBrf0145518  NAS 
>      F               gene    taxonID:7227
> *** Empty attribute ID field in
> FB      FBgn0060297     p74S            GO: 
> FB:FBrf0128634|PMID:10892653    IDA             F               gene 
> taxonID:7227
> *** Empty attribute ID field in
> FB      FBgn0060298     p69             GO: 
> FB:FBrf0128634|PMID:10892653    IDA             F               gene 
> taxonID:7227
> *** Empty attribute ID field in
> FB      FBgn0060299     p145            GO: 
> FB:FBrf0128634|PMID:10892653    IDA             F               gene 
> taxonID:7227
> *** Empty gene ID field in
> FB                              GO:0006951      FB:FBrf0105495  NAS 
>      P               gene    taxonID:7227
> ==================================================
> 
> gene_associations.wb
> 
> 22 "descendant-of-obsolete" attribs (178 associations):
> 
> 0005906: clathrin adaptor/adaptin<-obsolete<-cellular_component
> 0009406: virulence<-obsolete<-biological_process
> 0003892: proliferating cell nuclear antigen/PCNA<-obsolete<-molecular_function
> 0006832: small molecule transport<-obsolete<-biological_process
> 0009460: cytochrome b<-cytochrome<-obsolete<-molecular_function
> 0007012: actin cytoskeleton reorganization<-obsolete<-biological_process
> 0005208: amyloid protein<-obsolete<-molecular_function
> 0004431: 1-phosphatidylinositol-4-phosphate kinase/class I 
> PI3K<-obsolete<-molecular_function
> 0005065: heterotrimeric G protein<-obsolete<-molecular_function
> 0009487: glutaredoxin<-obsolete<-molecular_function
> 0000008: thioredoxin<-obsolete<-molecular_function
> 0006502: C-terminal protein prenylation<-obsolete<-biological_process
> 0004429: 1-phosphatidylinositol 3-kinase/class III 
> PI3K<-obsolete<-molecular_function
> 0005620: periplasmic space<-obsolete<-cellular_component
> 0003750: cell cycle regulator<-obsolete<-molecular_function
> 0003734: small nuclear ribonucleoprotein/snRNP<-obsolete<-molecular_function
> 0007048: oncogenesis<-obsolete<-biological_process
> 0008304: eukaryotic translation initiation factor 4 
> complex/eIF-4<-obsolete<-cellular_component
> 0000047: Rieske iron-sulfur protein<-obsolete<-molecular_function
> 0005480: vesicle transport<-obsolete<-molecular_function
> 0008164: organophosphorous resistance<-obsolete<-biological_process
> 0005468: small-molecule carrier or transporter<-obsolete<-molecular_function
> 
> 
> 3 deprecated attribs (19 associations):
> 
> 0008497: phospholipid transporter/phospholipid transporter (now 0005548)
> 0008433: guanyl-nucleotide exchange factor/GEF/GNRP/guanyl-nucleotide 
> releasing factor (now 0005085)
> 0003710: transcriptional activator/transcription activating factor (now 
> 0016563)
> 
> 
> 1 "descendant-of-unknown" attrib (96 associations):
> 
> 0005554: molecular_function unknown<-molecular_function
> 
> 
> 29 invalid entries:
> 
> *** Empty gene ID field in
> WB                              GO:0000003      PMID:11137018   IMP 
> WB:KK:B0250.1   P               B0250.3 protein taxon:6239      20020201
> *** Empty gene ID field in
> WB                              GO:0000003      PMID:11137018   IMP 
> WB:KK:B0250.1   P               B0250.3 protein taxon:6239      20020201
> *** Empty gene ID field in
> WB                              GO:0000003      PMID:11231151   IMP 
> WB:SA:yk190g1   P               B0491.8 protein taxon:6239      20020201
> *** Empty gene ID field in
> WB                              GO:0000003      PMID:11231151   IMP 
> WB:SA:yk237h6   P               C33E10.3        protein 
> taxon:6239      20020201
> *** Empty gene ID field in
> WB                              GO:0000003      PMID:11231151   IMP 
> WB:SA:yk237h6   P               C33E10.7        protein 
> taxon:6239      20020201
> *** Empty gene ID field in
> WB                              GO:0000003      PMID:11231151   IMP 
> WB:SA:yk237h6   P               C33E10.9        protein 
> taxon:6239      20020201
> *** Empty gene ID field in
> WB                              GO:0002119      PMID:11099033   IMP 
> WB:JA:F53B8.1   P               F53B8.1 protein taxon:6239      20020201
> *** Empty gene ID field in
> WB                              GO:0002119      PMID:11231151   IMP 
> WB:SA:yk282c1   P               Y39B6B.EE       protein 
> taxon:6239      20020201
> *** Empty gene ID field in
> WB                              GO:0002119      PMID:11231151   IMP 
> WB:SA:yk301a11  P               Y102A5C.6       protein 
> taxon:6239      20020201
> *** Empty gene ID field in
> WB                              GO:0007276      PMID:11231151   IMP 
> WB:SA:yk301a11  P               Y102A5C.6       protein 
> taxon:6239      20020201
> *** Empty gene ID field in
> WB                              GO:0007345      PMID:11099033   IMP 
> WB:JA:F53B8.1   P               F53B8.1 protein taxon:6239      20020201
> *** Empty gene ID field in
> WB                              GO:0007345      PMID:11099033   IMP 
> WB:JA:F53B8.1   P               F53B8.1 protein taxon:6239      20020201
> *** Empty gene ID field in
> WB                              GO:0007345      PMID:11137018   IMP 
> WB:KK:B0250.1   P               B0250.3 protein taxon:6239      20020201
> *** Empty gene ID field in
> WB                              GO:0007345      PMID:11137018   IMP 
> WB:KK:B0250.1   P               B0250.3 protein taxon:6239      20020201
> *** Empty gene ID field in
> WB                              GO:0007345      PMID:11231151   IMP 
> WB:SA:yk190g1   P               B0491.8 protein taxon:6239      20020201
> *** Empty gene ID field in
> WB                              GO:0007345      PMID:11231151   IMP 
> WB:SA:yk282c1   P               Y39B6B.EE       protein 
> taxon:6239      20020201
> *** Empty gene ID field in
> WB                              GO:0007345      PMID:11231151   IMP 
> WB:SA:yk301g7   P               F56D12.5        protein 
> taxon:6239      20020201
> *** Empty gene ID field in
> WB                              GO:0007345      PMID:11231151   IMP 
> WB:SA:yk342e7   P               Y52B11A.11      protein 
> taxon:6239      20020201
> *** Empty gene ID field in
> WB                              GO:0007397      PMID:11099033   IMP 
> WB:JA:F53B8.1   P               F53B8.1 protein taxon:6239      20020201
> *** Empty gene ID field in
> WB                              GO:0007582      PMID:11099033   IMP 
> WB:JA:F53B8.1   P               F53B8.1 protein taxon:6239      20020201
> *** Empty gene ID field in
> WB                              GO:0007582      PMID:11231151   IMP 
> WB:SA:yk190g1   P               B0491.8 protein taxon:6239      20020201
> *** Empty gene ID field in
> WB                              GO:0007582      PMID:11231151   IMP 
> WB:SA:yk282c1   P               Y39B6B.EE       protein 
> taxon:6239      20020201
> *** Empty gene ID field in
> WB                              GO:0007582      PMID:11231151   IMP 
> WB:SA:yk301a11  P               Y102A5C.6       protein 
> taxon:6239      20020201
> *** Empty gene ID field in
> WB                              GO:0007626      PMID:11099033   IMP 
> WB:JA:F53B8.1   P               F53B8.1 protein taxon:6239      20020201
> *** Empty gene ID field in
> WB                              GO:0040007      PMID:11231151   IMP 
> WB:SA:yk282c1   P               Y39B6B.EE       protein 
> taxon:6239      20020201
> *** Empty gene ID field in
> WB                              GO:0040007      PMID:11231151   IMP 
> WB:SA:yk301a11  P               Y102A5C.6       protein 
> taxon:6239      20020201
> *** Empty gene ID field in
> WB                              GO:0040010      PMID:11231151   IMP 
> WB:SA:yk282c1   P               Y39B6B.EE       protein 
> taxon:6239      20020201
> *** Empty gene ID field in
> WB                              GO:0040010      PMID:11231151   IMP 
> WB:SA:yk301a11  P               Y102A5C.6       protein 
> taxon:6239      20020201
> *** Empty gene ID field in
> WB                              GO:0040011      PMID:11099033   IMP 
> WB:JA:F53B8.1   P               F53B8.1 protein taxon:6239      20020201
> ================================================== 

--
This message is from the GOFriends moderated mailing list.  A list of public
announcements and discussion of the Gene Ontology (GO) project.
Problems with the list?           E-mail: owner-gofriends at geneontology.org
Subscribing   send   "subscribe"   to   gofriends-request at geneontology.org
Unsubscribing send   "unsubscribe"  to  gofriends-request at geneontology.org
Web:          http://www.geneontology.org/



More information about the go-friends mailing list