Search Mailing List Archives


Limit search to: Subject & Body Subject Author
Sort by: Reverse Sort
Limit to: All This Week Last Week This Month Last Month
Select Date Range     through    

[bioontology-support] Any suggestions for improving concept search results from the new API?

Lee M Surprenant lmsurpre at us.ibm.com
Tue Mar 18 15:48:13 PDT 2014


Ray, please excuse my bumping this old thread...

To me, subtree search is not usable in its current state.  For instance,
consider the Melanoma example in the documentation:


If you perform this search one level higher, at Melanocytic Neoplasm
(C7058) it take almost 30 seconds.
http://data.bioontology.org/search?q=melanoma&ontology=NCIT&subtree_id=http%3a%2f%2fncicb.nci.nih.gov%2fxml%2fowl%2fEVS%2fThesaurus.owl%23C7058
If you perform it two levels higher, at Neoplam by Morphology (C4741) it
returns a 502 Bad Gateway  (presumably due to a timeout)
http://data.bioontology.org/search?q=melanoma&ontology=NCIT&subtree_id=http%3a%2f%2fncicb.nci.nih.gov%2fxml%2fowl%2fEVS%2fThesaurus.owl%23C4741


Regarding the AND vs OR stuff, my main concern was whether hits in the
DEFINITION field (via 'include_properties') are being given any weight
since the concepts which match BOTH terms in the definition field were
returned below the concepts that matched a single search term in the
prefLabel/synonym fields.
That said, I may have obsessed on this single example too much...need to
see how often this is really an issue now that we'll be using it more.
Will follow up again if I find more evidence of this causing issues.

thanks,

Lee Surprenant
IBM Emerging Technologies | jStart Team



From:	Ray Fergerson <ray.fergerson at stanford.edu>
To:	Lee M Surprenant/Raleigh/IBM at IBMUS
Cc:	"support" <support at bioontology.org>
Date:	01/10/2014 09:21 PM
Subject:	RE: [bioontology-support] Any suggestions for improving concept
            search results from the new API?



Lee,

Yes we did change the search behavior  as a result of your earlier
suggestions. Currently we use the default Lucene behavior for search. Thus
we return good matches first and then increasingly bad ones (by Lucene’s
judgement). This means that a match on all words in the string will come
first but eventually you will get matches that are only good on one of the
words. It seems to us that this is good behavior. If you, for example,
misspell a word you will get results from one word but not the other. This
should help to locate the problem. This seems better than returning no
matches in the event of one misspelled word.

Ray

From: Lee M Surprenant [mailto:lmsurpre at us.ibm.com]
Sent: Monday, January 6, 2014 6:36 AM
To: Ray Fergerson
Cc: support
Subject: RE: [bioontology-support] Any suggestions for improving concept
search results from the new API?



Ray,

It looks like the search mechanism was updated in December?

The new results are much better.  My sample search, "Tretinoin Cytarabine",
now returns 149 results, with the top 7 being the ones containing both
terms.


However, I'm still not getting the desired results from my other sample:
"history and physical".  What I was was hoping for is similar results to
the v1 api, which surfaced strong matches like "Work-up" (C85833) and
"Review of Systems" (C95618). Here is what I'm seeing instead:


|-----+-----------------------------------------+-------+-----------------|
|searc|options                                  |number |comment          |
|h    |                                         |of hits|                 |
|query|                                         |       |                 |
|-----+-----------------------------------------+-------+-----------------|
|histo|ontologies=NCIT                          |58     |shouldn't you be |
|ry   |                                         |pages *|using stopwords  |
|and  |                                         |50/page|to prevent       |
|physi|                                         |= 2900 |matching the word|
|cal  |                                         |       |'and'?           |
|-----+-----------------------------------------+-------+-----------------|
|histo|ontologies=NCIT                          |5 pages|highest results  |
|ry   |                                         |*      |include only     |
|physi|                                         |50/page|"history" OR     |
|cal  |                                         |= 250  |"physical", but  |
|     |                                         |       |not both         |
|-----+-----------------------------------------+-------+-----------------|
|histo|added include_properties=true            |41     |still no sign of |
|ry   |                                         |pages *|the "good        |
|physi|                                         |50/page|matches" which   |
|cal  |                                         |= 2050 |include both     |
|     |                                         |       |terms            |
|-----+-----------------------------------------+-------+-----------------|
|histo|added subtree search - look only in      |INTERNA|took long time to|
|ry   |Activity subtree.  Same result with or   |L      |respond.  maybe a|
|physi|without include_properties, so I think it|SERVER |performance/timeo|
|cal  |has more to do with size of subtree than |ERROR  |ut issue?        |
|     |with number of search results?           |       |                 |
|-----+-----------------------------------------+-------+-----------------|





So, it seems like the search now performs a simple "OR" on the search
terms, but in this case I'd much prefer an AND.  "OR" would be OK if the
results were well-ordered (like for "Tretinoin Cytarabine"), but in this
case none of the the top results contain both search terms.  Maybe it is
related to the fact the matches are in the description
(include_properties=true) and not the concept name?

Here are the latter two queries for testing (and reproducing that subtree
search error):
http://data.bioontology.org/search?q=history%20physical&ontologies=NCIT&include_properties=true


http://data.bioontology.org/search?q=history%20physical&ontology=NCIT&subtree_id=http%3A%2F%2Fncicb.nci.nih.gov%2Fxml%2Fowl%2FEVS%2FThesaurus.owl%23C43431



PS. Number of hits would have been easier for me to calculate if the
documentation page indicated that the default pagesize is 50.

-Lee

Inactive hide details for Ray Fergerson ---12/13/2013 08:27:41 PM---Lee,Ray
Fergerson ---12/13/2013 08:27:41 PM---Lee,

From: Ray Fergerson <ray.fergerson at stanford.edu>
To: Lee M Surprenant/Raleigh/IBM at IBMUS, "support" <support at bioontology.org>
Date: 12/13/2013 08:27 PM
Subject: RE: [bioontology-support] Any suggestions for improving concept
search results from the new API?




Lee,

Sorry for the non-response on this. The search is workings as designed but
I think that you bring up a fair point. We will investigate changing this
behavior.

Ray

From: bioontology-support-bounces at lists.stanford.edu [
mailto:bioontology-support-bounces at lists.stanford.edu] On Behalf Of Lee M
Surprenant
Sent: Tuesday, November 26, 2013 5:44 AM
To: support
Subject: [bioontology-support] Any suggestions for improving concept search
results from the new API?



*bump*
Even if its just clarification on whether search is working as expected,
can someone reply to this?

-Lee
----- Forwarded by Lee M Surprenant/Raleigh/IBM on 11/26/2013 08:39 AM
-----

From: Lee M Surprenant/Raleigh/IBM at IBMUS
To: "support at bioontology.org Support" <support at bioontology.org>
Date: 11/19/2013 06:30 PM
Subject: [bioontology-support] Any suggestions for improving concept search
results from the new API?
Sent by: bioontology-support-bounces at lists.stanford.edu





In past, I was seeing good search results (5 hits) when searching NCI
Thesaurus (including Properties) with terms like the following: "history
and physical".
With the new backend, this search now returns 0 results.

A few other searches I've tried also offer fewer matches than before.  For
example "Tretinoin Cytarabine" used to return 7 results, now returns 0.

It seems to me like, in the past, it did a simple AND. And now maybe it
only matches contiguous text (eg. like doing a google search with " "
around the phrase)?
Is there a new syntax to match two non-contiguous words (ignoring order)
when they appear in the same concept name/description?
Any other ideas for me?

thanks,

Lee Surprenant
IBM Emerging Technologies | jStart Team
lmsurpre at us.ibm.com | (919) 543-8919
_______________________________________________
bioontology-support mailing list
bioontology-support at lists.stanford.edu
https://mailman.stanford.edu/mailman/listinfo/bioontology-support
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.stanford.edu/pipermail/bioontology-support/attachments/20140318/29588968/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 18899100.jpg
Type: image/jpeg
Size: 36911 bytes
Desc: not available
URL: <http://mailman.stanford.edu/pipermail/bioontology-support/attachments/20140318/29588968/attachment-0001.jpg>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: graycol.gif
Type: image/gif
Size: 105 bytes
Desc: not available
URL: <http://mailman.stanford.edu/pipermail/bioontology-support/attachments/20140318/29588968/attachment-0001.gif>


More information about the bioontology-support mailing list