Mailing List
Home
Forum Home
Maven - Project building tool
Axis - Java SOAP implementation
Lucene - Full-featured text search engine APIs
Cocoon - MVC web framework based on XML/XSL
Fop - Create PDF, PCL, PS, SVG, XML driven by XSL formatting objects.
Log4J - A log library
POI - Java Excel, Word and other Microsoft Office files manipulating library
Oracle database error code ...
Subjects
log4j warning: No appenders could be found
java security AccessControlException: access denied (java io FilePermission clie
java lang InstantiationException: org apache tools ant Main
Apache Axis Tutorial
Struts <logic iterate >
log4j properties How to parse outpu to multiple files
configuring log4j with BEA Weblogic 8 1
How to use XSL FOP Java together
JSP precompile
Servlet File Download dialog problem (IE6,Adobe 6 0)
Proposal: Adding jar manifest classpath in jar and war plugins
Unsupported major minor version 48 0 problem while running the an
   telope task
java security AccessControlException: access denied (java io FilePermission
axis wsdl2java Ant Task usage
net sf hibernate MappingException: Error reading resource: test/User hbm xml
Building EAR ANT Script for websphere 5 0
CREATING WAR Files
jsp data into Excel
Classpath problem
Jboss 3 2 3+ vs Tomcat Axis Question
RE: How to include jars and add them into the MANIFEST MF/Class Path
attribute
Printing problem
InstantiationException
Couldn 't find trusted certificate
Please : How can one install ant 1 6 0 under Eclipse 2 1 ?
Excel: Too many different cell formats
Running junit tests fails
XDoclet, Struts and Maven: Where to start? SOLUTION
1 3 final: now giving me java io FileNotFoundException (Too many
open files)
AXIS: tomcat timeout ?
 
relevancy "buckets " and secondary searching

relevancy "buckets " and secondary searching

2007-02-05       - By Erick Erickson

 Back
Reply:     1     2     3  

Am I missing anything obvious here and/or what would folks suggest...

Conceptually, I want to normalize the scores of my documents during a search
BUT BEFORE SORTING into 5 discrete values, say 0.1, 0.3, 0.5, 0.7, 0.9 and
apply a secondary sort when two documents have the same score. Applying the
secondary sort is easy, it's massaging the scores that has me stumped.

We have a bunch of documents (30K). Books actually. We only display to the
user 5 different "relevance" scores, with 5 being the most relevant. So far,
so good.

Within each quintile, we want to sort by title. So, suppose the following
three books score a hit:

relevance      title
0.98              zzzzz
0.94              ccccc
0.79              aaaaa

The proper display would be

5           ccccc
5           zzzzz
4           aaaaa


It's easy enough to do a secondary sort, but that would not give me what I
want. In this case, I'd get...

5       zzzzz
5       ccccc
4       aaaaa

because the secondary sort only matters if the primary sort is equal. The
user is left scratching her head asking "why did two books with the same
relevancy have the titles out of order?".

If I could massage my scores *before* sorts are done, things would be
hunky-dory, but I'm not seeing how to do that. One problem is that until the
top N documents have been collected, I don't know what the maximum relevance
is, therefore I don't know how to normalize raw scores. I followed Hoss's
thread where he talks about FakeNorms, but don't see how that applies to my
problem.

My result sets are strictly limited to < 500, so it's not unreasonable to
just get the TopDocs back and aggregate my buckets at that point and sort
them. But of course I only care about this when I am using relevancy as my
primary sort. For sorting on any other fields, I would just let Lucene take
care of it all. So post-sorting myself leads to really ugly stuff like

if (it's my special relevancy sort) do one thing
else don't do that thing.

repeated wherever I have to sort. Yuck.....


And since I'm talking about 500 docs, I don't want to wait until after I
have a Hits object because I'll have to re-query several times. On an 8G
index (and growing).


This almost looks like a HitCollector, but not quite.
This almost looks like a custom Similarity, but not quite since I want to
just let Lucene compute relevance and put that into a bucket.
This almost looks like FakeNorms, but not quite.
This almost looks like about 8 things I tried to make work, but not quite
<G>....

So, somebody out there needs to tell me what part of the manual I overlooked
<G>...

Thanks
Erick