[OPEN-ILS-GENERAL] Activity metric for relevance

Fri Jun 14 17:09:47 EDT 2013

All,

I don't have much time to go into detail right now (meetings solid
this afternoon), but in order to get the basics out there so Thomas
and others can start looking at it I'm going paste some SQL below, and
then give some basic narrative on how it could be generalized for use
in a precalculated age scaling modification to various possible
"popularity" metrics.

The original is based on a direct "rating" table, where things to be
rated are rated by users with values between, say, 1 and 5, and the
time that each rating is recorded.  More recent ratings are considered
more important, and the age scaling horizon defines a linear
regression of importance such that, given an aging scaling horizon of
30 days, a rating provided today is 30 times more important than a
rating provided 30 days ago (or any older than that).  In this scheme,
all ratings are counted for all time, but cutting them off as some
secondary, older age would be trivial.

-----------------------------------------------------------------

CREATE TABLE rating (
    id          SERIAL  PRIMARY KEY,
    usr         INT     NOT NULL REFERENCES actor.usr (id),
    thing     INT     NOT NULL REFERENCES thing (id), -- perhaps
biblio.record_entry ...
    rating      INT     NOT NULL,
    created     TIMESTAMP   NOT NULL DEFAULT NOW(),
    comments    TEXT
);

CREATE VIEW depository.derived_rating AS
    WITH setting(max_age) AS (
        SELECT  COALESCE(NULLIF(value,'0'),'1')::INT::TEXT AS max_age
          FROM  config_flag
          WHERE name = 'rating.new_rating_bump.days'
                AND enabled
          LIMIT 1
    ), duplicate_list(id,dup_count) AS (
        SELECT  id,
                setting.max_age::INT - DATE_PART('day', NOW() -
created )::INT AS dup_count
          FROM  rating, setting
          WHERE created > NOW() - (setting.max_age || ' days')::interval
            UNION
        SELECT  id,
                1 AS dup_count
          FROM  rating, setting
          WHERE created <= NOW() - (setting.max_age || ' days')::interval
    ), sized_arrays(package,rating_array) AS (
        SELECT  package,
                ARRAY_FILL( rating, ARRAY_APPEND(NULL::INT[],
duplicate_list.dup_count) ) AS rating_array
          FROM  depository.rating
                JOIN duplicate_list USING (id)
    ), flattened_duplicated_ratings(package,rating) AS (
        SELECT  thing, UNNEST( rating_array ) AS rating
          FROM  sized_arrays
    )
    SELECT  thing.id AS package, AVG( r.rating ) AS rating
      FROM  thing AS
            LEFT JOIN flattened_duplicated_ratings AS r ON thing.id = r.package
      GROUP BY 1;
------------------------------------------------------

This obviously does not work directly for things like holds or circs,
where there is no inherent "quality" but simply existence.  However,
if we decided on a granularity -- days or weeks for holds, and weeks
or months for circs, perhaps -- we can transform existence into
quality.  Consider hold count as a percentage of total holds per
granularity interval as a normalizing factor:

---------------------------------------------

CREATE VIEW daily_hold_popularity AS
  WITH
    bib_count_by_date(thing, count, created) AS (
      SELECT rhrr.bib_record AS thing, COUNT(ahr.id) AS count,
DATE(ahr.request_time) AS created
        FROM reporter.hold_request_record AS rhrr
                  JOIN action.hold_request AS ahr ON (ahr.id = rhrr.id)
        GROUP BY 1,2
     ),
     total_by_date(count, created) AS (
       SELECT COUNT(id) AS count, DATE(ahr.request_time) AS created
        FROM action.hold_request AS ahr
     ),
     scaling_factor(value) AS ( SELECT 10000 AS value )
   SELECT bib_count_by_date.thing AS id,
                 (( bib_count_by_date.count::DECIMAL /
total_by_date.count) * scaling_factor.value) AS rating,
                 bib_count_by_date.created
      FROM bib_count_by_date JOIN total_by_date USING (created), scaling_factor;

---------------------------------------------

The scaling factor would allow one to change the range of possible
values that holds use to contribute (the above would be 0-100), and
could be pulled from a global flag.  Use this view in the query above
instead of the "rating" table and hold percentage by bib id would now
be your "rating value".

Circs could be used in a similar fashion, as could acquisition date of
copies (perhaps normalized to week-of-year, or month).

This is, again, a pretty simple, linear age scaling algorithm, but
exponential or even quadratic functions are possible, and possibly
useful.

Comments? Thoughts?

--miker

On Fri, Mar 15, 2013 at 1:41 PM, Mike Rylander <mrylander at gmail.com> wrote:
>
> On Fri, Mar 15, 2013 at 9:01 AM, Thomas Berezansky <tsbere at mvlc.org> wrote:
>>
>> The current plan would not take into account how recent the circs (or
>> holds) were, just that they were within a configurable time period of the
>> time the cronjob that counts them last ran (default will likely be to
>> include those from within the last 6 to 12 months). If you have an algorithm
>> you think would work well and are willing to share we would gladly include
>> that as an option when doing the work, though.
>>
>
> I do, and I am.  As time permits over the next few weeks I'll get back to
> this thread.
>
>>
>> We would not, however, be able to make it a per-bump option with the way
>> we currently plan on storing the circ and hold counts, so instead it would
>> function as an overall modifier to the circ/hold count numbers. Though even
>> as I type this email I have thoughts on how we could change that if the
>> feeling is that it should be at least partially bump-to-bump configurable.
>>
>
> I think it's really only useful for some bump types in any case.  The ratio
> bumps are really point-in-time values -- they represent "right this very
> moment" (or "late last night," I guess). Threshold bumps don't attempt to
> take scale into account, just that some line was crossed.  For circs this
> year or holds this month, or similar, age scaling (probably a better term
> than just "aging") of each event's relevance should be useful.
>
> --miker
>
>>
>> Thomas Berezansky
>> Merrimack Valley Library Consortium
>>
>>
>> Quoting Mike Rylander <mrylander at gmail.com>:
>>
>>> Kathy,
>>>
>>> Have you considered allowing an aging parameter for some bumps, so that
>>> newer data toward the near end of the horizon is considered more
>>> important?
>>> For instance, spikes in circulation might have a larger short term effect
>>> on relevance, but over time, while still being factored into relevance,
>>> would be less important though still considered in the bump logic.  I ask
>>> because I have a simple algorithm I'm using in another project, to be
>>> debuted at the conference, that may be portable to this work.
>>>
>>> --miker
>>>
>>>
>>>
>>> On Thu, Mar 14, 2013 at 3:53 PM, Kathy Lussier <klussier at masslnc.org>
>>> wrote:
>>>
>>>> Hi all,
>>>>
>>>> MassLNC is working with our partners at MVLC to develop an activity
>>>> metric
>>>> (aka popularity metric) that will allow sites to rank more popular items
>>>> a
>>>> little higher in search results than items that don't see as much
>>>> activity.
>>>> I've raised this idea on the list before. Although Evergreen allows
>>>> sites
>>>> to adjust relevancy based on the appearance of keywords in certain
>>>> fields,
>>>> which is highly useful, our hope is that this additional functionality
>>>> will
>>>> lead to further improvement when ranking results by relevance.
>>>>
>>>> As an example, if a user were conducting a keyword search on "abraham
>>>> lincoln,"  there are many titles in most US libraries where the words
>>>> "abraham lincoln" show up in the title. There would be no way to tease
>>>> out
>>>> the titles that are getting the most attention by readers. In fact, a
>>>> title
>>>> like "Team of Rivals" ranks very low in our search results even though
>>>> there is a high likelihood it is the title the patron is seeking.  By
>>>> applying a metric based on activity, we might be able to see those
>>>> more-recently popular titles floating higher in the search results list.
>>>>
>>>> I would like to share MVLC's proposal outlining the details for
>>>> implementing this project. The proposal is available at
>>>>
>>>> http://masslnc.cwmars.org/**node/2757<http://masslnc.cwmars.org/node/2757>.
>>>>
>>>> It provides a lot of flexibility in allowing sites to define what "high
>>>> activity" means to them. Circulation activity, holds activity, total
>>>> copies, and publication age/bib record age can all be used as an
>>>> activity
>>>> metric.
>>>>
>>>> If you have any feedback or questions, feel free to let us know.
>>>>
>>>> Kathy
>>>>
>>>> --
>>>> Kathy Lussier
>>>> Project Coordinator
>>>> Massachusetts Library Network Cooperative
>>>> (508) 343-0128
>>>> klussier at masslnc.org
>>>> Twitter:
>>>> http://www.twitter.com/**kmlussier<http://www.twitter.com/kmlussier>
>>>>
>>>>
>>>
>>>
>>> --
>>> Mike Rylander
>>>  | Director of Research and Development
>>>  | Equinox Software, Inc. / Your Library's Guide to Open Source
>>>  | phone:  1-877-OPEN-ILS (673-6457)
>>>  | email:  miker at esilibrary.com
>>>  | web:  http://www.esilibrary.com
>>>
>>
>>
>
>
>
> --
> Mike Rylander
>  | Director of Research and Development
>  | Equinox Software, Inc. / Your Library's Guide to Open Source
>  | phone:  1-877-OPEN-ILS (673-6457)
>  | email:  miker at esilibrary.com
>  | web:  http://www.esilibrary.com

-- 
Mike Rylander
 | Director of Research and Development
 | Equinox Software, Inc. / Your Library's Guide to Open Source
 | phone:  1-877-OPEN-ILS (673-6457)
 | email:  miker at esilibrary.com
 | web:  http://www.esilibrary.com