<div dir="ltr"><div dir="ltr">On Mon, Feb 19, 2024 at 12:18 PM Blake Graham-Henderson <<a href="mailto:blake@mobiusconsortium.org">blake@mobiusconsortium.org</a>> wrote:<br></div><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><u></u>

  <div>

    Mike,<br>

    <br>

    A new column is what I was thinking. I figured that I'd break

    something by converting the string column into an integer. Though, I

    didn't think it would have necessitated a new metabib field

    definition, because the results of the existing definition could be

    converted to numeric closer to the "Simple Record Extracts" end of

    the chain? Perhaps introduce the new column in one of the views up

    the DB view chain somewhere? Still referencing the original

    extraction def?<br>

    <br></div></blockquote><div><br></div><div>+1 to the new column, we're definitely on the same page there.</div><div><br></div><div>As for whether we should reuse a metabib field or create a new one, where we can we should be using the right tool for the job and the tool designed for this is the record attribute infrastructure, not the metabib field infrastructure.  Whether we should change an existing record attribute definition (date1 or pubdate), or create a new one that uses the MODS 3.3 transform to pull the mods:originInfo/mods:dateIssued value is a decision that we can make.  The record attr pubdate is really close, and if we are OK with changing the sort data so that decade-granular date1 fields (like, 201u becoming 2010 when it's some time in the 2010s but the cataloger wasn't sure) sortable rather than showing up as NULL (and therefore sorting to the end of the list).  I think that would be a net improvement, but I don't want to assume.</div><div><br></div><div>I don't think we should be creating new special-purpose /logic/ code when the existing functionality can be combined and applied to get where we want to go.  That's why the record attribute infrastructure was built in a generic and reusable way.  Put another way, Evergreen already knows how to extract a single value for use by other business processes in the form of non-multi record attribute definitions (the "other business process" being reporting in this case) and to normalize arbitrary values in just the way we want using the index normalizer map (even though we're not "indexing" this, per se), so we should use that.</div><div><br></div><div>Any reporting-specific "view" (materialized or otherwise) is fair game for growing new columns, so we can definitely mess with things close to the end of the chain.  We should make sure that that's just about pulling in an expected value and /not/ normalizing or otherwise cleaning up the data that other code can already do for us.</div><div> </div><div>Hopefully that all makes sense...</div><div><br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div>

    New column name: "Publication Year (numeric normalized)"<br>

    pubyear_int<br>

    <br></div></blockquote><div><br></div><div>+1, IMO that's a good capsule definition for the goal.<br></div><div><br></div><div>Thanks,<br></div><div>--Mike</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div>

    <pre cols="72">-Blake-

Conducting Magic

Will consume any data format

MOBIUS

</pre>

    <div>On 2/19/2024 11:06 AM, Mike Rylander

      wrote:<br>

    </div>

    <blockquote type="cite">

      <div dir="ltr">Hrm... I traced back the date1 record attribute

        defintion, actually, rather than the pubdate metabib field. 

        It's important to note that record attributes and metabib fields

        have /very/ different use cases, ingest performance profiles,

        and configuration shapes.  What's most important here is that

        metabib fields are primarily meant to support search, and record

        attributes are primarily meant to support discrete value display

        and sorting.  We should try to use a single value (multi=false

        in the config table) record attribute here, rather than a

        metabib field.

        <div><br>

        </div>

        <div>The drawback with Date1 (as in, the data coming from the

          008) is that if you have really thin records the 008 may not

          exist. However I don't think the risk is really high there --

          the record attribute version of pubdate comes from the 008 as

          well, and that is what we use as the data for the publication

          date sort axis.  Oh! And, looking closer, the pubdate

          attribute uses the "Number or NULL Normalize" index normalizer

          (id=18), which is the second half of what I described before

          -- I'd just forgotten it existed.  Adding index normalizer 19

          in a position before number-or-null, and then setting up the

          view stack to use that record attribute, could be all that's

          needed.</div>

        <div><br>

        </div>

        <div>So, I think the record attribute version of pubdate is

          actually the best data source.</div>

        <div><br>

        </div>

        <div>One thing to consider is existing uses of whatever extant

          field we end up wanting to make use of.  So, the Real Plan,

          IMO should do all that -^ as a /new/ record attribute rather

          than hijacking an existing one, nor should it use an existing

          metabib field (recall, those are about searching rather

          exposing data for other things to use), and have it land in a

          completely new column on the Simple Record Extracts

          materialized view.  Then there's no chance of breaking

          existing reports with a column datatype change.</div>

        <div><br>

        </div>

        <div>Thoughts on that?</div>

        <div><br clear="all">

          <div>

            <div dir="ltr" class="gmail_signature">

              <div dir="ltr">--<br>

                Mike Rylander<br>

                Research and Development Manager<br>

                Equinox Open Library Initiative<br>

                1-877-OPEN-ILS (673-6457)<br>

                <a href="mailto:miker@equinoxOLI.org" target="_blank">miker@equinoxOLI.org</a><br>

                <a href="https://equinoxOLI.org" target="_blank">https://equinoxOLI.org</a><br>

              </div>

            </div>

          </div>

          <br>

        </div>

      </div>

      <br>

      <div class="gmail_quote">

        <div dir="ltr" class="gmail_attr">On Fri, Feb 16, 2024 at

          2:23 PM Blake Graham-Henderson <<a href="mailto:blake@mobiusconsortium.org" target="_blank">blake@mobiusconsortium.org</a>>

          wrote:<br>

        </div>

        <blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">All,<br>

          <br>

          Thanks for your considerate responses. What Mike said is the

          conclusion <br>

          I had come to, and I was wondering if anyone else needs the

          publication <br>

          year to be an actual number so that the reporter can do things

          like <br>

          average,min,max,etc. From the sounds of it, no one is

          currently using <br>

          the Evergreen reporter to produce such a thing (I don't see

          how you <br>

          could). I suppose no one is using an external program to make

          it happen <br>

          (to meet collection reporting needs from the higher-ups)?<br>

          <br>

          I agree with Mike, in that the best place to get the

          publication year <br>

          (right now) is the Simple Record Extracts, because it hunts it

          down from <br>

          several places in the bib record. Walking it backwards:<br>

          <br>

          reporter.materialized_simple_record ->

          reporter.old_super_simple_record <br>

          -> metabib.wide_display_entry ->

          metabib.compressed_display_entry -> <br>

          metabib.flat_display_entry -> metabib.display_entry<br>

          <br>

          Which is a trigger-created-table based upon the index

          definition found <br>

          in config.metabib_field<br>

          <br>

          one of those views is hardcoded to expect "pubdate" to exist

          in the <br>

          metabib_field definitions. Which exists with stock Evergreen <br>

          definitions. Which is:<br>

          <br>

"//mods33:mods/mods33:originInfo//mods33:dateIssued[@encoding="marc"]|//mods33:mods/mods33:originInfo//mods33:dateIssued[1]"<br>

          <br>

          Decoding that is fun. Suffice it to say: the pubyear can come

          from <br>

          several places in the record, and I like that better than only

          looking <br>

          in one place.<br>

          <br>

          So, in conclusion, if a patch were written, I think it would

          be smart to <br>

          piggy back on this logic. It might be fairly straightforward

          to get the <br>

          first occurrence from the JSON string and cast it to an

          integer <br>

          (stripping out non-numeric characters first). That's where my

          thoughts <br>

          are right now. I don't think we're going to be writing the

          patch anytime <br>

          soon, just thinking through it with everyone.<br>

          <br>

          If everyone agrees that this is something that Evergreen

          should have, <br>

          and we agree on the method, I might champion the bug and patch

          for <br>

          future meetings and releases!<br>

          <br>

          -Blake-<br>

          Conducting Magic<br>

          Will consume any data format<br>

          MOBIUS<br>

          <br>

        </blockquote>

      </div>

    </blockquote>

    <br>

  </div>

</blockquote></div></div>