[open-ils-commits] ***SPAM*** [GIT] Evergreen ILS branch master updated. f956df518e07abed2f153e3c33621000401a6ac5

Evergreen Git git at git.evergreen-ils.org
Thu Jul 10 15:30:48 EDT 2014


This is an automated email from the git hooks/post-receive script. It was
generated because a ref change was pushed to the repository containing
the project "Evergreen ILS".

The branch, master has been updated
       via  f956df518e07abed2f153e3c33621000401a6ac5 (commit)
       via  8af7e4263cd2e97e0c54dde61e78c473bcb8b64e (commit)
      from  fb0366d23241dca92b6346d06b09679317a2a0d7 (commit)

Those revisions listed above that are new to this repository have
not appeared on any other notification email; so we list those
revisions in full, below.

- Log -----------------------------------------------------------------
commit f956df518e07abed2f153e3c33621000401a6ac5
Author: Dan Scott <dscott at laurentian.ca>
Date:   Mon Jul 7 09:58:01 2014 -0400

    LP# 1330784: Release notes for sitemap builder
    
    More like documentation than release notes, but more is probably better
    than less.
    
    Signed-off-by: Dan Scott <dscott at laurentian.ca>
    Signed-off-by: Ben Shum <bshum at biblio.org>

diff --git a/docs/RELEASE_NOTES_NEXT/OPAC/sitemap_builder.txt b/docs/RELEASE_NOTES_NEXT/OPAC/sitemap_builder.txt
new file mode 100644
index 0000000..527a5ed
--- /dev/null
+++ b/docs/RELEASE_NOTES_NEXT/OPAC/sitemap_builder.txt
@@ -0,0 +1,51 @@
+Sitemap generator
+^^^^^^^^^^^^^^^^^
+A http://www.sitemaps.org[sitemap] directs search engines to the pages of
+interest in a web site so that the search engines can intelligently crawl
+your site. In the case of Evergreen, the primary pages of interest are the
+bibliographic record detail pages.
+
+The sitemap generator script creates sitemaps that adhere to the
+http://sitemaps.org specification, including:
+
+* limiting the number of URLs per sitemap file to no more than 50,000 URLs;
+* providing the date that the bibliographic record was last edited, so
+  that once a search engine has crawled all of your sites' record detail pages,
+  it only has to reindex those pages that are new or have changed since the last
+  crawl;
+* generating a sitemap index file that points to each of the sitemap files.
+
+Running the sitemap generator
++++++++++++++++++++++++++++++
+The `sitemap_generator` script must be invoked with the following argument:
+
+* `--lib-hostname`: specifies the hostname for the catalog (for example,
+  `--lib-hostname https://catalog.example.com`); all URLs will be generated
+  appended to this hostname
+
+Therefore, the following arguments are useful for generating multiple sitemaps
+per Evergreen instance:
+
+* `--lib-shortname`: limit the list of record URLs to those which have copies
+  owned by the designated library or any of its children;
+* `--prefix`: provides a prefix for the sitemap index file names
+
+Other options enable you to override the OpenSRF configuration file and the
+database connection credentials, but the default settings are generally fine.
+
+Note that on very large Evergreen instances, sitemaps can consume hundreds of
+megabytes of disk space, so ensure that your Evergreen instance has enough room
+before running the script.
+
+Scheduling
+++++++++++
+To enable search engines to maintain a fresh index of your bibliographic
+records, you may want to include the script in your cron jobs on a nightly or
+weekly basis.
+
+Sitemap files are generated in the same directory from which the script is
+invoked, so a cron entry will look something like:
+
+------------------------------------------------------------------------
+12 2 * * * cd /openils/var/web && /openils/bin/sitemap_generator
+------------------------------------------------------------------------

commit 8af7e4263cd2e97e0c54dde61e78c473bcb8b64e
Author: Dan Scott <dscott at laurentian.ca>
Date:   Thu Jun 19 15:52:42 2014 -0400

    LP#1330784 Add a sitemap generator for Evergreen
    
    Following the requirements at sitemaps.org, generate a
    set of sitemaps that reflect the bib record's last edit
    date, with 50,000 records per sitemap file.
    
    Users can run this script targeting different libraries
    and generating different output filenames using the
    documented options in the script.
    
    Signed-off-by: Dan Scott <dscott at laurentian.ca>
    Signed-off-by: Ben Shum <bshum at biblio.org>

diff --git a/Open-ILS/src/Makefile.am b/Open-ILS/src/Makefile.am
index 5d159de..330f5ab 100644
--- a/Open-ILS/src/Makefile.am
+++ b/Open-ILS/src/Makefile.am
@@ -72,6 +72,7 @@ core_scripts =   $(examples)/oils_ctl.sh \
 		 $(supportscr)/long-overdue-status-update.pl \
 		 $(supportscr)/purge_holds.srfsh \
 		 $(supportscr)/purge_circulations.srfsh \
+		 $(supportscr)/sitemap_generator \
 		 $(srcdir)/extras/eg_config \
 		 $(srcdir)/extras/openurl_map.pl \
 		 $(srcdir)/extras/import/marc_add_ids
diff --git a/Open-ILS/src/support-scripts/sitemap_generator b/Open-ILS/src/support-scripts/sitemap_generator
new file mode 100755
index 0000000..2758971
--- /dev/null
+++ b/Open-ILS/src/support-scripts/sitemap_generator
@@ -0,0 +1,232 @@
+#!/usr/bin/perl
+# Copyright (C) 2014 Laurentian University
+# Author: Dan Scott <dscott at laurentian.ca>
+#
+# This program is free software; you can redistribute it and/or
+# modify it under the terms of the GNU General Public License
+# as published by the Free Software Foundation; either version 2
+# of the License, or (at your option) any later version.
+#
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+
+use strict; use warnings;
+use XML::LibXML;
+use File::Copy;
+use Getopt::Long;
+use File::Spec;
+use File::Basename;
+use DBI qw(:sql_types);
+
+my ($dbhost, $dbport, $dbname, $dbuser, $dbpw, $help);
+my $config_file = '';
+my $sysconfdir = '';
+
+=item create_sitemaps() - Write the sitemap files
+
+With a maximum of 50,000 URLs per sitemap, this method
+automatically increments the sitemap file numbers and
+generates a corresponding sitemap index that lists all
+of the individual sitemap files.
+
+See http://www.sitemaps.org/ for the specification
+
+=cut
+sub create_sitemaps {
+    my ($settings, $bibs, $aou_id) = @_;
+
+    my $f_cnt = 1;
+    my $r_cnt = 0;
+    my @sitemaps;
+    my $fn = $settings->{'prefix'} . "sitemap$f_cnt.xml";
+    push(@sitemaps, $fn);
+    open(FH, '>', $fn) or die "Could not write sitemap $f_cnt\n";
+    print FH '<?xml version="1.0" encoding="UTF-8"?>' . "\n";
+    print FH '<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">' . "\n";
+
+    foreach my $bib (@$bibs) {
+        print FH "<url><loc>" . $settings->{'lib-hostname'} . "/eg/opac/record/" . $bib->[0];
+        if ($aou_id) {
+            print FH "?locg=$aou_id";
+        }
+        print FH "</loc><lastmod>" . $bib->[1] . "</lastmod></url>\n";
+        $r_cnt++;
+        if ($r_cnt % 50000 == 0) {
+            $f_cnt++;
+            print FH "</urlset>\n";
+            close(FH);
+            my $fn = $settings->{'prefix'} . "sitemap$f_cnt.xml";
+            push(@sitemaps, $fn);
+            open(FH, '>', $fn) or die "Could not write bibs\n";
+            print FH '<?xml version="1.0" encoding="UTF-8"?>' . "\n";
+            print FH '<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">' . "\n";
+        }
+    }
+    print FH "</urlset>\n";
+    close(FH);
+
+    open(INDEXFH, '>', $settings->{'prefix'} . "sitemapindex.xml") or die "Could not write sitemap index\n";
+    print INDEXFH '<?xml version="1.0" encoding="UTF-8"?>' . "\n";
+    print INDEXFH '<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">' . "\n";
+    foreach my $fn (@sitemaps) {
+        print INDEXFH "<sitemap><loc>" . $settings->{'lib-hostname'} . "/$fn</loc></sitemap>\n";
+    }
+    print INDEXFH "</sitemapindex>\n";
+    close(INDEXFH);
+    
+
+}
+
+=item get_settings() - Extracts database settings from opensrf.xml
+=cut
+sub get_settings {
+    my $settings = shift;
+
+    my $host = "/opensrf/default/apps/open-ils.reporter-store/app_settings/database/host/text()";
+    my $port = "/opensrf/default/apps/open-ils.reporter-store/app_settings/database/port/text()";
+    my $dbname = "/opensrf/default/apps/open-ils.reporter-store/app_settings/database/db/text()";
+    my $user = "/opensrf/default/apps/open-ils.reporter-store/app_settings/database/user/text()";
+    my $pw = "/opensrf/default/apps/open-ils.reporter-store/app_settings/database/pw/text()";
+
+    my $parser = XML::LibXML->new();
+    my $opensrf_config = $parser->parse_file($config_file);
+
+    # If the user passed in settings at the command line,
+    # we don't want to override them
+    $settings->{host} = $settings->{host} || $opensrf_config->findnodes($host);
+    $settings->{port} = $settings->{port} || $opensrf_config->findnodes($port);
+    $settings->{db} = $settings->{db} || $opensrf_config->findnodes($dbname);
+    $settings->{user} = $settings->{user} || $opensrf_config->findnodes($user);
+    $settings->{pw} = $settings->{pw} || $opensrf_config->findnodes($pw);
+}
+
+=item get_record_ids() - Gets a list of record IDs
+=cut
+sub get_record_ids {
+    my $settings = shift;
+    my $aou_id;
+
+    my $dbh = DBI->connect('dbi:Pg:dbname=' . $settings->{db} . 
+        ';host=' . $settings->{host} . ';port=' . $settings->{port} . ';',
+         $settings->{user} . "", $settings->{pw} . "", {AutoCommit => 1}
+    );
+    if ($dbh->err) {
+        print STDERR "Could not connect to database. ";
+        print STDERR "Error was " . $dbh->errstr . "\n";
+        return;
+    }
+
+    if ($settings->{'lib-shortname'}) {
+        my $stmt = $dbh->prepare("SELECT id FROM actor.org_unit WHERE shortname = ?");
+        $stmt->execute(($settings->{'lib-shortname'}));
+        my $rv = $stmt->bind_columns(\$aou_id);
+        $stmt->fetch();
+    }
+
+    my $q = "
+        SELECT DISTINCT bre.id, edit_date::date AS edit_date
+        FROM biblio.record_entry bre
+            INNER JOIN asset.opac_visible_copies aovc ON bre.id = aovc.record
+    ";
+    if ($aou_id) {
+        $q .= " WHERE circ_lib IN (SELECT id FROM actor.org_unit WHERE id = ? OR parent_ou = ?)";
+    }
+    $q .= " ORDER BY edit_date DESC";
+    my $stmt = $dbh->prepare($q);
+    if ($aou_id) {
+        $stmt->bind_param(1, $aou_id, { TYPE => SQL_INTEGER });
+        $stmt->bind_param(2, $aou_id, { TYPE => SQL_INTEGER });
+        $stmt->execute();
+    } else {
+        $stmt->execute();
+    }
+
+    my $bibs = $stmt->fetchall_arrayref([0, 1]);
+
+    if ($dbh->err) {
+        print STDERR "Error was " . $dbh->errstr . "\n";
+        return;
+    }
+    return ($bibs, $aou_id);
+}
+
+my $hostname;
+my $aou_shortname;
+my %settings = (
+    prefix => ''
+);
+
+GetOptions(
+        "lib-hostname=s" => \$settings{'lib-hostname'},
+        "lib-shortname=s" => \$settings{'lib-shortname'},
+        "prefix=s" => \$settings{'prefix'},
+        "config-file=s" => \$config_file,
+        "user=s" => \$settings{'user'},
+        "password=s" => \$settings{'pw'},
+        "database=s" => \$settings{'db'},
+        "hostname=s" => \$settings{'host'},
+        "port=i" => \$settings{'port'}, 
+        "help" => \$help
+);
+
+if (!$config_file) { 
+    my @temp = `eg_config --sysconfdir`;
+    chomp $temp[0];
+    $sysconfdir = $temp[0];
+    $config_file = File::Spec->catfile($sysconfdir, "opensrf.xml");
+}
+
+unless (-e $config_file) { die "Error: $config_file does not exist. \n"; }
+
+if ($settings{'lib-hostname'}) {
+    # Get additional settings from the config file
+    get_settings(\%settings);
+
+    my ($bibs, $aou_id) = get_record_ids(\%settings);
+    create_sitemaps(\%settings, $bibs, $aou_id);
+} else {
+    $help = 1;
+}
+
+if ($help) {
+    print <<HERE;
+
+SYNOPSIS
+    sitemap_generator [OPTION] ... [COMMAND] ... [CONFIG OPTIONS]
+
+DESCRIPTION
+    Creates a set of sitemaps for enabling web crawlers to crawl
+    freshly changed bibliographic records.
+
+OPTIONS
+    --config-file
+        specifies the opensrf.xml file
+
+    --lib-hostname
+        REQUIRED: hostname for the catalog (e.g "https://example.com")
+
+    --prefix
+        filename to add as a prefix to the generated set of sitemap files
+
+    --lib-shortname
+        include all records for the specified library and its children;
+        defaults to all records
+
+EXAMPLES
+   This script will normally be run as a cron job by the opensrf user from
+   the web root directory.
+
+   sitemap_generator --lib-hostname https://example.com --lib-shortname BR1 \
+      --prefix example_
+
+   This generates a set of sitemap files like so:
+     * example_sitemapindex.xml
+     * example_sitemap1.xml
+     * example_sitemap2.xml
+     * ...
+
+HERE
+}
+

-----------------------------------------------------------------------

Summary of changes:
 Open-ILS/src/Makefile.am                         |    1 +
 Open-ILS/src/support-scripts/sitemap_generator   |  232 ++++++++++++++++++++++
 docs/RELEASE_NOTES_NEXT/OPAC/sitemap_builder.txt |   51 +++++
 3 files changed, 284 insertions(+), 0 deletions(-)
 create mode 100755 Open-ILS/src/support-scripts/sitemap_generator
 create mode 100644 docs/RELEASE_NOTES_NEXT/OPAC/sitemap_builder.txt


hooks/post-receive
-- 
Evergreen ILS


More information about the open-ils-commits mailing list