/~bencollver/log/2025-04-30-z3950-library-science-for-dummies on tilde.pink

	View source

	# 2025-04-30 - Z39.50 Library Science For Dummies

	I enjoyed reading these blog posts from Wolfram Scneider titled
	Z39.50 For Dummies, originally posted in 2009 and 2010. It
	introduces Unix tools to work with z39.50 and MARC records. It
	describes how to download the Library of Congress card catalog for a
	local offline searchable database.

	# Contents

	* Z39.50 for Dummies
	* Z39.50 for Dummies - Part 1
	* Z39.50 for Dummies - Part 2
	* Z39.50 for Dummies - Part 3
	* Z39.50 for Dummies - Part 4
	* Z39.50 for Dummies - Part 5

	# Z39.50 for Dummies

	by Wolfram Schneider on 2009-08-27

	One of the things Index Data is known for is the YAZ toolkit--an open
	source programmers' toolkit supporting the development of
	Z39.50/SRW/SRU clients and servers. The first release was in 1995 and
	I've been using it for my own metasearch engine ZACK Gateway since
	1998, long before I joined Index Data.

	YAZ toolkit

	ZACK Gateway (defunct)

	Z39.50 is a client-server protocol for searching and retrieving
	information from remote computer databases. It is a mature low level
	protocol like HTTP and FTP. You don't implement Z39.50 yourself, you
	use the YAZ utilities and the libraries and frameworks for in other
	languages (C++, PHP, Perl, etc.).

	There are many people who thinks that Z39.50 is a dead standard, and
	hard to understand. That is not true. Z39.50 is still growing in use,
	stable and very fast. It is the only widely available protocol for
	metasearch.

	Using Z39.50 is not harder than using FTP. I think that the overhead
	for learning Z39.50 is less than a half day for an experienced
	programmer. Every problem which you have later is not related to the
	Z39.50 protocol itself, it is related to underlying system behind the
	Z39.50 server. Keep in mind that Z39.50 is an API to access
	(bibliographic) databases. It does not define how the data is
	structured and indexed in the database.

	# Z39.50 for Dummies Series - Part 1

	I will now start a Z39.50 for Dummies series and show some example
	how to access a remote database.

	I'm using in the following demos the zoomsh program from the
	YAZ toolkit

	zoomsh

	Let's start with a simple question: does the Library of Congress have
	the book "library mashups"? (I strongly recommend you buy this
	book--I wrote chapter 19):

	$ zoomsh "connect z3950.loc.gov:7090/voyager" \
	'search "library mashups"' quit

	z3950.loc.gov:7090/voyager: 2 hits

	That's all! Only one line on the command line. A SRU or SOAP request
	would not be shorter.

	Now, retrieve the record:

	$ zoomsh "connect z3950.loc.gov:7090/voyager" \
	'search "library mashups"' "show 0 1" "quit"

	z3950.loc.gov:7090/voyager: 2 hits
	0 database=VOYAGER syntax=USmarc schema=unknown
	02438cam 22003018a 4500
	001 15804854
	005 20090710141909.0
	008 090706s2009 nju b 001 0 eng
	906 $a 7 $b cbc $c orignew $d 1 $e ecip $f 20 $g y-gencatlg
	925 0 $a acquire $b 2 shelf copies $x policy default
	955 $b rg11 2009-07-06 $i rg11 2009-07-06 $a rg11 2009-07-08 to \
	Policy (CLED/SHED)
	$a td04 2009-07-09 to Dewey $w rd14 2009-07-10
	010 $a 2009025999
	020 $a 9781573873727
	040 $a DLC $c DLC
	050 00 $a Z674.75.W67 $b L52 2009
	082 00 $a 020.285/4678 $2 22
	245 00 $a Library mashups : $b exploring new ways to deliver library
	data / $c edited by Nicole C. Engard.
	260 $a Medford, N.J. : $b Information Today, Inc., $c c2009.
	263 $a 0908
	300 $a p. cm.
	504 $a Includes bibliographical references and index.
	505 0 $a What is a mashup? / Darlene Fichter -- Behind the scenes \
	: some technical details on mashups / Bonaria Biancu -- Making \
	your data available to be mashed up / Ross Singer -- Mashing up \
	with librarian knowledge / Thomas Brevik -- Information in \
	context / Brian Herzog -- Mashing up the library website / \
	Lichen Rancourt -- Piping out library data / Nicole C. Engard -- \
	Mashups @ Libraries interact / Corey Wallis -- Library catalog \
	mashup : using Blacklight to expose collections / Bess Sadler, \
	Joseph Gilbert, and Matt Mitchell -- Breaking into the OPAC / \
	Tim Spalding -- Mashing up open data with biblios.net Web \
	services / Joshua Ferraro -- SOPAC 2.0 : the thrashable, \
	mashable catalog / John Blyberg -- Mashups with the WorldCat \
	Affiliate Services / Karen A. Coombs -- Flickr and digital image \
	collections / Mark Dahl and Jeremy McWilliams -- Blip.tv and \
	digital video collections in the library / Jason A. Clark -- \
	Where's the nearest computer lab? : mapping up campus / Derik A. \
	Badman -- The repository mashup map / Stuart Lewis -- \
	The LibraryThing API and libraries / Robin Hastings -- ZACK \
	bookmaps / Wolfram Schneider -- Federated database search mashup \
	/ Stephen Hedges, Laura Solomon, and Karl Jendretzky -- \
	Electronic dissertation mashups using SRU / Michael C. Witt.
	650 0 $a Mashups (World Wide Web) $x Library applications.
	650 0 $a Libraries and the Internet.
	650 0 $a Library Web sites $x Design.
	650 0 $a Web site development.
	700 1 $a Engard, Nicole C., $d 1979-
	963 $a Amy Reeve; phone: 609-654-6266; email: areeve @ \
	infotoday.com; bc: nellor @ infotoday.com

	The default exchange format for bibliographic records in Z39.50 is
	MARC21. This is maybe not what you want to parse yourself.

	Ok, now let's download the record in XML format:

	$ zoomsh "connect z3950.loc.gov:7090/voyager" \
	'search "library mashups"' "show 0 1 xml" "quit"

	z3950.loc.gov:7090/voyager: 2 hits
	0 database=VOYAGER syntax=USmarc schema=unknown
	<record xmlns="http://www.loc.gov/MARC21/slim">
	<leader>02438cam a22003018a 4500</leader>
	<controlfield tag="001">15804854</controlfield>
	<controlfield tag="005">20090710141909.0</controlfield>
	<controlfield tag="008">090706s2009 nju b 001 0 eng \
	</controlfield>
	<datafield tag="906" ind1=" " ind2=" ">
	<subfield code="a">7</subfield>
	<subfield code="b">cbc</subfield>
	<subfield code="c">orignew</subfield>
	<subfield code="d">1</subfield>
	<subfield code="e">ecip</subfield>
	<subfield code="f">20</subfield>
	<subfield code="g">y-gencatlg</subfield>
	</datafield>

	[large XML output...]
	</record>

	You can parse the XML output with your favorite tools, usually an
	XSLT style sheet.

	Next time I will show you how to run a meta search in one line.

	-Wolfram

	UPDATE: The latest release of YAZ, inspired by this blog post,
	supports client-side mapping of MARC to MARCXML, so you can dump XML
	records even from targets that do not support XML.

	# Z39.50 for Dummies - Part 2

	In the last blog post Z39.50 for Dummies I gave an introduction on
	how to use the zoomsh program to access the Z39.50 Server of the
	Library of Congress.

	Today I will show you how to run a simple metasearch on the command
	line. You want to know which library has the book with the ISBN
	0-13-949876-1 (UNIX network programming / W. Richard Stevens)? You
	can run the zoomsh in a shell loop.

	Put the list of databases (zURL's) line by line in the text file
	zurl.txt:

	z3950.loc.gov:7090/voyager
	melvyl.cdlib.org:210/CDL90
	library.ox.ac.uk:210/ADVANCE
	z3950.library.wisc.edu:210/madison

	and run a little loop in a shell script:

	$ for zurl in `cat zurl.txt`
	do
	zoomsh "connect $zurl" "search @attr 1=7 0-13-949876-1" "quit"
	done

	z3950.loc.gov:7090/voyager: 0 hits
	melvyl.cdlib.org:210/CDL90: 1 hits
	library.ox.ac.uk:210/ADVANCE: 1 hits
	z3950.library.wisc.edu:210/madison: 0 hits

	Of course it takes time to run one search request after another. How
	about a parallel search? Modern xargs(1) commands on BSD based
	Operating Systems (MacOS, FreeBSD) and the GNU xargs supports to run
	several processes at a time.

	This example runs up to 2 search request at a time and is 2 times
	faster than the shell script above:

	$ xargs -n1 -P2 perl -e 'exec "zoomsh", "connect $ARGV[0]", \
	"search \@attr 1=7 0-13-949876-1", "quit"' < zurl.txt

	melvyl.cdlib.org:210/CDL90: 1 hits
	library.ox.ac.uk:210/ADVANCE: 1 hits
	z3950.loc.gov:7090/voyager: 0 hits
	z3950.library.wisc.edu:210/madison: 0 hits

	You see here that the order of responses is different, the fastest
	databases wins and displayed first.

	I think it is safe to run up to 20 searches in parallel on modern
	hardware. Note that there is a lot of process overhead here, for each
	request 2 processes will be executed. If a connection hangs you must
	wait until you hit the time out.

	This was an example how easy it is to run your own metasearch on the
	command line. If you want setup a real metasearch for your
	organization I recommend to try out our metasearch middleware
	pazpar2, featuring merging, relevance ranking, record sorting, and
	faceted results. In a nutshell, pazpar2 is a web-oriented Z39.50
	client. It will search a lot of targets in parallel and provide
	on-the-fly integration of the results. The interface is entirely
	webservice-based, and you can use it from any development
	environment.

	pazpar2 home page

	# Z39.50 for Dummies Series - Part 3

	This is part 3 of the Z39.50 series for dummies. In the first part I
	explained what Z39.50 is and how to run a simple search. In the
	second part I showed how to run a simple meta search on the command
	line.

	I searched for the book: UNIX network programming /
	W. Richard Stevens, ISBN 0-13-949876-1 in four large libraries:

	$ for zurl in `cat zurl.txt`
	do
	zoomsh "connect $zurl" "search @attr 1=7 0-13-949876-1" "quit"
	done

	z3950.loc.gov:7090/voyager: 0 hits
	melvyl.cdlib.org:210/CDL90: 1 hits
	library.ox.ac.uk:210/ADVANCE: 1 hits
	z3950.library.wisc.edu:210/madison: 0 hits

	Only 2 out of 4 libraries own this must-have book. Can this be true?
	Well, lets modify the ISBN and search without dashes ('-')

	$ for zurl in `cat zurl.txt`
	do
	zoomsh "connect $zurl" "search @attr 1=7 0139498761" "quit"
	done

	z3950.loc.gov:7090/voyager: 1 hits
	melvyl.cdlib.org:210/CDL90: 1 hits
	library.ox.ac.uk:210/ADVANCE: 1 hits
	z3950.library.wisc.edu:210/madison: 1 hits

	Bingo--every library has a copy of UNIX network programming by
	W. Richard Stevens!

	Z39.50 defines the syntax to search in a database. It does not define
	the semantic of a search, how an ISBN is structured.

	If you build a search engine on top of Z39.50 you need an additional
	layer to handle the semantic of a search for each database. (You need
	this layer too to add workaround for broken implementations.)

	In this example above we must remove the dashes in an ISBN search for
	the Library of Congress and University of Wisconsin-Madinson Libraries.

	Another thing which you must be aware: libraries use for historical
	reasons different character sets: utf-8, iso8859-1, iso5426, and
	marc8. You must convert your search query to the right character set
	for each library, for searching and retrieving the records.

	In this article I described the challenges to run a meta search on
	top of Z39.50. All these problems are due the underlying databases
	and not Z39.50--you will have the same problems if you use a web
	based XML services such as SRU or a proprietary, vendor-based API.
	The truth is that running a metasearch is not a trivial task.

	# Z39.50 for Dummies - Part 4

	Libraries store and exchange bibliographic data in MARC records. A
	MARC record is a MAchine-Readable Cataloging record. It was developed
	at the Library of Congress (LoC) beginning in the 1960s.

	MAchine-Readable Cataloging record

	Library of Congress

	A dump of the LoC catalog (and other libraries) is available at the
	Internet Archive in the collection marcrecords. The LoC catalog dump
	is split into 29 files, part01.dat to part29.dat. Each file is
	roughly 200MB large.

	LoC catalog dump

	The great news is that the data from LoC is public domain (already
	paid by the US taxpayers, thank you!) and you can use the data for
	your own system.

	MARC Open-Access (2016)

	MDSConnect datasets (2020)

	Before you can import data, you must validate, convert, or fix the
	bibliographic data. I will show now how you can do this with the
	Index Data YAZ toolkit. The YAZ toolkit contains the program
	yaz-marcdump to dump MARC records.

	yaz-marcdump

	yaz-marcdump called without an option will print the records in line
	format:

	$ yaz-marcdump part01.dat \| more

	00720cam 22002051 4500
	001 00000002
	003 DLC
	005 20040505165105.0
	008 800108s1899 ilu 000 0 eng
	010 $a 00000002
	035 $a (OCoLC)5853149
	040 $a DLC $c DSI $d DLC
	050 00 $a RX671 $b .A92
	100 1 $a Aurand, Samuel Herbert, $d 1854-
	245 10 $a Botanical materia medica and pharmacology; $b drugs \
	considered from a botanical, pharmaceutical, physiological, \
	therapeutical and toxicological standpoint. $c By S. H. Aurand.
	260 $a Chicago, $b P. H. Mallen Company, $c 1899.
	300 $a 406 p. $c 24 cm.
	500 $a Homeopathic formulae.
	650 0 $a Botany, Medical.
	650 0 $a Homeopathy $x Materia medica and therapeutics.
	[...]

	First converts the MARC21 records in MARC-8 encoding to MARC21 in
	UTF-8 encoding:

	$ yaz-marcdump -f marc-8 -t utf-8 -o marc part01.dat > part.mrc

	For MARC21, the leader offset 9 tells whether it is really MARC8
	(almost always the case) or whether it's UTF-8. A MARC21 must have
	position 9='a' (value 97). For this reason, the option -l for
	yaz-marcdump may come in handy:

	$ yaz-marcdump -f marc-8 -t utf-8 -o marc -l 9=97 part01.dat \
	> part.mrc

	If you prefer MARCXML instead MARC21 records you may convert the
	records:

	$ yaz-marcdump -o marcxml -f MARC-8 -t UTF-8 part01.dat \
	> part.marcxml

	<collection xmlns="http://www.loc.gov/MARC21/slim">
	<record>
	<leader>00720cam a22002051 4500</leader>
	<controlfield tag="001"> 00000002 </controlfield>
	<controlfield tag="003">DLC</controlfield>
	<controlfield tag="005">20040505165105.0</controlfield>
	<controlfield tag="008">800108s1899 ilu 000 0 eng
	</controlfield>
	<datafield tag="010" ind1=" " ind2=" ">
	<subfield code="a"> 00000002 </subfield>
	</datafield>
	<datafield tag="035" ind1=" " ind2=" ">
	<subfield code="a">(OCoLC)5853149</subfield>
	</datafield>
	[...]

	The Library of Congress has over 7 million records. That's huge data,
	total 5.6GB raw data. If you compress that data it is only 1.7GB.

	To convert compressed data, run yaz-marcdump in a UNIX pipe:

	$ zcat part01.dat.gz \| yaz-marcdump -f MARC-8 -t UTF-8 \
	-o marcxml /dev/stdin > part01.marcxml

	You can search a marc dump with the UNIX grep tool:

	$ yaz-marcdump -f marc-8 -t utf-8 part01.dat \| grep Sausalito

	260 $a Sausalito, Calif. : $b University Science Books, $c 2000.
	260 $a Sausalito, Calif. : $b Math Solutions Publications, \
	$c c2000.
	260 $a Sausalito, Calif. : $b Post-Apollo Press, $c c2000.
	260 $a Sausalito, Calif. : $b University Science Books, \
	$c c2002.
	260 $a Sausalito, Calif. : $b Post-Apollo Press, $c c2000.
	260 $a Sausalito, CA : $b Toland Communications, $c c2000.
	260 $a Sausalito, CA : $b In Between Books, $c 2001.
	[...]

	The yaz-marcdump tool supports the character sets UTF-8, MARC-8,
	ISO8859-1, ISO5426 and some other encodings. For more information,
	see the yaz-iconv manual pages.

	yaz-iconv

	In this article I showed how to validate, convert, or fix
	bibliographic data dumped in MARC format. Next time I will show some
	advanced examples how to analyze MARC records on modern standard PC
	hardware.

	# Z39.50 for Dummies - Part 5

	In this article I will show you how to analyze MARC data on a modern
	PC hardware. PC are very fast now and incredibly cheap. You can rent
	a quad-core Intel machine with 8GB RAM and unlimited traffic for
	40 Euro/month (+VAT) in a data center.

	If the computer is fast enough, you don't have to spend too much time
	on complex algorithms. You can use the raw power of your computer and
	do a brute force approach.

	In the following example I will use the 7 million records from a dump
	of the Library of Congress (LoC) catalog. For details, please read
	the previous article Z39.50 for Dummies - Part 4.

	$ for i in *.dat; do
	yaz-marcdump -f marc-8 -t utf-8 -o line
	done > loc.txt

	$ du -hs loc.txt
	4.9G

	The line dump of the LoC is 4.9GB large and fits into main
	memory--great!

	# count for the last name "Calaminus"
	$ egrep -c Calaminus loc.txt

	4 hits, the search took 4 seconds real time

	# count records with <span class="caps">ISBN</span> number
	$ egrep -c ^020 loc.txt
	3999863

	There are nearly 4 million ISBN numbers (out of 7 million records).
	The search took 11 seconds.

	# count <span class="caps">URL</span>s
	$ egrep -c http:// loc.txt
	265540

	There are 265,540 URLs in the LoC records.

	# check for subject headings for the city of
	# Sausalito, California using regular expression
	$ egrep -c '^[67][0-<span class="caps">9[0</span>-9].*Sausalito' \
	loc.txt
	19

	There are 19 subject headings for Sausalito

	# search with a typo in name (a => o)
	$ egrep Sausolito loc.txt

	No hits due a typo in the name, try it with agrep, a grep program
	with approximate matching capabilities:

	$ agrep -c -1 Sausolito loc.txt
	282

	282 hits, the search took 8 seconds

	agrep

	The examples above are for software developers and experienced
	librarians. They are helpful for a quick check of your bibliographic
	records, for data mining, analyzing or to double-check if your
	indexer works correctly.

	If you want setup a public system for end-users you need of course a
	real full text engine [such] as our zebra software.

	zebra

	From: https://www.datamercs.net/posts/2020-08-15-z3950-for-dummies/

	tags: article,technical,unix

	# Tags

	article
	technical
	unix