Meta Data Extraction with Jhove
Jhove (Jstore/Havard Object Validation Environment) is a tool that provides functions to perform format-specific identification, validation and characterization of digital objects
Introduction
-
Format identification is the process of determining the format to which a digital object conforms; in other words, it answers the question: “I have a digital object; what format is it?”
-
Format validation is the process of determining the level of compliance of a digital object to the specification for its purported format, e.g.: “I have an object purportedly of format F; is it?”
-
Format validation conformance is determined at two levels: well-formedness and validity.
- A digital object is well-formed if it meets the purely syntactic requirements for its format.
- An object is valid if it is well-formed and it meets additional semantic-level requirements.
For example, a TIFF object is well-formed if it starts with an 8 byte header followed by a sequence of Image File Directories (IFDs), each composed of a 2 byte entry count and a series of 8 byte tagged entries. The object is valid if it meets certain additional semantic-level rules, such as that an RGB file must have at least three sample values per pixel.
- Format characterization is the process of determining the format-specific significant properties of an object of a given format, e.g.: “I have an object of format F; what are its salient properties?”
The set of characteristics reported by JHOVE about a digital object is known as the object’s representation information, a concept introduced by the Open Archival Information System (OAIS) reference model [ISO/IEC 14721]. The standard representation information reported by JHOVE includes: file pathname or URI, last modification date, byte size, format, format version, MIME type, format profiles, and optionally, CRC32, MD5, and SHA-1 checksums [CRC32, MD5, SHA-1]. Additional media type-specific representation information is consistent with the NISO Z39.87 Data Dictionary for digital still images and the draft AES metadata standard for digital audio.
Identification, validation, and characterization actions are frequently necessary during routine operation of digital repositories and for digital preservation activities. These actions are performed by modules. The output from JHOVE is controlled by output handlers. JHOVE uses an extensible plug-in architecture; it can be configured at the time of its invocation to include whatever specific format modules and output handlers that are desired. The initial release of JHOVE includes modules for arbitrary byte streams, ASCII and UTF-8 encoded text, GIF, JPEG2000, and JPEG, and TIFF images, AIFF and WAVE audio, PDF, HTML, and XML; and text and XML output handlers.
Implementation
1. Requirements
The tool is written in Java. Therefore a Java Runtime Environment is required for the proper operation of Jhove. Jhove should be able to be run on any operating systems like UNIX or Windows.
2. Installation
Download the tool from https://sourceforge.net/projects/jhove/files/ or install it with apt-get install jhove
.
3. Invoke Jhove
The JHOVE command-line interface is invoked by the Bourne shell script “jhove” (under Unix) or the DOS shell script “jhove.bat” (under Windows) in the JHOVE installation directory. This script properly sets the Java CLASSPATH and executes the Jhove class with the Java interpreter.
In the invocation syntax below, brackets [ and ] enclose optional arguments. In addition to the syntax specified in subsequent sections, any of the following standard options can also be used:
... [-c config] [-h handler] [-e encoding] [-o output] [-x sax-class] [-t directory] [-b buffer] [-l loglevel]...
where
-c config
specifies the pathname of the configuration file;
-h handler
specifies the output handler (defaults to TEXT, the standard Text handler);
-e encoding
specifies the character encoding used by the output handler (defaults to UTF-8);
-o output
specifies the output file pathname (defaults to standard output);
-x sax-class
specifies the SAX parser class name (defaults to the J2SE 1.4 default);
-t directory
specifies the pathname of the directory in which temporary files are created (defaults to the current working directory); and
-b buffer
specifies the buffer size used for buffered I/O (defaults to the J2SE 1.4 default).
-l loglevel
specifies the logging level (defaults to SEVERE).`
Note that the temporary directory and buffer size and logging level can also be specified in the configuration file. 4.1 Format Identification
The following syntax is used to discover, or identify, the format of a digital object.
jhove ... [-ks] file-or-uri1 .. file-or-uriN
where the first ellipsis … is a placeholder for any of the optional standard options defined above.
The digital object(s) can be specified as a file or directory pathname or as a URI. If a directory is specified, JHOVE will recursively walk through the directory. The optional -s flag specified that the identification should be performed solely on the basis of the internal signatures (e.g., magic numbers) associated with the formats, rather than by a complete parsing of the object. After the object’s format has been identified, its representation information is displayed. The optional -k flag specifies that object checksum values should be calculated and displayed as part of the representation information.
Result
The result of a pdf file will look like the following:
ReportingModule: PDF-hul, Rel. 1.8 (2009-05-22)
LastModified: 2013-04-29 16:04:17 CEST
Size: 14138
Format: PDF
Version: 1.5
Status: Well-Formed and valid
SignatureMatches:
PDF-hul
MIMEtype: application/pdf
PDFMetadata:
Objects: 14
FreeObjects: 1
IncrementalUpdates: 0
DocumentCatalog:
PageLayout: SinglePage
PageMode: UseNone
Info:
Creator: cairo 1.12.14 (http://cairographics.org
Producer: cairo 1.12.14 (http://cairographics.org
Filters:
FilterPipeline: FlateDecode
Fonts:
TrueType:
Font:
BaseFont: NHOAPS+DejaVuSans
FontSubset: true
FirstChar: 32
LastChar: 121
FontDescriptor:
FontName: NHOAPS+DejaVuSans
Flags: Nonsymbolic
FontBBox: -1020, -415, 1680, 1166
FontFile2: true
Encoding: WinAnsiEncoding
ToUnicode: true
Pages:
Page:
Sequence: 1
Leave a comment