Skip to main content

The ArchivesSpace Indexer

info

If you have navigated to this article to learn about the effects of re-indexing, you can navigate directly to that section. However, Atlas recommends that all users read this article in its entirety to better understand the indexer and its role in ArchivesSpace.

warning

This article was written by an archivist for other archivists. Any technical simplifications are the fault of the author.

We generally think of ArchivesSpace as one application the same way we think of Microsoft Word or Mozilla Firefox as one application, but, behind-the-scenes, ArchivesSpace is actually multiple applications working together:

  1. The database where your data is stored (MySQL at Atlas)
  2. The Staff User Interface (SUI)
  3. The Public User Interface (PUI)
  4. The Application Programming Interface (API)
  5. The search storage engine, or indexer (Solr)

When troubleshooting, it's helpful to understand that different errors can occur when one or more of these applications is suffering an issue, and even better to know how those issues can cascade from one application to another. This article focuses on the search storage engine, or indexer: its role in the bigger picture, understanding when it fails, and what occurs during a re-indexing.

Indexer Basics

The search platform (Solr) stores data for efficient searching

The role of the index in ASpace (or any database application) is similar to that in a physical book: a fast way for the interface to find what you're looking for. Its primary role is to support the search and retrieval of information in an efficient way.

To find references to something in a physical book, you can either read the entire book from cover to cover, or, use the index to look up references and navigate directly to your keyword in the text.

When you type the keyword "California" into a search box in ArchivesSpace, the application has the equivalent of these two options: it can start searching the entire database table by table and row by row to find the keyword, or it can search the data stored by its search storage engine (the index). Indexing data is significantly faster for search.

Another key point is that the index of a physical book could disappear, and while that would be inconvenient, the book itself would still be whole and viable. The same is true for ArchivesSpace; when the indexer is malfunctioning, rest assured your actual data is unaffected.

This is a highly simplistic comparison, but it is meant to emphasize that:

  • indexing is related to searching and display
  • index data is a copy of, or reference to, the data in your database
  • the index cannot be confused with the data itself

How the indexer works (and doesn't)

The indexer works by making a reference copy of the data in your instance of ArchivesSpace that it then uses for search and display. This is a simplistic explanation: it is not a perfect copy and it is not a copy of everything. The significant part to understand is that it is a copy as opposed to being your actual data stored in the actual database. When these two things (copy and actual data) match, the index is healthy; when they don't match is when problems are detected.

The index to a book stays static once that book is published, but the indexer in ArchivesSpace has to keep itself updated as you make changes. Changes include adding records, editing records, publishing records, transferring records, and deleting records. ArchivesSpace checks for changes as part of its ongoing operations (see Indexing Lag), but indexing can and will fail.

The indexer failing means that when ArchivesSpace goes to find something, its indexed data is either not there, or it is there but doesn't match recent changes. An inaccurate index becomes disorienting very quickly, especially when you know something should be there because you just created it.

The indexer tends to fail when there is a lot going on, like a lot of records being added, transferred, deleted, or edited at once or in quick succession, or simply over time, as ASpace accumulates changes and eventually just fails to keep them updated. The reality of working in ArchivesSpace is not whether the indexer will fail, but when.

"Fail" here means that ArchivesSpace just can't make those ongoing updates anymore. This affects the Public User Interface the most, since the entire PUI is sourced from indexer data. Remember that indexing failures are related only to the index; your actual data (stored in a MySQL database) is not affected by indexing failures.

There are two indexes

Both the Staff User Interface (SUI) and the Public User Interface (PUI) have an index, but this distinction is rarely made. Most documentation refers to these together as "the index" or "Solr." The distinction is not very significant to you the user except when troubleshooting indexing issues. The most common symptom of indexing issues is seeing information in the SUI but not the PUI; that is because the two interfaces get their display data from two different indexes.

Atlas usually takes action on both indexes at once (see Re-indexing).

Atlas disables the PUI indexer for sites that do not use it, so if your institution does not use the PUI, you do not need to worry about the PUI's index.

While the comparison to a physical book index makes sense for search, it is less intuitive that the indexer plays a role in displaying information to the various interfaces (SUI, PUI), and that's because the interface often searches for you when displaying relevant information. The most notable example is the PUI. The data displayed to users in the PUI is not directly sourced from the database; instead, the indexer creates a set of information based on the database that displays to the PUI. While the end result is the same (you see data) the fact that the indexer is involved in display is important for understanding some of the key troubleshooting observations when the indexing is failing or lagging.

Another example can be found on the staff interface. Say you are looking for all the records associated with a single personal agent. When you navigate to the View of a personal agent record in ASpace, the staff interface displays all the associated records. In that moment, ASpace did a search: it searched for all the relevant records, and then displayed them. You the user may not feel like you just requested a search because you never put anything into the search bar, but the search was part of your request to see all records.

Notable times when searching is happening without you realizing:

  • When browsing. A Browse is a "search of everything"
  • When generating EAD or a PDF. The interface first searches all the components that need to go into the EAD/PDF, and then generates it.
  • When counting records. Any time ASpace says you have X records, like 56 Resources, that count is informed by a search.
  • When displaying linked records. Records linked to a records (like Agents liked to Resources) must first be searched for before displaying. Any time ASpace needs to display linked records, a search occurs.

Common indexing issues and symptoms

Indexing issues are a simple explanation for a lot of complicated-looking but not actually-complicated problems. The issues can present as catastrophic (for example, the record count of your holdings suddenly half of what it was yesterday) but are actually benign.

Since indexing data is a copy of your data and is only about search and display, it is notably not about the storage of your actual data. Your actual data is stored in a MySQL database that is completely separate of the index. Your index could catastrophically fail but your data would be safe. To repeat the book comparison, you can lose the index to a book and still have the book.

The most important part of identifying indexing problems is that you are isolating an easy-to-solve problem as distinct from other, more substantial issues.

The following are some common indexer problem symptoms.

Indexing Lag

Even when the indexer is working fine, there will always be a minimum of indexer lag. Indexer lag is the time it takes for the indexer to catch up to changes as you are making them. The most common example is creating or editing published records in the staff interface and then expecting them to appear in the PUI.

By default, ArchivesSpace checks every 30 seconds to see if any edits have been made to your data and indexes any changes found. It first indexes the staff index and then the PUI. However, lag is proportional to the number of changes you're making in the timeframe you're making them in. One or two changes at a time will appear within 30 second or so, but for large operations or edits, ArchivesSpace will need time to index on top of the 30 second refresh rate.

Long lag times occur after creating or editing records in bulk. Some specific examples of that are:

  • spreadsheet/CSV ingests that create or edit more than 50 records at a time
  • creating or updating locations in bulk
  • barcoding in bulk
  • merging resource records or top containers

The PUI fails first and most often

The entire Public User Interface is sourced from indexed data. This is a crucial, fundamental fact of working in ArchivesSpace. For this reason, the PUI is usually the first and most obvious way to detect that ArchivesSpace is having indexing issues that are not related to normal lag times (see Indexing Lag). If you have been directed to this article because of a problem with the PUI, then the chances are very high that Atlas is in the process of re-indexing the affected server. In most cases, this resolves display issues with the PUI, or at the very least is the most important first step in troubleshooting PUI display issues.

Merging Controlled Values

Not all users have the ability to merge controlled values (navigate to System then Manage Controlled Value Lists), but this is when one value from a drop-down menu (for say a container type) is merged with another. It has been observed that the newly merged controlled values consistently fail to index after the merge operation is complete. Atlas recommends re-indexing when this occurs.

Publishing a repository for the first time

Atlas staff have consistently seen issues displaying or searching for records on the PUI after publishing a repository for the first time, even if that repository has only a few records. While we have no explanation for this behavior, it is very consistent. The solution is to re-index ArchivesSpace.

Transferring records from one repository to another

Atlas staff have consistently seen issues displaying or searching for records after transferring resources between repositories. This usually only affects the records that were transferred and not the repository as a whole. This has been observed in both the staff indexer and the PUI indexer. While we have no explanation for this behavior, it is very consistent. The solution is to re-index ArchivesSpace. You can also try re-saving individual Resource records if you can determine how to navigate to them directly on the staff interface (which is usually just a matter of experimenting with URLs until you find the record; this can be very inefficient but is worth mentioning for some cases).

Records "Disappearing" from the PUI

An uncommon indexer error is when records (usually resources) seemingly disappear from the Public User Interface. This problem assumes that other variables related to display on the PUI have been eliminated, mainly Publish status of the record and whether it has been Suppressed. Once those variables are eliminated, a display issue with the PUI is usually an indexing issue. Atlas recommends re-indexing when this occurs.

Generalized Indexing Failure

If you catch yourself confused by the way your data looks on the screen, then ask yourself: is this an indexing problem? The main symptom of any generalized indexing failure is that something isn't quite right. Perhaps your holdings count is too low, a repository appears empty, a link to a record on the PUI that should work is broken. The most common way this manifests is that you can see something on the SUI that you should also see on the PUI, but you can't (see Pro Tip below).

If you are in doubt about the health of your indexer, submit a support ticket to support@atlas-sys.com with a example of the suspected behavior, and Atlas will investigate and re-index if necessary.

Pro-Tip: A Simple Test

When troubleshooting indexer problems on the PUI, you can try this simple test.

ArchivesSpace automatically reindexes a record when it detects a change. One way to trigger that change is to hit the Save button on any record you are trying to diagnose as having an indexing problem. This small action helps to test whether indexing is currently working. After hitting Save, navigate to the PUI and see if the record appears. Remember to allow time for indexing lag.

This doesn't always work, and may not apply to the situation you are troubleshooting, but this pro-tip can be handy.

Re-indexing

To re-index ArchivesSpace means that Atlas staff delete the indexer "state" data, which triggers the Solr application to re-create the index with fresh data.

Re-indexing is a time-intensive process and the time it takes is directly related to the size of your holdings. For most customers this process takes a few hours (or overnight).

Effects during re-indexing

danger

Since index data underpins all search and display, clearing that data all at once will dramatically but temporarily affect the display and search functions of both the PUI and Staff User interface. Your collections may be inaccessible during re-indexing insomuch as they cannot reliably be found via Search or Browse (direct navigation to all records is still possible).

warning

It is normal that repositories may appear empty or missing records after an update; your records will reappear after a period of time. Re-indexing can take a few hours to a few days depending on the size of the database.

info

It is safe to view and create data in ArchivesSpace while it is re-indexing.

As detailed in other sections in this article, the index is a copy of your data used for search and display; it is not the data itself and it is not involved in the creation of data. With these facts in mind, both the SUI and the PUI are safe to use while re-indexing is taking place, including staff creating new data. However, due to the effects of re-indexing it is important to let staff and patrons know that the application isn't going to search and display as expected until re-indexing is complete.

Re-indexing rebuilds all the records in the index one at a time for a thorough and beneficial refresh. Since this happens one record at a time, it is a process, and certain repositories and records will re-index before others. This can be confusing to experience if you are not aware it is happening.

Some facts about this process:

  • Indexing is done per repository, so one repository will be complete before another begins.
  • Within each repository, the PUI re-indexes before the SUI.
  • Re-indexing proceeds by record ID within each repository, so records with lower ID numbers will be re-indexed before higher IDs. The ID referenced here is the ID provided by the database, not an ID provided by an archivist. You can see ID numbers in URIs. In the following example, 455 is the database ID of this resource: /repositories/2/resources/455.
  • Any server restart that happens in the midst of re-indexing will start the process over again for whatever repository was being indexed at the time. This is rare, but worth noting when Atlas is troubleshooting your server.

These variables will make search, display, and record counts inconsistent and confusing until indexing is complete, especially for users who are not familiar with the process.

For some hosted customers, the lack of reliable search and display while ArchivesSpace rebuilds the indexes means they consider re-indexing time as downtime, even though the application is still up and running. Other customers elect to proceed as normal during re-indexing. All sites should inform staff when re-indexing occurs, to prevent confusion.

Re-indexing during an update

The benefits of semi-regular re-indexing are so significant that Atlas re-indexes your server(s) upon every update, even if it is not required by the update itself. Experience as hosting providers has informed this decision on your behalf. We have observed that there are fewer overall inconveniences by undertaking this important step on a semi-regular basis. Atlas Systems always re-indexes during an update when it is required by the release notes for that version.

Sites can opt-out of this procedure if the re-index was otherwise optional for the update, but we highly recommend re-indexing with updates. If you'd like to opt out of indexing during your next update, please let us know when you contact us for your update.

Re-indexing as a solution

Re-indexing corrects most problems with the indexer. Any indexer issues that persist after a re-index are likely significant and will require additional investigation, but simply put, re-indexing almost always works to correct issues both large and small.

Other facts

A collection of FYIs related to indexing. This list will be updated over time.

  • Starting in v3.5.0, the PUI indexer no longer indexes unpublished repositories.
  • Atlas disables the PUI indexer for servers that do not use the PUI.
  • The only way to know when re-indexing is complete is in the server logs.
  • Database integrity errors can cause indexing loops, where ASpace never actually finishes indexing. Atlas has experience with the most common cause of these errors, which is related to the overnight time change with Daylight Savings Time. Starting in v3.5.0, these errors have been resolved, but they may still exist in your data if you were in ASpace before v3.5.0. When and if Atlas detects this problem, we will communicate it to you.