Pomfort Basics

Managing Data (Part Two): Completeness of Data with Manifest Files

Managing Data (Part Two): Completeness of Data with Manifest Files

As part of a two-part blog series, this article continues our exploration of the concept of data management. In the first article, we already established that there are two dimensions to  successful data management: “One of the most important goals […] is to maintain the integrity and completeness of all data recorded and created during production.” So, while the first article looked at how to maintain the integrity of single files by calculating and comparing hash values, we now move on to discuss the aspect of completeness, including manifest files. 

Completeness of Data

When talking about “completeness”, we can distinguish two levels: First, we need to make sure that each single data management process or activity (e.g., the backup of one camera card) is complete[1]. But sometimes, such activities are grouped together and we need to take care of the completeness of that entire batch of activities or processes (e.g., ensure completeness of the backups of all came[ra cards during a shooting day). As we will see, both levels have different challenges. So, let’s start with the first one.

Single-Activity Completeness

Let’s assume a batch of files needs to be backed up. We start that process by pointing a piece of software to the folder that contains the files we want to copy. Then, the process starts, copying file by file to the new destination. If everything goes well, completeness is hopefully covered by the copy process, i.e. the software only stops when all files have been written successfully to the destination.

In case the process is aborted or fails without notice, taking a glance at the destination won’t allow us to see if a file (or more) is missing in one of the subfolders. That would already require a more thorough check, which typically includes a comparison of destination and source. Even in the least managed workflows, you might do something like comparing overall folder sizes on source and destination, or compare the number of files in a folder to find such undetected incomplete processes.

Multi-Activity Completeness

Now let’s look at a scenario where multiple activities are grouped together to create one new “package” of files. A typical example can be found in almost every shooting day:

Over the course of a shooting day, multiple camera cards are offloaded to a travel drive. Additional files, such as audio cards, reports, set photos, transcoded dailies, etc., are also added to that travel drive. So, even if the single offloads or copies end up complete, how do you ensure that there isn’t an entire folder missing on that travel drive? Even if each individual card was copied successfully, you also need to keep track of the numbers somewhere. Should there be eight or nine cards on that travel drive?

“Why Would Anything Go Wrong” – And How To Spot It Anyways

The first level of completeness (completeness of one offload or backup process) is the most important one. If we cannot check if one process has finished successfully and all copied files are still there, we will not be able to talk about grouping such processes together.

If you do detect an issue, the reasons for incomplete processes can be manifold:

  • Power issues: Not only the computer but maybe one of the external devices that need a separate power supply can lose power (think tripped cables, loose plugs through vibrations, etc.). Such devices can be card reader systems or the external destination drives.
  • Read errors on the source: Read errors can occur when faulty block are read on a card. That can make the filesystem hang or timeout.
  • Write errors on one of the destinations: Same issue with faulty blocks, but sometimes also because of such basic reasons as “not enough space on the volume left”. Even if the software checks at the start of the process if there is enough space, other systems can write to the same destination while the process is running (e.g., a transcoder is writing to a destination while a copy is running).
  • Errors from consistency checks: Depending on how the software works or is set up, the process may stop in case of a hash error.
  • Other resource issues: Although a copy process does not require much RAM, if there is nothing left there is not much left to do than abort. For example, if other RAM-hungry processes are running, our data management process may not be able to load data from the source files into memory anymore. 

In order to detect such issues, the data management software needs to make sure that at a later time, it is easy to find out if a process has been paused, aborted, or if it failed somewhere in the middle. Relying on the software’s result status itself is not always sufficient, as in some of the cases below (e.g., running out of RAM), the software cannot reliably work anymore and may quit without being able to write errors to a log.

The solution is to create a “receipt” at the end of the process that includes a list of files that the process attempted to process. If the receipt is missing, we know that something went wrong. If it’s there, we can check with the list of files if the result of the process is still there. Incompleteness of the second level happens when the result of one or more individual data management activities is missing in the package.

Usually, this has less technical but more organizational reasons:

  • Device mix-up: The wrong placement or label of a camera card can lead to situations where the camera card is simply not copied. 
  • Process not started: Sometimes, all is prepared, but the process is simply not started. This can happen when the user got distracted, forgot to start the offload, left and came back thinking it’s already done.
  • Wrong destination: A copy needs to go to the right destination; otherwise something happened, but the result is missing where you expect it. This can happen when drag&dropping folders onto other folders in Finder, and the destination is not highlighted when releasing the mouse button.
  • Communication error: When working together in stressful situations, things can get misunderstood (“Did you start that?” – “Yes!” …but that means something different to each person).

These issues are outside of the scope of software, but good management can still help spot those situations (e.g., by comparing travel drive contents against other information sources such as a camera report and cross-checking cards twice or more before formatting them).

With all these things considered, there is still one issue that can be prevented – although it’s outside the scope of the person doing the actual data management to group files together at the end of the shooting day:

  • Lost folder in future data management processes: Even if you made sure that the travel drive leaving the film set is complete, the future process that copies the contents of that travel drive to a file server in a facility also can fail. But in the facility, there are no stickers of camera cards or other source files to compare to. To allow colleagues to spot incompleteness in the future, again, an inventory list of things that had originally been copied solves that problem, as you provide something to compare to whenever that might be needed.

To sum up, we saw that ensuring completeness is not a one-shot task, especially not in media workflows of film productions. Files and folders are copied, moved, backed up and archived so many times, so that completeness goes far beyond that one copy on the film set.

To ensure completeness in all these subsequent activities, we need to document what should be there from the very moment something is expected to be there. This means completeness is always thought from the source. Let’s look closer what this means for our data management activities.

Completeness With Lists

For detecting an incomplete activity providing a simple list of files might be a good solution already. We list all files that should be in a folder and that can be easily checked at any time.

If we also think about the topic of integrity, that file list would also be the perfect place to store the hash value for each file. The chances are high that your operating system of choice already comes with tools for such a list. For example, the md5 command outputs the file name and an MD5 hash value for the file. 

An example looks like this:

% md5 A001C006_141024_R2EC.mov
MD5 (A001C006_141024_R2EC.mov) = 52d29e6b6fe711e08effb93588c2cee6

For multiple files the result of this command could already be our “file list with hashes”:

% md5 *
MD5 (A001C006_141024_R2EC.mov) = 52d29e6b6fe711e08effb93588c2cee6
MD5 (A001C019_141024_R2EC.mov) = 91684b6ccbd27ac14712dcb11b6095e6
MD5 (A001C024_141024_R2EC.mov) = 4900c8b11b22b328b21e396f3a95759c

In the early days of digital cinematography, such lists were in fact often used as the listing of files in a folder, documenting completeness (the list of files in the folder) and consistency (with a file hash value for each file). It sounds like the md5 command covers all our needs, right? Let’s take a closer look, and we’ll learn the limitations and additional requirements for typical applications. First, let’s point the md5 command at a folder of files:

% md5 A001R2EC/
md5: A001R2EC: Is a directory

Bummer, that doesn’t work with folder structure directly[2]. Usually, we will need to deal with more complex folder structures, so that should definitively be covered.

Also, with this approach, we don’t know what should be there: If you abort the md5 command after two files, the process stops, and the list looks like there are only two files. Nothing indicates that the source had ten files.

Manifest files

So, good data management software should know what was on the source and make that the base for any file lists. If something was aborted or failed, the fact that something is missing can be easily spotted.

In Pomfort’s Silverstack and Offload Manager applications, that is done with so-called manifest files. These files are similar to the simple one we saw above but in a more structured syntax and with more information tailored to media workflows. These manifest files are typically in a format called “MHL” (short for “media hash list”), or the newer “ASC MHL”, a new format recently specified by the ASC (American Society of Cinematographers). 

Both formats similarly contain a list of files in the form of relative paths and file hashes. The software writes these manifest files after the copies of all files are made, so no incomplete, intermediate state of the manifest can remain during an aborted or failed copy that would indicate a wrong scope of completeness.

An excerpt of the ASC MHL manifest file made for the files above would look like this:

...
<hashes>
    <hash>
      <path size="44900638" lastmodificationdate="2016-02-09T11:38:41+01:00">
        A001C006_141024_R2EC.mov
      </path>
      <xxh128 action="original" hashdate="2023-01-23T09:18:40.616865+01:00">
        c90b79d2e682e9f8dd2715add65e5913
      </xxh128>
    </hash>
    <hash>
      <path size="44394794" lastmodificationdate="2016-02-09T11:38:53+01:00">
        A001C019_141024_R2EC.mov
      </path>
      <xxh128 action="original" hashdate="2023-01-23T09:18:40.632174+01:00">
        e25ad55e74d8665142b58f7ee1f2de96
      </xxh128>
    </hash>
...

Besides the file paths and the hash values, we see a bit more information per file: The file size and last modification of the file, and the date when the hash was calculated. ASC MHL manifests also get hashed themselves, so a changed or incomplete manifest can also be detected reliably.

The MHL and ASC MHL manifest files include additional information about the software that did the offload (including version numbers) and can even include the contact information of the person in charge of  the backup process. All this information simplifies research when something seems or is in fact wrong with the files.

“Sealing”

Manifest files already document well the completeness of one data management activity, as shown above. How about the completeness of multiple backups?

In fact, we still have a problem when we add a manifest file to each copied folder and collect all such folders on a travel drive. What if one folder is missing? As an example, we imagine a situation where the travel drive is copied onto a fileserver in the facility, and one folder is not copied. How do you know? A completeness check of the remaining folders will find no issues, as all the manifests for their respective folders will check successfully.

Without a “super manifest”, we cannot know if there should be something that has no trace yet. Such a super manifest doesn’t have to be another list of all the files, but it can be a list of the manifests that should be there and checked.

In Silverstack, there is the concept of “sealing” that creates such a “list of lists” on the destination drive. Starting with that seal, you can then easily check if a travel drive is still complete in all aspects at any time later.

The new ASC MHL standard offers something similar and allows so-called “nesting” of manifests. That means you can use ASC MHL not only to list files in certain folders but also to reference other ASC MHL manifest files. That way, you can group manifests together and also provide a single starting point for checking completeness.

Best practices

Unfortunately, the manifest files don’t check anything themselves. Hence, users need to find the right places in the workflow where the manifest is used to check available data for completeness.

So if you are the person creating manifest files together with file copies, make the receiving person of this data aware of the manifest files and how they can be checked. For Silverstack’s seals and MHL files, there is a software for macOS and Windows called “Pomfort SealVerify” (available for free here). As of today (January 2023), ASC MHL is still quite new, so not all data management tools can write or verify those. A command line tool is part of the open source reference implementation, that can be used to verify drives or folders with ASC MHL files.

Motivate the receiver of the data package to check completeness with the given manifest files and the mentioned tools. The earlier an issue is spotted, the higher the probability that it can be easily resolved. Also, do tests and check the data you are handing over yourself once in a while so that you are familiar with the tools and know that it works.

Please note: ASC MHL has been available in a beta of Silverstack and will be officially released with Silverstack v8.4 in early February 2023. 

Summary

In this article, we discussed how manifest files containing file lists, hashes, and context information are a powerful tool to document what data has been processed and what should have been received after a data transfer. With the content of such manifest files, you can check the consistency of every single file (by comparing hash values) and the completeness of entire data sets (by comparing file lists). With regular checks in critical moments and sufficient backups for recovery in case of an issue, you are well-prepared for a successful data management workflow.

__________________________

[1] You could argue that there is a level even before that: Making sure that each individual file is complete, but that’s covered by checking “integrity” with hashes as seen in the previous article.

[2] Of course can get a bit more sophisticated by combining md5 with the find command, e.g. by doing something like find A001R2EC/ -type f -exec md5 {} \;. But now you made the first step into becoming the developer of data management software, and that goes beyond the scope of this article.


All posts in this series:

About the Author
Patrick is head of products for Pomfort’s on-set applications. He combines a technical background in software engineering with practical experience in digital film productions – and a preference for things that actually work.
Pomfort Basics

Articles with this tag cover basic information on Pomfort’s software applications, related workflows and technical contexts - perfect if you’re just starting out!