One of the most important goals of data management on set is to maintain the integrity and completeness of all data recorded and created during production. Since this is your responsibility as the involved Digital Imaging Technician or Data Manager, it’s useful to make yourself familiar with a few core concepts.
This article is the first of a two-part series providing a basic introduction to the technical aspects of data management. We will take a closer look at what data integrity means, some potential issues that might threaten it, and how checksums and hash algorithms help protect it.
In a future article, we will then move on to the second pillar of successful data management by discussing manifest files and how they are used to maintain the completeness of data. But, first things first – let’s talk data integrity!
Maintaining integrity (or “data integrity”) means ensuring that the data is “correct” over its entire lifetime. For recorded media files, this means a file has not been altered unintentionally and contains the same content as it did when it was recorded in camera.
What sounds quite simple in theory, can become tricky in reality. Just think about the long list of hardware and software components that need to work together when transferring a file from a camera card to an external hard drive:
- Different devices connected through pluggable cables,
- card reader with own controller and connectors,
- SSD or disk controller components and caches,
- USB or Thunderbolt interface components,
- maybe a RAID system (hardware or software),
- several file systems (maybe different types), sometimes a virtual file system (e.g. the Codex VFS) creating file data for a volume when it’s needed,
- an operating system with file access routines for reading and writing to files, an access rights management system, file caching mechanisms in RAM, multi threading support,
- … and the software application that performs the data transfer, such as Pomfort Silverstack or the Offload Manager.
The good news: It’s not often that unwanted and unexpected changes to a file occur. However, it’s not impossible either. The amount of physical and interactive components, with their physical connectors and cable connections, independent power supplies, and different vendors’ firmware and software versions all increase the chance of something going wrong in certain situations. So, what are potential consequences? Let’s look at a few examples of issues that can arise during the transfer or storage of files:
- Empty file: This can happen during copy, when a file is created, but writing its content was not successful. Possible reasons: Full media, insufficient access rights, or process aborted after creation of file.
- Shortened file: This can happen during copy by an incomplete write process. Possible reasons: Aborted copy, full media full, power or connection failure, and no resume of process
- Wrong content of file: There are different possible reasons for that: This can happen when blocks get in disorder during write, e.g., issue or bug with multithreading. It can also happen when a data block is allocated on a volume but not written. Then old content written before formatting the card can “shine through” and appear as wrong content. Another possibility is the case of bit errors during transfer or storage, for example, with faulty and unreliable components or storage/memory. The content of a file can also be entirely wrong or corrupt, for example, when the file system structure is corrupt.
- File modification through editing: A file can also be edited, for example, when a software application overwrites an entire file or parts of the file when opening it, which can be unintentionally or undetected by the user.
How to achieve data integrity
Alright, enough with the horror scenarios. You’re now probably asking yourself what is done to help maintain data integrity. In the following, we’ll introduce you to a core concept that helps detect the above issues: The calculation and verification of checksums for individual files.
First, let’s look at what checksums are, how hash algorithms are used to create checksums, and how we can use them to check the integrity of file content.
In order to detect errors in files during transmission or storage, it’s common to create checksums. The idea here is to use an algorithm (the checksum function) that creates a small piece of data (the checksum) from an arbitrarily sized block of data (for example, the file). A good checksum function has a very (very, very) high probability that the checksum for a given file will become different when any part of the file’s content changes. The checksum function is also deterministic, so you can re-calculate the checksum at any time.
In other words:
When you calculate a file’s checksum after the file was transferred or copied, you can ensure with a very (very, very) high probability that its content remained the same by comparing the calculated checksum with a checksum that was previously calculated for that file (e.g., on the source volume). If the checksums are not equal, you can be positive that the content changed. If the checksums are equal, you can assume an unmodified file.
This is the basic principle behind all file integrity checks that are implemented in software when using checksums to detect changes in files.
There is a broad range of checksum methods, some for very small pieces of data, some even for error correction. But the best-suited and most commonly used kind of checksum functions for ensuring file data integrity are hash algorithms.
A hash function maps an arbitrary-sized block of data (i.e., the content of a file) to a short (and fixed) size value. This follows exactly our requirement for a checksum function. The value created by the hash function is called the “hash value”. The name of the hash algorithm used to create the hash value is sometimes also called the “hash type” (for example, a file might have the hash value
f5b96775f6c2d310d585bfa0d2ff633c of the hash type
According to Wikipedia, “[a] good hash function satisfies two basic properties: 1) it should be very fast to compute; 2) it should minimize duplication of output values (collisions)”.
A “collision” occurs when two different files of data result in the same hash value. The chance that this happens shall be as small as possible. Instead, every hash value in the output range should be generated with roughly the same probability. With that, the hash algorithm is well prepared for our purpose – that it results in a different hash value when the given data is different.
Here are a few examples of hash algorithms typically used in media management processes with example values:
- MD5 (128 bit, example value:
- xxhash64 (64 bit, example value:
- C4 (512 bit, example value:
Let’s consider a hash algorithm that produces fully uniformly distributed values. Then, the chance of collision when looking at a file twice (i.e., that the modified file results in the same hash value) can be determined by the length of the hash as follows:
Probability for collision (i.e., probability for an ideal hash algorithm resulting in the same hash for two different file contents) is
where l is the length of hash in bits.
So for a hash algorithm with hash values of 64-bit length (for example, xxhash64), this probability is 1 / 2^ 64, which is about 1/18.446.744.073.709.551.616 or 5.42101086 × 10^ -20. Put differently, you would need to try about 185.395.973.344.368.338 (185 quadrillion) random changes of the same file until the overall collision probability of all your comparison exceeds 1%. That’s ~5.878.867 (5.9 million) years of trying 1000 changes of a given file per second – with a 99% chance that still no collision occurred.
Even with the fact that hash algorithms do not distribute values perfectly uniformly, you can imagine that 64 bit is already quite a good size for a hash value to detect arbitrary changes in a file.
The hash algorithm can be the limiting factor of data transfer since we always want to create hashes during the transfer process. Hence, in addition to transferring the actual data, the hash value for that data also needs to be computed – which takes additional CPU time, of course. The website for the xxhash family of hash algorithms provides an overview of the speeds of different hash algorithms.
A first takeaway is that the length (i.e., the number of bits the hash values have) does not necessarily correspond to the speed of the algorithms. Another takeaway is that the differences in maximum speed can be substantial (e.g., up to a factor of ~50x between XXH3 and MD5 on the same computer hardware). This comes from the fact that hashes need to be calculated sequentially, byte by byte of the data – and are not easy to implement multithreaded. This means that the speed of one CPU core limits the process of calculating hashes – and switching that one hashing process to a CPU with more cores will not make hashing faster (your software might, of course, hash multiple files at once independently to improve overall throughput).
Conclusion and outlook
In this article, we discussed how the creation of checksums helps to detect issues with the integrity of data. We talked about the use of hash algorithms and the use of hash values as checksums, and showed how a software application can find out if data has changed during its lifetime and warn the user accordingly.
In an upcoming article, we will take a look at the aspect of completeness and what measures are taken in data management processes to ensure that no file is forgotten. So, stay tuned and keep an eye out for part two of this series!