Comparing Folders with Python

I recently wrote a small folder comparison tool for Windows. It’s a small graphic interface on top of the basic feature-set of Python’s standard library package filecmp.

Here’s a download for the standalone .exe. You can find the source code on Github.

It looks like this:

Why not Bash?

The users that requested the tool needed the ability to compare the contents of two folders, quickly and at will, thousands of times over the next few months. This seems like a perfect fit for the Bash diff command.

Using the test data in the Git repo above, the command would be:

$ diff -r tests/control_data_1 tests/control_data_2

Which produces the output:

Only in tests/control_data_1/Data Folder/test_folder: test_image.bmp
Only in tests/control_data_1/Data Folder: test_text.txt
Only in tests/control_data_1/Data Folder: test_word.docx
Binary files tests/control_data_1/Data Folder/test_zip.zip and tests/control_data_2/Data Folder/test_zip.zip differ

Saving the output to a text file would be done easily by adding > output.txt to the end of the command. If any other options were needed, they are probably already built-in and documented on the man page.

But most of the end users run Windows and are not comfortable with the command line, so unfortunately Bash isn’t an option.

(I personally use and highly recommend Git Bash for technical or open-minded Windows users in this situation.)

Why Python?

I looked into similar commands using the native Windows command prompt, but that proved to be a nightmare. The Windows FC command comes nowhere close to the feature set of diff, and has no built-in support for subdirectories.

Stack Overflow provides the cleanest code I could find for a .bat file that met my needs:

@echo off &setlocal
set "folderA=D:\NONMEM7.3beta7.0"
set "folderB=D:\NONMEM7.3beta7.0Renamed"
for %%a in ("%folderA%\*.f90") do if not exist "%folderB%\%%~na_recoded%%~xa" echo %%~na_recoded%%~xa not found in %folderB%.
for %%a in ("%folderB%\*.f90") do for /f "delims=_" %%b in ("%%~na") do if not exist "%folderA%\%%~b%%~xa" echo %%~b%%~xa not found in %folderA%.

It might work, but I certainly wouldn’t want to maintain it.

So command line tools and batch files are out. In that case, a minimal graphic interface is probably best, ideally as a standalone .exe program.

And so our final program needs:

Some version of the diff feature set
A graphic interface
An installer to turn it into a standalone Windows program

Python has a standard library module for the first two, and PyInstaller can take care of the packaging. I also think it’s the most fun to code in.

Comparing files with the filecmp module

The filecmp module is included in the standard library for both Python 2 and Python 3. This means that, like the diff command, it should be battle-tested and reliable. Hopefully it also means that it is well documented, and any quirks that do exist have already been raised and answered on Stack Overflow.

For the most part, I’ve found this to be true. It takes 3 lines of Python to achieve results similar to diff:

import filecmp
comparison = filecmp.dircmp('./tests/control_data_1', './tests/control_data_2')
comparison.report_full_closure()

Which, when using the test data in the Git repo, produces the output:

diff ./tests/control_data_1 ./tests/control_data_2
Common subdirectories : ['Data Folder']

diff ./tests/control_data_1/Data Folder ./tests/control_data_2/Data Folder
Only in ./tests/control_data_1/Data Folder : ['test_empty_folder', 'test_text.txt', 'test_word.docx']
Identical files : ['test_excel.xlsx', 'test_powerpoint.pptx']
Differing files : ['test_zip.zip']
Common subdirectories : ['test_folder']

diff ./tests/control_data_1/Data Folder/test_folder ./tests/control_data_2/Data Folder/test_folder
Only in ./tests/control_data_1/Data Folder/test_folder : ['test_image.bmp']
Identical files : ['test_folder_text.txt']

That’s pretty good! More granular properties of the comparison object like comparison.left_only and comparison.common_dirs are also available.

Problems with content diffs

Originally I wanted to have two groups for files with matching filepaths – same and diff – based on a binary content match similar to the diff command. Unfortunately, the matching was found to be unreliable during testing, so I decided to go without it. The end users were only concerned with name-based file existence anyway. No sense burning time on unwanted, bug-prone features.

Below is the failing test – my runs show a false-positive rate around 75%.

import filecmp
import os
import shutil

def test_diff_files():
    """Create two files with different contents and compare results."""

    folder1 = 'results1'
    folder2 = 'results2'
    os.mkdir(folder1)
    os.mkdir(folder2)

    file1 = os.path.join(folder1, 'hello_world.txt')
    with open(file1, 'w') as file:
        file.write('foo')

    file2 = os.path.join(folder2, 'hello_world.txt')
    with open(file2, 'w') as file:
        file.write('bar')

    comparison = filecmp.dircmp(folder1, folder2)

    try:
        assert comparison.diff_files == ['hello_world.txt']
        assert comparison.same_files == []
    except AssertionError:
        raise
    finally:
        shutil.rmtree(folder1)
        shutil.rmtree(folder2)

# Run the test 100 times to get percent accuracy
failures = 0
for _ in range(100):
    try:
        test_diff_files()
    except AssertionError:
        failures += 1

print("%i 'same files' out of 100 expected 'diff files'" % failures)

Using recursion to summarize data

The comparison results will be summarized in a dictionary with the keys left, right, and both. Each file analyzed will be formatted with a full filepath relative to the root folders compared (denoted as .) and the / path separator.

comp = filecmp.dircmp(folder1, folder2)

data = {
    'left': [r'{}/{}'.format(prefix, i) for i in comp.left_only],
    'right': [r'{}/{}'.format(prefix, i) for i in comp.right_only],
    'both': [r'{}/{}'.format(prefix, i) for i in comp.common_files],
}

Then, each sub-folder shared by the original folders needs to be compared. After each additional comparison is completed, the values from the sub-report are appended to the lists in the main report.

if comparison.common_dirs:
    for folder in comparison.common_dirs:
        # Update reported file prefix to include new sub-folder
        prefix += '/' + folder

        # Compare sub-folder and add results to the report
        sub_folder1 = os.path.join(folder1, folder)
        sub_folder2 = os.path.join(folder2, folder)
        sub_report = _recursive_dircmp(sub_folder1, sub_folder2, prefix)

        # Add results from sub_report to main report
        for key, value in sub_report.items():
            data[key] += value

This will run as many times as it needs, until all sub-folders are exhausted. Any sub-folder that exists only in one location will not be searched further – it will just be included in either the left or right list.

Output to .txt and .csv

Outputting the summarized data to a .txt file is easy, and can be done without any imports. This was the earliest version of the report and is deliberately not DRY – it keeps the raw text report easy to change and format as requested by the end users.

filename = 'output.txt'
with open(filename, 'w') as file:
    file.write('COMPARISON OF FILES BETWEEN FOLDERS:\n')
    file.write('\tFOLDER 1: {}\n'.format(folder1))
    file.write('\tFOLDER 2: {}\n'.format(folder2))
    file.write('\n\n')

    file.write('FILES ONLY IN: {}\n'.format(folder1))
    for item in report['left']:
        file.write('\t' + item + '\n')
    if not report['left']:
        file.write('\tNone\n')
    file.write('\n\n')

    ...

Although the data isn’t strictly tabular, the end users requested the report be output to Excel, since that’s where further reconciliation would occur. Instead of getting the full-featured xlsxwriter from PyPI, I kept it simple and used the csv module from the standard library.

To start, a csv writer is created with the excel dialect.

filename = 'output.csv'
with open(filename, 'w') as file:
    csv_writer = csv.writer(file, dialect='excel')

The headers are written to the first row, and the data is ordered to match.

headers = (
    "Files only in folder '{}'".format(folder1),
    "Files only in folder '{}'".format(folder2),
    "Files in both folders",
)
csv_writer.writerow(headers)

data = (
    report['left'],
    report['right'],
    report['both'],
)

Then, the data is written row by row to the CSV. Once the number of items in each list is exhausted, it will start throwing an IndexError. Just catch these and replace them with None to keep the columns in the CSV aligned.

row_index = 0
row_max = max(len(column) for column in data)
while row_index < row_max:
    values = []
    for column in data:
        # Use data from column if it exists, otherwise use None
        try:
            values += [column[row_index]]
        except IndexError:
            values += [None]

    csv_writer.writerow(values)
    row_index += 1

Testing the results

Now the program only needs two directories as inputs, and outputs a recursive comparison report to .txt or .csv. The end users reviewed a sample report generated with test data that they provided, and were happy with the results. After a few formatting tweaks, of course.

Some of these tests were formalized using the built-in unittest module. This module is a little more verbose than pytest, nose, or raw assert statements, but these constraints often force me to build better tests.

For this module, three integration tests were created to cover the inputs and outputs agreed upon with the end users. A couple unit tests were also created for the core _recursive_dircmp function to deal with edge cases like shared subdirectories and binary file comparison. These unit tests help identify, when an integration test fails, whether the problem is related to the folder comparison function or the .txt and .csv writing functions.

Next steps

Now the program can create the comparison reports requested, and can run on any platform with a default Python 3.5 installation. But it still requires editing a text file and opening a terminal to run, which our end users would prefer not to do. To solve this, I used the tkinter module which has a great cross-language tutorial on their official website. If you want to see the code I used in my interface, it’s all available on Github.

Thanks for reading! If you have any comments or suggestions for improvement, please send me an email or issue a pull request.