Comparing Folders with Python
I recently wrote a small folder comparison tool for Windows. It’s a small graphic interface on top of the basic feature-set of Python’s standard library package filecmp
.
Here’s a download for the standalone .exe. You can find the source code on Github.
It looks like this:
Why not Bash?
The users that requested the tool needed the ability to compare the contents of two folders, quickly and at will, thousands of times over the next few months. This seems like a perfect fit for the Bash diff
command.
Using the test data in the Git repo above, the command would be:
$ diff -r tests/control_data_1 tests/control_data_2
Which produces the output:
Only in tests/control_data_1/Data Folder/test_folder: test_image.bmp
Only in tests/control_data_1/Data Folder: test_text.txt
Only in tests/control_data_1/Data Folder: test_word.docx
Binary files tests/control_data_1/Data Folder/test_zip.zip and tests/control_data_2/Data Folder/test_zip.zip differ
Saving the output to a text file would be done easily by adding > output.txt
to the end of the command. If any other options were needed, they are probably already built-in and documented on the man
page.
But most of the end users run Windows and are not comfortable with the command line, so unfortunately Bash isn’t an option.
(I personally use and highly recommend Git Bash for technical or open-minded Windows users in this situation.)
Why Python?
I looked into similar commands using the native Windows command prompt, but that proved to be a nightmare. The Windows FC
command comes nowhere close to the feature set of diff
, and has no built-in support for subdirectories.
Stack Overflow provides the cleanest code I could find for a .bat
file that met my needs:
@echo off &setlocal
set "folderA=D:\NONMEM7.3beta7.0"
set "folderB=D:\NONMEM7.3beta7.0Renamed"
for %%a in ("%folderA%\*.f90") do if not exist "%folderB%\%%~na_recoded%%~xa" echo %%~na_recoded%%~xa not found in %folderB%.
for %%a in ("%folderB%\*.f90") do for /f "delims=_" %%b in ("%%~na") do if not exist "%folderA%\%%~b%%~xa" echo %%~b%%~xa not found in %folderA%.
It might work, but I certainly wouldn’t want to maintain it.
So command line tools and batch files are out. In that case, a minimal graphic interface is probably best, ideally as a standalone .exe
program.
And so our final program needs:
- Some version of the
diff
feature set - A graphic interface
- An installer to turn it into a standalone Windows program
Python has a standard library module for the first two, and PyInstaller can take care of the packaging. I also think it’s the most fun to code in.
Comparing files with the filecmp module
The filecmp
module is included in the standard library for both Python 2 and Python 3. This means that, like the diff
command, it should be battle-tested and reliable. Hopefully it also means that it is well documented, and any quirks that do exist have already been raised and answered on Stack Overflow.
For the most part, I’ve found this to be true. It takes 3 lines of Python to achieve results similar to diff
:
import filecmp
comparison = filecmp.dircmp('./tests/control_data_1', './tests/control_data_2')
comparison.report_full_closure()
Which, when using the test data in the Git repo, produces the output:
diff ./tests/control_data_1 ./tests/control_data_2
Common subdirectories : ['Data Folder']
diff ./tests/control_data_1/Data Folder ./tests/control_data_2/Data Folder
Only in ./tests/control_data_1/Data Folder : ['test_empty_folder', 'test_text.txt', 'test_word.docx']
Identical files : ['test_excel.xlsx', 'test_powerpoint.pptx']
Differing files : ['test_zip.zip']
Common subdirectories : ['test_folder']
diff ./tests/control_data_1/Data Folder/test_folder ./tests/control_data_2/Data Folder/test_folder
Only in ./tests/control_data_1/Data Folder/test_folder : ['test_image.bmp']
Identical files : ['test_folder_text.txt']
That’s pretty good! More granular properties of the comparison
object like comparison.left_only
and comparison.common_dirs
are also available.
Problems with content diffs
Originally I wanted to have two groups for files with matching filepaths – same
and diff
– based on a binary content match similar to the diff
command. Unfortunately, the matching was found to be unreliable during testing, so I decided to go without it. The end users were only concerned with name-based file existence anyway. No sense burning time on unwanted, bug-prone features.
Below is the failing test – my runs show a false-positive rate around 75%.
import filecmp
import os
import shutil
def test_diff_files():
"""Create two files with different contents and compare results."""
folder1 = 'results1'
folder2 = 'results2'
os.mkdir(folder1)
os.mkdir(folder2)
file1 = os.path.join(folder1, 'hello_world.txt')
with open(file1, 'w') as file:
file.write('foo')
file2 = os.path.join(folder2, 'hello_world.txt')
with open(file2, 'w') as file:
file.write('bar')
comparison = filecmp.dircmp(folder1, folder2)
try:
assert comparison.diff_files == ['hello_world.txt']
assert comparison.same_files == []
except AssertionError:
raise
finally:
shutil.rmtree(folder1)
shutil.rmtree(folder2)
# Run the test 100 times to get percent accuracy
failures = 0
for _ in range(100):
try:
test_diff_files()
except AssertionError:
failures += 1
print("%i 'same files' out of 100 expected 'diff files'" % failures)
Using recursion to summarize data
The comparison results will be summarized in a dictionary with the keys left
, right
, and both
. Each file analyzed will be formatted with a full filepath relative to the root folders compared (denoted as .
) and the /
path separator.
comp = filecmp.dircmp(folder1, folder2)
data = {
'left': [r'{}/{}'.format(prefix, i) for i in comp.left_only],
'right': [r'{}/{}'.format(prefix, i) for i in comp.right_only],
'both': [r'{}/{}'.format(prefix, i) for i in comp.common_files],
}
Then, each sub-folder shared by the original folders needs to be compared. After each additional comparison is completed, the values from the sub-report are appended to the lists in the main report.
if comparison.common_dirs:
for folder in comparison.common_dirs:
# Update reported file prefix to include new sub-folder
prefix += '/' + folder
# Compare sub-folder and add results to the report
sub_folder1 = os.path.join(folder1, folder)
sub_folder2 = os.path.join(folder2, folder)
sub_report = _recursive_dircmp(sub_folder1, sub_folder2, prefix)
# Add results from sub_report to main report
for key, value in sub_report.items():
data[key] += value
This will run as many times as it needs, until all sub-folders are exhausted. Any sub-folder that exists only in one location will not be searched further – it will just be included in either the left
or right
list.
Output to .txt and .csv
Outputting the summarized data to a .txt
file is easy, and can be done without any imports. This was the earliest version of the report and is deliberately not DRY – it keeps the raw text report easy to change and format as requested by the end users.
filename = 'output.txt'
with open(filename, 'w') as file:
file.write('COMPARISON OF FILES BETWEEN FOLDERS:\n')
file.write('\tFOLDER 1: {}\n'.format(folder1))
file.write('\tFOLDER 2: {}\n'.format(folder2))
file.write('\n\n')
file.write('FILES ONLY IN: {}\n'.format(folder1))
for item in report['left']:
file.write('\t' + item + '\n')
if not report['left']:
file.write('\tNone\n')
file.write('\n\n')
...
Although the data isn’t strictly tabular, the end users requested the report be output to Excel, since that’s where further reconciliation would occur. Instead of getting the full-featured xlsxwriter
from PyPI, I kept it simple and used the csv
module from the standard library.
To start, a csv
writer is created with the excel
dialect.
filename = 'output.csv'
with open(filename, 'w') as file:
csv_writer = csv.writer(file, dialect='excel')
The headers are written to the first row, and the data is ordered to match.
headers = (
"Files only in folder '{}'".format(folder1),
"Files only in folder '{}'".format(folder2),
"Files in both folders",
)
csv_writer.writerow(headers)
data = (
report['left'],
report['right'],
report['both'],
)
Then, the data is written row by row to the CSV. Once the number of items in each list is exhausted, it will start throwing an IndexError
. Just catch these and replace them with None
to keep the columns in the CSV aligned.
row_index = 0
row_max = max(len(column) for column in data)
while row_index < row_max:
values = []
for column in data:
# Use data from column if it exists, otherwise use None
try:
values += [column[row_index]]
except IndexError:
values += [None]
csv_writer.writerow(values)
row_index += 1
Testing the results
Now the program only needs two directories as inputs, and outputs a recursive comparison report to .txt
or .csv
. The end users reviewed a sample report generated with test data that they provided, and were happy with the results. After a few formatting tweaks, of course.
Some of these tests were formalized using the built-in unittest
module. This module is a little more verbose than pytest
, nose
, or raw assert
statements, but these constraints often force me to build better tests.
For this module, three integration tests were created to cover the inputs and outputs agreed upon with the end users. A couple unit tests were also created for the core _recursive_dircmp
function to deal with edge cases like shared subdirectories and binary file comparison. These unit tests help identify, when an integration test fails, whether the problem is related to the folder comparison function or the .txt
and .csv
writing functions.
Next steps
Now the program can create the comparison reports requested, and can run on any platform with a default Python 3.5 installation. But it still requires editing a text file and opening a terminal to run, which our end users would prefer not to do. To solve this, I used the tkinter
module which has a great cross-language tutorial on their official website. If you want to see the code I used in my interface, it’s all available on Github.
Thanks for reading! If you have any comments or suggestions for improvement, please send me an email or issue a pull request.