I recently wrote a small folder comparison tool for Windows. It’s a small graphic interface on top of the basic feature-set of Python’s standard library package
It looks like this:
Why not Bash?
The users that requested the tool needed the ability to compare the contents of two folders, quickly and at will, thousands of times over the next few months. This seems like a perfect fit for the Bash
Using the test data in the Git repo above, the command would be:
Which produces the output:
Saving the output to a text file would be done easily by adding
> output.txt to the end of the command. If any other options were needed, they are probably already built-in and documented on the
But most of the end users run Windows and are not comfortable with the command line, so unfortunately Bash isn’t an option.
(I personally use and highly recommend Git Bash for technical or open-minded Windows users in this situation.)
I looked into similar commands using the native Windows command prompt, but that proved to be a nightmare. The Windows
FC command comes nowhere close to the feature set of
diff, and has no built-in support for subdirectories.
Stack Overflow provides the cleanest code I could find for a
.bat file that met my needs:
@echo off &setlocal set "folderA=D:\NONMEM7.3beta7.0" set "folderB=D:\NONMEM7.3beta7.0Renamed" for %%a in ("%folderA%\*.f90") do if not exist "%folderB%\%%~na_recoded%%~xa" echo %%~na_recoded%%~xa not found in %folderB%. for %%a in ("%folderB%\*.f90") do for /f "delims=_" %%b in ("%%~na") do if not exist "%folderA%\%%~b%%~xa" echo %%~b%%~xa not found in %folderA%.
It might work, but I certainly wouldn’t want to maintain it.
So command line tools and batch files are out. In that case, a minimal graphic interface is probably best, ideally as a standalone
And so our final program needs:
- Some version of the
- A graphic interface
- An installer to turn it into a standalone Windows program
Python has a standard library module for the first two, and PyInstaller can take care of the packaging. I also think it’s the most fun to code in.
Comparing files with the filecmp module
filecmp module is included in the standard library for both Python 2 and Python 3. This means that, like the
diff command, it should be battle-tested and reliable. Hopefully it also means that it is well documented, and any quirks that do exist have already been raised and answered on Stack Overflow.
For the most part, I’ve found this to be true. It takes 3 lines of Python to achieve results similar to
Which, when using the test data in the Git repo, produces the output:
That’s pretty good! More granular properties of the
comparison object like
comparison.common_dirs are also available.
Problems with content diffs
Originally I wanted to have two groups for files with matching filepaths –
diff – based on a binary content match similar to the
diff command. Unfortunately, the matching was found to be unreliable during testing, so I decided to go without it. The end users were only concerned with name-based file existence anyway. No sense burning time on unwanted, bug-prone features.
Below is the failing test – my runs show a false-positive rate around 75%.
Using recursion to summarize data
The comparison results will be summarized in a dictionary with the keys
both. Each file analyzed will be formatted with a full filepath relative to the root folders compared (denoted as
.) and the
/ path separator.
Then, each sub-folder shared by the original folders needs to be compared. After each additional comparison is completed, the values from the sub-report are appended to the lists in the main report.
This will run as many times as it needs, until all sub-folders are exhausted. Any sub-folder that exists only in one location will not be searched further – it will just be included in either the
Output to .txt and .csv
Outputting the summarized data to a
.txt file is easy, and can be done without any imports. This was the earliest version of the report and is deliberately not DRY – it keeps the raw text report easy to change and format as requested by the end users.
Although the data isn’t strictly tabular, the end users requested the report be output to Excel, since that’s where further reconciliation would occur. Instead of getting the full-featured
xlsxwriter from PyPI, I kept it simple and used the
csv module from the standard library.
To start, a
csv writer is created with the
The headers are written to the first row, and the data is ordered to match.
Then, the data is written row by row to the CSV. Once the number of items in each list is exhausted, it will start throwing an
IndexError. Just catch these and replace them with
None to keep the columns in the CSV aligned.
Testing the results
Now the program only needs two directories as inputs, and outputs a recursive comparison report to
.csv. The end users reviewed a sample report generated with test data that they provided, and were happy with the results. After a few formatting tweaks, of course.
Some of these tests were formalized using the built-in
unittest module. This module is a little more verbose than
nose, or raw
assert statements, but these constraints often force me to build better tests.
For this module, three integration tests were created to cover the inputs and outputs agreed upon with the end users. A couple unit tests were also created for the core
_recursive_dircmp function to deal with edge cases like shared subdirectories and binary file comparison. These unit tests help identify, when an integration test fails, whether the problem is related to the folder comparison function or the
.csv writing functions.
Now the program can create the comparison reports requested, and can run on any platform with a default Python 3.5 installation. But it still requires editing a text file and opening a terminal to run, which our end users would prefer not to do. To solve this, I used the
tkinter module which has a great cross-language tutorial on their official website. If you want to see the code I used in my interface, it’s all available on Github.
Thanks for reading! If you have any comments or suggestions for improvement, please send me an email or issue a pull request.