Haiku News Bot

This week I released my first bot on Twitter – HaikuNewsBot

It’s built in Python, and gets news from over 50 English sources using the free News API

HaikuNewsBot tweet

It’s not 100% accurate or super useful but it’s a cute proof of concept for future bots and parsing programs. I’ll add small updates over time as I receive feedback and notice bugs.

Parsing a Haiku

I settled on making this an English-only news bot, so I could use consistent (and familiar) language parsing techniques for every article title.

Most of the syllable counting logic is a combination of the CMU Pronouncing Dictionary from the nltk package, and the textstat package.

I wrote a custom syllable parser for numbers, so that ‘16,000’ is properly counted as four syllables (six-teen-thous-and). I also wrote a parser for acronyms, discovering that the translation of letters to syllables is wonderfully simple: 3 if letter == 'w' else 1.

I’ll wait while you go through the alphabet in your head now.

Before running words through the parsers, I .split() the headline on all whitespace, stripped it of non-alphanumeric characters, and analyzed each text element by itself.

If an element was 3 characters or less and uppercase (e.g. FBI), then assume its an acronym, and use the character parsing logic above.

If an element is composed of letters and numbers (e.g. G20 or patio11), then split the alpha part(s) from the numeric part(s), and parse each of those elements separately.

Collecting and Posting Tweets

Each time the bot runs, it gets about 15 new haikus from the API. It stores these in a local SQLite database, ranking them by haiku likelihood (it has about a 70% accuracy rate).

Whenever the bot is told to post a haiku, it chooses the highest-rated one from within the last day from the database. It tries to post it to Twitter, and does nothing if it fails (at the moment). Eventually I’d like to set it to email me if it fails with the error message.

Hosting the Service

The bot is hosted on a $5 Linode VPS with some other toy programs, and runs once an hour as a cron job (0 * * * *).



Where to Store API Keys

A common security issue that comes up when developing networked applications is where to store security credentials like API keys.

Basics

One of the first steps to securing API keys is to remember to never check them into source code repositories. If you ever commit and push code that includes a hard-coded API key, it is compromised, and you’ll need to revoke it and create a new one.

One option to store these keys is to read them from files not synced to the source code repo. The problem with this is when changing machines or working with multiple developers - its hard to pin down a consistent location to store these keys.

The next logical step is to store them in Environment Variables, which can be used consistently on any machine / operating system (e.g. using Python’s os.environ). This is what I’ve been doing until now, generally storing the keys themselves in Keepass and typing them into the shell manually using export.

Using a .env File

Instead of manually typing these keys each time, I found a good convention from Epicodus that I’m going to start using.

Create a .env file in the Git repo of you application, and be sure to include .env in your .gitignore file. On each line of this file, include the export KEY=<VALUE> line to declare all needed environment variables.

Now, at the beginning of each development session, instead of manually typing export for each specific key, just use source .env to load all your API keys.

Considerations

Please note that this method only secures against accidental key exposure in source code and version control, it will not protect against someone with root control to your box.



Using a Router as a Wireless Access Point

The wireless signal in parts of my house was weak, and I wanted to improve it. I already had an existing router and modem, and figured I could improve the signal by adding one more router to the current setup.

It took a few hours to research everything and get it set up properly, so I’ll document the key steps here for when I need to set it up again.

Research your network

Before touching the new router, learn about your current network setup. Navigate to your current router’s admin page (e.g. 192.168.1.1) and log in with your username and password.

If you don’t remember your username / password, try the defaults located on a sticker on the bottom of the router. If those don’t work, factory reset the router by holding a paper clip on the reset button for 10 seconds.

Navigate to Advanced Settings > LAN > DHCP Server, and see what the ‘Starting Address’ and ‘Ending Address’ values are for the router’s IP pool (e.g. 192.168.1.2 to 192.168.1.254). This is the range of IP addresses on the subnet (e.g. 192.168.1.*) that your router can use for clients.

Make sure there is room on the subnet for your new router. If you want the new router to have address 192.168.1.2, then make sure to set the ‘Starting Address’ of the original router’s DHCP IP pool to 192.168.1.3 or higher.

Restart the original router if you made any changes.

Set up the new router

Unpack the new router, and cover the single WAN port with tape. Routers configured as access points do all input and output through the 4 LAN ports instead.

DO NOT plug the new router into the original router until the end.

Plug the new router into a power outlet. Plug an ethernet cable into any of the 4 LAN ports of the router, and plug the other end of the cable into your computer.

Configure new router as an access point

Look on the bottom of the new router for the default IP address, username, and password. Navigate to that address (e.g. 192.168.0.1) and log in.

On the admin page, navigate to Advanced > Network > LAN, and change the IP address to an open address in your original router’s subnet (e.g. 192.168.1.2). Save this change, and restart the router.

Wait a couple of minutes, then reconnect to the admin page using the new IP address.

Navigate to Advanced > Network > DHCP Server, and disable it entirely by unchecking ‘Enable DHCP Server’. Ignore all the other settings, because its disabled.

Configure wireless network names

Most modern devices, when presented with multiple networks using the same name and password, will juggle between those connections seamlessly based on which signal is strongest at the time.

Unless you have a device that can’t handle this, keep it simple and use a single network identity for all signals.

On both the old and new router, navigate to Basic > Wireless, and set the network name / password for both the 2.4GHz and 5GHz networks to the same values.

Test and celebrate

You should now have 4 signals, broadcasting from 2 devices, that appear to most computers as 1 wireless connection. Use something like Speedtest.net to measure you connection speed and confirm it has improved.

Assuming everything works, clean up your mess and organize your cables, then don’t think about your wireless network again until you move.



Ubuntu 17.04 on ZenBook Pro

After a little headache and some arcane magic, I managed to get an encrypted version of Ubuntu 17.04 installed on my laptop. The laptop is an ASUS ZenBook Pro UX501VW, which is like a 15” Macbook Pro, but half the price and without some crazy touch bar.

Most of my success is thanks to Peter van der Zee (qFox), who logged his battle to install Ubuntu 16.04 on the same laptop in 2016.

Since he did most of the work already, this post is mostly a collection of links with some minor addenda in case I need to do this again.

Set up the flash drive

First problem I had was that the flash drive I wanted to use to install the ISO wasn’t properly formatted when I started this process. I ended up using the GParted utility (sudo apt install gparted) to reformat it completely. Other, more direct, commands did nothing but cause more headache.

Once the flash drive is ready, download the Ubuntu 64-bit ISO, and install it using the Startup Disk Creator utility.

Installing

Basically, I just followed qFox’s 16.04 guide, but used the Ubuntu Gnome 17.04 image. Unlike previous versions of Ubuntu, I was able to encrypt the installation without hassle just by following the wizard (previous versions were unable to locate the partition after installing).

Every time you reboot to do some part of this process, you’ll need to hit esc to get to the boot select menu, then e to set grub parameters. Change the quiet splash on the GRUB_CMDLINE_LINUX_DEFAULT line to i951.preliminary_hw_support=1 nogpumanager.

If the installer still hangs after changing quiet splash in the grub parameters, drop into the root shell, and download the proprietary Nvidia drivers before installing. Use whichever ones are the newest if the above are outdated.

After fiddling with grub parameters and getting to the installation wizard, the process should be relatively easy. Install, encrypt, and update as normal.

Once you’re happy with your installation, update grub permanently using qFox’s settings:

sudo nano /etc/default/grub
// replace the "quiet splash" line with
// GRUB_CMDLINE_LINUX_DEFAULT="i951.preliminary_hw_support=1 acpi_osi= acpi_backlight=native"
sudo update-grub

High DPI (4k) resolution should work out of the box after a full update and reboot. You can fiddle with the scaling settings in Gnome to get it working better with certain apps (e.g. Keepass, Skype).

I’ll update this post in the future when I go through the process again with future releases.



Making a Huge Google Map

I recently made a 25000 x 25000 pixel Google Map to hang on my wall and track all the places I’ve explored in the city.

Final map hanging on wall

The final product took 30 minutes for my laptop to generate, 45 minutes to print, and cost about $50 in materials.

All of the code I used is available on GitHub. I cleaned up the script a bit from the original, but don’t expect it to work on your computer without slight modifications.

Creating the Image

The program currently requires Python 3.5+ because I love type hints. It uses the Selenium library for browser automation, and the Pillow library for image manipulation.

Before running, the program needs a few things from the user:

  • Figure out what percent to crop from each side of a screenshot to hide all unwanted, non-map elements (taskbars, extra displays, GUI elements)
  • Figure out a good starting point (top-left coordinate) for the map, and how many rows / columns should be used to capture the desired area

Once provided with that information, the program does the following:

  • Figure out how much to adjust the latitude and longitude each screenshot so that the final grid of images will line up perfectly
  • Visit Google Maps, wait for it to load, and take each screenshot
  • Loop through the grid of saved images and combine them into one huge image

The main script is only 132 lines long, 40 of which are comments, if you want to check it out for yourself.

Printing and building

The build was fairly straightforward, and was supposed to be a trial run. However, the results came out so well that I decided to just keep this version.

Materials and costs:

  • 4 square feet of white vinyl = $8
  • 4 square feet of ink = $20
  • 3.5’ x 4.5’ sheet of foam board = $12
  • 500 assorted map tacks = $7

The build:

Map printing in plotter Map on drafting table Map at super close 1-inch block Map with dog Duct taping card stock together Hanging on the wall

Future considerations

If I continue using this program for more than a one-off project, I need to consider the following changes:

  • Instead of using rows and columns to determine the size of the map, pass a latitude_end and longitude_end and have the program figure out the rows and columns.
  • Replace pyscreenshot.grab with a hand-rolled image cropping function, since Selenium already provides basic screenshot functionality.
  • Allow for adjustable zoom levels, instead of hard-coding it to 18z.
  • Allow for Google Earth images in addition to Google Maps.


Snake Rave 1.1

Snake Rave version 1.1 is now live. Changes since the last post include:

  • Added ‘Loading’, ‘How to Play’, ‘About’, and ‘Settings’ screens
  • Added option to Settings for wrap-around to be turned on or off
  • Added option to Settings to change the current speed of the snake
  • Added option to Settings to toggle whether snake speeds up with each point
  • Made game pause and reset automatically on window resize / rotation
  • Improved ‘Game Over’ screen to show score, time, and speed
  • Increased speed of snake as game progresses (0.02x increase per point)

Play now

Screenshots:

Snake Rave Menu Snake Rave Playing Snake Rave How to Play





Let's Encrypt Auto Renewal

The bad news is that my server was unreachable yesterday due to a bug in my SSL certificate renewal script. Sorry about that. The good news is, while debugging the issue, I discovered that the functionality for auto-renewal is now built into the letsencrypt client.

This feature, as simple as typing ./letsencrypt-auto renew, was added to Let’s Encrypt back in February. Follow the Digital Ocean guide if you need help setting up a cron job, and your certificates will never get outdated again.

Since 2015, before this feature was available, I was using a script dubbed le-renew.sh provided by Erika Heidi at Digital Ocean. Unfortunately, when the auto-renew was supposed to kick off this month, this script ran into a parsing error:

PluginError((‘There has been an error in parsing the file (%s): %s’, u’/etc/apache2/sites-enabled/roche.io.conf’, u’Syntax error’),)

I only found one other mention of this error through search, which suggested that my whitespace was inconsistent. Regardless, when I checked the guide from Digital Ocean that originally provided the script, they had replaced it with the now standard ./letsencrypt-auto renew command.

I deleted the old shell script, replaced it with that command, and my free SSL cert was working perfectly again. No other changes necessary.

My thanks to the Let’s Encrypt team. Less dependencies, less problems.



Comparing Folders with Python

I recently wrote a small folder comparison tool for Windows. It’s a small graphic interface on top of the basic feature-set of Python’s standard library package filecmp.

Here’s a download for the standalone .exe. You can find the source code on Github.

It looks like this:

Minesweeper clone


Why not Bash?

The users that requested the tool needed the ability to compare the contents of two folders, quickly and at will, thousands of times over the next few months. This seems like a perfect fit for the Bash diff command.

Using the test data in the Git repo above, the command would be:

$ diff -r tests/control_data_1 tests/control_data_2

Which produces the output:

Only in tests/control_data_1/Data Folder/test_folder: test_image.bmp
Only in tests/control_data_1/Data Folder: test_text.txt
Only in tests/control_data_1/Data Folder: test_word.docx
Binary files tests/control_data_1/Data Folder/test_zip.zip and tests/control_data_2/Data Folder/test_zip.zip differ

Saving the output to a text file would be done easily by adding > output.txt to the end of the command. If any other options were needed, they are probably already built-in and documented on the man page.

But most of the end users run Windows and are not comfortable with the command line, so unfortunately Bash isn’t an option.

(I personally use and highly recommend Git Bash for technical or open-minded Windows users in this situation.)

Why Python?

I looked into similar commands using the native Windows command prompt, but that proved to be a nightmare. The Windows FC command comes nowhere close to the feature set of diff, and has no built-in support for subdirectories.

Stack Overflow provides the cleanest code I could find for a .bat file that met my needs:

@echo off &setlocal
set "folderA=D:\NONMEM7.3beta7.0"
set "folderB=D:\NONMEM7.3beta7.0Renamed"
for %%a in ("%folderA%\*.f90") do if not exist "%folderB%\%%~na_recoded%%~xa" echo %%~na_recoded%%~xa not found in %folderB%.
for %%a in ("%folderB%\*.f90") do for /f "delims=_" %%b in ("%%~na") do if not exist "%folderA%\%%~b%%~xa" echo %%~b%%~xa not found in %folderA%.

It might work, but I certainly wouldn’t want to maintain it.

So command line tools and batch files are out. In that case, a minimal graphic interface is probably best, ideally as a standalone .exe program.

And so our final program needs:

  • Some version of the diff feature set
  • A graphic interface
  • An installer to turn it into a standalone Windows program

Python has a standard library module for the first two, and PyInstaller can take care of the packaging. I also think it’s the most fun to code in.

Comparing files with the filecmp module

The filecmp module is included in the standard library for both Python 2 and Python 3. This means that, like the diff command, it should be battle-tested and reliable. Hopefully it also means that it is well documented, and any quirks that do exist have already been raised and answered on Stack Overflow.

For the most part, I’ve found this to be true. It takes 3 lines of Python to achieve results similar to diff:

import filecmp
comparison = filecmp.dircmp('./tests/control_data_1', './tests/control_data_2')
comparison.report_full_closure()

Which, when using the test data in the Git repo, produces the output:

diff ./tests/control_data_1 ./tests/control_data_2
Common subdirectories : ['Data Folder']

diff ./tests/control_data_1/Data Folder ./tests/control_data_2/Data Folder
Only in ./tests/control_data_1/Data Folder : ['test_empty_folder', 'test_text.txt', 'test_word.docx']
Identical files : ['test_excel.xlsx', 'test_powerpoint.pptx']
Differing files : ['test_zip.zip']
Common subdirectories : ['test_folder']

diff ./tests/control_data_1/Data Folder/test_folder ./tests/control_data_2/Data Folder/test_folder
Only in ./tests/control_data_1/Data Folder/test_folder : ['test_image.bmp']
Identical files : ['test_folder_text.txt']

That’s pretty good! More granular properties of the comparison object like comparison.left_only and comparison.common_dirs are also available.

Problems with content diffs

Originally I wanted to have two groups for files with matching filepaths – same and diff – based on a binary content match similar to the diff command. Unfortunately, the matching was found to be unreliable during testing, so I decided to go without it. The end users were only concerned with name-based file existence anyway. No sense burning time on unwanted, bug-prone features.

Below is the failing test – my runs show a false-positive rate around 75%.

import filecmp
import os
import shutil

def test_diff_files():
    """Create two files with different contents and compare results."""

    folder1 = 'results1'
    folder2 = 'results2'
    os.mkdir(folder1)
    os.mkdir(folder2)

    file1 = os.path.join(folder1, 'hello_world.txt')
    with open(file1, 'w') as file:
        file.write('foo')

    file2 = os.path.join(folder2, 'hello_world.txt')
    with open(file2, 'w') as file:
        file.write('bar')

    comparison = filecmp.dircmp(folder1, folder2)

    try:
        assert comparison.diff_files == ['hello_world.txt']
        assert comparison.same_files == []
    except AssertionError:
        raise
    finally:
        shutil.rmtree(folder1)
        shutil.rmtree(folder2)

# Run the test 100 times to get percent accuracy
failures = 0
for _ in range(100):
    try:
        test_diff_files()
    except AssertionError:
        failures += 1

print("%i 'same files' out of 100 expected 'diff files'" % failures)

Using recursion to summarize data

The comparison results will be summarized in a dictionary with the keys left, right, and both. Each file analyzed will be formatted with a full filepath relative to the root folders compared (denoted as .) and the / path separator.

comp = filecmp.dircmp(folder1, folder2)

data = {
    'left': [r'{}/{}'.format(prefix, i) for i in comp.left_only],
    'right': [r'{}/{}'.format(prefix, i) for i in comp.right_only],
    'both': [r'{}/{}'.format(prefix, i) for i in comp.common_files],
}

Then, each sub-folder shared by the original folders needs to be compared. After each additional comparison is completed, the values from the sub-report are appended to the lists in the main report.

if comparison.common_dirs:
    for folder in comparison.common_dirs:
        # Update reported file prefix to include new sub-folder
        prefix += '/' + folder

        # Compare sub-folder and add results to the report
        sub_folder1 = os.path.join(folder1, folder)
        sub_folder2 = os.path.join(folder2, folder)
        sub_report = _recursive_dircmp(sub_folder1, sub_folder2, prefix)

        # Add results from sub_report to main report
        for key, value in sub_report.items():
            data[key] += value

This will run as many times as it needs, until all sub-folders are exhausted. Any sub-folder that exists only in one location will not be searched further – it will just be included in either the left or right list.

Output to .txt and .csv

Outputting the summarized data to a .txt file is easy, and can be done without any imports. This was the earliest version of the report and is deliberately not DRY – it keeps the raw text report easy to change and format as requested by the end users.

filename = 'output.txt'
with open(filename, 'w') as file:
    file.write('COMPARISON OF FILES BETWEEN FOLDERS:\n')
    file.write('\tFOLDER 1: {}\n'.format(folder1))
    file.write('\tFOLDER 2: {}\n'.format(folder2))
    file.write('\n\n')

    file.write('FILES ONLY IN: {}\n'.format(folder1))
    for item in report['left']:
        file.write('\t' + item + '\n')
    if not report['left']:
        file.write('\tNone\n')
    file.write('\n\n')

    ...

Although the data isn’t strictly tabular, the end users requested the report be output to Excel, since that’s where further reconciliation would occur. Instead of getting the full-featured xlsxwriter from PyPI, I kept it simple and used the csv module from the standard library.

To start, a csv writer is created with the excel dialect.

filename = 'output.csv'
with open(filename, 'w') as file:
    csv_writer = csv.writer(file, dialect='excel')

The headers are written to the first row, and the data is ordered to match.

headers = (
    "Files only in folder '{}'".format(folder1),
    "Files only in folder '{}'".format(folder2),
    "Files in both folders",
)
csv_writer.writerow(headers)

data = (
    report['left'],
    report['right'],
    report['both'],
)

Then, the data is written row by row to the CSV. Once the number of items in each list is exhausted, it will start throwing an IndexError. Just catch these and replace them with None to keep the columns in the CSV aligned.

row_index = 0
row_max = max(len(column) for column in data)
while row_index < row_max:
    values = []
    for column in data:
        # Use data from column if it exists, otherwise use None
        try:
            values += [column[row_index]]
        except IndexError:
            values += [None]

    csv_writer.writerow(values)
    row_index += 1

Testing the results

Now the program only needs two directories as inputs, and outputs a recursive comparison report to .txt or .csv. The end users reviewed a sample report generated with test data that they provided, and were happy with the results. After a few formatting tweaks, of course.

Some of these tests were formalized using the built-in unittest module. This module is a little more verbose than pytest, nose, or raw assert statements, but these constraints often force me to build better tests.

For this module, three integration tests were created to cover the inputs and outputs agreed upon with the end users. A couple unit tests were also created for the core _recursive_dircmp function to deal with edge cases like shared subdirectories and binary file comparison. These unit tests help identify, when an integration test fails, whether the problem is related to the folder comparison function or the .txt and .csv writing functions.

Next steps

Now the program can create the comparison reports requested, and can run on any platform with a default Python 3.5 installation. But it still requires editing a text file and opening a terminal to run, which our end users would prefer not to do. To solve this, I used the tkinter module which has a great cross-language tutorial on their official website. If you want to see the code I used in my interface, it’s all available on Github.

Thanks for reading! If you have any comments or suggestions for improvement, please send me an email or issue a pull request.



Garden Path Sentences

The horse raced past the barn fell.

Yes, the above sentence is grammatically correct. After a couple of reads, it becomes clear that it was the horse that fell, after it was raced past the barn.

A garden path sentence is a linguistic pattern in which the start of a sentence leads the reader to believe it will continue in one manner, but then ends in another, preventing it from being parsed correctly. This often leaves the reader feeling tricked and confused, in a delightful manner.

I collected some garden path sentences from around the internet and categorized them below. I left out any sentences that were just missing commas after introductory clauses (e.g. “When Fred eats food gets thrown”).

They’re not as much fun if you know the trick beforehand, so I mixed them up and include a footnote for each linking them to their category. Enjoy!

Examples of Garden Path Sentences

  • The horse raced past the barn fell. [4]
  • The complex houses married and single soldiers and their families. [1]
  • The man who hunts ducks out on weekends. [2]
  • Fat people eat accumulates. [3]
  • The man pushed through the door swung. [4]
  • The old man the boats. [1]
  • The cotton clothing is usually made of grows in Mississippi. [3]
  • The girl told the story cried. [4]
  • The woman that whistles tunes pianos. [2]
  • We painted the wall with cracks. [5]
  • I convinced her children are noisy. [3]
  • The raft floated down the river sank. [4]
  • Have the students who failed the exam retake the class. [6]
  • The sour drink from the ocean. [1]
  • The florist sent the flowers was pleased. [4]

Types of Garden Path Sentences

  1. Subjects you think are adjectives with verbs you think are subjects
  2. Verbs you think are prepositional objects
  3. Objects you think are adjectives
  4. Past participles you think are verbs
  5. Prepositions you think belong to the action that actually belong to the object
  6. Commands you think are questions (¡Not a problem in Spanish!)

Sources



Archive

2017

2016