?

Log in

The Hunt Index, its history and where it is going - devjoe [entries|archive|friends|userinfo]
devjoe

[ userinfo | livejournal userinfo ]
[ archive | journal archive ]

The Hunt Index, its history and where it is going [Jan. 29th, 2016|12:56 pm]
devjoe
[Tags|, , , , , ]

I said I was done blogging this year's MIT Mystery Hunt (and I might still renege on that and write some more), but except for one detail, this post is not really about this year's hunt.

The Origin of the Index

Quite a few years ago, perhaps as early as 2006, I started compiling an index of all Mystery Hunt puzzles, categorizing each under appropriate keywords. Because I only worked on it occasionally, it took a while - years - to get through the whole archive, in which time more hunts happened and the task got longer. At the time, I started with the 2000 hunt, because it was the first of the modern hunts fully present in the Mystery Hunt archive, and also the first I attended. It was sometime in 2009, in between writing puzzles for the 2010 hunt, that I actually caught up with the present. However, I decided at the time not to release the index, because having that come from a member of the writing team would make people think it was somehow related to the 2010 Hunt, and with the 2010 Hunt actually involving (mostly fictionalized) hunt history, it was also borderline spoily, or at least would appear so.

Generations 1 & 2

So I held back the release of the index until after the completion of the 2010 Mystery Hunt. I released it shortly afterward, to the admiration and awe of many solvers. The first version of the index was a single very long HTML file built in whatever Firefox's HTML composer was called at the time. It was full of glitches - some links to puzzles didn't go to the right place, some places a puzzle link had been inserted inside another, some were missing links, etc. I knew I needed a better system and it was overhauled twice in short succession. The first revision involved writing Python scripts to parse the existing index into keywords and puzzles linked to each keyword, along with the link for each puzzle and any extra data attached to the keyword or puzzle. All of this was written into three big CSV files, one which listed all the puzzles and their data, one which listed the keywords and their data, and one which contained all the links between puzzles and keywords. Another Python script regenerated the big HTML file from these files. I also had it generate individual keyword files from this.

Generation 3

But this was a pain to maintain, and especially to keep the references matching everywhere. It really seemed like I wanted some sort of database. In August 2011 Verizon's personal web space imploded (they shut off FTP access, making the only way to edit your web site to be through the clunky and buggy web interface), so I moved my personal site (and the Hunt Index with it) to Google App Engine, initially just putting up a static site, but knowing I could store data in a database and put Python code up on the site to generate pages dynamically from the database. With a few exceptions on a MUCH smaller scale, I had never done this sort of thing, and I wanted to try it. And I did. I wrote thousands of lines of Python to make forms for creating and editing keywords, puzzles, and the newly created categories of keywords, as well as links between them, and for generating all the pages at all the various URLs I chose for these pages. And I added solution links for every puzzle since I now had a page for every puzzle from which I could provide the multiple links. And authors on every puzzle. Except for the forms for creating new pages, this was all integrated into the pages themselves; when I was logged int (via a hidden login page), the edit forms appeared right there on the main pages of the site. It used Google account authentication, and theoretically anybody who knew the hidden login page could have logged in, but only a list of accounts I authorized could see the edit forms or submit any changes using them. It was very easy to write an authentication function and check it before rendering any of the pages or processing any form submissions. By Hunt 2012, generation 3 of the site was live.

I realized even before this was finished that this also probably wasn't the right way to run the site. Google App Engine had reasonable free quotas for storage and bandwidth for a site of my size, but database access was very limited. In order to make this site work at all within the free limits, I needed to use memcache to cache things - and cache everything. So there was a whole backend layer; every piece of data for every page was retrieved using a function which looked for the data in memcache and reloaded it from the database if it had fallen out of cache. And every editing form refreshed all affected pages in the cache after making changes. And if somebody tried to load one of the big pages after the whole site had fallen out of the cache, it took a while to regenerate it and that web user experienced a very slow page load. Still, I worried about this limit. The daily database quota was 50,000 reads or writes, but every item, every field in a keyword or puzzle and every link between them was a read. Generating the full index was over 15,000 reads and growing. This was working only because in times of high use (like during hunt), the cache would stay alive and it would never hit the database.

Generation 4

What I really wanted to do here was have a static site, but have the editing forms within the site modify the static pages, or kick off a job that would do that. Only when I was editing the site would the database get used at all. But Google App Engine discourages this type of coding, and in the long run, if I added the features and additional hunts I wanted to add, I would push the database past 50k items and have trouble even regenerating the site from the database. So the solution was to go back to offline editing of the site. But I wasn't in any way intending to return to generation 2 of the web site with manual editing of a list of links. What I needed for generation 4 was a set of editing forms for offline use similar to web forms I had before. I knew how to write Tkinter GUIs in Python, and I had written a few, so that was what I proposed to do.

In 2015, we won the hunt. This put the hunt index, for the first time, in the position of being a semi-official thing as the output of a current hunt author. Many solvers who used my site would probably expect there to be a puzzle involving it, somehow, and I figured I would give it to them. But this was exactly the sort of thing that would drive the GAE-database-driven site past its limits. Hunt has a budget, so I could certainly pay GAE's fees, but that was the wrong solution to the problem. The right solution was what I had already decided was generation 4 would be. So early in 2015 I started doing this.

All the form code had to be rewritten from the ground up, but the old forms provided prototypes for what needed to go on the new forms. The biggest difference is that they would be untangled from the web page generation code. All the ifadmin() code in the old web page generation was stripped out, but the remaining code provided a good basis for the code that would generate pages. In addition to this, I wanted to fix what was a growing flaw of the old editing forms. In those forms, when I wanted to link a keyword to a puzzle or vice versa, there was an enormous drop-down list with all the keywords or puzzles in it. This was a pain. The lists were way too long, and sometimes it was hard for me to find a keyword I was looking for. More than once I had created an essentially duplicate keyword under another name due to not finding the old one, and later discovered this and merged them. For the new system, I wrote a search system. I could enter a search string and generate a short list of matching puzzles or keywords instead of searching through the whole big list. This would also help me avoid those duplicates by letting me find keywords where the word I thought to search on wasn't at the start of the entered keyword.

Linking Update, and the Puzzle

I did all this, and tested my work by entering the entire 2015 hunt into the index using the new interface. I fixed a couple glitches along the way. But in the meantime, the proposed 2016 puzzle involving the index had taken the form of making a whole "fishy" copy of the site which would have some differences. There were some issues with that. In many places within the site, links to pages used root-relative links, starting with /huntindex. I could replace all those with /fishyindex in the copy. But for the purpose of testing this puzzle, I wanted to provide an offline version of the site (or, as it turned out, a zipped version which could be expanded in an arbitrary path on another web site), rather than have the site live and potentially findable by teams months before the hunt. (One issue was that if search engines somehow found the site, they'd index the whole thing unless I put it into robots.txt, but if I did that, anybody expecting something there and actively checking the site might find it.) Those root-relative links frustrated my offline/other URL strategy, because for a local version of the site, they would send the user to the root of his filesystem or Windows drive, and within another web site, they'd send the user to a URL that did not exist.

The solution to this was to get rid of root-relative links. The problem with that was the reason I had root-relative links in the first place. I had pages both directly under /huntindex and in subfolders under /huntindex, and both sets of pages had links to pages for individual puzzles, etc. and in addition, the header of every page had a Home link. Some of these links are in the descriptions of specific keywords, which are part of the data, not the code, of the site, and so can't be easily rewritten by the code. The fix for this was to put all the pages except for the home page under subfolders. /huntindex/keywords would become /huntindex/index/keywords while /huntindex/keyword/crosswordpuzzles would stay where it was. But now ../ would take you from any other page on the site back to the home page, and ../keyword/crosswordpuzzles would take you from any page to the page about crossword puzzles. All the links would work from anywhere, any place the site was posted. But to do this, I had to move a few of the old pages to different URLs. To help train search engines to pick up the new URLs, I put in static pages at the old URLs which had meta-tag redirects to the new ones, as well as to help anybody who had bookmarked, say, the list of keywords. I eventually plan to remove these and let the old URLs die, but as a transition plan, this seemed sound. The links within the descriptions of various keywords were searched for and fixed. And I made the changes on the live version of the index, transparently to most people who do not look at their address bar, but they worked. I clicked around through all the kinds of links on my site to confirm.

To build the puzzle (originally called Something Fishy, but renamed to Haddock Walk due to requirements of the dream fragment in its round), I made a copy of the editing program and the then-current database under other names, using a different path to write web files. Then I just added in the links I needed in the new copy of the site like I would have done on any other update. This copy got zipped up and was served from puzzletron for testing. It worked, people solved the puzzle, and it was all ready to go. And nothing broke related to this puzzle during the actual hunt. On Saturday when most teams reached this puzzle, the web traffic on my site was 10 times normal, but still only a third of my daily free bandwidth quota, the only limit which now mattered, and this was in the range I expected.

And one of my friends emailed me to point out a wrong keyword link on the site the day before hunt. I just decided to ignore this, because I didn't want to have any additional differences between the real and fishy sites, I didn't want to regenerate the fishy site, and this just wasn't where teams would be looking for this puzzle. It's still broken, but it is on my list of things to fix, soon.

The Future (and a little more of the past)

By the end of August 2011 I had a numbered list of 15 tasks that needed to be done for my GAE site. A few of these were in my personal site outside of the hunt index, but they included the entire original plan for the database-driven generation 3 of the site. Most of these were done when I made this version public on January 1, 2012 and all but two were done by the end of 2012. After that I slipped into a pause in which I made only very minor changes to the site for two years, except for adding new hunts. Then I did all the work I described in 2015, and I hope to do a bunch more in 2016. Here is what I hope to do (numbers are from that list, which is now up to 22 tasks):

19. This task has a few small improvements I want to make to the Python data-entry GUI before I get into the new features. It's not very interesting, but it is the first item on my list.

21. Provide a feedback system. It's always been possible for people to email me, but I'm inspired by the whole Google form thing that was part of what saved the 2016 hunt. I had thought of some other way to do this, but what I am now thinking is to set up a Google form much like the one used during hunt, and pre-populate a field that indicates which page they called it from, and link to it next to the Home link on every page. This will make it easier for users of the site to submit feedback and improve the site.

12. Make pages for complete hunts. When you are on the page for a puzzle, you'll be able to click the name of the hunt to see all the puzzles in that hunt.

18. Make pages for individual authors. Right now, authors are just text attributes of the puzzles. This change will make them be links to an author object, which will provide a link to an author page, which would list all puzzles by the same author. I had two requests from teammates for exactly this after the 2016 hunt and I told them it was already in my plans.

I will do 12 and 18 together, since they involve a lot of similar tasks. I considered at one point making pages for teams, but this gets too weird. People move around from one team to another, and teams come and go. I could not definitively link an author to any one team, and the team pages would also show people who are no longer with the team. I could mark when people joined and left teams, but they gets way too much into internal team matters which I don't know, have no definitive source to look up, and some may consider none of my business. So at most, I'll put a text attribute in the hunt page which specifies the name of the team that ran that hunt, and maybe one for the winner as well. For similar reasons, I don't want to put a whole bunch of biographical information about authors on their individual pages. For 2016 I expect author pages will only exist to show links to things in the index. Once this is up, if your name is spelled wrong, or if you are wrongly credited, let me know and I'll fix it (and if this stems from errors in the hunt archive, if you let me know before we turn over the site to Setec, I may be able to get them fixed there too), but for other kinds of requests the answer is no for now and I have no real plans for the future.

15. Add more puzzle data to the site. This is the task that never ends, or we hope it doesn't, anyway. As long as there are more hunts. If I ever find the time I might start indexing some of those other hunts, too.

22. Add synonyms for keywords. These would appear in the list of all keywords and the full index, but they would just have a cross-reference to the primary version of that keyword I am using. This would allow people to find puzzles under sudoku and number place, under cross sums and kakuro, under paint by numbers (logic puzzle) and half a dozen other names it is called, and likewise for other things besides puzzle types that there are multiple ways of naming.

These are all the items on my list for now. For those who are curious, 16 was a different plan to deal with the limitations of the database which was replaced by making the new static generation 4 version, 17 was to deal with Google App Engine's impending removal of the older database structure that I was using, an item which turned into simply "get rid of the database" after my new plan for 16, and 20 was to add viewport tags to make the site mobile-friendly.
LinkReply