A Day in the Life of a Wordnik Dev: Russell Horton

Continuing our series on the lives of Wordnik developers, today we talk to Russell Horton, aka @ngr_am, Computational Linguist. Russ lets us in on his favorite tools, languages, and a spicy hobby.

What’s your favorite coding editor/IDE? Why?

Sublime Text 2, because of batch edits, ⌘-P, and the package ecosystem.

What’s your beverage of choice?

Green tea.

The best lunch within five blocks of the office?

Hella Vegan burrito at Curry Up Now.

What’s your favorite music to listen to while working? And your sound system of choice?

Depending on the task, I like to listen to electronica, hip-hop or NPR, on the Cambridge Model Twelve.

What are your favorite languages?

I love English. Scala and Python are pretty nice, too.

Which language do you think is terrible?

I’m not much of a language snob. I even kinda miss Perl. But this just ain’t right:

if ("almond milk lattes" == 0) {
echo "WTHF?";

What was your first language? When did you learn it?

Basic, in 1987, on our Gateway 386dx 25MHZ (with Turbo!). Thanks Mom and Dad!

Where do you go for your tech news?

Twitter, the folks in the office.

Where do you go for help?

The folks in the office, StackOverflow, IRC.

What’s the best thing you’ve read about coding lately?

The Dispatch documentation is pleasant.

What’s the worst thing you’ve read about coding lately?

I try not to read awful things.

What’s your favorite book?

I don’t have favorites, but I recently read The Moon is a Harsh Mistress by Robert A. Heinlein and really enjoyed it.

If you had a time machine, what would you go back and tell your younger self?

Buy Apple stock when I was advised to in 1998. To be fair, I was doing support on LCIIs, Performas and Quadras at the time, so it didn’t seem too obvious.

If you weren’t a dev, what would you be?

If I weren’t a computational linguist, I would be a non-computational linguist.

What do you like to do when you’re not coding?

Grow and eat the world’s hottest peppers.

What’s your strategy for hitting the bell in Wordnik’s Friday contest?

Make the plane of your body parallel to the flight path of the dart. Extend your arm as straight and as far as possible, and sight along the barrel. Have a light grip, and squeeze the trigger with the tip of your index finger. Disclosure: I have never won. But I did stick a suction dart to the bell, twice.

Share this:
Share this page via Email Share this page via Stumble Upon Share this page via Digg this Share this page via Facebook Share this page via Twitter

Tactful Trademark Defense: An Example

d is for dachshundA trademark is a little bit like a pet. Once you get one, you have to take care of it for life. But instead of food, water, and regular walks, taking care of a trademark involves making sure that you are the only person using it.

Early in December here at Wordnik we got a nice email from one of our loyal users, letting us know that there was a word game in the app store using the name “Wordnik.” It didn’t have our heart logo wordnik's heart logo(or even our “gearheart” ) so our correspondent wasn’t sure it was ours … and it wasn’t.

So we took a look at the game, and it was called Wordnik. And there was contact info for the developer …

Now, this is the point where it’s pretty easy to get all het up and angsty about your trademark. It’s YOURS, by golly! You spent a lot of time thinking it up, paid your lawyers good money to register it, and you’ve put a lot of sweat into building the brand behind it, and … and … and … somebody else has just waltzed in and slapped it on something ELSE? This is the point where lots of people send (or have their lawyers send) an angry, red-in-the-face cease-and-desist email.

[This is the place where having a background in dictionary work made me want to take a different approach. When I was a traditional dictionary editor, I got a lot of C&D trademark defense emails from lawyers, all asking us to take trademarked terms out of the dictionary. Basically, the lawyers know that it's perfectly okay to have trademarks in the dictionary; a dictionary entry is not a competing product or service, it's a statement of fact, documenting the existence of the use of a word, trademark or no (and if dictionary editors know the trademark status of a word, that fact is included, too). Those letters were just a bit of trademark theatre to show that the holders are actively defending their trademark, in case they ever have to go to court to defend it. But nobody ever likes getting a C&D letter, no matter how theatrical. They make people feel bad and waste money and time calling in their own lawyers. So I wanted to try something different.]

Our thought process, instead, went something like this: “There’s no evidence that this app is trying to confuse Wordnik users. And their logo (although not as shiny as ours) is cute. Also, Wordnik is pretty awesome name (if we say so ourselves) — it’s not unlikely that someone else would think so, too. And this developer seems cool. So what’s the harm in sending a nice note, bringing our trademark status to his attention? You can always call in the big guns later.”

So this is what we sent [NOTE: all emails here are published with the consent of the developer]:

One of the users of our website, Wordnik.com, pointed out to us that your iOS app is also using the name “Wordnik”.

You may not be aware that we have applied for a US trademark for the name “Wordnik” and our application has been approved for registration.

Since the Wordnik API powers many word games on the web and on mobile devices, our trademark filing for the name “Wordnik” also includes its use in combination with computer games.

I’d rather not drag our lawyers into this (expensive for both of us) — but given our trademark status, you probably want to consider renaming your app (and maybe even using our API, check it out at developer.wordnik.com).

How about:


This list of English suffixes may help, too: http://en.wiktionary.org/wiki/Category:English_suffixes

I hope to hear back from you by Dec 31, 2012.

The developer, Michael Nathanson, was (as we had hoped) awesome. We got this reply really quickly:

Yeah, I realized only after the fact. I don’t know how missed that. Will try to get it changed up soon. Thanks for the name suggestions.

So since he was being so nice, it was easy to be nice back:

Thanks Mike! Really appreciate the reply.

And we tested a *LOT* of names before settling on Wordnik, I’m happy to give you feedback on your new name if you want.

Mike got back to us with a status update, too:

Know that I’m determined to get a new name but it just may take a couple of weeks before I get it all sorted …

Were the name suggestions that you put forth in your first email ones that you had considered using yourself before landing on Wordnik? Got any other runner ups that you passed up?

Thanks so much, and appreciate the friendly nature of the email.

He was so friendly and responsive, in fact, that we were happy to reply (and we even put out feelers for hiring him — too bad he’s embedded out East!):

We basically added every possible suffix in English (and some that were “unpossible”) to the word “word” while looking for Wordnik. We had the best response to “sciencey” suffixes (wordology) and “persony” suffixes (wordist, wordette).

PS where are you based? We’re always looking to hire good iOS devs :-) And then maybe you could make an in-house app for us. :-)

Mike also investigated using our API for his game, newly named Wordogram:

I’m still trying to imagine the best way to use your service to power a hint type in my game. Basically, the goal of Wordogram is to find the ‘Secret Word’ by narrowing down the letters by guessing other words and being told how many letters are shared between each guess and the Secret Word. I have various basic hints like ‘Reveal a letter in the Secret Word’ or ‘Remove 5 letters not in the Secret Word’, but at the end you may be left with a set of letters which make up several anagrams. There is no hint currently available that will help you once you’re at that stage. This can be frustrating particularly when there are a ton of anagrams, or if your brain just can’t see any possible words with the given letters.

By the way, Wordnik was only on the app store for less than 2 days at the time you contacted me and I did 0 marketing as it was more of a test release, so changing the name had no impact on brand association.

Oh, and we sent Mike a Wordnik T-shirt, too, and a few days later we got another update:

Wordogram is officially on the app store, so no more conflict. Thanks again for being so cool about it, and for the shirt as well.

Is this kind of approach going to work every time? In cases where there’s obvious intent to infringe, probably not. But with a name like Wordnik, it’s likely that anyone else who likes it is also going to be a friendly, word-loving sort. So why not assume goodwill to start with, and go from there?

Even though we love our trademark lawyers (no, really, they’re the best, if you have any trademark needs at all, call them asap), we prefer to use their valuable time helping us register new stuff, not policing guys who didn’t mean any harm.

From our point of view, this was the best possible outcome. We defended our trademark; we met a cool, kindred-spirit developer and had a fun conversation; and we found a new word game to play (and possibly gained another API client). And it’s likely none of this would have happened if we’d sent a pissy email, guns blazing.

[Photo: CC BY-NC 2.0 by dorisnight]

Share this:
Share this page via Email Share this page via Stumble Upon Share this page via Digg this Share this page via Facebook Share this page via Twitter

APIhub Has Swagger (And a Whole Lot More)

Last week, a new community built around APIs – APIhub – was launched by MuleSoft. Given the activity in the API space, and the number of services popping up around the world of API, what differentiates APIhub, other than those sexy Swagger-powered interactive docs?

First visit to the site shows an impressive number – 13,000+ APIs. That’s a lot of APIs, and a seemingly impossible number to actually navigate. As some have pointed out, there are 10 to 20 times that many public APIs out there, and the growth is accelerating. So rises the need for API discovery, and more importantly, collaboration around APIs.

Just as source code has moved to GitHub and Q&A to StackOverflow, APIhub hopes to be the center of API collaboration. Applications are becoming more and more centered around external services, and your quality of service can be affected by issues outside your direct control.

What sort of discussions would make APIhub succeed?

“This API is running in a single EC2 availability zone, and it’s US-east” might raise your eyebrow and make you think twice about relying on it for your app’s authentication.

“I toyed with 10 different client libraries for this service, and XXX was by far the best.”

“The documentation says the input param is a GET but it’s really a POST”. (No really, this happens all the time.)

This isn’t just for the developers either. Maybe a provider doesn’t know that there’s a huge issue in their driver? Maybe their syntax sucks? Or maybe they don’t know that a bunch of users want a driver written for the Go language?

Supporting the Swagger specification shows that openness is intended in the offering – it’s pretty awesome to see interactive documentation against the Facebook Graph API or GitHub. Now exposed via Swagger JSON, these APIs can take advantage of the growing ecosystem around Swagger, which means visualizing requests, responses, models, and generating client code that follows the consumer, not necessarily just the service provider.

Hopefully APIhub can bring collaboration to the API like GitHub has for code. There’s an audience of both producers and consumers who can move their products forward much faster with a solid community.

Share this:
Share this page via Email Share this page via Stumble Upon Share this page via Digg this Share this page via Facebook Share this page via Twitter

A Day in the Life of a Wordnik Dev: Will Fitzgerald

Continuing our series snooping delving into the lives of Wordnik developers, today we talk to Will Fitzgerald, aka @willf, Lead of the Analytics Platform.

Will fills us in on his love of music, his hatred of beets, and his conspiracy theory about winning (or not winning) the Friday Nerf gun bell shoot-off.

What’s your favorite coding editor/IDE? Why?

If I’m working on code for Wordnik.com, I use Sublime Text. It loads quickly and it’s easy on the eyes. If I’m writing Scala code, I use IntelliJ. It’s ugly and loads slowly, but it does pretty well for refactoring and finding where things are defined and where they are used (much of my programming these days is spent in code archeology).

Mostly, I just want my editors to get out of my way, and let me get into the flow of testing and coding. It’s sacrilege in a Unix shop to say, but I miss Visual Studio.

What’s your beverage of choice?

I recently stopped drinking coffee, and have switched to tea, and not your fancy tea, either. A pot of milky Red Rose or Lipton’s gets me through the day. That, and several glasses of Adam’s ale.

The best lunch within five blocks of the office?

Right across the street is The Tofu House. But it’s really a general Korean restaurant, and I love their bugolgi and bibimbop, and their side dishes. Plus, they give you a stick of melon gum on your way out.

What’s your favorite music to listen to while working?

I don’t listen to a lot of music when I work, but when I do it’s usually a cappella or folk. I listen to home made Sacred Harp recordings, Irish sean nós, and groups and singers such as Anonymous 4, Blind Willie Johnson, and Iris Dement.

What are your favorite languages?

I hate them all and love them all. I do most of coding in Ruby or Scala; Ruby because Wordnik.com is a Rails app on the front end, and for data scripting. Scala is our company standard for back-end coding. I’m glad it exists, and that we use it — I hate it much less than Java, and really like certain aspects, like type-safety and focus on functions and immutability.

Sometimes I use R for data analysis, or Coffeescript when I need to write Javascript. I run a weblog, lispjobs.wordpress.com, which provides a community service for people looking for Common Lisp, Scheme, or Clojure jobs. Does that hint at which languages I hate the least?

Which language do you think is terrible?

Most programmers cut their teeth on Java, C, or C++. I think these tend to engrain bad habits of mutability, verbosity, and a weird kind of focus on efficiency. I’m not against efficiency, of course, but, forced to choose, I’d rather have a correct and tested program than a fast one.

What was your first language? When did you learn it?

Good ol’ Basic on an Apple II+, with a whopping 32 Kb of main memory. I was so glad when I got a expansion card that allowed me to display 80 columns and lowercase letters. I’d type in programs from BASIC Computer Games by David Ahl, and twiddle with them until they worked.

The first computer language I ever wrote was a replacement for .bat files on PC computers, which was widely used at the company I worked for (the Upjohn Company, now part of Pfizer). I wrote this in Turbo Pascal, which was a fine language once they fixed floating point arithmetic.

Where do you go for your tech news?

I rely mostly on my Twitter feed these days; but I dip into Hacker News, too, even though the cool kids disdain it.

Where do you go for help?

I usually do a general search on Bing, and, if that doesn’t work, Google. As someone who worked on Bing, I never know whether to be glad or disappointed when a failed Bing search also fails on Google.

What’s the best thing you’ve read about coding lately?

Dave Fayram, a former colleague of mine for whom I have a lot of respect, wrote an essay, “FizzBuzz, A Deep Navel to Gaze Into,” which has a sweetly fine introduction to why one might care about category theory, even for silly and useless programs like FizzBuzz. (Write a program that prints the numbers from 1 to 100. But for multiples of three print “Fizz” instead of the number and for the multiples of five print “Buzz”. For numbers which are multiples of both three and five print “FizzBuzz”.)

I admire how he took this very ridiculous thing, which is sometimes used as a basic screening test for programmers, to introduce monoids and the option type.

What’s the worst thing you’ve read about coding lately?

Well, it’s not about coding, but Einstein’s list of demands on his wife, Mileva Marić, was pretty bad.

What’s your favorite book?

I’m one of those people to whom others say, “Have you read all those?” when they come into our house (and they haven’t seen the upstairs library or the boxes in the third floor storage room). So it’s hard to choose a favorite book.

But when I have to read a novel, and it’s gotta be good, and it’s gotta be now, I’ll pick up any of Dorothy L Sayers’s Peter Wimsey books, especially The Nine Tailors, and read the dickens out of it (again). Hint: if you haven’t read these, you might want to read them in order, and don’t bother with the ones that Sayers herself didn’t write.

If you had a time machine, what would you go back and tell your younger self?

My experiences formed me into what I am today; would I really want to go back and turn myself into a different person? It’s a hard question. But I think I’d go back to undergraduate self and have a talk: I’d dropped a planned major in statistics to do a major in linguistics, and I’d encourage me to do both. But, as Blind Willie Johnson just sang as I wrote this paragraph, “Ain’t nobody’s fault but mine.”

Also, I’d go back and tell my first grade self not to eat beets when I had the flu; I can’t stand beets to this day because of Sauce-Bearnaise Syndrome.

If you weren’t a dev, what would you be?

I don’t think of myself primarily as a “dev,” but I recently read someone’s tweet about how if the emerging field of “digital humanities” had been a thing in their early days, that’s what they’d do. It’s mostly something that academics get to do, so I’m glad to be working in the intersection of words and numbers.

What do you like to do when you’re not coding?

I really like to sing, especially shape note singing with others from a book called The Sacred Harp. It’s musically and spiritually raw, and provides a communal and noisy outlet in my generally solitary and quiet life.

What’s your strategy for hitting the bell in Wordnik’s Friday contest?

Basically, I just take a dollar out of my pocket and put it in the pot, with no hope of winning. It’s clearly a plot by certain people at Wordnik (I will not name names) to swindle us out of dozens of dollars a year.

Share this:
Share this page via Email Share this page via Stumble Upon Share this page via Digg this Share this page via Facebook Share this page via Twitter

Tech Perspective on Wordnik Related Content

Today Wordnik announced the public beta of Related Content – an application which identifies connections between related pieces of content and recommends the most relevant content to users. You can read all about it on TechCrunch and see it across the web – it’s the first of many new products to use the technology that we’ve been working on over the last 4 years.

But how does this work? Let’s dive into what makes this product special from a technology point of view.

To make Related Content easy for publishers, and of course for developers, to integrate, we’ve taken a pure HTML/JavaScript approach. There are three important steps in this process.

First, publishers use HTML5 microdata to annotate their articles. Why? To give control over metadata and help find the content. You don’t want to crawl ads and comments in an article when trying to understand the body, do you? Crawling is hard and a never-ending job (folks like Diffbot are doing an exceptional job at crawling as a service). But more importantly, the publisher wants to have some say in how content metadata is defined. Don’t leave it up to the crawler, that’s our motto.

HTML also helps us understand where to put the recommendations, and is an awesome way to give control to the developers who integrate Related Content. That’s really it for integration – HTML5 and JavaScript. You don’t need to run a special server or integrate Related Content into your templating system (we provide a WordPress plugin to automate the HTML integration steps) so we won’t be adding tables to your database, requiring a bigger server, etc. In fact, folks successfully run Related Content on their site with no templating system, only serving statically via Apache. I won’t use the “cloud” term, but effectively, this is providing a cloud-based integration between your content and our recommendation system.

Publishers who want to override the styles can do so — and you can see nice examples of such at Harvard Book Store — via simple CSS. That takes much of the customization out of our hands and gives it directly to the publisher. No nasty control panels choosing your font, etc., publishers can now simply inherit the CSS from their site or override them with their own magic. It’s a win for developers and publishers alike.

Next, the communications between Wordnik’s servers and your site. The last thing we want to do is slow down your site by serving up content synchronously. Synchronous calls on the web are a problem waiting for an opportunity. Even via AJAX, we don’t want to block those precious HTTP connections, slow down loading the DOM, etc. So what do we do?

If you’ve followed Wordnik’s technology for a bit, you’d see that we sponsor the Atmosphere WebSocket framework. The connection between Wordnik via WebSockets is entirely non-blocking (with the exception of a quick HTTP/101 upgrade request). So your DOM loads, your images load (even your ads) without interference from us.

What else does it do? It lessons the load on our servers as well, as WebSockets can dramatically reduce thread contention by effectively “parking them” while waiting for a response. Of course, we want to process the responses as quickly as possible, but with heavy computation, heavy loads, and complex content, there’s a chance that we may need to take more time than desirable. No worries though! WebSockets and the Atmosphere framework makes it easy for everyone.

So when we compute recommendations, they are injected directly into the page as HTML5 fragments. Huh?

Providing pre-computed HTML from our servers accomplishes a number of goals. First, the overhead from our JavaScript is minimal – in fact, our dependency-free JavaScript is only 27kb! No crazy rendering logic or expensive calls to our server. We deliver a pre-computed chunk of HTML which gets injected right where you want it.

Finally, the server. There is a ton going on here, as you would expect. Is Wordnik doing search? A big Lucene index? Hive? You’ve probably seen our support for MongoDB – is this just a big, fancy query against Mongo?

Not really. With the help of our linguists and NLP engineers, and the Scala programming language, we have developed an awesome NLP pipeline which takes text, cleans and extracts concepts, featurizes and optimizes them, then runs through a matching algorithm to find the best possible matches. This isn’t keyword search but a whole new level of understanding of text

Lots of plumbing behind the scenes

There is an enormous amount of plumbing, storage, and compute power required to run a system like this, and in the process of developing it, we’ve created tools like Swagger and learned to optimize the heck out of EC2 and MongoDB. Most recently, with @casualjim joining the Wordnik team, the Scalatra framework has been integrated into our technology stack, and is efficiently serving up functionality through our micro-service framework, inspired by the awesome folks at Netflix.

Requests for recommendations run through a series of services across our Swagger-enabled framework. A centralized service (named Caprica) keeps track of all the servers running in the cluster and tells each service who to talk to. When load goes up, a new server is launched and added to the cluster. Caprica tells the other servers how to talk to it and it goes off to work. When load decreases, Caprica will take the server out of the loop, then safely shut it down. This is the promise of elastic computing, and Caprica helps us keep things in order. We hope to release Caprica to open source when possible.

You can find out more about our technology on this blog and in our github repository at http://github.com/wordnik. Stay tuned for more details on how Wordnik technology works!

Share this:
Share this page via Email Share this page via Stumble Upon Share this page via Digg this Share this page via Facebook Share this page via Twitter

Swagger-core 1.2 released! Here’s what it means for you

When we first architected Swagger, the goal was to programmatically describe our APIs with the purpose of keeping our documentation always up-to-date. The specification grew in importance internally, and helped us auto-generate a sandbox UI, client SDKs, server skeletons, and even document legacy APIs.

Now about 15 months after being open-sourced, we have integrated support across nearly all popular server languages, including PHP, Python, Java, Scala, Ruby, and JavaScript. Swagger is right in the middle of huge API infrastructures such as Klout and IGN, and is helping make tools for developer-focused SDKs from the likes of Singly and the 3Scale API management platform.

We’ve received a ton of feedback and contributions to both our framework and the specification from bright and ambitious developers around the world. It’s been a great experience!

Today we are announcing swagger-core 1.2 which facilitates a key request – the integration of the resource and API listings with the server framework. Here is a brief background.

Swagger supports a Resource Listing – that is, an inventory of all APIs on a particular system – similar to a site map for the API. The resource listing contains pointers to API Declarations, which are used to describe the actual API functionality and the models either produced or consumed.

The original resource listing was auto-generated by the Swagger framework under the path `resources.{format}` (where {format} is either .xml or .json). The framework also encouraged placing the API Descriptions under the *root* of each API path. So /pet.json would contain the API Description for the Pet API, etc. The problem is, many people already have dedicated uses for the root resource on their APIs.

With that in mind, we’ve changed the default behavior of the swagger-core framework to produce the Resource Listing and API Descriptions as follows:

  • The Resource Listing defaults to the GET method on /api-docs.{format}
  • The API Listing defaults to the GET method on /api-docs.{format}/{api-name} where {api-name} is the name of each API.

As trivial as this change sounds, it drastically improves the integration path. Now each API does not need to extend Swagger code, and the `/resources.{format}` path is no longer “reserved” (we were surprised to see how many people already had a /resources API!).

You can see the latest code in Github at https://github.com/wordnik/swagger-core. All samples are updated and the compiled binaries have been deployed to maven central. In addition, there is updated support for the following frameworks:

  • JAX-RS
  • Play2
  • RESTEasy
  • Grails

Updates to other frameworks will be following soon.

Finally, remember Swagger is free, open-source software, distributed under the Apache 2 license.

Feedback welcome! See our updated sample app.

The Wordnik Engineering Team


Share this:
Share this page via Email Share this page via Stumble Upon Share this page via Digg this Share this page via Facebook Share this page via Twitter

A Day in the Life of a Wordnik Dev: Robert Voyer

Have you found yourself wondering, Who are those guys who work at Wordnik? Well, you’re in luck. Today we’re kicking off a new series, A Day in the Life of a Wordnik Dev, and first up is Robert Voyer, Computational Linguist.

Robert answers some questions about his tech set up, his favorite (and less favorite) programming languages, and why he could eat Indian fast food every day.

What’s your favorite coding editor/IDE? Why?

Emacs (and the ENhanced Scala Interaction Mode for Emacs). I suppose it’s considered old-school, but I like it because it feels more lightweight than a lot of the standard IDEs for Java.

What’s your beverage of choice?

Coffee in the morning (with cream), and Racer 5 in the evening. I recently switched to lattes but wouldn’t call myself a latte guy. That’s too froufrou.

The best lunch within five blocks of the office?

Curry Up Now, because you can’t really beat Indian food in a burrito.

What’s your favorite music to listen to while working?

I subscribe to Rdio and listen to lots of different stuff. Typically I listen to instrumental music. It’s less distracting. I recently discovered Tycho, and have also been listening to a band called Oval (their album “O”) and another band called How to count one to ten (“Blue Building Blocks”).

What are your favorite languages?

Python and Scala.

Python was a first favorite because it felt elegant and easy to read. Scala is a great way to move into statically typed languages because it actually feels like you’re coding in a dynamic language like Python, but you get additional performance improvements and compile-time guarantees.

Which language do you think is terrible?

I don’t like to be a language Nazi, but my least favorite language is PHP. It’s syntactically ugly and feels like a weird mishmash of other languages.

What was your first language? When did you learn it?

My first language was Perl and I learned it in 2002. I was writing simple Perl scripts for pushing around text.

Where do you go for your tech news?

Twitter and Prismatic.

Where do you go for help?

Stack Overflow and Ivan (@casualjim).

What’s your favorite movie/book/website?

My favorite literary book is The History of Love by Nicole Krauss. My favorite tech book is Foundations of Statistical Natural Language Processing. This has been a Bible for me in terms of statistical NLP. My favorite movie is Rushmore, and my favorite website is FiveThirtyEight.

If you had a time machine, what would you go back and tell your younger self?

If I could go back to college and never miss a class and take advantage of all the opportunities there, I would. Being in school is such a privilege and so much fun.

If you weren’t a dev, what would you be?

I don’t know. Maybe I’d still be in school as a student or a teacher.

What do you like to do when you’re not coding?

I love to hang out with my wife and daughter. My daughter is almost 8 months old and is really starting to develop her own crazy personality. Otherwise, I’m a big board game fan. I also enjoy playing Ultimate Frisbee and writing and recording music. My main instruments are the piano and the guitar.

What’s your strategy for hitting the bell in Wordnik’s Friday contest?

I try not to think too hard about it, and I shoot a few times during the day to get a sense of the bias of the gun*.

*A Nerf gun of course.

Share this:
Share this page via Email Share this page via Stumble Upon Share this page via Digg this Share this page via Facebook Share this page via Twitter

Data Modeling with MongoDB / QConSF Recap

Last week a three-day tech conference, QCon, came to San Francisco, and I was lucky enough to be asked to present lessons learned from Wordnik’s 2.5 years of data modeling experience with MongoDB.

First, briefly about tech conferences. You can spend every waking moment attending tech conferences – on the surface, there’s no shortage of them. The difference is really in the content. You have vendors pitching their products, “non-technical” tech discussions for managers, etc. Sometimes it’s hard to tell the difference without looking at the speaker list.

Friday at QCon started off with a keynote from John Hughes about testing race conditions. During the keynote John was programming C and running automatically-generated test suites in Erlang. When I saw this (and yes, seeing C code does bring me back to a happy point in life), I knew this would be a really technical conference. Coding in C during a keynote! Great start.

I also saw talks from Kenny Gorman (@kennygorman) about scaling NoSQL, Gil Tene (@giltene) about JDK performance optimization, and Adrian Cockcroft (@adrianco) about Netflix’s amazing AWS infrastructure.

My talk was a recap of what Wordnik has done with NoSQL over the last few years. We’ve come a long way in terms of how we use MongoDB, thanks to heavy profiling, analysis, trial and error, as well as migration to the scala programming language.

The first thing about migrating to MongoDB is as follows:

Seriously – the point of a technology shift like non-relational databases is disruptive. You have to change a lot of code and practice to really take advantage of it. We have data to back that up.

When we first migrated to MongoDB, the whole NoSQL movement was fairly young, and we honestly weren’t ready to bet the entire company on it (see this slide deck) so we made the datastore “configurable” at runtime. This helped us migrate live, without downtime, and with the ability to switch back if we needed to (we never exercised the switchback, though). It, however, came with an efficiency cost.

Our software internally was using the same models, identifiers, and structure for both MySQL and MongoDB. Sure, it worked but it didn’t take advantage of the hierarchal, document nature of MongoDB. That costs us in terms of both disk space and performance. Ultimately we changed our data models to huge speed increases and flexibility in querying, again.

You can see the complete slideshare here, with more details about optimizing indices, query syntax, and object marshaling:

Thanks to the QCon team for a great conference, and to the attendees for asking excellent questions.

Share this:
Share this page via Email Share this page via Stumble Upon Share this page via Digg this Share this page via Facebook Share this page via Twitter

Json4s: One AST to rule them all

Every web framework in Scala, of which we have plenty, seems to insist on writing their own JSON library. But various tools rely on the AST from lift-json. This is both good and bad as there are a number of gaps in the lift-json version, the most notable being that it does not support any other type except Double for representing decimal numbers.

A second problem is that because of the large number of dependencies in the Lift project, it’s typically a bottleneck for upgrading to scala-2.10. Third, it just seems an odd place for a nice library like that to live. I’m hopeful that more libraries like Play, Spray, etc., will contribute to this project so that instead of a fragmented JSON landscape, we get a homogeneous one. All of them have a JSON AST with the same types defined in it.

So I’ve set out to set lift-json free from the Lift project and to add some improvements along the way. The first improvement I’ve made is that you now have the choice between using BigDecimal or Double for representing decimal numbers, so you can use the library also to represent invoices etc.

The second change I made is to add several backends for parsing. The original lift-json parser is still available in the native package but you can now also use Jackson as a parsing backend to the JSON library. It’s really easy to add more backends so Smarter JSON, spray-json, etc., are all in the cards.

I looked at what Play2 has in their JSON support, and their main thing seems to be a type class based reader/writer story instead of the formats based one from lift-json. So for good measure I also added that system to this library. In general I like typeclasses for this type of stuff, but in this case I actually think that lift-json has a nicer approach by assembling all the context into a single formats object instead of requiring many type classes to be in scope at any given time when you want to parse or write JSON. There are a few more convenience methods added on the JValue type that allow you to camelize or underscore keys, remove the nulls etc.

I spoke with Joni Freeman about the general idea of this library, and he showed me what he did on a branch for lift-json 3.0 so I incorporated his work into json4s too. It basically means that a key/value pair used to represent a JObject is no longer a valid JSON AST node and there are some extra methods to work with those fields. All of this is explained in the README of the json4s project.

I’ve also added support for json4s to the dispatch reboot project so you can use it there just like you can with lift-json. Furthermore Rose Toomey let me know that Salat is now using json4s as json library instead of lift-json.

Some of the improvements I still want to make are for Scala 2.10. I want to use the Mirror API to be able to reflect over more stuff than just case classes. For some of the use cases we have at Wordnik, it makes sense to be able to use a few annotations (yes annotations unfortunately) to have certain keys be ignored and so on. I’ll probably steal some of that from the Salat project so that there is still some degree of consistency between our libraries.

I also want to figure out how we can possibly make the AST based approach useful with huge data structures, as it’s not inconceivable to want to send 100MB or 10GB JSON docs over the wire. At that moment a lazy approach actually makes a lot of sense, but I’m open to suggestions on how this could be achieved efficiently without breaking the AST model.

So if you’re using JSON in Scala, consider using or contributing to the json4s project.

As a bonus here’s an example on how to deserialize a case class from JSON:

scala> import org.json4s._
scala> import org.json4s.jackson.JsonMethods._
scala> implicit val formats = DefaultFormats // Brings in default date formats etc.
scala> case class Child(name: String, age: Int, birthdate: Option[java.util.Date])
scala> case class Address(street: String, city: String)
scala> case class Person(name: String, address: Address, children: List[Child])
scala> val json = parse("""
         { "name": "joe",
           "address": {
             "street": "Bulevard",
             "city": "Helsinki"
           "children": [
               "name": "Mary",
               "age": 5
               "birthdate": "2004-09-04T18:06:22Z"
               "name": "Mazy",
               "age": 3

scala> json.extract[Person]
res0: Person = Person(joe,Address(Bulevard,Helsinki),List(Child(Mary,5,Some(Sat Sep 04 18:06:22 EEST 2004)), Child(Mazy,3,None)))
Share this:
Share this page via Email Share this page via Stumble Upon Share this page via Digg this Share this page via Facebook Share this page via Twitter

Using Crowdsourcing to make relevance judgments: A customer story

CrowdConf, the annual conference on crowdsourcing, had a number of interesting sessions on both microtasking (that is, paying people to do very small tasks, such as systems built on top of Amazon’s Mechanical Turk) and on crowdfunding (that is, asking many people to pay relatively small amounts of money, such as Kickstarter and Indiegogo).

We presented Wordnik’s story of how we use crowdsourcing to improve our recommendations for Related Content, which is available as a WordPress plugin and as a widget for any web page (including Tumblr and Blogger).
Because our recommendations are made based on machine learning and personalization algorithms, it’s important that we evaluate the recommendations our software makes. If we want to add a new feature to our machine learning model, it’s important to test whether the new model is better than the current one.

The gold standard for doing this is to ask human judges whether a particular recommendation is relevant (and to rate how relevant it is). When we were creating our first models, we did these evaluations ourselves, and then we hired a part-time worker to do them. Our worker was great at sussing out detailed issues the early recommendation system had, but one person doesn’t scale. It takes too long to get a decent representative sample.

Fortunately, microtasking sites like Mechanical Turk make it possible to do on-demand evaluations quickly and at scale. Working with CrowdFlower, we have set up a workflow that allows us to quickly:

  1. Create sets of treatment and control recommendations
  2. Send recommendations that haven’t been previously evaluated to CrowdFlower
  3. Send the evaluations to our internal experiment tracking system
  4. Review and report the comparisons between treatment and control

If you’re interested in helping make our recommendation and analytics systems even better, peruse our jobs and send me a note (will/at/wordnik.com) — or write if you have any other questions.

The slide deck attached is a slightly modified version of what we presented at CrowdConf: Using crowdsourcing to make relevance judgments: A customer story

Share this:
Share this page via Email Share this page via Stumble Upon Share this page via Digg this Share this page via Facebook Share this page via Twitter