Turbocharging Solr Index Replication with BitTorrent
Many of you probably use BitTorrent to download your favorite ebooks, MP3s, and movies. At Etsy, we use BitTorrent in our production systems for search replication.
Search at Etsy
Search at Etsy has grown significantly over the years. In January of 2009 we started using Solr for search. We used the standard master-slave configuration for our search servers with replication.
All of the changes to the search index are written to the master server. The slaves are read-only copies of master which serve production traffic. The search index is replicated by copying files from the master server to the slave servers. The slave servers poll the master server for updates, and when there are changes to the search index the slave servers will download the changes via HTTP. Our search indexes have grown from 2 GB to over 28 GB over the past 2 years, and copying the index from the master to the slave nodes became a problem.
The Index Replication Issue
To keep all of the searches on our site working fast we optimize our indexes nightly. Index optimization creates a completely new copy of the index. As we added new boxes we started to notice a disturbing trend: Solr’s HTTP replication was taking longer and longer to replicate after our nightly index optimization.
After some benchmarking we determined that Solr’s HTTP replication was only allowing us to transfer between 2 MB and 8 MB of data per second. We tried various tweaks to HTTP replication adjusting compression and chunk size, but nothing helped. This problem was only going to get worse as we scaled search. When deploying a new slave server we experienced similar issues, only 8 MB per second transfer pulling all of our indexes at once and it could take over 4 hours, with our 3 large indexes consuming most of the transfer time.
Our 4 GB optimized listings index was taking over an hour to replicate to 11 search slaves. Even if we made HTTP replication go faster, we were still bound by our server’s network interface card. We tested netcat from master to a slave server and the results were as expected, the network interface was flooded. The problem had to be related to Solr’s HTTP replication.
The fundamental limitation with HTTP replication is that replication time increases linearly with the number of slaves. The master must talk to each slave separately, instead of all at once. If 10 boxes take 4 hours, scaling to 40 boxes would take over half a day!
We started looking around for a better way to gets bits across our network.
Multicast Rsync?
If we need to get the same bits to all of the boxes, why not send the index via multicast to the slaves? It sure would be nice to only send the data once. We found an implementation of rsync which used multicast UDP to transfer the bits. The mrsync tool looked very promising: we could transfer the entire index in our development environment in under 3 minutes. So we thought we would give it a shot in production.
[15:25] <gio> patrick: i'm gonna test multi-rsyncing some indexes
from host1 to host2 and host3 in prod. I'll be watching the
graphs and what not, but let me know if you see anything
funky with the network
[15:26] <patrick> ok
....
[15:31] <keyur> is the site down?
Multicast rsync caused an epic failure for our network, killing the entire site for several minutes. The multicast traffic saturated the CPU on our core switches causing all of Etsy to be unreachable.
BitTorrent?
For those folks who have never heard of BitTorrent, it’s a peer-to-peer file sharing protocol used for transferring data across Internet. BitTorrent is a very popular protocol for transferring large files. It’s been estimated that 43% to 70% of all Internet traffic is BitTorrent peer-to-peer sharing.
Our Ops team started experimenting with a BitTorrent package herd, which sits on top of BitTornado. Using herd they transferred our largest search index in 15 minutes. They spent 8 hours tweaking all the variables and making the transfer faster and faster. Using pigz for compression and herd for transfer, they cut the replication time for the biggest index from 60 minutes to just 6 minutes!
Our Ops experiments were great for the one time each day when we need to get the index out to all the slave servers, but it would also require coordination with Solr’s HTTP replication. We would need to stop replication, stop indexing, and run an external process to push the index out to the boxes.
BitTorrent and Solr Together
By integrating BitTorrent protocol into Solr we could replace HTTP replication. BitTorrent supports updating and continuation of downloads, which works well for incremental index updates. When we use BitTorrent for replication, all of the slave servers seed index files allowing us to bring up new slaves (or update stale slaves) very quickly.
Selecting a BitTorrent Library
We looked into various Java implementations of the BitTorrent protocol and unfortunately none of these fit our needs:
- The BitTorrent component of Vuze was very hard to extract from their code base
- torrent4j was largely incomplete and not usable
- Snark is old, and unfortunately unstable
- bitext was also unstable, and extremely slow
Eventually we came upon ttorrent which fit most of the requirements that we had for integrating BitTorrent into the Solr stack.
We needed to make a few changes to ttorrent to handle Solr indexes. We added support for multi-file torrents, which allowed us to hash and replicate the index files in place. We also fixed some issues with large file (> 2 GB) support. All of these changes can be found our fork of the ttorrent code; most of these changes have already been merged back to the main project.
How it Works
BitTorrent replication relies on Lucene to give us the names of the files that need to be replicated.
When a commit occurs the steps taken on the master server are as follows:
- All index files are hashed, a Torrent file is created and written to disk.
- The Torrent is loaded into the BitTorrent tracker on the master Solr server.
- Any other Torrents being tracked are stopped to ensure that we only replicate the latest version of the index.
- All of the slaves are then notified that a new version of the index is available.
- The master server then launches a BitTorrent client locally which seeds the index.
Once a slave server has been notified of a new version of the index, or the slave polls the master server and finds a newer version of the index, the steps taken on the slave servers are as follows:
- The slave server requests the latest version number from the master server.
- The Torrent file for the latest index is downloaded from master over HTTP.
- All of the current index files are hash verified based on the contents of the Torrent file.
- The missing parts of the index are downloaded using the BitTorrent protocol.
- The slave server then issues a commit to bring the new index online.
When new files need to be downloaded, partial (“.part”) files are created. This allows for us to continue downloading if replication gets interrupted. After downloading is completed the slave server continues to seed the index via BitTorrent. This is great for bringing on new servers, or updating servers that have been offline for a period of time.
HTTP replication doesn’t allow for the transfer of older versions of a given index. This causes issues with some of our static indexes. When we bring up new slaves, Solr creates a blank index whose version is greater than the static index. We either have to optimize the static indexes or force a commit before replication will take place.
With BitTorrent replication all index files are hash verified ensuring slave indexes are consistent with the master index. It also ensures the index version on the slave servers match the master server, fixing the static index issue.
User Interface
The HTTP replication UI is very clunky: you must visit each slave to understand which version of the index it has. Its transfer progress is pretty simple, and towards the end of the transfer is misleading because the index is actually being warmed, but the transfer rate keeps changing. Wouldn’t it be nice to look in one place and understand what’s happening with replication?
With BitTorrent replication the master server keeps a list of slaves in memory. The list of slaves is populated by the slaves polling master for the index version. By keeping this list we can create an overview of replication across all of the slaves. Not to mention the juicy BitTorrent transfer details and a fancy progress bar to keep you occupied while waiting for bits to flow through the network.
The Results
Pictures are worth a few thousand words. Lets look again at the picture from the start of this post, where we had 11 slave servers pull 4 GB of index.
Today we have 23 slave servers pulling 9 GB of indexes.
You can see it no longer takes over an hour to get the index out to the slaves despite more than doubling the number of slaves and the index size. The second largest triangle on the graph represents our incremental indexer playing catch up after the index optimization.
This shows the slaves are helping to share the index as well. The last few red blobs are indexes that haven’t been switch to BitTorrent replication.
Drawbacks
One of the BitTorrent features is hash verification of the bits on disk. This creates a side effect when dealing with large indexes. The master server must hash all of the index files to generate the Torrent file. Once the Torrent file is generated all of the slave servers must compare the hashes to the current set of index files. When hashing 9 GB of index it can take roughly 60 seconds to perform the SHA1 calculations. Java’s SHA1 implementation is not thread safe making it impossible to do this process in parallel. This means there is a 2 minute lag before the BitTorrent transfer begins.
To get around this issue we created a thread safe version of SHA1 and a DigestPool interface to allow for parallel hashing. This allows us to tune the lag time before the transfer begins, at the expense of increased CPU usage. It’s possible to hash the entire 9 GB in 16 seconds when running in parallel, making the lag to transfer around 32 seconds total.
Improvements
To better deal with the transfer lag we are looking at creating a Torrent file per index segment. Lucene indexes are made up of various segments. Each commit creates an index segment. By creating a new Torrent file per segment we can reduce the lag before transfer to milliseconds, because new segments are generally small.
We are also going to be adding support for transfer of arbitrary files via replication. We use external file fields and custom index time stamp files for keeping track of incremental indexing. It makes sense to have Solr manage replication of these files. We will follow HTTP replication’s lead on confFiles, adding dataFiles and indexFiles to handle the rest of the index related files.
Conclusion
Our search infrastructure is mission critical at Etsy. Integrating BitTorrent into Solr allows us to scale search without adding lag, keeping our sellers happy!
The Product Hacking Ecosystem
Most product ideas are shitty, yet we spend the majority of our lives working on them.
As a product hacker, you’ll be working on a constant stream of ideas that excite you to the point of obsession; staying up late writing code, thinking about it every waking and non-waking minute. We’ve all admitted that a minority of our ideas will turn into something that will have the impact we dream of, but we don’t let that truth prevent us from being excited that this next thing might be the one. Some have admitted this and accepted that they’re a junky who’s only going to get that fix from a great feature once in a long while. Although I admit that I’m a junky, I haven’t yet become a fatalist.
Web Operations people speak about measuring their work by the Mean Time Between Failures (MTBF). For product hackers, we should be thinking in terms of minimizing Mean Time Between Wins (MTBW). Because it’s difficult to know which ideas are going to blossom into that great feature, a nice proxy for MTBW is Mean Time to Bad Idea Detection (MTTBID).
By building out an ecosystem for you and your team that allows bad ideas to be detected quickly, you can spend your time iterating on the great ideas and shipping your wins quickly while the shitty ideas die a meaningless death somewhere in a pile of other shitty ideas.
The best hackers I know are impatient. As soon as you get an exciting result, you’re going to be talking about it with whoever will listen. An ecosystem of tools that are just there providing a source of truth that everyone can understand and agree with is like having a posse of hardened thugs at your back at all times. Instead of excitement going sour when people who haven’t seen the light are doubting you, you can all agree on whats actually going on. If the numbers you care about are getting better, then great. If your product isn’t something that can be measured easily, or is a long term bet, you can show that the numbers you care about aren’t getting worse and show that its safe to push on into the wilderness.
Here are some things we’ve learned about how to build that ecosystem.
Make Tools for Failing Fast
Ideas can fail at any level of scrutiny. Some ideas don’t pan out when looked at under a microscope. Others don’t work out when talking about it over a drink. If it survives to the point of being shown to users, it can fail when you’re looking at it through a telescope and you’re just not seeing the response you hoped for. We spent some time trying to improve the quality and performance of our relevance sorting algorithm for search results before we made relevance-ordering the site-wide default. During the four month period where we did this work, we were able to get thirty experiments completed. Of those, eleven were real wins that made it into the final product.
At Etsy, the birth of every idea is the simplest possible implementation that permits experimentation. To give ourselves immediate feedback on the effects of search algorithm changes we created a tool that let us see the new ranking and all of the information we need to understand why a listing is ranked the way it is. The tool let us see this new ranking the moment our search server finished compiling, allowing for rapid iteration on tricky edge-cases, and the ability to quickly detect and kill bad components.
We created a tool that runs a sample of popular and long-tail queries through a new algorithm and displays as much information as can be determined without real people being involved; an estimated percent of changed search results over the universe of all queries, a list of the most strongly affected queries, a list of the most strongly affected Etsy shops, etc..
We created tooling for running side-by-side studies where real users were asked to rate which set of search results they preferred for a given query. When a feature was ready to be launched as an A-B test, we were able to see a set of visualizations explaining how our change was performing relative to the standard algorithm.
![]() |
![]() |
| What a Search AB Test Looks Like | What a site-wide AB test looks like |
The best part is that we don’t think about these tools while building new products and running experiments. We come up with ideas, implement them, and if they do well we ship them. Our conversations are about the product, the code we write is for the product and our shitty ideas are executed on the spot and sloppily buried in shallow graves, as they deserve and as is our wont.
Make Tools that Make Process Disappear
Edward Tufte introduced the concept of “chart junk”; the distracting stuff on a visualization that isn’t saying anything about the data. Marshall McLuhan made a compelling case that “The medium is the message” implying that the vehicle through which you perceive something impacts your understanding of it. Just because your paying clients won’t see your internal tooling doesn’t give you license to slap together an ill considered tool. The medium is the message, and your tools are your medium. Working Memory is limited and people are busy. Decisions are worse when getting the answer to a question about your product requires that you lose track of what you asked or why it’s important. Decisions are even worse if you never get a chance to ask questions and get answers. Products designed with fewer poor decisions are less shitty than products designed with more poor decisions. GNU wouldn’t exist without GDB.
![]() |
![]() |
| Our Non-Shitty Search Query Analysis Tool | Solr’s Shitty Query Analysis Tool |
It’s really important to our business that we return great results when people are doing searches on Etsy. It turns out we’re super lazy and if there are any barriers in the way of us asking “why is this item showing up for this query”, we’re just not going to ask the question and it’s not going to get fixed. Our query analysis tool (pictured on the left) helps reduce that barrier to getting an answer.
The best information about your product is going to come from real users. Unfortunately, its often painful to get your products out in to the real world. Having completed an iteration of a product, you’re filled with excitement and fear. You’re hoping you got it all right, but if you didn’t, you’re ready to fix it because you know every intimate detail of your new creation. This state of excitement and readiness is the last thing you want to let go of. Continuous deployment, the practice of pushing your code live the moment its ready, is absolutely essential for product hackers.
If you need to wait any non-trivial amount of time between completing something and seeing how well it’s performing, you’re not going to be working on that project by the time you get your answer. When you do get your answer, you’re not only going to have to refresh your memory on what you had been working on, but you’re going to have to do the same on whatever else you had started working on. Asking your team to work with patience and discipline has never worked and never will work. Build an ecosystem where doing the right thing is the easiest thing. Build an ecosystem where making great decisions is the easiest thing. Build an ecosystem where the lazy, excitable and impatient really shine.
Translation Memory
By: Diego Alonso
As we mentioned in Teaching Etsy to Speak a Second Language, developers need to tag English content so it can be extracted and then translated. Since we are a company with a continuous deployment development process, we do this on a daily basis and as an result get a significant number of new messages to be translated along with changes or deletions of existing ones that have already been translated. Therefore we needed some kind of recollection system to easily reuse or follow the style of existing translations.
A translation memory is an organized collection of text extracted from a source language with one or more matching translations. A translation memory system stores this data and makes it easily accessible to human translators in order to assist with their tasks. There’s a variety of translation memory systems and related standards in the language industry. Yet, the nature of our extracted messages (containing relevant PHP, Smarty, and JavaScript placeholders) and our desire to maintain a translation style curated by a human language manager made us develop an in-house solution.
In short, we needed a system to suggest translations for the extracted messages. Etsy’s Search Team has integrated Lucene/Solr into our deployment infrastructure allowing for Solr configuration, Java-based indexers, and query parsing logic to go to production code in minutes. We decided to take advantage of Lucene’s MoreLikeThis functionality to index “similar” documents, in this case similar English messages with existing translations. The process turned out to be pretty straightforward: we query the requested English message using a ContentStream to the MoreLikeThisHandler and get as a result similar messages with scores. This is done through our Translator’s UI via Thrift. We’ve determined a threshold to filter the messages by score in order to only provide relevant translations after getting similar English messages from the query results.
It’s worth mentioning that we need to use a ContentStream to send the source message because most of the time we’ll be requesting translation suggestions for new messages. In other words, messages without translations are not present in our index to match as documents. When sending a ContentSream to the MoreLikeThisHandler, it will extract the “interesting” terms to perform the similarity search.
Here’s a simple diagram of the main parts of this process:
We could easily test and optimize our results on the search environment through Solr queries before wiring the service in the Translator’s UI. As you can see in the following query we send the content (stream.body) of the English message, play with the minimum document frequency (mindf) and term frequency (mintf) of the terms and even filter the query (fq) for translations in a certain language.
http://localhost:8393/solr/translationmemory/mlt?stream.body=Join%20Now
&mlt.fl=content&mlt.mindf=1&mlt.mintf=1&mlt.interestingTerms=list
&fl=id,md5,content,type,score&fq=language:de
And since we know you love to read some code, here’s how we defined our translation memory data types and service interface in Thrift:
struct TranslationMemoryResult {
1: string md5
2: double score
}
struct TranslationMemorySearchResults {
1: i32 count,
2: list matchedMessages
}
service TranslationMemorySearch extends fb303.FacebookService {
/**
* Search for translation memory
*
* @param content of the message to match
* @param language code of the existing translations
* @return a TranslationMemorySearchResults instance - never "null"
*/
TranslationMemorySearchResults search(1: string content,
2: i32 type,
3: string language)
}
Let’s look at some common use cases where translation memory comes in handy.
It’s pretty common that a new feature is released where we want to attract new members by adding some kind of registration button. In this case the extracted English message has the following data:
Description: A call to action to join etsy.com Content: Join Now
When displaying this message in our Translator’s UI we get the following results after looking for its content in our Translation Memory .
Match Source Translation 100% Join Now Jetzt teilnehmen 80% Join now. It's free! Jetzt anmelden. Kostenlos!
Another case is when we have a whole feature translated in the site, but we try some different copy in our English version. In the following example the translators can base their translations in their following suggestions.
Description: Text for when an experimental feature has no requirements
Content: This prototype has no specific requirements. Welcome!
Match Source / Translation
86% This team has no specific requirements. Welcome! /
Dieses Team stellt keine speziellen Bedingungen. Willkommen!
86% This experiment has no specific requirements. Welcome! /
Für dieses Experiment gibt es keine speziellen Bedingungen. Willkommen!
Here’s a screenshot from our Translator’s UI in action:
Having a translation memory system like this has proven to be really useful for our translators who stumble upon new, edited, and deleted messages each day. We also update our index of extracted messages every few minutes with translations, providing resh suggestions.
In addition, we have created a translation glossary manager to maintain a common style when translating. When viewing an English message, we stem the content of the message and match the terms with our glossary. A few examples from our German version of the site are “Search” into “Suche”, “Circles” into “Zirkel”, and “Shop” into –surprisingly, the English word– “Shop”.
So that’s a glimpse of how deal with translations at Etsy. Check back soon for more posts on how we handle internationalization at Etsy.
Multilingual User Generated Content and SEO
By: Lacy Rhoades & David Bernal
Etsy offers a lot of items from a lot of sellers worldwide. Now since we’ve started to better support our international members, this means Etsy also comes in a variety of languages. We rely heavily on search engines to bring people looking for something unique to our Sellers and their Etsy Shops. As luck would have it, search engines speak a variety of languages too.
Who speaks our language?
A persistent cookie, initially set through our language detection logic, determines what language we show users on Etsy. This makes for the most ideal experience by allowing us to present what we refer to as a “unified marketplace” and avoid segmented set of items for separate geographic regions. All content on Etsy, all Etsy Shops, Listings and features (no matter what language) are found at Etsy.com and there is no segmentation. Piece of cake. Right?
What does this mean for search results?
Interesting question: What does a single unified multilingual marketplace mean for search providers like Google, Bing or Yahoo? As we came to find out, it means that when said providers come along to crawl and index our amazing content, they only browse the pages of Etsy in English. This is a problem!
We ask our users, “Hey! What language would you like to see?”
Unfortunately, search crawlers don’t care to answer such inquiries. Search engine bots, in their desire to see all content without prejudice, don’t send a browser accept language, aren’t signed-in users with profile locations, and it hardly makes sense to use geo-IP lookups to determine their location. Because of this, all content shows up in English for automated search crawlers, which means our foreign language search results and page ranks were extinct long before they could ever be created.
Search providers must first establish what region or language the searcher is interested in, but they must also establish what region or language the search results are in. Our goal is to make it as easy as possible for search bots to do just that.
To simulate the way search engines crawl Etsy you can use curl to retrieve the contents of an Etsy URL. At this point in our work with multilingual content, there was no way to simply “curl” a URL on Etsy and ever receive anything but English content.
A simple response to this is to this is to publish our content at different addresses for different languages. The simple unified Etsy.com would remain as the marketplace for real users, but for search crawlers we would publish simple multilingual domains like de.etsy.com where all content would appear in German, or it.etsy.com where content would appear in Italian.
Search engines start saying, “Hey! What’s the big idea!?”
The majority of our content published across these multiple domain names is still only in English and as it turns out, search engine providers really do not like it when you multiplex copies of identical content across multiple addresses and multiple domain names.
Thankfully this problem (like any other issue) has been seen before and there’s a well thought out, convenient solution. The major search providers have establish a way to say (in your site’s HTML code) that some content is identical in part or in whole to other content elsewhere. This code also allows you to specify if content is intended for segments of the international audience speaking certain languages.
Now, who here speaks HTML? <rel alternate=”…”/>
Given a situation where an Etsy Listing is not translated into several languages, we would append this HTML code to the English Listing page (inside the <head> tags). This alleviates any confusion as to what language-speakers this content is intended for. It assists search providers in ignoring duplicate content elsewhere.
<link rel="canonical" href="http://www.etsy.com/listing/83154191/etsy-stickers-keeping-it-real-set-of-5" /> <link rel="alternate" hreflang="fr" href="http://fr.etsy.com/listing/83154191/etsy-stickers-keeping-it-real-set-of-5" /> <link rel="alternate" hreflang="nl" href="http://nl.etsy.com/listing/83154191/etsy-stickers-keeping-it-real-set-of-5" />
There are two things at play here:
- We specify a “canonical” url, which is the one true URL to view this content, in whatever language is appears in on the page, this way when search engines see it elsewhere, they know to give its pagerank to canonica URL, avoiding dilution.
- We specify “alternate” urls, which tell search engines that there are similar-looking pages around that have the same content, but different chrome, that is, the content is the same, but the site navigation and boilerplate is in the language specified by the hreflang attribute.
Given the Etsy Listing has been translated to German, this code would be appended to the German version of that same Etsy Listing:
<link rel="canonical" href="http://de.etsy.com/listing/83154191/etsy-autoaufkleber-5-stuck" />
Notice there are no alternates listed for the German content. That content can’t be found anywhere else. There is no ambiguity and so we have no worries of misrepresenting it as duplicate content.
Note that rel=”alternate” is only used for pages with the same content, but different languages for the navigation and boilerplate. Search engines use this information to send users to the version of the site they’re most-likely to be able to move around in.
We know how to say, “Stay out!” in every language.
Another angle to be sure we don’t publish mislabeled English-only content is to make use of robots.txt and “robot” type meta tags. Using combinations of this technique we can suggest that search providers not index multilanguage content that will only be available in the future.
This might mean that robots.txt will list “disallow” directives for chunks of the site. Like this:
Disallow: /listings/*
Also we may say that if an Etsy Listing is not translated into French, viewing that Listing from fr.etsy.com will result in rendering meta tags like this:
<meta name="robots" content="noindex">
Always know how to ask for directions!
One of the best ways to help yourself out with SEO is to provide a map for your search providers. This standard is known as the Sitemap protocol. This sitemap file is specific to the international subdomain or top level domain the request came from.
An Etsy Listing will show up in the sitemap.xml of any subdomain corresponding to the languages the listing is available in. For example, a listing available in Italian will show up in the sitemap.xml at it.etsy.com, while a listing only available in English would not. Any Etsy Listing is available to any user on Etsy, but targeting the sitemap.xml in this way allows us to indicate to search providers what the region and language are for a particular Etsy Listing.
All the data contained inside the sitemap.xml and robots.txt file is generated dynamically, querying out for only content translated into that specific region and language. You can check out our SEO-related files below. (Be sure to clear any region- or language-specific cookies you might have set already.)
http://www.etsy.com/robots.txt http://www.etsy.com/etsymap_listing_50m_sitemap_index.xml http://de.etsy.com/robots.txt http://de.etsy.com/etsymap_listing_de_sitemap.xml.gz
Keep a log of your journey!
At Etsy use Splunk, a data management interface, to aggregate all of our web server logs. We can use it to run periodic reports and keep an eye on the results for us. We also run automated tasks daily, essentially mimicking the role of an Etsy user. The tasks execute a search query for Etsy content using a normal search engine, and then compare the results to what we’d expect to see for such a query.
Using this system we can keep an eye on:
- What regional / language content is being crawled by search crawlers?
- What multilingual content is being returned to users searching for Etsy Shop and Listing content?
- Roughly how much traffic is coming in organically from search results? We can base the reports on the request’s referrer.
- Are these numbers (total numbers of shops and organic traffic) in sync roughly with what we would expect?
If we see a surge in regional traffic without a commensurate rise in the data feeding that region or language, we can tell that search indexes are likely tainted and need attention.
Now don’t wander off anywhere. Check back soon for more updates about our adventures in multilingual content.
Moving from SVN to Git in 1,000 easy steps!
This past summer we completed a project that spanned several months of planning and preparation – moving our source control from Subversion to Git. The code that runs our search engine, front-end web stack, support/admin tools, API, configuration management, and more are now stored in and deployed from Git. We thought some of you might find our approach migrating an 80-100 person engineering team interesting and possibly instructive.
We went through three phases:
- Preparation
- Execution
- Follow through
Preparation was the longest and most difficult phase. We dealt with figuring out when and how to move, how to educate our team, and making it a smooth transition for everyone. The execution phase had to be done quickly, because at the rate we are committing and releasing changes it would be counter-productive for the cutover to take more than a few hours. We spent a lot of the time in the preparation phase making sure that was possible. The follow through phase refers to supporting our team from the point we cutover and into the future.
This post isn’t prescriptive, but before getting into details here is our only piece of advice:
If you can deal with your current source code system, do not go through this pain. Seriously. This was a long, painful process for us. Over the years, many tools, systems, and processes had become deeply intertwined with our subversion installation. That said, if your team is small, or your source control system isn’t tied into anything, go for it! Just do it as soon as possible – the only time better than today was yesterday.
Preparation
Moving to Git is something we’ve been talking about for at least 2 years. It’s also something we put off for a lot of very good reasons. Around that time we had been introducing the culture of continuous deployment, which included the mind shift of moving away from long lived branches, and instead branching in code with feature flags, making small frequent deploys, and using percentage rampups to slowly roll out features. At the time we didn’t want to introduce any other road blocks to instilling this into our engineering culture.
What we found happening more and more was that new engineers were coming in already familiar with Git, using things like git-svn and writing tools to make SVN act more like Git. While evaluating the options, it was clear that for our team, Git was a better fit than SVN (and a better fit than any other distributed version control system as well). One of the biggest reasons is github.com and its popularity for open source collaboration. Not only do we put our open source contributions on there, but so do Twitter, Facebook and many others.
Though we did not move to Git for its branching capability, our tools weren’t capturing some of the work we were already doing with patches and pushing changes directly between team members for review and testing. We also felt that re-examining and adding new tools to the mix seemed like a healthy trait to have in our culture, and felt the switch to Git would increase engineer happiness.
After we committed to the decision, we handled the move to Git slowly and delicately for a few reasons. One is that we deploy around 30 times a day across an engineering organization that was about 80 people at the time. We didn’t want to lose any of that velocity (we knew we might lose some in the beginning, but wanted it to be as seamless as possible). Another was that we had a varying range of Git familiarity across the team. From Git experts to people who had never touched it. Education played a huge part in our successful transition. It was also important for us to continue the use of flagging code on/off and having a continuously deployed trunk mentality even after the switch to git.
Prep
The first few months of prep consisted of slowly reorganizing our SVN repositories to be more in line with how we were working. Our code had become spread across many different repositories, and we wanted to make sure that when you were ready to work on the main website, that you only needed to clone one repository. This was a good thing for us even if we hadn’t moved to Git, because it introduced a logical organization that was more in line with how our site was laid out.
We were also starting to decide which tools we could use around Git. We wanted a front end tool with a good UI, so naturally we contacted GitHub for a trial version of github enterprise. At the same time we tried out some tools such as gitweb with something like gitosis or gitolite underneath. We also took a look at gitorious which is probably the next closest thing to GitHub if you are looking for a free solution.
Education
We started to look around for training, and it became obvious that github’s training program is the best out there. We wanted to make sure everyone was well prepared, and they offer a training that you can go through online, or have the instructors come to you for a more hands on experience. We also wanted to examine our workflow and integration issues, and found that bringing an instructor on site was the best use of our time.
Since we were training people with various skill levels, we decided to split the training into separate sessions based on experience. We surveyed our team, and grouped people into beginner and intermediate buckets. We broke about 3/4 of our team across two beginner days, and the rest in a more advanced session on the last day. We also spent some time after the training each day to discuss our workflow and the integration with our current tooling.
Our instructor, Matthew McCullough, couldn’t have been better in explaining git in a sensible way to the team. As a bonus, since we already had github enterprise installed, we were able to use that for the hands on training to get people acclimated to using it, and by the end of the sessions people were creating and hacking on their own repositories in our private installation. It quickly became apparent that this was going to be a great tool for collaboration, with a fantastic UI and all the benefits of the public GitHub, while maintaining the privacy of our codebase that we required.
Pre Migration
After the training, we had to attempt to move fairly quickly so that all that was learned in training was not forgotten. The best way to learn a tool is to use it, and we had to plan how to carry out the actual migration.
There were a few key things we did at this stage. First, we created an Engineering organization within our GitHub, and created a repository that held our web code. We then created a cron that mirrored our SVN commits into the GitHub repo. We were able to use that to update our deploy and testing tools in the background without affecting our current flow.
We also created documents in our wiki that described our workflow, including an explanation of how to do similar tasks that one would do in SVN, with Git. We made it clear with a few weeks notice of our plans for the move so that everyone was mentally prepared for the switch, and even did a few in house training sessions specifically on our new workflow. In the end, we kept our workflow similar to SVN to ease our transition. We still don’t use branches (most of the time), we still deploy from trunk (…well, master).
Execution
The next step was actually flipping the switch. At this point we’d done so much preparation that we were just ready to make this happen and get it over with. We had a code freeze (no commits) one evening in late June, and migrated our deployment and testing tools to use the new Git repo. Our commits were already mirrored into Git, so the new repository was up to date. But we also had to be sure the Git repository was getting chef‘d out to each developer’s VM, with the web configuration in place to have engineers sit down the next day and be ready to code and deploy. We had to make sure that our hooks were working, and that our commit emails and IRC notifications were uninterrupted. All in all the code freeze lasted about 12 hours, and we were ready to go for the next day.
Follow Through
As part of our preparation, we made sure to identify some of the members across the team who were key in helping assist others with the transition to this new tool. It certainly helps to have a few people on hand that are familiar with Git and its distributed model to help people get acclimated. We set up a #git IRC channel (we use IRC across the entire company for communication) and we also had our documentation to point to, which people were able to add to if they encountered any new problems or needed clarification with the new workflow.
The first day on Git our velocity was above average – we wanted to make sure everyone was comfortable and able to work in this new system, and the migration didn’t end up slowing us down at all. We stated that everyone had to clone-pull-add-commit-push on that first day so that there was no getting lost for weeks. In our opinion this was one of the things that was the most successful aspects of the move. Just like we have people deploy on their first day here, overcoming the fear is a big part of adapting to a new process.
Summary
Overall we can say the Git migration was a success. It turned out to be an immense task with a maze of dependencies, but in the end we’re on a current version control system that should last for years to come. It opens up many new workflow possibilites and solves some of our existing problems, not to mention it’s blazing fast. If you’re more interested in the technical instead of the social migration of SVN to Git, I wrote a blog post a few years ago on my personal blog that you may be interested in, and there’s also a couple of pointers over on github on how to make the conversion.
Etsy at LISA Conference
A few of us at Etsy will be speaking at the LISA Conference next week which runs from December 4–9, in Boston, MA. Avleen Vig is speaking on Dec. 7th about the operational impact of continuous deployment. Erik Kastner and John Goulah will be talking on Dec. 8th about the tools and culture around our deployment process. Check here for a full schedule of technical sessions. If you happen to be there, please come say hello!
Engineering Social Commerce
This holiday season we launched a redesigned version of a product we call “Gift Ideas for Your Friends”. The product works by connecting with your Facebook account, analyzing thousands or more of your friends’ likes and interests, and then making recommendations across millions of items in Etsy’s marketplace. Social commerce has been somewhat of a hot topic lately, and the gift recommender is a social commerce feature in that it provides a new and unique shopping experience to buy gifts for your friends and family. In this post we explore some of the engineering challenges we faced in building a social commerce feature like the gift recommender.
The gift recommender is social in that it brings your friends to Etsy. We all know shopping for friends and family is hard, particularly around the holidays when shopping lists grow quite large. Building a responsive experience here that allows navigation across your friends and their recommendations requires a tight coupling between client and server components as well as with Facebook’s API. On the commerce side of things, the product is powered by data mining algorithms that analyze contexts in both Facebook’s social setting and Etsy’s marketplace to make relevant recommendations. While developing these algorithms represents a challenge within itself, the coupling between these algorithms and the end design and user interaction is equally critical.
The end result is a product that requires integration among components across our entire stack, including: frontend html, css, and javascript, middle tier application logic and libraries, backend database interfaces and job queues, and hadoop driven recommendations. Let’s dive into some of the application’s core components, related system couplings, and some of the challenges we faced in building the product.
Tight integration with Facebook
The gift ideas product works by analyzing each of your friends. For each friend, we request various attributes, including name, education history, likes, interests, and activities. Facebook has a limit of 5,000 friends, but having friend counts above 1,000 is not uncommon. Furthermore, it is also not uncommon for people to have upwards of thousands of likes and interests. So, for a typical Facebook power user (read: your average graduate college student) requesting upwards of 100,000 attributes is not uncommon.
So how do we pull this amount data back from Facebook? First, you may have noticed that each friend is featured in a separated UI component which allows us to compute recommendations independently. When creating recommendations, we split up friends into groups of 50 and use our asynchronous job queueing system (powered by Gearman) to create recommendations in parallel. Each Facebook request is constructed using a series of fairly complex Facebook Query Language queries, a SQL like language supported by Facebook’s API. Some of these queries are extremely complex. For example, the query to fetch a user’s page likes looks something like this:
select page_id, name, type
from page
where page_id in (
select page_id
from page_fan
where uid in (
select uid2 from friend where uid1 = me() limit 50;
)
)
Requesting data from Facebook is the slowest component of the recommendation creation process: some of our larger Facebook queries take multiple seconds to respond.
Performance: Caching, Caching, and Caching
The new design for the product displays many friends and their recommendations on the primary splash page. This is in contrast to the old design which only allowed for viewing of only one friend’s recommendations at a time. This presented several performance challenges.
Each Facebook attribute triggers a recommendation, and each recommendation shows items from the marketplace by issuing a search query. The new product displays four recommendations per friend in batches of 20 friends, so each batch can require as many as 80 search queries. Assuming an average response time of ~200ms per search, this could add up to load times in excess of 15 seconds. (!)
Luckily, the distribution of Facebook likes (and corresponding gift recommendations) is very sharp: the most popular 5,000 recommendations represent over 90% of all recommendations made by the product. Therefore, caching listing results at a per-recommendation level granularity provides us with tremendous speedups: 200ms search requests optimize to ~2ms memcache requests.
Client and Server-side Facebook API
We make heavy usage of the Facebook client API to authenticate users during initial Facebook connection phases: we did not want to recreate the javascript authentication flow supported by Facebook’s javascript SDK. However, we also needed server side API access for deeper queries required by the core recommender algorithm.
Complicating matters, we also recently released a feature that allows you to connect your Etsy account with your Facebook account. Managing tokens and authentication across the two systems while also allowing users to shop for gift ideas without an Etsy account presents several technical challenges.
Dealing with Backend Latency on the Frontend
Perhaps the biggest improvements made this year stem from a tighter coupling between the backend recommendation generation process and the frontend display. The initial creation process can take 3 or more seconds, and providing user feedback and context throughout is critical.
You may have noticed that your recommendations fill in “on the fly” as they’re created. As each of the asynchronous Gearman job workers completes its recommendation task, we stream results back to the client, which then renders them immediately via ajax. The end goal here is to enable the user to see recommendations appear as soon as possible, providing a more immediate shopping experience.
Data Mining
Of course, the core of the recommender system is the recommender algorithm and the supporting data. The core algorithm is responsible for understanding the meaning of a given Facebook attribute in an Etsy context. For example, the artist “Pink” is a popular musician on Facebook. However, a query for “pink” returns substantially different results on Etsy.
The core gift recommendation algorithm is overviewed in a post from earlier this year. We’ve also made several improvements since then. We’re smarter in analyzing gender when retrieving appropriate listing suggestions, and we’ve also taken another pass and removed bad listings based on data from the first year of the product.
Precision vs Recall, and the End Experience
“Gift Ideas for Your Friends” provides a different experience compared to other traditional recommender algorithms. For example, Netflix’s algorithms take a collaborative approach in which your entire profile is analyzed in aggregate, and recommendations are created by comparing your favorite movies compared to others.
In contrast, “Gift Ideas for Your Friends” makes point-based recommendations off of a single attribute of your friend’s Facebook profile. Jim likes burning man. Kurt likes video games. Chad likes Brooklyn.
In informational retrieval terms, the goal of the gift recommender is to optimize on precision: to make a handful of good recommendations based on a given set of attributes. There are lots of things that your mother likes on Etsy that aren’t represented in her Facebook profile, and the gift recommender will “miss” these recommendation opportunities. This is compared to Netflix style recommendations where the goal is to optimize for recall: given your entire movie history, provide recommendations that capture your taste as a whole.
In fact, for the general gift giving problem, optimizing for precision is a more natural objective: you generally buy your mother only a one or two gifts each year. Your mother might appreciate gifts like vintage glassware, amethyst jewelry, raku pottery, etc. A successful holiday gift really only requires buying her one of these items. Netflix style recommendations are aimed at capturing your various aspects of your taste and have stronger expectations for movie recommendations across all genres / styles that you may like.
The ultimate goal of the product is to provide a glimpse of Etsy through your friends and their existence on Facebook. We view the recommendations and sample results not as the final word in what to buy, but rather as a landing pad to dive into the marketplace. Diving into Chad’s recommendations for “Brooklyn” could then lead to a search for “brooklyn bridge” and purchasing an 8×10 photo.
At Etsy, we build our system in a continuously deployed environment which allows us to quickly iterate and experiment. We view everything we build as somewhat of an experiment, and the Facebook gifter is no exception. We look forward to the future of “Gift Ideas for Your Friends” and social commerce in general on Etsy.
Etsy at Hadoop World
Tomorrow afternoon, I will be speaking at Hadoop World about Data Mining for Product Search Ranking. The talk will cover recent attempts to improve search at Etsy using clickstream analysis in Hadoop. If you’re attending Hadoop World, come say hello!
Grace Hopper Celebration of Women in Computing 2011
We are proud to announce that we are a sponsor of this year’s Grace Hopper Celebration of Women in Computing. The conference is being held next week, November 9-11, at the Oregon Convention Center in Portland, OR.
Grace Hopper Celebration of Women in Computing is the largest technical conference for women in computing. Michelle D’Netto, Rachel Vecchitto, and myself are so excited to be attending the conference. We cannot wait to catch up with women leaders in industry, academia and government, and have the most wonderful opportunity to meet the future of women in computing.
Please come meet us at the Etsy booth in the exhibit hall, or at least bump into us if you will be in Portland.
P.S. Thank you, Grace Hopper, for your words of wisdom: “It’s easier to ask for forgiveness than it is to get permission.”
Localizing Logically for a Global Marketplace
An often-overlooked (or underestimated) aspect of internationalizing a website is determining how to localize for a given visitor. You might detect that a visitor is “German” because their Geo IP locates them as connecting from Germany, or because their browser accept language is German. But what if your detection is faulty… they’re using a shared computer, or on vacation, or the Geo IP is just plain wrong?
At Etsy, we wanted to use a tiny bit of magic to help provide the best localized experience for visitors, while still allowing users control over their experience. Websites often use too much magic (e.g. automatically setting incorrect localization settings with a difficult-to-find escape hatch) or use too little (e.g. forcing visitors to a lame splash page where they must choose their home country/language from dropdowns).
For a global marketplace like Etsy, localization breaks down into three components:
- Language—what language should we use to display the site (UI chrome, shops, listings, help content, emails, etc.)?
- Region—what region (country) should we assume the user is from, when showing regional content (e.g. blog posts) or during shopping (e.g. search, checkout, shipping)?
- Currency—what currency should we display prices in?
For a given visitor, localizing to the correct language, region and currency provides the best experience on Etsy. We spent time understanding our visitors and members, determining the best cues for our detection logic, and created an EtsyLocale helper class to encapsulate all localization-related functionality on the site.
Understand your visitors
We started by examining primary and secondary use cases.
Primary use cases:
- English-speaking American user (en/US/USD)
- English-speaking Canadian user (en/US/CAD)
- British English-speaking UK user (en-GB/UK/GBP)
- British English-speaking Australian user (en-GB/AU/AUD)
- German-speaking German user (de/DE/EUR)
- French-speaking French user (fr/FR/EUR)
Secondary use cases:
- Curious American user wanting to see German translations (de/DE/EUR with easy route back to en/US/USD)
- German user on vacation in America (de/DE/EUR)
- Spanish-speaking American (es/US/USD)
- French-speaking Canadian (fr/CA/CAD)
- British ex-pat living in Japan (en-GB/JP/JPY)
- English-speaking user using friend’s German computer (en/DE/EUR)
Some countries have clear majority preferences for language, but some don’t! And there are plenty of edge case exceptions to think through for any userbase. For that reason, we allow language, region & currency to be set independently.
Cues
At Etsy, we use a series of cues to determine which language, region & currency to show a user. In decreasing order of “signal strength”:
- User preference (language, currency and/or region preference set manually by signed-in member)
- Cookie preference (language, currency and/or region preference set manually by signed-out visitor)
- Primary browser accept language (e.g. en-US or de-DE or de or kr)
- User profile address country (e.g. DE or US)
- GeoIP region (e.g. DE or US)
We also have ccTLDs (e.g. etsy.de) and subdomains (e.g. de.etsy.com) that we use for marketing efforts and for search indexing (SEO) purposes, which we’ll discuss in a later blog post.
Magic
When a visitor comes to Etsy without preferences set, we iterate through the above cues in order, and for each cue we see if there’s a clear mapping between the cue and our set of supported languages, regions and currencies. Once we’ve computed our best guess for language, region & currency, we show a gentle nag at the bottom of the page:

The nag is persistent, and is in both the detected language and English. We always nag instead of auto-setting to minimize confusion/surprises (magic). We don’t nag our primary market users (English-speaking users in the US). All visitors/members have the ability to change their language, region & currency preferences by clicking links in the footer of any page:

We store these preferences as cookie preferences for signed-out visitors. When a user registers, we migrate these cookie preferences to user preferences. When a user signs out, we don’t write their language/region/currency preferences back out as a cookie, providing a “clean slate” experience for other users of that browser. When we add support for new languages/regions/currencies, we treat that as a new “version” of the preferences, re-nagging where appropriate.
Encapsulate
We encapsulate all of this logic in an EtsyLocale() object, which is available across our stack, for easy access to the current visitor/member’s language, region and currency, e.g.
if (EtsyLocale::getInstance()->getRegion() == “DE”) {
include “hello_etsy_berlin_meetup.tpl”;
}
We make use of Smarty modifiers to format and display prices based on EtsyLocale->getCurrency(). Our translation tools (specifically, translateMsg()) make use of EtsyLocale->getLanguage() to determine which translations to use.
We use PHP’s built-in setlocale() methods for date formatting (including month name translations), number formatting, string alphabetization and so on. PHP’s setlocale() function has varying support for locale formats. For example, if an Etsy visitor has German language preferences but French region preferences, we might represent that locale string as “de_FR” in PHP. However, setlocale() doesn’t understand that we should use German month names and number formatting for “de_FR”. So, to be safe, we pass in a list of locale strings that setlocale() should attempt to use including a more generic language-only locale—in this case (“de_FR”, “de_DE”).
You get a lot for free with PHP locales, but there’s still a lot of holes to plug. At Etsy, we needed to develop date formats for each region, e.g. short dates (“Dec 1, 2010” for en-US, “01. Dez. 2010” for de) and long dates (“December 1, 2010” for en-US, “01 décembre 2010” for fr). Using setlocale() too aggressively for number formatting can cause SQL-incompatible float writing (e.g. “1.234,56” instead of “1,234.56“). And keep in mind you often need to use multibyte-aware functions in PHP to take advantage of locale settings.
Na, was sagt ihr?
We’d love to hear from you… any examples of well-localized sites, problems you’ve come across, PHP tricks/solutions? Share with us below. Stay tuned for more about Etsy’s internationalization.




















