Flurry APIs now support CORS

We recently upgraded all of our REST APIs to support Cross-Origin Resource Sharing (CORS). If you’re not familiar with CORS, this means that you can now access Flurry APIs via Javascript on web pages other than flurry.com. Get started now or read on to learn more about CORS.

Why do we need CORS?

Modern Browsers implement something called a same origin policy when processing and rendering the data associated with a website. This policy says that the web page can only load resources that come from the same host (origin) as the web page itself. For example, if you load flurry.com your browser will only let the Javascript in the page load resources from flurry.com. 

Why is a same origin policy important? Cross-site scripting attacks, which take advantage of cross-origin interactions, are one of the more common methods of personal information stealing these days. A basic cross-site scripting attack would show a user a webpage which looks just like the login page for your website. When they user enters their credentials into the malicious website, believing it to be yours, the malicious site takes the credentials and uses javascript to log them into your website using AJAX. After being logged in through javascript, they can steal data, manipulate the account and change the password – all using AJAX behind the scenes without the user being aware.

Such attacks appear to the attacked service as legitimate traffic since they originate from a normal computer browser – complete with the cookies you have set. By preventing access to resources not hosted on the origin, and hence preventing AJAX from reaching another host, the browser is protecting you from this kind of attack. 

However, with the rise of HTML5, more and more web content is loaded dynamically through javascript and rendered in the browser. There are now very legitimate uses for cross-origin resource access in javascript, including widgets, applications and content management.

What is CORS?

CORS bridges the gap between security and flexibility by allowing a host to specify which resources are available from non-origin domains. This allows you to make REST APIs available for access from other domains in the browser, but not your login page. 

Adding CORS support is as simple as adding an extra HTTP response header that specifies what origins can access a given resource. To allow any domain to access a resource, you would include the following HTTP header in responses to requests for that resource:

Access-Control-Allow-Origin: *

Or, to only allow access from Flurry’s website domain you would use the following:

Access-Control-Allow-Origin: http://flurry.com

Note that since the CORS header is in the response of the HTTP request, the request has already been made before your browser evaluates whether to allow access to the result. It’s important to keep that in mind since even if the browser detects a CORS violation, the request will have already been processed on your servers. 

Not all browsers support CORS right now but most modern browsers do. You can read more on the CORS Wikipedia page.

Standard

The Delicate Art of Organizing Data in HBase

Here at Flurry we make extensive use of HBase, a distributed column-store database system built on top of Apache Hadoop and HDFS. For those used to relational database systems, HBase may seem quite limited in the ways you can query data. Rows in HBase can have unlimited columns, but can only be accessed by a single row key. This is kind of like declaring a primary key on a SQL table, except you can only query the table by this key. So if you wish to look up a row by more than one column, as in SQL where you would perform a select * from table where col1 = val1 and col2 = val2, you would have to read the entire table and then filter it. As you can probably tell, reading an entire table can be an extremely costly operation, especially on the data sizes that HBase usually handles. One of the solutions to this is to use a composite key, combining multiple pieces of data into a single block.

What’s a composite key?

HBase uses basic byte arrays to store its data, in both row keys and columns. However it would be tedious to try and manipulate those in our code, so we store our keys in a container class that knows how to serialize itself. For example:

https://gist.github.com/2912716

Here we have two fields forming a single key with a method to serialize it to a byte array. Now any time we want to look up or save a row, we can use this class to do so. 

While this allows us to instantly get a row if we know both the userId and applicationId, what if we want to look up multiple rows using just a subset of the fields in our row key? Because HBase stores rows sorted lexicographically by comparing the key byte arrays, we can perform a scan to effectively select on one of our index fields instead of requiring both. Using the above index as an example, if we wanted to get all rows with a specific userId:

https://gist.github.com/2912730

Since we’ve set the start and stop rows of the Scanner, HBase can give us all the rows with the specified userId without having to read the whole table. However there is a limitation here. Since we write out the userId first and the applicationId second, our data is organized such that all rows with the same userId are adjacent to each other, and rows with the same applicationId but different userIds are not. Thus if we want to query by just the applicationId in this example, we need to scan the entire table. 

There are two “gotchas” to this serialization approach that makes it work as we expect it. First, we assume here that userId is greater than 0. The binary representation of a negative 2’s Complement Integer would be lexicographically after the largest positive number. So if we intend to have negative userIds, we would need to change our serialization method to preserve ordering. Secondly, DataOutput.writeUTF specifically serializes a string by first writing a short (2 bytes) for its length, then all the characters in the string. Using this serialization method, the empty string is naturally first lexicographically. If it we serialized it using a different method, such that the empty string were not first, then our scan would stop somewhere in the next userId. 

As you can see, just putting our index data in our row key is not enough. The serialization format of our components determines how well we can exploit the properties of HBase lexicographic key ordering, and the order of the fields in our serialization determines what queries we are able to make. Deciding what order to write our row key components in needs to be heuristically driven by the reads we need to perform quickly. 

How to organize your key

The primary limitation of composite keys is that you can only query efficiently by known components of the composite key in the order they are serialized. Because of this limitation I find it easiest to think of your key like a funnel. Start with the piece of data you always need to partition on, and narrow it down to the more specific data that you don’t often need to distinguish. Using the above example, if we almost always partition your data by userId, putting it in our index first is a good idea. That way we can easily select all the applicationIds for a userId, and also select a specific applicationId for a userId when we need to. However, if we are often looking up data by the applicationId for all users, then we would probably want to put the applicationId first.

As a caveat to this process, keep in mind that HBase partitions its data across region servers based on the same lexicographic ordering that gets us the behavior we’re exploiting. If your reads/writes are heavily concentrated into a few values for the first (or first few) components of your key, you will end up with poorly distributed load across region servers. HBase functions best when the distribution of reads/writes is uniform across all potential row key values. While a perfectly uniform distribution might be impossible, this should still be a consideration when constructing a composite key.

When your key gets too complex

If your queries go beyond the simple funnel model, then it’s time to consider adding another table. Those used to heavily normalized relational databases will instinctively shy away from repeating data in multiple tables, but HBase is designed to handle large amounts of data, so we need to make use of that to overcome the limitations of a single index. Adding another table that stores the same cells with a differently organized key can reduce your need to perform full table scans, which are extremely expensive time-wise, at the cost of space, which is significantly cheaper if you’re running at Hadoop scale. Using the above example, if we routinely needed to query by just userId or applicationId, we would probably want to create a second table. Assuming we still want to be able to query by both userId and applicationId, the field we use as a key for the second table would depend on the distribution of the relationship of user to application and vice versa. If a user has more applications than an application has users, then that would mean scanning a composite key of (userId, applicationId) would take longer than scanning it in (applicationId, userId) order, and vice versa.

The downside to this approach is the added complexity of ensuring that the data in both of your tables is the same. You have to ensure that whenever you write data to one table, you write it to both tables simultaneously.  It helps to have all of your HBase reading and writing encapsulated so that individual producers or consumers are not accessing the HBase client code directly, and the use of multiple tables to represent a single HBase-backed entity is opaque to its users.

If you’ve got a lot of rows

Sometimes storing large parts of your data in the key can be hazardous. Scanning multiple rows is usually more costly than reading a single row, even if that single row has many columns. So if performing a scan on the first part of your composite key often returns many rows, then you might be better off reorganizing your table to convert some of the parts of your composite key to column names. The tradeoff there is that HBase reads entire rows at a time, so while you can instruct the HBase client to only return certain columns, this will not speed up your queries. In order to do that, you need to use column families. 

Getting it right the first time

Finally, always take extreme care to pick the right row key structure for your current and future querying needs. While most developers are probably used to throwing out most of the code they write, and refactoring many times as necessary for evolving business requirements, changing the structure of an HBase table is very costly, and should be avoided at all costs. Unlike with most relational databases, where adding a new index is a simple operation with a mild performance hit, changing an HBase table’s structure requires a full rewrite of all the data in the table. Operating on the scale that you’d normally use Hadoop and HBase for, this can be extremely time consuming. Furthermore, if you can’t afford your table to be offline during the process, you need to handle the migration process while still allowing reads from the old and new tables, and writes to the new table, all while maintaining data integrity.

HBase or not HBase

All of this can make using HBase intimidating to first time users. Don’t be afraid! These techniques are common to most NoSQL systems which are the future of large scale data storage. Mastering this new world allows you to unlock a massively powerful system for large data sets and perform analysis never possible before. While you need to spend more time up front on your data schema, you gain the ability to easily work with Petabytes of data and tables containing hundreds of Billions of rows. That’s big data. We’ll be writing more about HBase in the future, let us know if there’s something you’d like us to cover. 

Here at Flurry, we constantly working to evolve our big data handling strategies. If that sounds interesting, we are hiring engineers! Please check out http://flurry.com/jobs for more information.

 

Standard

Scaling @ Flurry: Measure Twice, Plan Once

Working in web operations can be quite exciting when you get paged in the middle of the night to debug a problem, then burn away the night hours formulating a solution on the fly using only the resources you have at hand. It’s thrilling to make that sort of heroic save, but the business flows much more smoothly when you can prepare for the problems before they even exist. At Flurry we watch our systems closely, find patterns in traffic and systems’ behavioral changes, and generally put solutions in place before we encounter any capacity-related problems.

A good example of this sort of capacity planning took place late last year. During the Christmas season Flurry usually sees an increase in traffic as people fire up their new mobile phones, so we knew in advance of December 2011 that we’d need to accommodate a jump in traffic—but how much?  Fortunately, we had retained bandwidth and session data from earlier years, so were able to estimate our maximum bandwidth draw based on the increases we experienced previously, and our estimate was within 5% of our actual maximum throughput. There are probably some variables we still need to account for in our model, but we were able to get sufficient resources in place to make it through the holiday season without any serious problems. Having worked in online retail, I can tell you that not getting paged over a holiday weekend is something wonderful.

2011_eoy_bandwidth

Doubling of outbound bandwidth from Nov-Dec 2011, Dec 24th yellow

Outside of annual events, we also keep a constant eye on our daily peak traffic rates. For example, we watch bandwidth to ensure we aren’t hitting limits on any networking choke points, and requests-per-second is a valuable metric since it helps us determine scaling requirements like overall disk capacity (each request taking up a small amount of space in our data stores) and the processing throughput our systems can achieve overall. The overall throughput includes elements like networking devices (switches, routers, load balancers) or CPU time spent handling and analyzing the incoming data.

Other metrics we watch on a regular basis include disk utilization, number of incoming requests, latency for various operations (time to complete service requests, but also metrics like software deployment speed, mapreduce job runtimes, etc.), CPU, memory, and bandwidth utilization, as well as application metrics for services like MySQL, nginx, haproxy, and Flurry-specific application metrics. Taken altogether, these measurements allow us to gauge the overall health and trends of our systems’ traffic patterns, from which we can then extrapolate when certain pieces of the system might be nearing capacity limits.

Changes in traffic aren’t the only source of capacity flux, though—because the Flurry software is a continually-changing creature, Flurry operations regularly coordinates with the development teams regarding upcoming changes that might cause changes like increases in database connections, more time spent processing each incoming request, or other similar items. Working closely with our developers also allows Flurry to achieve operational improvements like bandwidth offloading by integrating content delivery networks into our data delivery mechanisms.

One area I think we could improve is in understanding what our various services are capable of when going all-out. We’ve done some one-off load tests to get an idea of upper limits for requests per second per server, and have used that research as a baseline for rough determinations of hardware requirements, but the changing nature of our software makes that a moving target. Getting more automated capacity tests would be handy in both planning hardware requirements and for potentially surfacing performance-impacting software changes.

Overall, though, I think we’ve done pretty well. Successful capacity planning doesn’t prevent every problem, but paying attention to our significant metrics allows us to grow our infrastructure to meet upcoming demand, saving our urgent-problem-solving resources for the truly emergent behavior instead of scrambling to keep up with predictable capacity issues.

Standard