Scaling Build and Deployment Systems

Flurry is a rapidly growing company in every meaning of the word including customers, teams, and applications.  We have 650 million unique users per month from mobile applications that embed our SDKs which create over 1.6 billion transactions per day.  Our HBase cluster with over 1,000 nodes holds many petabytes of data and is rapidly growing.

While Flurry’s growth is great news for our customers, employees, and investors, it creates challenges for Release Management to support the growing number of applications, the growing number of developers, and the growing number of  development and test environments.  We need to manage all of that well and do so quickly and reliably.

In short, we need continuous integration to rapidly build, deploy, and test our applications.

Deployment Tools @ Flurry
 

To support our continuous integration, we setup three core tools.

  • Source Control: We use Github Enterprise to manage our source code and various configuration files. We use a variation of the Git Flow development process where all features are developed on individual branches and every merge is in the form of a pull request (which is also code reviewed).
  • Continuous Build: We use Jenkins to build code and deploy to our QA and test environments and to run our JUnit tests.  Jenkins is set up to automatically run JUnit tests when code developers check-in code and when they create pull requests for their branches.  Jenkins also runs Selenium tests with SauceLabs every night against the QA environments and Production.
  • Task Tracking: We use Jira (with the Greenhopper agile plugin) for ticket management for planning and tracking enhancements and bug fixes.  All three tools are well integrated with various plug-ins that allow them to share information and to trigger actions.

Challenges at Scale
 
Our setup for continuous integration has served us well but has some challenges.

  • Too Many Jobs: We have more than 50 Jenkins jobs.  We have over 130 deployment scripts and more than 1,600 configuration files for the CI tools and applications.  Each new application and each new QA environment adds to the pile.  While we are whizzes at writing bash scripts and Java programs, this is clearly not scalable in the long term.
  • Slow Deployments: For security reasons, our Jenkins server cannot deploy war files and config files directly to Production servers.  For Production deployments, we run a Jenkins job that copies the files built by other jobs to a server in the data center over a locked-down, one-way secure tunnel.  Ops staff then manually runs various scripts to push the files to the Production servers and restart them.  This is inefficient in terms of time and people resources.
  • Test Overrun: Our JUnit test suite has over 1,000 test cases which take about an hour to run.  With the increase in the number of developers, the number of test runs triggered by their pull requests is clogging the Jenkins build server. We have biweekly releases to Production which we would like to be able to cut down to a daily basis or at least every few days.  The blocker to this is that the build, deploy, test, analyze, and fix cycle is too long to allow that.

Improving the Process: Design for Speed

The speed of an engineering team is directly related to the speed of release and deployment so we needed to get faster. We have taken a number of steps to address the challenges.

  • We optimized our JUnit test cases by removing unnecessary sleep statements and stubbed out the deployments to our test CDN which reduces network wait time.
  • We upgraded the build system to bigger, faster hardware and parallelized the JUnit test runs so that we can run multiple test jobs at the same time.
  • We added dedicated Jenkins slave servers that can share the burden during times of heavy parallel building. 

Overall we have reduced the time to run the entire test suite to 15 minutes.

To make it easier to manage the Jenkins jobs, we removed a lot of old jobs and combined others using parameterized builds.  We renamed the remaining Jenkins jobs to follow a standard naming convention and organized them into tabbed views.  We now have a dozen jobs laid out where people can find them.

Jenkins

Jenkins

All of the improvement steps have helped, but we needed more fundamental changes.

Improving the Process: RPM Based Software Deployments
 

We changed our build and deployment process to use RPM repositories where every environment has its own local repository of RPMs.  In the new process, the Jenkins job builds the war files then bundles up each war file along with its install script.  The job also builds RPMs for each application’s config files, email templates, data files and the config files for HBase and Hadoop.  Once all of the RPMs are built, the job rsyncs the RPMs to the target environment’s repo.  It then runs ssh yum install against each server in the environment to get it to update itself.  Once all the servers are updated, the job restarts all of the servers at once.  The job is parameterized so that users can build and deploy a single app, a set of apps, all apps, or just some config files.

The developers have access to Jenkins so that they can update the QA environments at will without having to involve Release Engineering.

The RPM-based build and deployment process gives us several advantages.  The install scripts are embedded into the RPMs which reduces the cluttered hierarchy of scripts called by the Jenkins jobs.  The same tools and processes for deployments in the Dev and QA environments can now be safely used in Production.

By having a repo for each environment, we only have to deploy the RPMs once to that repo.  Each sever in the environment then pulls the RPMs from its repo.  This save a lot of time and network bandwidth for our remote environments whose servers used to get files directly from Jenkins.

RPMs support dependencies which instruct yum to deploy a group of other RPMs before deploying the given RPM.  For example, we can set an application’s RPM to be dependent of the application’s config file RPM, so that when we install the application, yum automatically installs the config file RPM.  The dependency feature also allows us to set up a parent RPM for each class of server where the parent RPM is dependent on all of the application RPMs that run on that class of server.  We simple execute yum install with the parent RPM, and yum downloads and installs all of the application RPMs and their dependent config file RPMs needed for that server.  In the future we will add dependencies for Java, Jetty, and various OS packages to the parent RPMs.  This will allow us to kick start a new server and fully provision it at the push of a button.

Conclusion

As with any change in process and tools, there were a few gotchas.  The Jenkins slave server was easy to set up, but there were a lot of tools and configurations needed to support our junit test runs that had to be copied from the Jenkins master server.  We also found a few places where the concurrent junit tests runs stepped on common files.

Overall, the changes have sped up and cleaned up our build and deployments.  They have allowed us to better manage what we have and to handle our future growth.

Standard

Hardware Fault Detection and Planning

The Flurry Operations Team handles 4,428 spinning disks across 1107 servers among a team of 6 awesome operations engineers.  Since Flurry is a startup, we don’t have an on-site tech team to handle all of the hardware issues that happen at the datacenter.  As we’ve grown from 400 disks to over 4000, we’ve improved our process of handling servers experiencing disk hardware failures.

The most common hardware alerts we receive are from the “Self-Monitoring, Analysis and Reporting Technology”, better known as a SMART alert.   This tool tests and monitors disks and will detect and report on potential disk issue, hoping to warn the admin before a disastrous disk issue appears.  (Find out more about SMART errors).

Flurry lives and dies by the data stored in our hadoop and hbase cluster, so when a disk issue happens we need to respond quickly and decisively to prevent data loss and/or performance impacts.  We generally find that we receive critical and non-critical alerts on around 1% of active cluster disks each month, not all of which need immediate attention.

Monitoring 400 disks: SMART error detected on host

When we were a humble cluster of 100 servers it was easy to log into a box, gracefully stop the hadoop services, unmount the drive, and start the hadoop daemons back up.  Most of the actionable alerts we saw were High Read Errors or Uncorrectable Sectors, which tend to indicate a potentially fatal disk issues.

Hadoop tends to let the filesystem handle marking the sectors as bad and/or unreadable, forcing a read to occur on another replica.  Hadoop is pretty good about moving the block mapping but it can increase the read latency, and generally degrades the overall performance of the cluster.  Did I already mention that we don’t like performance degradation?

Monitoring 1200 disks: Find those bad drives fast

Img_20110222_153756

Our first datacenter expansion in 2011 consisted of a buildout of an additional 200 servers.  Each server has 4 x 1TB drives which are utilized in the cluster, that’s 800 disks in this buildout.   During pre-production diagnostic tests, we had a 0.5% failure rate of the new disks.  

Once the initial issues were resolved, we deployed the servers into production.  The 200 new servers had an average of 2.67 disks going bad per month for the period before our next data center buildout.  Our original 400 disks started reporting 2 new issues a month.  That’s jumping from 0.3% to 0.6% disk issues a month, possibly degrading due to their age.

Monitoring 2400 disks: Throwing more servers in the mix

Photo

Four months later, we needed to double our infrastructure to handle all of the new data that we were processing for our analytics.  This time, we were adding in 1200 new disks to the cluster with the same amount of issues.  The pre-production diagnostics tests only shook out 0.02% of the bad disks.

At this time, we started seeing our drive SMART checks increasing from <1% to 1.3% failures a month.  This was also during the Holiday App Boom as seen here and here. We were spending too much time ferrying drives back and forth from the office to the datacenter and started questioning our diagnostics, urgency and response of SMART errors, and steps to replace a drive.

Chart_1

Our servers have temperature indicators we started to manually monitor and started noticing the new servers were running around 84°F on idle, which we tend to see higher hardware failure rates.  We started graphing the temperatures and noticed they increased to 89°F as we started to bring servers into production.  There was a lot we needed to do and not enough resources to do it, other than bug the NOCs to come up with strategies to bring us down to 77°F.

Monitoring 4800 disks: Finally some cooling relief

Photo1

10 months later, we once again doubled our infrastructure and migrated all of our servers into a new space where we now benefit from more efficient datacenter cooling.  Where we had an average of 77°F, we were now running between 60°F to 65°F.  Guess what happened to our average monthly SMART errors.  It went down to 0.4% since the move.  There may be several factors at play here:

  1. higher temperatures definitely seemed to contribute to higher failure rates
  2. we had a burn in time for those first 2400 disks
  3. the load across the disks had lightened after such a large expansion

Monitoring N disks: Scripts and processes to automate our busy lives

We’ve also improved our process for taking out servers with SMART alerts by creating a script which smartd will call when there’s an issue.  In order to automate this, we’ve allowed the smartd check to take out servers at will. Modifying the smartd.conf script a bit, we use the check to call our shell script which does a few checks to gracefully stop the hadoop and hbase processes. This spreads out the existing data on the effected disks to healthy servers. We’ve also included a check to make sure the number of servers we take down does not exceed our hadoop HDFS replication factor, which further prevents the increase in the risk of removing multiple replicated blocks of data. Once all is complete, the script will notify the Operations team of the tasks performed or skipped. We have open sourced this script on our Github account here so you can fork and use it yourself.

What about the physical disks? Instead of having an engineer go down and take out disks from each server, we plan on utilizing our Remote Hands to perform that task for us, so we can focus on handling the broader-reaching issues. There were times where we batched up disk collection and engineers would carry 30 drives back to the office (walking barefoot, uphill both ways).  

As always, we’re trying to do things more efficiently.  A few improvements we have in the plans include:

  1. Having the script unmount the bad disk and bring the server back into production.
  2. The script will email Remote Hands with the server, disk, location and issue, for them to swap the bad drive.
  3. Once the disk is swapped, mount the new drive and return the server into production.
  4. Adapting the script to handle other hardware alerts/issues (network cards, cables, memory, mainboard)

We’ve learned from those grueling earlier days, and continue to make hardware management a priority.  With a small team managing a large cluster, it’s important to lean on automating simple, repetitive tasks as well as utilizing the services you are already paying for.  I, for one, welcome our new robotic overlords.

Standard