Continuous Delivery @ Viki

At viki we like to release frequently as often as multiple times a day or atleast once a day. This can be challenging in an agile aggressive team1 where there is always lot of things going in parallel and everybody still commits to a single main-line of the repository. It requires a discipline to stick to continuous deployment and that is done by following a workflow that everybody in team is comfortable with. When features are released everyday, various teams (Marketing/Customer Support/Business) need to be aware of whats in the pipeline and how things are being rolled-out. In most companies, there is a deployment team which takes care of release managment + communication with rest of the world about what’s being released.

Ideal Scenario:

  1. Your Product Managers write feature/bug stories and prioritize them anytime during the day.
  2. Your engineering team believes in Continuous Integration and all engineers commit to Master branch (or Trunk) several times a day.
  3. If the build (CI) is green, that tag from Master Branch might get pushed to QA environment for stories to be delivered.
  4. Once all stories are accepted, (QA) Staging Tag is deployed to production, Happy Ending of the day!

Practical Scenario:

  1. You have 5 Product Managers who request feature/bug stories and prioritize them anytime during the day.
  2. You have 10 engineers working on 5 stories and commits to Master branch (or Trunk) several times a day.
  3. Build is green only 70% of time during the day.
  4. By the time build is green there are commits to 4 finished & intermediate commits to an 1 un-finished story.
  5. QA delivers the stories(4) which has been finished and 3 are accepted and 1 is rejected.
  6. Among 3 accepted stories Marketing Team takes a call to hold 1 story even though its ready.
  7. Now we have 2 production ready, 1 held by marketing, 1 rejected, 1 un-finished.
  8. Commits related to 2 prod ready are shuffled between commits related to all non-ready stories.
  9. Among 2 ready-to-deploy , 1 is marked urgent but it’s commits are part of the day when CI build was RED (not necessarily due to this commit).
  10. What and how would you deploy today? (cherry-pick commits for a feature with no green build? FAIL) You post-pone release!
    • ==Day Rolls Over==
  11. 3 new stories requested by Product Managers.
  12. An engineer finishes 2 of new stories quickly whose commits goes to master. 3rd new story might take long to finish, but gets its commits pushed to master.
  13. Build is green and a tag on master is pushed to QA environment.
  14. 2 new stories are delivered and accepted.
  15. Which staging tag on master would you deploy to production? At any given time there are commits from un-finished/un-delivered stories.

Solutions:

  1. Sure you can use feature toggle, but it only makes sense for long running (for weeks) set of stories. When every story starts to have a feature toggle, then system gets polluted with if-else everywhere, which again is difficult to manage and error prone
  2. You can also ask engineers to have separate feature branches for each stories and rebase with master often. This brings in its own over heads
    • Time spent in merging changes
    • It needs a single controller of release branch who pulls the changes and makes sure what goes live is vetted. (This controller can soon become a bottleneck in the process)
    • Engineers work in isolation and can not commit intermediate commits unless feature is complete.
  3. Use All-Accepted Marker : This is a commit on master below which all stories have been accepted and there are few commits (shuffled with other un-finished) above the marker that can be cleanly cherry-picked.

Workflow for All-Accepted Marker Deployment:

  1. Find a suitable commit below which all stories are accepted
  2. Branch out to new “Release” Branch
  3. Inform stake-holders about what features are being released and locked down release marker
  4. Cherry-pick related commits from above the marker to release branch
  5. Build the release branch and wait for it to be green
  6. Deploy release branch to production
  7. Automate this entire process

Deployment Pipeline is the tool that helps a deployment manager in above workflow

developers-machine:~/workspace/repository (master)$ pipeline help
Tasks:
  pipeline help [TASK]       # Describe available tasks or one specific task
  pipeline release_plan      # Prepares a release plan
  pipeline setup             # Setup Deployment Pipeline Tool
  pipeline status            # lists all stories with their status
  pipeline suitable_release  # Suggests a release commit to be picked and also includes a release plan

Options:
  [--config=CONFIG]  # A ruby file that defines relevant constants & configs. accepts ENV $PIPELINE_CONFIG
                     # Default: /Users/dev_home/.pipeline_config

By the way, we are currently growing our engineering team, check our current openings and write us to jobs@viki.com if you are interested in joining the Viki team!
;)

[1]: Team which is motivated for release-often philosophy so much that it releases to production multiple times a day. It uses DVCS like Git and agile story tracker like PIVOTAL TRACKER. It has adopted TDD & Continious Integration as way of life. Every engineer commits to master all the time.

Finding unaccepted commit with git_story

As Viki engineers, we work closly with our product team in developing a new features for Viki. We start with creating stories together. After that, those stories will be implemented and tested by engineers. And at the end, they have to be tested and to be accepted again by the product team. We use this process to ensure that our product is built with the consideration from both sides, technical point of view and user experience point of view.

But when it comes to process of releasing our feature to production, we find it hard to find out that whether a set of to-be-released commits belong to accepted stories.

We keep tracking between PivotalTracker story and Git commit for tracing back the history of development but that is not enough. We need a mechanism to list out all unaccepted stories because it quite hard and error prone to manaully check between commits and stories. After researching for a while, we decided to write a gem called git_story to help us on this ourselves.

To use this gem, we can run command like this,

$ git story <commit-before-first-commit> <last-commit>

For example,

$ git story e1a4be4 9a5bfa42

And get back the result,

9a5bfa42712ba2a5cc76b504966d05bfd848892c #29606203 delivered from Admin
9f40f97dfa2200c4fdd94aa38d03d52d9123bb69 #29973257 delivered from Core
ac7f751c58d16453b2c4b4c9005cd5f8936cdd18 #29306719 finished from Data

The result contains git commit sha-1, story id, status of the story and which project the story comes from. With this information we can figure out easily if we are safe to release to production.

For more details about this gem, please visit github.

Ah, don’t forget we are currently growing our engineering team, check our current openings and write us to jobs@viki.com if you are interested! ;)

Darcy Laycock (@sutto) on API driven development

Last Saturday was RedDotRubyConf 2012 in Singapore. Darcy Laycock (@sutto) gave a very nice talk about API driven development and was kind enough to pay us a visit at Viki and talk a bit more about his experiences. Here are some notes of the conference talk and the discussions with us.

Speed

  • The expected response time for an API should be lower than 100ms. If it takes longer than that, you should probably worry.
  • Caveat: search APIs can take longer, so no worries, that is expected
  • See: this presentation by Instagram on how to work around slow parts of API
  • Darcy and his team have found 50ms to be the sweet spot

Documentation

  • The best API documentation is handwritten (i.e.: not auto-generated from code). Be sure to present your documentation in an easily navigable format.
  • Consider including a console to explore the API. Consider using or following the example of Apigee.
  • Like Echonest, provide a test auth key that developers can use to test with, but with low limits.
  • Stripe, Twitter and Github APIs generally good examples to follow.

Internal protection

  • For protection of internal APIs from outsiders, you could use an API key approach.
  • However, Darcy’s team have found AWS firewalls to be a good alternative: they only allow internal apps to access their API endpoints.
  • This approach is simple, and  has the same results as the key approach, for less complexity.

API architecture design

  • One approach is to separate out APIs – e.g. have the Graph resource separate from the People resource, and combine those resources at another API service in front.
  • Benefits of this approach: it’s easier to understand, and easier to test. Also, if you want to change the implementation of an API resource, for instance: switch the graph resource to a graph database, you can do that and know the API will continue to work.
  • If speed is an issue here, use HTTP caching in front of each resource, because most of the time will be spent on IO-blocking operations.
  • Also use threading at the combination stage, by requesting multiple resources at the same time. This is beneficial even with Ruby’s Global Interpreter Lock. And use Fibers in Ruby 1.9. There may still be some logic problems, but this is worth the speed benefit from taking a concurrent approach.

Design of APIs

  • On API design, it does appear that Foursquare does something that Darcy has not seen in any other API: they display compact and full representations of objects. ‘Compact’ is the bare minimum to show in a list, e.g. a list of users; and a ‘full’ object is, say, a full profile of user.
  • Each object returns a flag saying compact or full.

Hypermedia APIs

  • ‘Hypermedia is the term people use to describe Roy Fielding’s original REST thesis. Rails doesn’t actually implement real REST.
  • A distinctive feature of a Hypermedia API is that it is self-describing, parts links to other parts from within the API responses, just like on the web.

Security

  • iOS-only apps can choose to not implement OAuth.
  • In general, Darcy recommends OAuth, and 2.0 over 1.0. Because the standard will cover all use cases.
  • OAuth 2 has something called ‘bearer’ tokens, that can be used for 2-legged authentication. This approach skips the typical OAuth dance.

Caching

  • Use rack-middle in front of rack-cache to validate access token.
  • Make access token part of the cache-key.
  • Also you may use rack-middleware to set arbitrary HTTP “vary” header to cache based on other keys (e.g. country of the request)

Thanks to our fantastic intern Cedric (@ejames_c) for helping out taking notes.

By the way, we are currently growing our engineering team, check our current openings and write us to jobs@viki.com if you are interested in joining the Viki team! ;)

Singapore Ruby Brigade September Meeting

This month we hosted the monthly Singapore Ruby Brigade meeting in the ViKi office.

Presentations:

Todo for next meeting: projector, better food and maybe to start a little bit later.

Thank you everyone for attending.

Images on google+

Finally, multiple profiles in Chrome

One of my biggest problem with chrome was the inability to use multiple profiles. Testing the site with users with different privileges, logging in/out all the time is very time-consuming.

Also, keeping private life separated from office was an issue — different gmail, different pivotal tracker login, different github account, different tabs, different everything. Logging in and out all the time is not an option. Using Firefox for one and Chrome for the other is passable, but not perfect. First, what if you need more than two sets of logins? I am actively using four profiles. Chrome, Firefox, Iron, Safari? Give me a break.

Finally, Chrome addressed the issue, and multiple profiles are in the dev builds. It works very well, it even remembers the tabs opened for each profile. Here is how you can activate it:

  1. Download developer version from http://www.google.com/chrome/intl/en/eula_dev.html
  2. Install it, then type in chrome://flags into the address bar. On the page, search for multiple profiles. Enable it.
  3. Restart your browser
  4. type in chrome://settings/personal into the address bar. Now you will see a ‘Users’ section. Click Add new User. It immediately opens a new browser window (not a tab) with the new user selected. Now you can see the the profile marker on the top right corner of the window. Close this window, and you will see that in the preferences screen now you have two users: a default user and ‘User 1′. You can now edit the profile names and customize their icons.
    Profile Settings
  5. Restart Chrome again. Now the profile marker will be visible on the main window as well, and you can switch users at ease. Every time you switch a user, a new window will be opened, and all tabs of that user will be restored from the last session.

@akomba

Autoscaling Heroku Dynos

Cloud computing solutions have become very popular because it is easy to use, very scalable, low in cost, high reliability and performance. Here at VIKI, we rely on Heroku’s architecture to scale our application.

Paying for the perceived demand

While Heroku has gained credibility for one of the best rails cloud hosting solutions, it has a few quirks in its business model that are not so friendly to their customers. For one, it is not really a true pay-per-usage solution.

Dynos and Workers are set manually by the customers according to what the customer “thinks” is enough for their app to support its user traffic. When we set our dynos manually, we pay for them at a specific constant amount according to our perceived demand of what the traffic might be.

In reality, the amount of traffic an app generates varies at different period of times with unpredictable sharp peaks and troughs. Most often, dynos are set at levels that are way above what we think is required by our app. It is not possible to achieve true pay-per-usage because we are paying for “perceived demand” and not the actual number of dynos we are using, which is often more than required. Good business model for Heroku, not really very cost efficient for its customers.

Implementing Autoscaling

There are a few solutions in the community now that have proclaimed to autoscale dynos like heroscale or autoscale-heroku gem. However, none of them seems to work effectively at this point of time.

At some point, we’ve decided to create our own custom working solution to autoscale Heroku dynos, so that we can only pay for the dynos that our app actually use.

Here’s a technical summary of what we did, note that this works seamlessly for us:

We came up with a ruby shell script that enabled us to pull capacity metrics at 3 minute intervals from New Relic, a Heroku system analytics tool we use to track application performance. The busy_percent metric tells us what percentage of (dynos+workers) the application is actually using based on traffic. Since this is the only useful metric we can get from New Relic, we had to trim the metric so that it makes sense for the dynos.

By getting the current number of dynos and workers set in heroku using
heroku dynos --app <your_app>,
we determined the actual projected percentage (used_dynos) of the dynos being used by proportioning out the “busy_percent” from the dynos to workers ratio. We then set the dynos (should_dynos) to a number that will achieve (used_dynos / should_dynos) = 80%. It means that we try to attain an modest 80% utilization of dynos so that we have enough buffer time for the script to react in case there is a peak. So, we try to only pay for a little more than what we use. Simple, right?

The Pessimism Ratio

There are however a few issues we had to face when creating this script:

1) Because busy_percent is an average, it does not truly reflect the maximum because dyno usage fluctuates. When a surge comes, the maximum “busy_percent” is usually more reflective of maximum dynos required. We solved this by adding a “projected” 20% to the average to achieve a perceived maximum.

2) If the proportion of workers to dynos gets large, the formula does not work as well because the sensitivity drops. For example, if I have 10 workers and 1 dyno. 1 dyno is being utilized and no workers are working. New Relic’s busy_percent will only show 9% because it is 1/11. No matter how that dyno struggles, the script will think that it is ok even after proportioning. We added a “pessimism” ratio so that we can manually set the sensitivity of the projected used_dynos according to what our workers are currently at.

The results of this simple script has been very rewarding. Our peak period is between 3am to 4am. During this period the application uses as much as 36 dynos. Because of this, we had to run on 36 dynos 24/7. Now the app averages on 20 dynos, sometimes dropping to as low as 13. Our bill dropped by about $2000 per month. That’s about 24 grand per year saving. Not bad for 2 days of work.

The Code

Here it is. Run it with a 3 minute cron and it should work like a charm if you are using New Relic as well! Drop us a line if you find any problems with it.

http://github.com/viki-org/heroku-autoscale/blob/master/autoscale