Agile Testing

Wednesday, July 20, 2011

Accessing the data center from the cloud with OpenVPN

This post was inspired by a recent exercise I went through at the prompting of my colleague Dan Mesh. The goal was to have Amazon EC2 instances connect securely to servers at a data center using OpenVPN.

In this scenario, we have a server within the data center running OpenVPN in server mode. The server has a publicly accessible IP (via a firewall NAT) with port 1194 exposed via UDP. Cloud instances which run OpenVPN in client mode are connecting to the server, get a route pushed to them to an internal network within the data center, and are then able to access servers on that internal network over a VPN tunnel.

Here are some concrete details about the network topology that I'm going to discuss.

Server A at the data center has an internal IP address of 10.10.10.10 and is part of the internal network 10.10.10.0/24. There is a NAT on the firewall mapping external IP X.Y.Z.W to the internal IP of server A. There is also a rule that allows UDP traffic on port 1194 to X.Y.Z.W.

I have an EC2 instance from which I want to reach server B on the internal data center network, with IP 10.10.10.20.

Install and configure OpenVPN on server A

Since server A is running Ubuntu (10.04 to be exact), I used this very good guide, with an important exception: I didn't want to configure the server in bridging mode, I preferred the simpler tunneling mode. In bridging mode, the internal network which server A is part of (10.10.10.0/24 in my case) is directly exposed to OpenVPN clients. In tunneling mode, there is a tunnel created between clients and server A on a separated dedicated network. I preferred the tunneling option because it doesn't require any modifications to the network setup of server A (no bridging interface required), and because it provides better security for my requirements (I can target individual servers on the internal network and configure them to be accessed via VPN). YMMV of course.

For the initial installation and key creation for OpenVPN, I followed the guide. When it came to configuring the OpenVPN server, I created these entries in /etc/openvpn/server.conf:

server 172.16.0.0 255.255.255.0
push "route 10.10.10.0 255.255.255.0"
tls-auth ta.key 0

The first directive specifies that the OpenVPN tunnel will be established on a new 172.16.0.0/24 network. The server will get the IP 172.16.0.1, while OpenVPN clients that connect to the server will get 172.16.0.6 etc.

The second directive pushes a static route to the internal data center network 10.10.10.0/24 to all connected OpenVPN clients. This way each client will know how to get to machines on that internal network, without the need to create static routes manually on the client.

The tls_auth entry provides extra security to help prevent DoS attacks and UDP port flooding.

Note that I didn't have to include any bridging-related scripts or other information in server.conf.

At this point, if you start the OpenVPN service on server A via 'service openvpn start', you should see an extra tun0 network interface when you run ifconfig. Something like this:

tun0 Link encap:UNSPEC HWaddr 00-00-00-00-00-00-00-00-00-00-00-00-00-00-00-00
   inet addr:172.16.0.1 P-t-P:172.16.0.2 Mask:255.255.255.255
   UP POINTOPOINT RUNNING NOARP MULTICAST MTU:1500 Metric:1
   RX packets:2 errors:0 dropped:0 overruns:0 frame:0
   TX packets:2 errors:0 dropped:0 overruns:0 carrier:0
   collisions:0 txqueuelen:100
   RX bytes:168 (168.0 B) TX bytes:168 (168.0 B)

Also, the routing information will now include the 172.16.0.0 network:

# netstat -rn

Kernel IP routing table

Destination Gateway Genmask Flags MSS Window irtt Iface

172.16.0.2 0.0.0.0 255.255.255.255 UH 0 0 0 tun0

172.16.0.0 172.16.0.2 255.255.255.0 UG 0 0 0 tun0

...etc

Install and configure OpenVPN on clients

Here again I followed the Ubuntu OpenVPN guide. The steps are very simple:

1) apt-get install openvpn

2) scp the following files (which were created on the server during the OpenVPN server install process above) from server A to the client, into the /etc/openvpn directory:

ca.crt

ta.key

client_hostname.crt

client_hostname.key

3) Customize client.conf:

# cp /usr/share/doc/openvpn/examples/sample-config-files/client.conf /etc/openvpn

Edit client.conf and specify:

remote X.Y.Z.W 1194 (where X.Y.Z.W is the external IP of server A)

cert client_hostname.crt
key client_hostname.key
tls-auth ta.key 1

Now if you start the OpenVPN service on the client via 'service openvpn start', you should see a tun0 interface when you run ifconfig:

tun0 Link encap:UNSPEC HWaddr 00-00-00-00-00-00-00-00-00-00-00-00-00-00-00-00
   inet addr:172.16.0.6 P-t-P:172.16.0.5 Mask:255.255.255.255
   UP POINTOPOINT RUNNING NOARP MULTICAST MTU:1500 Metric:1
   RX packets:2 errors:0 dropped:0 overruns:0 frame:0
   TX packets:2 errors:0 dropped:0 overruns:0 carrier:0
   collisions:0 txqueuelen:100
   RX bytes:168 (168.0 B) TX bytes:168 (168.0 B)

You should also see routing information related to both the tunneling network 172.16.0.0/24 and to the internal data center network 10.10.10.0/0 (which was pushed from the server):

# netstat -rn

Kernel IP routing table

Destination Gateway Genmask Flags MSS Window irtt Iface

172.16.0.5 0.0.0.0 255.255.255.255 UH 0 0 0 tun0

172.16.0.1 172.16.0.5 255.255.255.255 UGH 0 0 0 tun0

10.0.10.0 172.16.0.5 255.255.255.0 UG 0 0 0 tun0

....etc

At this point, the client and server A should be able to ping each other on their 172.16 IP addresses. From the client you should be able to ping server A's IP 172.16.0.1, and from server A you should be able to ping the client's IP 172.16.0.6.

Create static route to tunneling network on server B and enable IP forwarding on server A

Remember that the goal was for the client to access server B on the internal data center network, with IP address 10.10.10.20. For this to happen, I needed to add a static route on server B to the tunneling network 172.16.0.0/24, with server A's IP 10.10.10.10 as the gateway:

# route add -net 172.16.0.0/24 gw 10.10.10.10

The final piece of the puzzle is to allow server A to act as a router at this point, by enabling IP forwarding (which is disabled by default). So on server A I did:

# sysctl -w net.ipv4.ip_forward=1

# echo "net.ipv4.ip_forward=1" >> /etc/sysctl.conf

At this point, I was able to access server B from the client by using server B's 10.10.10.20 IP address.

We've just started to experiment with this setup, so I'm not yet sure if it's production ready. I wanted to jot down these things though because they weren't necessarily obvious, despite some decent blog posts and OpenVPN documentation. Hopefully they'll help somebody else out there too.

Thursday, June 30, 2011

A strategy for handling DNS in EC2 with Route 53

In my previous post I showed how to use the boto library to manage Route 53 DNS zones. Here I will show a strategy for handling DNS within an EC2 infrastructure using Route 53.

Let's assume you have a registered domain name called mycompanycloud.com. You want all your EC2 instances to use that domain name to communicate with each other. Assume you launch a database instance that you want to refer to as db01.mycompanycloud.com. What you do is you add a CNAME record in the DNS zone for mycompanycloud.com and point it to the external AWS name assigned to that instance. For example:

# route53 add_record ZONEID db01.mycompanycloud.com CNAME ec2-51-10-11-89.compute-1.amazonaws.com 3600

The advantage of this method is that DNS queries for db01.mycompanycloud.com from within EC2 will eventually resolve the CNAME to the internal IP address of the instance, while DNS queries from outside EC2 will resolve it to the external IP address -- which is in general exactly what you want.

There's one more caveat: if you need the default DNS and search domain in /etc/resolv.conf to be mycompanycloud.com, you need to configure the DHCP client to use that domain, by adding this line to /etc/dhcp3/dhclient.conf:

supersede domain-name "mycompanycloud.com ec2.internal compute-1.internal" ;

Then edit/overwrite /etc/resolv.conf and specify:

nameserver 172.16.0.23
domain mycompanycloud.com
search mycompanycloud.com ec2.internal compute-1.internal

The line in dhclient.conf will ensure that your custom resolv.conf file will be preserved across reboots -- which is not usually the case in EC2 with the default DHCP behavior (thanks to Gerald Chao for pointing out this solution to me).

Of course, you should have all this in the Chef or Puppet recipes you use when you build out a new instance.

I've been applying this strategy for a while and it works out really well, and it also allows me to not run and take care of my own BIND servers in EC2.

Monday, June 20, 2011

Managing Amazon Route 53 DNS with boto

Here's a quick post that shows how to manage Amazon Route 53 DNS zones and records using the ever-useful boto library from Mitch Garnaat. Route 53 is a typical pay-as-you-go inexpensive AWS service which you can use to host your DNS zones. I wanted to play with it a bit, and some Google searches revealed two good blog posts: "Boto and Amazon Route53" by Chris Moyer and "Using boto to manage Route 53" by Rob Ballou. I want to thank those two guys for blogging about Route 53, their posts were a great help to me in figuring things out.

Install boto

My machine is running Ubuntu 10.04 with Python 2.6. I ran 'easy_install boto', which installed boto-2.0rc1. This also installs several utilities in /usr/local/bin, of interest to this article being /usr/local/bin/route53 which provides an easy command-line-oriented way of interacting with Route 53.

Create boto configuration file

I created ~/.boto containing the Credentials section with the AWS access key and secret key:

# cat ~./boto
[Credentials]
aws_access_key_id = "YOUR_ACCESS_KEY"
aws_secret_access_key = "YOUR_SECRET_KEY"

Interact with Route 53 via the route53 utility

If you just run 'route53', the command will print the help text for its usage. For our purpose, we'll make sure there are no errors when we run:

# route53 ls

If you don't have any DNS zones already created, this will return nothing.

Create a new DNS zone with route53

We'll create a zone called 'mytestzone':

# route53 create mytestzone.com
Pending, please add the following Name Servers:
 ns-674.awsdns-20.net
 ns-1285.awsdns-32.org
 ns-1986.awsdns-56.co.uk
 ns-3.awsdns-00.com

Note that you will have to properly register 'mytestzone.com' with a registrar, then point the name server information at that registrat to the name servers returned when the Route 53 zone was created (in our case the 4 name servers above).

At this point, if you run 'route53 ls' again, you should see your newly created zone. You need to make note of the zone ID:

root@m2:~# route53 ls
================================================================================
| ID:   MYZONEID
| Name: mytestzone.com.
| Ref:  my-ref-number
================================================================================
{}

You can also get the existing records from a given zone by running the 'route53 get' command which also takes the zone ID as an argument:

# route53 get MYZONEID
Name                                   Type  TTL                  Value(s)
mytestzone.com.                        NS    172800               ns-674.awsdns-20.net.,ns-1285.awsdns-32.org.,ns-1986.awsdns-56.co.uk.,ns-3.awsdns-00.com.
mytestzone.com.                        SOA   900                  ns-674.awsdns-20.net. awsdns-hostmaster.amazon.com. 1 7200 900 1209600 86400

Adding and deleting DNS records using route53

Let's add an A record to the zone we just created. The route53 utility provides an 'add_record' command which takes the zone ID as an argument, followed by the name, type, value and TTL of the new record, and an optional comment. The TTL is also optional, and defaults to 600 seconds if not specified. Here's how to add an A record with a TTL of 3600 seconds:

# route53 add_record MYZONEID test.mytestzone.com A SOME_IP_ADDRESS 3600
{u'ChangeResourceRecordSetsResponse': {u'ChangeInfo': {u'Status': u'PENDING', u'SubmittedAt': u'2011-06-20T23:01:23.851Z', u'Id': u'/change/CJ2GH5O38HYKP0'}}}

Now if you run 'route53 get MYZONEID' you should see your newly added record.

To delete a record, use the 'route53 del_record' command, which takes the same arguments as add_record. Here's how to delete the record we just added:

# route53 del_record Z247A81E3SXPCR test.mytestzone.com. A SOME_IP_ADDRESS
{u'ChangeResourceRecordSetsResponse': {u'ChangeInfo': {u'Status': u'PENDING', u'SubmittedAt': u'2011-06-21T01:14:35.343Z', u'Id': u'/change/C2B0EHROD8HEG8'}}}

Managing Route 53 programmatically with boto

As useful as the route53 command-line utility is, sometimes you need to interact with the Route 53 service from within your program. Since this post is about boto, I'll show some Python code that uses the Route 53 functionality.

Here's how you open a connection to the Route 53 service:

from boto.route53.connection import Route53Connection
conn = Route53Connection()

(this assumes you have the AWS credentials in the ~/.boto configuration file)

Here's how you retrieve and walk through all your Route 53 DNS zones, selecting a zone by name:

ROUTE53_ZONE_NAME = "mytestzone.com."

zones = {}
conn = Route53Connection()

results = conn.get_all_hosted_zones()
zones = results['ListHostedZonesResponse']['HostedZones']
found = 0
for zone in zones:
    print zone
    if zone['Name'] == ROUTE53_ZONE_NAME:
        found = 1
        break
if not found:
    print "No Route53 zone found for %s" % ROUTE53_ZONE_NAME

(note that you need the ending period in the zone name that you're looking for, as in "mytestzone.com.")

Here's how you add a CNAME record with a TTL of 60 seconds to an existing zone (assuming the 'zone' variable contains the zone you're looking for). You need to operate on the zone ID, which is the identifier following the text '/hostedzone/' in the 'Id' field of the variable 'zone'.

from boto.route53.record import ResourceRecordSets
zone_id = zone['Id'].replace('/hostedzone/', '')
changes = ResourceRecordSets(conn, zone_id)
change = changes.add_change("CREATE", 'test2.%s' % ROUTE53_ZONE_NAME, "CNAME", 60)
change.add_value("some_other_name")
changes.commit()

To delete a record, you use the exact same code as above, but with "DELETE" instead of "CREATE".

I leave other uses of the 'route53' utility and of the boto Route 53 API as an exercise to the reader.

Wednesday, June 01, 2011

Technical books that influenced my career

Here's a list of 25 technical books that had a strong influence on my career, presented in a somewhat chronological order of my encounters with them:

"The Art of Computer Programming", esp. vol. 3 "Sorting and Searching" - Donald Knuth
"Operating Systems" - William Stallings
"Introduction to Algorithms" - Thomas Cormen et al.
"The C Programming Language" - Brian Kernighan and Dennis Ritchie
"Programming Windows" - Charles Petzold
"Writing Solid Code" - Steve Maguire
"The Practice of Programming" - Brian Kernighan and Rob Pike
"Computer Networks - a Systems Approach" - Larry Peterson and Bruce Davie
"TCP/IP Illustrated" - W. Richard Stevens
"Distributed Systems - Concepts And Design" - George Coulouris et al.
"DNS and BIND" - Cricket Liu and Paul Albitz
"UNIX and Linux System Administration Handbook" - Evi Nemeth et al.
"The Mythical Man-Month" - Fred Brooks
"Programming Perl" - Larry Wall et al.
"Counter Hack Reloaded: a Step-by-Step Guide to Computer Attacks and Effective Defenses" - Edward Skoudis and Tom Liston
"Programming Python" - Mark Lutz
"Lessons Learned in Software Testing" - Cem Kaner, James Bach, Bret Pettichord
"Refactoring - Improving the Design of Existing Code" - Martin Fowler
"The Pragmatic Programmer" - Andrew Hunt and David Thomas
"Becoming a Technical Leader" - Gerald Weinberg
"Extreme Programming Explained" - Kent Beck
"Programming Amazon Web Services" - James Murty
"Building Scalable Web Sites" - Cal Henderson
"RESTful Web Services" - Leonard Richardson, Sam Ruby
"The Art of Capacity Planning" - John Allspaw

What is your list?

Tuesday, May 24, 2011

Setting up RAID 0 across ephemeral drives on EC2 instances (and surviving reboots!)

I've been experimenting with setting up RAID 0 across ephemeral drives on EC2 instances. The initial setup, be it with mdadm and lvm, or directly with lvm, is not that hard -- what has proven challenging is surviving reboots. Unless you perform certain tricks, your EC2 instance will be blissfully unaware of its new setup after a reboot. What's more, if you try to mount the new striped volume at boot time by adding it to /etc/fstab, chances are you won't even be able to ssh into the instance anymore. It happened to me many times while experimenting, hence this blog post.

Update: I realize I didn't go into details about the use case of this type of setup. This is useful if you don't want to incur EBS performance and reliability penalties, and yet you have a data set that is larger than the 400 GB offered by an individual ephemeral drive. Of course, if your instance dies, so do the ephemeral drives (after all they are named like this for a reason...) -- so make sure you have a good backup/disaster recovery strategy for the data you store there!

In the following, I will assume you want to set up RAID 0 across the four ephemeral drives that come with an EC2 m1.xlarge instance, and which are exposed as devices /dev/sdb through /dev/sde. By default, /dev/sdb is mounted as /mnt, while the other drives aren't mounted.

I also assume you want to create 1 volume group encompassing the RAID 0 array, and within that volume group you want to create 2 logical volumes with associated XFS file systems, and also 1 logical volume for swap.

Step 1 - unmount /dev/sdb

# umount /dev/sdb

(also comment out the entry corresponding to /dev/sdb in /etc/fstab)

Step 2 - install lvm2 and mdadm

For an unattended install of these packages (slightly complicated by the fact that mdadm also needs postfix), I do:

# DEBIAN_FRONTEND=noninteractive apt-get -y install mdadm lvm2

Step 3 - manually load the dm-mod module

# modprobe dm-mod

(this seems to be a bug in devmapper in Ubuntu)

If you want to set up RAID 0 via lvm directly, you can skip steps 4 and 5. From what I've read, you get better performance if you do the RAID 0 setup with mdadm. Also, if you need any other RAID level, you need to use mdadm.

Step 4 - configure RAID 0 array via mdadm

# mdadm --create /dev/md0 --level=0 --chunk=256 --raid-devices=4 /dev/sdb /dev/sdc /dev/sdd /dev/sde

Verify:

# mdadm --detail /dev/md0

/dev/md0:

Version : 00.90

Creation Time : Mon May 23 22:35:20 2011

Raid Level : raid0

Array Size : 1761463296 (1679.86 GiB 1803.74 GB)

Raid Devices : 4

Total Devices : 4

Preferred Minor : 0

Persistence : Superblock is persistent

Update Time : Mon May 23 22:35:20 2011

State : clean

Active Devices : 4

Working Devices : 4

Failed Devices : 0

Spare Devices : 0

Chunk Size : 256K

UUID : 03f63ee3:607fb777:f9441841:42247c4d (local to host adb08lvm)

Events : 0.1

Number Major Minor RaidDevice State

0 8 16 0 active sync /dev/sdb

1 8 32 1 active sync /dev/sdc

2 8 48 2 active sync /dev/sdd

3 8 64 3 active sync /dev/sde

Step 5 - increase block size to 64 KB for better performance

# blockdev --setra 65536 /dev/md0

Step 6 - create physical volume from the RAID 0 array

# pvcreate /dev/md0

(if you didn't want to use mdadm, you would call pvcreate against each of the /dev/sdb through /dev/sde devices)

Step 7 - create volume group called vg0 spanning the RAID 0 array

# vgcreate vg0 /dev/md0

(if you didn't want to use mdadm, you would run vgcreate and specify the 4 devices /dev/sdb through /dev/sde)

Verify:

# vgscan

Reading all physical volumes. This may take a while...

Found volume group "vg0" using metadata type lvm2

# pvscan

PV /dev/md0 VG vg0 lvm2 [1.64 TiB / 679.86 GiB free]

Total: 1 [1.64 TiB] / in use: 1 [1.64 TiB] / in no VG: 0 [0 ]

Step 8 - create 3 logical volumes within the vg0 volume group

Each local drive is 400 GB, so the total size for the volume group is 1.6 TB. I'll create 2 logical volumes at 500 GB each, and a 10 GB logical volume for swap.

# lvcreate --name data1 --size 500G vg0

# lvcreate --name data2 --size 500G vg0

# lvcreate --name swap --size 10G vg0

Verify:

# lvscan

ACTIVE '/dev/vg0/data1' [500.00 GiB] inherit

ACTIVE '/dev/vg0/data2' [500.00 GiB] inherit

ACTIVE '/dev/vg0/swap' [10.00 GiB] inherit

Step 9 - create XFS file systems and mount them

We'll create XFS file systems for the data1 and data2 logical volumes. The names of the devices used for mkfs are the ones displayed via the lvscan command above. Then we'll mount the 2 file systems as /data1 and /data2.

# mkfs.xfs /dev/vg0/data1

# mkfs.xfs /dev/vg0/data2

# mkdir /data1

# mkdir /data2

# mount -t xfs -o noatime /dev/vg0/data1 /data1

# mount -t xfs -o noatime /dev/vg0/data2 /data2

Step 10 - create and enable swap partition

# mkswap /dev/vg0/swap

# swapon /dev/vg0/swap

At this point, you should have a fully functional setup. The slight problem is that if you add the newly created file systems to /etc/fstab and reboot, you may not be able to ssh back into your instance -- at least that's what happened to me. I was able to ping the IP of the instance, but ssh would fail.

I finally redid the whole thing on a new instance (I created the RAID 0 directly with lvm, bypassing the mdadm step), but didn't add the file systems to /etc/fstab. After rebooting and running lvscan, I noticed that the logical volumes I had created were all marked as 'inactive':

# lvscan

inactive '/dev/vg0/data1' [500.00 GiB] inherit

inactive '/dev/vg0/data2' [500.00 GiB] inherit

inactive '/dev/vg0/swap' [10.00 GiB] inherit

This was after I ran 'modprobe dm-mod' manually, otherwise the lvscan command would complain:

/proc/misc: No entry for device-mapper found

Is device-mapper driver missing from kernel?

Failure to communicate with kernel device-mapper driver.

A Google search revealed this thread which offered a solution: run 'lvchange -ay' against each logical volume so that the volume becomes active. Only after doing this I was able to see the logical volumes and mount them.

So I added these lines to /etc/rc.local:

/sbin/modprobe dm-mod

/sbin/lvscan

/sbin/lvchange -ay /dev/vg0/data1

/sbin/lvchange -ay /dev/vg0/data2

/sbin/lvchange -ay /dev/vg0/swap

/bin/mount -t xfs -o noatime /dev/vg0/data1 /data1

/bin/mount -t xfs -o noatime /dev/vg0/data2 /data2

/sbin/swapon /dev/vg0/swap

After a reboot, everything was working as expected. Note that I am doing the mounting of the file systems and the enabling of the swap within the rc.local script, and not via /etc/fstab. If you try to do it in fstab, it is too early in the boot sequence, so the logical volumes will be inactive and the mount will fail, with the dire consequence that you won't be able to ssh back into your instance (at least in my case).

This was still not enough when creating the RAID 0 array with mdadm. When I used mdadm, even when adding the lines above to /etc/rc.local, the /dev/md0 device was not there after the reboot, so the mount would still fail. The thread I mentioned above does discuss this case at some point, and I also found a Server Fault thread on this topic. The solution in my case was to modify the mdadm configuration file /etc/mdadm/mdadm.conf and:

a) change the DEVICE variable to point to my 4 devices:

DEVICE /dev/sdb /dev/sdc /dev/sdd /dev/sde

b) add an ARRAY variable containing the UUID of /dev/md0 (which you can get via 'mdadm --detail /dev/md0'):

ARRAY /dev/md0 level=raid0 num-devices=4 UUID=03f63ee3:607fb777:f9441841:42247c4d

This change, together with the custom lines in /etc/rc.local, finally enabled me to have a functional RAID 0 array and functional file systems and swap across the ephemeral drives in my EC2 instance.

I hope this will be useful to somebody out there and will avoid some head-against-the-wall moments that I had to go through....

Monday, May 09, 2011

Managing infrastructures in the cloud, with lessons learned the hard way

Here is a collection of blog posts I wrote over the last 3 years or so. Some of them are practical step-by-step tutorials on using various tools for managing cloud instances, while others talk about lessons learned the hard way, by deploying large-scale infrastructures in the cloud. I am aggregating them here for ease of future reference:

Lessons learned

Experiences deploying a large-scale infrastructure in Amazon EC2 (April 2009)
Lessons learned from deploying a production database in EC2 (April 2011)
Dark launching and other lessons from Facebook on massive deployments (July 2009)
You're not a cloud provider if you don't provide an API (February 2009)

Working with EC2-specific tools

Experiences with Amazon EC2 and EBS (September 2008)
Update on EC2 and EBS (October 2008)
Deploying EC2 instances from the command line (December 2008)
Working with Amazon EC2 regions (December 2008)

Load balancing (ELB and HAProxy)

Using AWS Elastic Load Balancing with a password-protected site (January 2011)
Load balancing in Amazon EC2 with HAProxy (February 2009)
HAProxy, X-Forwarded-For, GeoIP, KeepAlive (March 2009)
HAProxy and Apache performance tuning tips (March 2009)

Working in the multi-cloud with libcloud

Using libcloud to manage instances across multiple cloud providers (December 2010)
libcloud 0.4.2 and SSL (January 2011)
Passing user data to EC2 Ubuntu instances with libcloud (January 2011)
Slides for 'Working in the multi-cloud with libcloud' presentation (March 2011)

Backups to S3

Incremental encrypted backup to Amazon S3 using duplicity (May 2008)

Mail server setup in EC2

Setting up a mail server in EC2 with postfix and Postini (May 2010)

Rackspace CloudFiles

Managing Rackspace CloudFiles with python-cloudfiles (September 2010)

Friday, May 06, 2011

Upgrading the GD library in Ubuntu

We needed to use ImageFlow for some internal testing of image manipulations (esp. reflections). With a stock php5/libgd2 install in Ubuntu 10.04, some calls to the ImageFlow library would fail with:

"GD library is too old. Version 2.0.1 or later is required, and 2.0.28 is strongly recommended."

The libraries installed by Ubuntu were:

$ dpkg -l | grep libgd2
rc  libgd2-noxpm                               2.0.36~rc1~dfsg-3ubuntu1.9.04.1         GD Graphics Library version 2 (without XPM s
ii  libgd2-xpm                                 2.0.36~rc1~dfsg-3ubuntu1.9.04.1         GD Graphics Library version 2
$ dpkg -l | grep php5-gd
ii  php5-gd                                    5.2.6.dfsg.1-3ubuntu4.6                 GD module for php5

The issue here is that Ubuntu does not use the version of GD which is bundled with PHP. See this discussion for more details.

So...some googling around later, I stumbled on this great howtoforge post by patusovniak on "Recompiling PHP5 with bundled support for GD in Ubuntu". It also serves as a good overview of building Ubuntu packages from source. The only observation I have is that after I ran the step

dpkg-buildpackage -rfakeroot

I had to install all .deb packages in /usr/src. So I did

cd /usr/src
dpkg -i *.deb

When running phpinfo(), the GD section now looks:

gd

GD Support enabled
GD Version bundled (2.0.34 compatible)
FreeType Support enabled
FreeType Linkage with freetype
FreeType Version 2.3.11
T1Lib Support enabled
GIF Read Support enabled
GIF Create Support enabled
JPEG Support enabled
libJPEG Version 6b
PNG Support enabled
libPNG Version 1.2.42
WBMP Support enabled
XPM Support enabled
XBM Support enabled

Hopefully this will be useful to someone out there trying to desperately use a newer version of GD with PHP in Ubuntu...

Wednesday, April 27, 2011

Lessons learned from deploying a production database in EC2

In light of the Big EC2 Outage of 2011, I thought I'd offer some perspective on my experiences in deploying and maintaining a production database in EC2. I find it amusing to read some blog posts (esp. the one from SmugMug) where people brag about how they never went down during the EC2 outage, while saying in the same breath that their database infrastructure was hosted somewhere else...duh!

I will give a short history of why we (Evite) ended up hosting our database infrastructure in EC2. After all, this is not how people start using the cloud, since it's much easier to deploy web or app servers in the cloud. I will also highlight some lessons learned along the way.

In the summer of 2009 we decided to follow the example of FriendFeed and store our data in an almost schema-less fashion, but still use MySQL. At the time, NoSQL products such as Cassandra, Riak, etc were still very much untested at large scale, so we thought we'd use something we're familiar with. We designed our database layer from the get go to be horizontally scalable by sharding at the application layer. We store our data in what we call 'buckets', which are MySQL tables with an ID/key and a blob of JSON data corresponding to that ID, plus a few other date/time-related columns for storing the creation and update timestamps for the JSON blob. We started with 1,024 such buckets spread across 8 MySQL instances, so 128 buckets per instance. The number 8 seemed like a good compromise between capacity and cost, and we also did some initial load testing against one server to confirm this number.

We initially rolled out the database infrastructure on 8 Dell PE2970s, each with 16 GB of RAM and 2 quad-core CPUs. Each server ran 2 MySQL instances, for a total of 16, out of each 8 were active at any time, and the other 8 were passive -- each of the active MySQL instances was in a master-master pair with a passive instance running on a different server. This was done so that if any server went down, we still had 8 active MySQL in the mix. We had HAProxy load balancing across each pair of active/passive instances, sending all traffic to the active one, unless it went down, at which point traffic would be sent automatically to the passive one (I blogged about this setup and its caveats here).

As for the version of MySQL, we ran 5.1.37, which was pretty new at the time.

At this point, we did a lot of load testing using BrowserMob, which allowed us to exercise our application in an end-to-end fashion, in the same way a regular user would. All load tests pointed to the fact that we had indeed sufficient firepower at our disposal for the DB layer.

Two important things to note here:

1) We ran the load test against empty databases;
2) We couldn't do a proper 'dark launching' for a variety of reasons, the main one being that the 'legacy' code we were replacing was in a state where nobody dared to touch it -- so we couldn't send existing production traffic to our new DB infrastructure;

We deployed this infrastructure in production in May/June 2009, and it performed well for a few months. At some point, in late September 2009, and with our highest traffic of the year expected to start before Halloween, we started to see a performance degradation. The plain vanilla version of MySQL we used didn't seem to exercise the CPU cores uniformly, and CPU wait time was also increasing.

I should also point out here that our application is *very* write-intensive, so the fact the we had 2 MySQL instances per server, both in a master-master setup with another 2 instances running on a different server, started to tax more and more the CPU and RAM resources of each server. In particular, because each server had only 16 GB RAM, the innodb_buffer_pool_size (set initially at 4 GB for each of the 2 MySQL instances) was becoming insufficient, due also to the constant increase of our database size. It also turned out we were updating the JSON blobs too frequently and in some cases unnecessarily, thus causing even more I/O.

At this point, we had a choice of either expanding our hardware at the data center, or shooting for 'infinite' horizontal scale by deploying in EC2. We didn't want to wait 2-3 weeks for the former to happen, so we decided to go into the cloud. We also took the opportunity to do the following:

we replaced vanilla MySQL with the Percona XtraDB distribution, which includes a multitude of patches that improve the performance of MySQL especially on multi-core servers
we engaged Percona consultants to audit our MySQL setup and recommend improvements, especially in the I/O area
we 'flattened' our MySQL server farm by deploying 16 MySQL masters (each an m1.xlarge in EC2) backed by 16 MySQL slaves (each an m1.large in EC2); we moved away from master-master to a simpler master-slave, because the complexities and the potential subtle issues of the master-master setup were not worth the hassle (in short, we have seen at least one case where the active master was overloaded, so it stopped responded to the HAProxy health checks; this caused HAProxy to fail over to the passive master, which wasn't fully caught up replication-wise with the active one; this caused a lot of grief to us)
we eliminated the unnecessary JSON blob updates, which tremendously reduced the writes to our database

Both moving to the Percona distribution and engaging the Percona experts time turned out to be really beneficial to us. Here are just some of the recommendations from Percona that we applied on the master DBs:

increase innodb_buffer_pool_size to 8 GB
store different MySQL data file types on different EBS volumes; we set apart 1 EBS volume for each for these types of files:

data files
innodb transaction logs
binlogs
temporary files (we actually have 3 EBS volumes for 3 temp directories that we specify in my.cnf)

These recommendations, plus the fact that we were only running one MySQL instance per server, plus the reduction of unnecessary blob updates, gave us a #winning formula at the time. We were able to sustain what is for us the highest traffic of the year, the week after Thanksgiving. But....not all was rosy. On the Tuesday of that week we lost one DB master due to the fact that the EBS volume corresponding to the MySQL data directory went AWOL, causing the CPU to get pegged at 100% I/O wait. We had to fail over to the slave, and rebuild another master from scratch. Same thing happened again that Thursday. We thought it was an unfortunate coincidence at the time, but knowing what we know now, I believe we melted those EBS volumes with our writes. Apologies to the other EC2 customers sharing those volumes with us...

Ever since the move to EC2 we've been relatively happy with the setup, with the exception of fairly frequent EBS issues. The main symptom of such an EBS issue is I/O wait pegged at > 90% for that specific server, which triggers elevated errors across our application server pool. The usual MO for us is to give that server 15-30 minutes to recover, then if it doesn't, to go ahead and fail over to the slave.

One good thing is that we got to be *really* good at this failover procedure. My colleague Marco Garcia and I can do the following real quick even if you wake us up at 1 AM (like the EC2 outage did last week):

fail over the DB master1 to a slave (call it slave1)
launch another m1.xlarge EC2 instance to act as a new master (call it master2); this instance is automatically set up via Chef
take an xtrabackup of slave1 to another EBS volume (I described this in more detail here)
take a snapshot of the EBS volume, then create another volume out of the snapshot, in the zone of the master2
restore the xtrabackup files from the new EBS volume into master2
configure master2 as a slave to slave1, let replication catch up
at an appropriate time, switch the application from slave1 to master2
configure slave1 back as a slave to master2

When I say 'real quick', I have to qualify it -- we have to wait quite a bit for the xtrabackup to happen, the for the backup files to be transferred over to the new master, either via EBS snapshot or via scp. That's where most of the time goes in this disaster recovery procedure -- think 'hours'.

You may question our use of EBS volumes. Because we wanted to split the various MySQL file types across multiple disks, and because we wanted to make sure we have enough disk capacity, we couldn't just use ephemeral disks. Note that we did also try to stripe multiple EBS volumes into a RAID 0 array, especially for the MySQL datadir, but we didn't notice a marked performance improvement, while the overall reliability of the array was still tied to the least performing of the volumes in the stripe. Not #winning.

We've been quite patient with this setup, even with the almost constant need to babysit or account for flaky EBS volumes, until the EC2 outage of last week. We thought we were protected against massive EC2 failures because each MySQL master had its slave in a different EC2 availability zone -- however our mistake was that all of the zones were within the same region, US East.

During the first few hours of the outage, all of our masters and slaves in zone us-east-1a got frozen. The first symptom was that all existing queries within MySQL would not complete and would just hang there, since they couldn't write to disk. Then things got worse and we couldn't even connect to MySQL. So we failed over all masters to their corresponding slaves. This was fine until mid-day on the 1st day of the outage, when we had another master fail, this time in zone us-east-1b. To compound the issue, that master happened to have the slave in us-east-1a, so we were hosed at that point.

It was time for plan B, which was to launch a replacement master in another region (we chose US West) and yet another server in another cloud (we chose Rackspace), then to load the database from backups. We take full mysqldump backups of all our databases every 8 hours, and incrementals (which in our case is the data from the last 24 hours) every hour. We save those to S3 and to Rackspace CloudFiles. So at least there we were well equipped to do a restore. We also had the advantage of having deployed a slave in Rackspace via LittleChef, so we had all that setup (we couldn't use our regular Chef server setup in EC2 at the time). However, while we were busy recovering that server, we got lucky and the server that misbehaved in us-east-1b came back online, so we were able to put it back into the production pool. We did take a maintenance window while this was happening for around 2 hours, but that was the only downtime we had during the whole EC2 outage. Not bad when everything is said and done.

One important thing to note is that even though we were up and running, we had largely lost our redundancy -- we either lost masters, so we failed over to slaves, or we lost slaves. In each case, we had only 1 MySQL server to rely on, which didn't give us a warm and fuzzy feeling. So we spent most of last week rebuilding our redundancy. BTW, this is something that I haven't seen emphasized enough in the blog posts about the EC2 outage. Many people bragged about how they never went down, but they never mentioned the fact that they needed to spend a long time rebuilding their redundancy. This is probably because they never did, instead banking on Amazon to recover the affected zone.

At this point, we are shooting for moving our database server pool back in the data center, this time on much beefier hardware. We are hoping to consolidate the 16 masters that we currently have on 4 Dell C2100s maxed out with CPUs, RAM and disks, with 4 slaves that we will deploy at a different data center. The proper sizing of the new DB pool is to be determined though at this point. We plan on starting with one Dell C2100 which will replace one of the existing masters, then start consolidating more masters, all while production traffic is hitting it. Another type of dark launching if you will -- because there's nothing like production!

I still think going into EC2 wasn't such a bad idea, because it allowed us to observe our data access patterns and how they affect MySQL. The fact that we were horizontally scalable from day one gave us the advantage of being able to launch new instances and add capacity that way if needed. At this point, we could choose to double our database server count in EC2, but this means double the headaches in babysitting those EBS volumes....so we decided against it. We are able though to take everything we learned in EC2 during the past 6 months and easily deploy anywhere else.

OK, so now a recap with some of the lessons learned:

Do dark launches whenever possible -- I said it before, and the above story says it again, it's very hard to replicate production traffic at scale. Even with a lot of load testing, you won't uncover issues that will become apparent in production. This is partly due to the fact that many issues arise after a certain time, or after a certain volume (database size, etc) is reached, and load testing generally doesn't cover those situations.
It's hard to scale a database -- everybody knows that. If we were to design our data layer today, we would probably look at one of the more mature NoSQL solutions out there (although that is still a fairly risky endeavor in my mind). Our sharded MySQL solution (which we use like I said in an almost NoSQL fashion) is OK, but comes with a whole slew of issues of its own, not the least being that maintenance and disaster recovery are not trivial.
If you use a write-intensive MySQL database, use real hardware -- virtualization doesn't cut it, and EBS especially so. And related to this:
Engage technology experts early -- looking back, we should have engaged Percona much earlier in the game, and we should have asked for their help in properly sizing our initial DB cluster
Failover can be easy, but rebuilding redundancy at the database layer is always hard -- I'd like to see more discussion on this issue, but this has been my conclusion based on our experiences. And related to this:
Automated deployments and configuration management can be quick and easy, but restoring the data is a time sink -- it's relatively easy to launch a new instance or set up a new server with Chef/LittleChef/Puppet/etc. It's what happens afterwards that takes a long time, namely restoring the data in order to bring that server into the production pool. Here I am talking mostly about database servers. It's much easier if you only have web/app servers to deal with that have little or no state of their own (looking at you SmugMug). This being said, you need to have an automated deployment/config mgmt strategy if you use the cloud, otherwise you're doing it wrong.
Rehearse your disaster recovery procedures -- we were forced to do it due to the frequent failures we had in EC2. This turned out to be an advantage for us during the Big Outage.
Don't blame 'the cloud' for your outages -- this has already been rehashed to death by all the post-mortem blogs after the EC2 outage, but it does bear repeating. If you use 'the cloud', expect that each and every instance can go down at any moment, no matter who your cloud provider is. Architect your infrastructure accordingly.
If yo do use the cloud, use more than one -- I think that multi-cloud architectures will become the norm, especially after the EC2 outage.
It's not in production if it is not monitored and graphed -- this is a no-brainer, but it's surprising how often this rule is breached in practice. The first thing we do after building a new server is put it in Nagios, Ganglia and Munin.

Friday, April 15, 2011

Installing and configuring Graphite

Here are some notes I jotted down while installing and configuring Graphite, which isn't a trivial task, although the official documentation isn't too bad. The next step is to turn them into a Chef recipe. These instructions apply to Ubuntu 10.04 32-bit with Python 2.6.5 so YMMV.

Install pre-requisites

# apt-get install python-setuptools
# apt-get install python-memcache python-sqlite
# apt-get install apache2 libapache2-mod-python pkg-config
# easy_install-2.6 django

Install pixman, cairo and pycairo

# wget http://cairographics.org/releases/pixman-0.20.2.tar.gz
# tar xvfz pixman-0.20.2.tar.gz
# cd pixman-0.20.2
# ./configure; make; make install

# wget http://cairographics.org/releases/cairo-1.10.2.tar.gz
# tar xvfz cairo-1.10.2.tar.gz
# cd cairo-1.10.2
# ./configure; make; make install

BTW, the pycairo install was the funkiest I've seen so far for a Python package, and that says a lot:

# wget http://cairographics.org/releases/py2cairo-1.8.10.tar.gz
# tar xvfz py2cairo-1.8.10.tar.gz
# cd pycairo-1.8.10
# ./configure --prefix=/usr
# make; make install
# echo ‘/usr/local/lib’ > /etc/ld.so.conf.d/pycairo.conf
# ldconfig

Install graphite packages (carbon, whisper, graphite webapp)

# wget http://launchpad.net/graphite/0.9/0.9.8/+download/graphite-web-0.9.8.tar.gz
# wget http://launchpad.net/graphite/0.9/0.9.8/+download/carbon-0.9.8.tar.gz
# wget http://launchpad.net/graphite/0.9/0.9.8/+download/whisper-0.9.8.tar.gz

# tar xvfz whisper-0.9.8.tar.gz
# cd whisper-0.9.8
# python setup.py install

# tar xvfz carbon-0.9.8.tar.gz
# cd carbon-0.9.8
# python setup.py install
# cd /opt/graphite/conf
# cp carbon.conf.example carbon.conf
# cp storage-schemas.conf.example storage-schemas.conf

# tar xvfz graphite-web-0.9.8.tar.gz
# cd graphite-web-0.9.8
# python check-dependencies.py
# python setup.py install

Configure Apache virtual host for graphite webapp

Although the Graphite source distribution comes with an example vhost configuration for Apache, it didn't quite work for me. Here's what ended up working -- many thanks to my colleague Marco Garcia for figuring this out.

# cd /etc/apache2/sites-available/
# cat graphite


<VirtualHost *:80>        
ServerName graphite.mysite.com        
DocumentRoot "/opt/graphite/webapp"        
ErrorLog /opt/graphite/storage/log/webapp/error.log        
CustomLog /opt/graphite/storage/log/webapp/access.log common        
<Location "/">                
SetHandler python-program                
PythonPath "['/opt/graphite/webapp'] + sys.path"                
PythonHandler django.core.handlers.modpython                
SetEnv DJANGO_SETTINGS_MODULE graphite.settings                
PythonDebug Off                
PythonAutoReload Off        
</Location>        
<Location "/content/">
SetHandler None        
</Location>
<Location "/media/">
SetHandler None
</Location>
Alias /media/ "/usr/local/lib/python2.6/dist-packages/Django-1.3-py2.6.egg/django/contrib/admin/media/"
</VirtualHost>

# cd /etc/apache2/sites-enabled/
# ln -s ../sites-available/graphite 001-graphite

Make sure mod_python is enabled:

# ls -la /etc/apache2/mods-enabled/python.load

Create Django database for graphite webapp

# cd /opt/graphite/webapp/graphite
# python manage.py syncdb

Apply permissions on storage directory

# chown -R www-data:www-data /opt/graphite/storage/

Restart Apache

# service apache2 restart

Start data collection server (carbon-cache)

# cd /opt/graphite/bin
# ./carbon-cache.py start

At this point, if you go to graphite.mysite.com, you should see the dashboard of the Graphite web app.

Test data collection

The Graphite source distribution comes with an example client written in Python that sends data to the Carbon collecting server every minute. You can find it in graphite-web-0.9.8/examples/example-client.py.

Sending data is very easy -- like we say in Devops, just open a socket!


import sys
import time
import os
import platform
import subprocess
from socket import socket

CARBON_SERVER = '127.0.0.1'
CARBON_PORT = 2003
delay = 60 
if len(sys.argv) > 1:  
    delay = int( sys.argv[1] )

def get_loadavg():    
    # For more details, "man proc" and "man uptime"      
        if platform.system() == "Linux":
            return open('/proc/loadavg').read().strip().split()[:3]    
        else:
            command = "uptime"
            process = subprocess.Popen(command, stdout=subprocess.PIPE, shell=True)                      
            os.waitpid(process.pid, 0)
            output = process.stdout.read().replace(',', ' ').strip().split()          
            length = len(output)
            return output[length - 3:length]


sock = socket()
try:
    sock.connect((CARBON_SERVER,CARBON_PORT))
except:
    print "Couldn't connect to %(server)s on port %(port)d" % {'server':CARBON_SERVER, 'port':CARBON_PORT}
    sys.exit(1)

while True:
    now = int( time.time() )
    lines = []
    # We're gonna report all three loadavg values
    loadavg = get_loadavg()
    lines.append("system.loadavg_1min %s %d" % (loadavg[0],now))    
    lines.append("system.loadavg_5min %s %d" % (loadavg[1],now))    
    lines.append("system.loadavg_15min %s %d" % (loadavg[2],now))    

    message = '\n'.join(lines) + '\n' 
    #all lines must end in a newline
    print "sending message\n"
    print '-' * 80
    print message
    print
    sock.sendall(message)
    time.sleep(delay)

Some observations about the above code snippet:

the format of a message to be sent to a Graphite/Carbon server is very simple: "metric_path value timestamp\n"
metric_path is a completely arbitrary name -- it is a string containing substrings delimited by dots. Think of it as an SNMP OID, where the most general name is at the left and the most specific is at the right

in the example above, the 3 metric_path strings are system.loadavg_1min, system.loadavg_5min and system.loadavg_15min

Establish retention policies

This is explained very well in the 'Getting your data into Graphite' portion of the docs. What you want to do is to specify a retention configuration for each set of metrics that you send to Graphite. This is accomplished by editing the /opt/graphite/storage/schemas file. For the example above which send the load average for 1, 5 and 15 min to Graphite every minute, we can specify the following retention policy:

[loadavg]

priority = 100

pattern = ^system\.loadavg*

retentions = 60:43200,900:350400

This tells graphite that all metric_paths starting with system.loadavg should be stored with a retention policy that keeps per minute (60 seconds) precision data for 30 days(43,200 seconds), and per-15 min (900 sec) precision data for 10 years (350,400 seconds).

Go wild with stats!

At this point, if you run the example client, you should be able to go to the Graphite dashboard and expand the Graphite->system path and see the 3 metrics being captured: loadavg_1min, loadavg_5min and loadavg_15min. Clicking on each one will populate the graph with the corresponding data line. If you're logged in into the dashboard, you can also save a given graph.

The sky is the limit at this point in terms of the data you can capture and visualize with Graphite. As an example, I parse a common maillog file that captures all email sent out through our system. I 'tail' the file every minute and I count how many message were sent out total, and per mail server in our mail cluster. I send this data to Graphite and I watch it in near-realtime (the retention policy in my case is similar to the loadavg one above).

Here's how the Graphite graph looks like:

In another blog post I'll talk about Etsy's statsd and its Python equivalent pystatsd, to which my colleague Josh Frederick contributed the server-side code.

Monday, March 28, 2011

Working in the multi-cloud with libcloud

I just posted my slides on "Working on the multi-cloud with libcloud" to Slideshare. It's a talk I gave at the SoCal Piggies meeting in February 2011.

Friday, March 25, 2011

ABM - "Always Be Monitoring"

What prompted this post was an incident we had in the very early hours of this past Tuesday, when we started to see a lot of packet loss, increased latency and timeouts between some of our servers hosted at a data center on the US East Coast, and some instances we have running in EC2, also in the US East region. The symptoms were increased error rates in some application calls that we were making from one back-end server cluster at the data center into another back-end cluster in EC2. These errors weren't affecting our customers too much, because all failed requests were posted to various queues and reprocessed.

There had also been network maintenance done that night within the data center, so we weren't sure initially if it's our outbound connectivity into EC2 or general inbound connectivity into EC2 that was the culprit. What was strange (and unexpected) too was that several EC2 availability zones seemed to be affected -- mostly us-east-1d, but we were also seeing increased latency and timeouts into 1b and 1a. That made it hard to decide whether the issue was with EC2 or with us.

Running traceroutes from different source machines (some being our home machines in California, another one being a Rackspace cloud server instance in Chicago) revealed that packet loss and increased latency occurred almost all the time at the same hop: a router within the Level 3 network upstream from the Amazon EC2 network. What was frustrating too was that the AWS Status dashboard showed everything absolutely green. Now you can argue that this wasn't necessarily an EC2 issue, but if I were Amazon I would like to monitor the major inbound network paths into my infrastructure -- especially when it has the potential to affect several availability zones at once.

This whole issue lasted approximately 3.5 hours, then it miraculously stopped. Somebody must have fixed a defective router. Twitter reports from other people experiencing the exact same issue revealed that the issue was seen as fixed for them at the very minute that it was fixed for us too.

This incident brought home a valuable point for me though: we needed more monitors than we had available. We were monitoring connectivity 1) within the data center, 2) within EC2, and 3) between our data center and EC2. However, we also needed to monitor 4) inbound connectivity into EC2 going from sources that were outside of our data center infrastructure. Only by triangulating (for lack of a better term) our monitoring in this manner would we be sure which network path was to blame. Note that we already had Pingdom set up to monitor various URLs within our site, but like I said, the front-end stuff wasn't affected too much by that particular issue that night.

So...the next day we started up a small Rackspace cloud server in Chicago, and a small Linode VPS in Fremont, California, and we added them to our Nagios installation. We run the same exact checks from these servers into EC2 that we run from our data center into EC2. This makes network issues faster to troubleshoot, although unfortunately not easier to solve -- because we could be depending on a 3rd party to solve them.

I guess a bigger point to make, other than ABM/Always Be Monitoring, is OYA/Own Your Availability (I didn't come up with this, I personally first saw it mentioned by the @fastip guys). To me, what this means is to deploy your infrastructure across multiple providers (data centers/clouds) so that you don't have a single point of failure at the provider level. This is obviously easier said than done....but we're working on it as far as our infrastructure goes.

Wednesday, March 16, 2011

What I like and don't like to see in a technical presentation

What I like to see:

Live demo of the technology/tool/process you are describing (or at least a screencast)
Lessons learned -- the most interesting ones are the failures
If you're presenting something you created:

compare and contrast it with existing solutions
convince me you're not suffering from the NIH syndrome
convince me your creation was born out of necessity, ideally from issues you needed to solve in production

Hard data (charts/dashboards)
Balance between being too shallow and going too deep when covering your topic

keep in mind both the HOW and the WHY of the topic

Going above and beyond the information I can obtain with a simple Google search for that topic
Pointers to any tools/resources you reference (GitHub pages preferred)

What I don't like to see:

Cute slides with images and only a couple of words (unless you provide generous slide notes in some form)
Humor is fine, but not if it's all there is
Hand-waving / chest-pounding
Vaporware
No knowledge of existing solutions with established communities

you're telling me you're smarter than everybody else in the room but you're not backing up that assertion

Simple usage examples that I can also get via Google searches
Abandoning the WHY for the HOW
Abandoning the HOW for the WHY

Monday, March 14, 2011

Deployment and hosting open space at PyCon

One of the most interesting events for me this year at PyCon was an Open Space session organized by Nate Aune on deployment, hosting and configuration management. The session was very well attended, and it included representatives of a large range of companies. Here are some of them, if memory serves well: Disqus, NASA, Opscode, DjangoZoom, Eucalyptus, ep.io, Gondor, Whiskey Media ... and many more that I wish I could remember (if you were there and want to add anything, please leave a comment here).

Here are some things my tired brain remembers from the discussions we had:

everybody seems to be using virtualenv when deploying their Python applications
everybody seems to be using Fabric in one way or another to push changes to remote nodes
the participants seemed to be split almost equally between Puppet and Chef for provisioning
the more disciplined of the companies (ep.io for example) use Puppet/Chef both for provisioning and application deployment and configuration (ep.io still uses Fabric for stopping/starting services on remote nodes for example)
other companies (including us at Evite) use Chef/Puppet for automated provisioning of the OS + pre-requisite packages, then use Fabric to push the deployment of the application because they prefer the synchronous aspect of a push approach
upgrading database schemas is hard; many people only do additive changes (NoSQL makes this easier, and as far as relational databases go, PostgreSQL makes it easier than MySQL )
many people struggle with how best to bundle their application with other types of files, such as haproxy or nginx configurations

at Evite we face the same issue, and we came up with the notion of a bundle, a directory structure that contains the virtualenv of the application, the configuration files for the application, and all the other configuration files for programs that interact with our application -- haproxy, nginx, supervisord for example
when we do a deploy, we check out a bundle via a revision tag, then we push the bundle to a given app server

some people prefer to take the OS package approach here, and bundle all the above types of files in an rpm or deb package
Noah Kantrowitz has released 2 Chef-related Python tools that I was not aware of: PyChef (a Python client that knows how to query a Chef server) and commis (a Python implementation of a Chef server, with the goal of being less complicated to install than its Ruby counterpart)
LittleChef was mentioned as a way to run Chef Solo on a remote node via fabric, thus giving you the control of a 'push' method combined with the advantage of using community cookbooks already published for Chef
I had to leave towards the end of the meeting, when people started to discuss the hosting aspect, so I don't have a lot to add here -- but it is interesting to me to see quite a few companies that have Platform-as-a-Service (PaaS) offerings for Python hosting: DjangoZoom, ep.io, Gondor (ep.io can host any WSGI application, while the DjangoZoom and Gondor are focused on Django)

All in all, there were some very interesting discussions that showed that pretty much everybody is struggling with similar issues. There is no silver bullet, but there are some tools and approaches that can help make your life easier in this area. My impression is that the field of automated deployments and configuration management, even though changing fast, is also maturing fast, with a handful of tools dominating the space. It's an exciting space to play in!

Tuesday, March 08, 2011

Monitoring is for ops what testing is for dev

Devops. It's the new buzzword. Go to any tech conference these days and you're sure to find an expert panel on the 'what' and 'why' of devops. These panels tend to be light on the 'how', because that's where the rubber meets the road. I tried to give a step-by-step description of how you can become a Ninja Rockstar Internet Samurai devops in my blog post on 'How to whip your infrastructure into shape'.

Here I just want to say that I am struck by the parallels that exist between the activities of developer testing and operations monitoring. It's not a new idea by any means, but it's been growing on me recently.

Test-infected vs. monitoring-infected

Good developers are test-infected. It doesn't matter too much whether they write tests before or after writing their code -- what matters is that they do write those tests as soon as possible, and that they don't consider their code 'done' until it has a comprehensive suite of tests. And of course test-infected developers are addicted to watching those dots in the output of their favorite test runner.

Good ops engineers are monitoring-infected. They don't consider their infrastructure build-out 'done' until it has a comprehensive suite of monitoring checks, notifications and alerting rules, and also one or more dashboard-type systems that help them visualize the status of the resources in the infrastructure.

Adding tests vs. adding monitoring checks

Whenever a bug is found, a good developer will add a unit test for it. It serves as a proof that the bug is now fixed, and also as a regression test for that bug.

Whenever something unexpectedly breaks within the systems infrastructure, a good ops engineer will add a monitoring check for it, and if possible a graph showing metrics related to the resource that broke. This ensures that alerts will go out in a timely manner next time things break, and that correlations can be made by looking at the metrics graphs for the various resources involved.

Ignoring broken tests vs. ignoring monitoring alerts

When a test starts failing, you can either fix it so that the bar goes green, or you can ignore it. Similarly, if a monitoring alert goes off, you can either fix the underlying issue, or you can ignore it by telling yourself it's not really critical.

The problem with ignoring broken tests and monitoring alerts is that this attitude leads slowly but surely to the Broken Window Syndrome. You train yourself to ignore issues that sooner or later will become critical (it's a matter of when, not if).

A good developer will make sure there are no broken tests in their Continuous Integration system, and a good ops engineer will make sure all alerts are accounted for and the underlying issues fixed.

Improving test coverage vs. improving monitoring coverage

Although 100% test coverage is not sufficient for your code to be bug-free, still, having something around 80-90% code coverage is a good measure that you as a developer are disciplined in writing those tests. This makes you sleep better at night and gives you pride in producing quality code.

For ops engineers, sleeping better at night is definitely directly proportional to the quantity and quality of the monitors that are in place for their infrastructure. The more monitors, the better the chances that issues are caught early and fixed before they escalate into the dreaded 2 AM pager alert.

Measure and graph everything

The more dashboards you have as a devops, the better insight you have into how your infrastructure behaves, from both a code and an operational point of view. I am inspired in this area by the work that's done at Etsy, where they are graphing every interesting metric they can think of (see their 'Measure Anything, Measure Everything' blog post).

As a developer, you want to see your code coverage graphs showing decent values, close to that mythical 100%. As an ops engineer, you want to see uptime graphs that are close to the mythical 5 9's.

But maybe even more importantly, you want insight into metrics that tie directly into your business. At Evite, processing messages and sending email reliably is our bread and butter, so we track those processes closely and we have dashboards for metrics related to them. Spikes, either up or down, are investigated quickly.

Here are some examples of the dashboards we have. For now these use homegrown data collection tools and the Google Visualization API, but we're looking into using Graphite soon.

Outgoing email messages in the last hour (spiking at close to 100 messages/second):

Size of various queues we use to process messages (using a homegrown queuing mechanism):

Percentage of errors across some of our servers:

Associated with these metrics we have Nagios alerts that fire when certain thresholds are being met. This combination allows our devops team to sleep better at night.

Saturday, February 26, 2011

AWS CloudFormation is a provisioning and not a config mgmt tool

There's a lot of buzz on Twitter on how the recently announced AWS CloudFormation service spells the death of configuration management tools such as Puppet/Chef/cfengine/bcfg2. I happen to think that the opposite is true.

CloudFormation is a great way to provision what it calls a 'stack' in your EC2 infrastructure. A stack comprises several AWS resources such as EC2 instances, EBS volumes, Elastic Load Balancers, Elastic IPs, RDS databases, etc. Note that it was always possible to do this via your own homegrown tools by calling in concert the various APIs offered by these services/resources. What CloudFormation brings to the table is an easy way to describe the relationships between these resources via a JSON file which they call a template.

Some people get tripped by the inclusion in the CloudFormation sample templates of applications such as WordPress, Joomla or Redmine -- they think that CloudFormation deals with application deployments and configuration management. If you look closely at one of these sample templates, let's say the Joomla one, you'll see that what happens is simply that a pre-baked AMI containing the Joomla installation is used when launching the EC2 instances included in the CloudFormation stack. Also, the UserData mechanism is used to pass certain values to the instance. They do add a nice feature here where you can reference attributes defined in other parts of the stack template, such as DB endpoint address in this example:

"UserData": {
          "Fn::Base64": {
            "Fn::Join": [
              ":",
              [
                {
                  "Ref": "JoomlaDBName"
                },
                {
                  "Ref": "JoomlaDBUser"
                },
                {
                  "Ref": "JoomlaDBPwd"
                },
                {
                  "Ref": "JoomlaDBPort"
                },
                {
                  "Fn::GetAtt": [
                    "JoomlaDB",
                    "Endpoint.Address"
                  ]
                },
                {
                  "Ref": "WebServerPort"
                },
                {
                  "Fn::GetAtt": [
                    "ElasticLoadBalancer",
                    "DNSName"
                  ]
                }
              ]
            ]
          }
        },

However, all this was also possible before CloudFormation. You were always able to bake your own AMI containing your own application, and use the UserData mechanism to run whatever you want at instance creation time. Nothing new here. This is NOT configuration management. This will NOT replace the need for a solid deployment and configuration management tool. Why? Because rolling your own AMI results in an opaque 'black box' deployment. You need to document and version your per-baked AMIs carefully, then develop a mechanism for associating an AMI ID with a list of packages installed on that AMI. If you think about it, you actually end up writing an asset management tool. Then if you need to deploy a new version of the application, you either bake a new AMI (painful), or you reach for a real deployment/config mgmt tool to do it.

The alternative, which I espouse, is to start with a bare-bone AMI (I use the official Ubuntu AMIs provided by Canonical) and employ the UserData mechanism to bootstrap the installation of a configuration management client such as chef-client or the Puppet client. The newly created instance then 'phones home' to your central configuration management server (Chef server or Puppetmaster for example) and finds out how to configure itself. The beauty of this approach is that the config mgmt server keeps track of the customizations made on the client. No need for you to document that separately -- just use the search functions provided by the config mgmt tool to find out which packages and applications have been installed on the client.

The barebone AMI + config mgmt mechanism does result in EC2 instances taking longer to get fully configured initially (as opposed to the pre-baked AMI technique), but the flexibility and control you gain over those instances is well worth it.

One other argument, that I almost don't need to make, is that the pre-baked AMI technique is very specific to EC2. You will have to reinvent the wheel if you want to deploy your infrastructure to a different cloud provider, or inside your private cloud or datacenter.

So.....do continue to hone your skills at learning how to fully utilize a good configuration management tool. It will serve you well, both in EC2 and in other environments.

Tuesday, February 22, 2011

Cheesecake project now on GitHub

I received a feature request for the Cheesecake project last week (thanks Joost Cassee!), so as an experiment I also put the code up on Github. Hopefully the 'social coding' aspect will kick in and more people will be interested in the project. One can dream.

HAProxy monitoring with Nagios and Munin

HAProxy is one of the most widely used (if not THE most widely used) software load balancing solution out there. I definitely recommend it if you're looking for a very solid and very fast piece of software for your load balancing needs. I blogged about it before, but here I want to describe ways to monitor it with Nagios (for alerting purposes) and Munin (for resource graphing purposes).

HAProxy Nagios plugin

Near the top of Google searches for 'haproxy nagios plugin' is this message to the haproxy mailing list from Jean-Christophe Toussaint which contains links to a Nagios plugin he wrote for checking HAProxy. This plugin is what I ended up using. It's a Perl script which needs the Nagios::Plugin CPAN module installed. Once you do it, drop check_haproxy.pl in your Nagios libexec directory, then configure it to check the HAProxy stats with a command line similar to this:

/usr/local/nagios/libexec/check_haproxy.pl -u 'http://your.haproxy.server.ip:8000/haproxy;csv' -U hauser -P hapasswd

This assumes that you have HAProxy configured to output its statistics on port 8000. I have these lines in /etc/haproxy/haproxy.cfg:

# status page.
listen stats 0.0.0.0:8000
    mode http
    stats enable
    stats uri /haproxy
    stats realm HAProxy
    stats auth hauser:hapasswd

Note that the Nagios plugin actually requests the stats in CSV format. The output of the plugin is something like:

HAPROXY OK -  cluster1 (Active: 60/60) cluster2 (Active: 169/169) | t=0.131051s;2;10;0; sess_cluster1=0sessions;;;0;20000 sess_cluster2=78sessions;;;0;20000

It shows the active clusters in your HAProxy configuration (e.g. cluster2), together with the number of backends that are UP among the total number of backends for that cluster (e.g 169/169), and also with the number of active sessions for each cluster. If any backend is DOWN, the check status code is critical and you'll get a Nagios alert.

HAProxy Munin plugins

Another Google search, this time for HAProxy and Munin, reveals another message to the haproxy mailing list with links to 4 Munin plugins written by Bart van der Schans:

- haproxy_check_duration: monitor the duration of the health checks per server
- haproxy_errors: monitor the rate of 5xx response headers per backend
- haproxy_sessions: monitors the rate of (tcp) sessions per backend
- haproxy_volume: monitors the bps in and out per backend

I downloaded the plugins, dropped them into /usr/share/munin/plugins, symlink-ed them into /etc/munin/plugins, and added this stanza to /etc/munin/plugin-conf.d/munin-node:

[haproxy*]
user haproxy
env.socket /var/lib/haproxy/stats.socket

However, note that for the plugins to work properly you need 2 things:

1) Configure HAProxy to use a socket that can be queried for stats. I did this by adding these lines to the global section in my haproxy.cfg file:

chroot /var/lib/haproxy
user haproxy
group haproxy
stats socket /var/lib/haproxy/stats.socket uid 1002 gid 1002

(where in my case 1002 is the uid of the haproxy user, and 1002 the gid of the haproxy group)

After doing 'service haproxy reload', you can check that the socket stats work as expected by doing something like this (assuming you have socat installed):

echo 'show stat' | socat unix-connect:/var/lib/haproxy/stats.socket stdio

This should output the HAProxy stats in CSV format.

2) Edit the 4 plugins and change the 'exit 1' statement to 'exit 1' at the top of each plugin:

if ( $ARGV[0] eq "autoconf" ) {
    print_autoconf();
    exit 0;
} elsif ( $ARGV[0] eq "config" ) {
    print_config();
    exit 0;
} elsif ( $ARGV[0] eq "dump" ) {
    dump_stats();
    exit 0;
} else {
    print_values();
    exit 0;
}

If you don't do this, the plugins will exit with code 1 even in the case of success, and this will be interpreted by munin-node as an error. Consequently, you will scratch your head wondering why no haproxy-related links and graphs are showing up on your munin stats page.

Once you do all this, do 'service munin-node reload' on the node running the HAProxy Munin plugins, then check that the plugins are working as expected by cd-ing into the /etc/munin/plugins directory and running each plugin through the 'munin-run' utility. For example:

# munin-run haproxy_sessions 
cluster2.value 146761052
cluster1.value 0

That's it. These plugins make it fairly easy for you to get more peace of mind and a better sleep at night. Although it's well known that in #devops we don't sleep that much anyway...

Thursday, January 27, 2011

Printing to the cloud

A movie is worth 10,000 words...

Tuesday, January 25, 2011

Using AWS Elastic Load Balancing with a password-protected site

Scenario: you have a password-protected site running in EC2 that you want handled via Amazon Elastic Load Balancing. The problem with that is that the HTTP healthchecks from the ELB to the instance hosting your site will fail because they will get a 401 HTTP status code instead of 200. Hence the instance will be marked as 'out of service' by the ELB.

My solution was to serve one static file (I called it 'check.html' containing the text 'it works!') without password protection.

In my case, I have nginx handling both the dynamic app (which is a Django app running on port 8000) and the static files. Here are the relevant excerpts from nginx.conf (check.html is in /usr/local/nginx/static-content):

http {
    include       mime.types;
    default_type  application/octet-stream;

    upstream django {
        server 127.0.0.1:8000;
    }

    server {
        listen       80;

        location / {
            proxy_pass http://django/;
            auth_basic            "Restricted";
            auth_basic_user_file  /usr/local/nginx/conf/.htpasswd;
        }

        location ~* ^.+check\.html$
        {
            root   /usr/local/nginx/static-content;
        }
    }
}

Wednesday, January 19, 2011

Passing user data to EC2 Ubuntu instances with libcloud

While I'm on the topic of libcloud, I've been trying to pass user data to newly created EC2 instances running Ubuntu. The libcloud EC2 driver has an extra parameter called ex_userdata for the create_node method, and that's what I've been trying to use.

However, the gotcha here is that the value of that argument needs to be the contents of the user data file, and not the path to the file.

So...here's what worked for me:

1) Created a test user data file with following contents:

#!/bin/bash

apt-get update
apt-get install -y munin-node python2.6-dev
hostname coolstuff

2) Used the following script to create the node (I also created a keypair which I passed to create_node as the ex_keypair argument):

#!/usr/bin/env python

import os, sys
from libcloud.types import Provider 
from libcloud.providers import get_driver 
from libcloud.base import NodeImage, NodeSize, NodeLocation
 
EC2_ACCESS_ID     = 'MyAccessID'
EC2_SECRET_KEY    = 'MySecretKey'
 
EC2Driver = get_driver(Provider.EC2) 
conn = EC2Driver(EC2_ACCESS_ID, EC2_SECRET_KEY)

keyname = sys.argv[1]
resp = conn.ex_create_keypair(name=keyname)
key_material = resp.get('keyMaterial')
if not key_material:
    sys.exit(1)
private_key = '/root/.ssh/%s.pem' % keyname
f = open(private_key, 'w')
f.write(key_material + '\n')
f.close()
os.chmod(private_key, 0600)

ami = "ami-88f504e1" # Ubuntu 10.04 32-bit
i = NodeImage(id=ami, name="", driver="")
s = NodeSize(id="m1.small", name="", ram=None, disk=None, bandwidth=None, price=None, driver="")
locations = conn.list_locations()
for location in locations:
    if location.availability_zone.name == 'us-east-1b':
        break

userdata_file = "/root/proj/test_libcloud/userdata.sh"
userdata_contents = "\n".join(open(userdata_file).readlines())

node = conn.create_node(name='tst', image=i, size=s, location=location, ex_keyname=keyname, ex_userdata=userdata_contents)
print node.__dict__

3) Waited for the newly created node to get to the Running state, then ssh-ed into the node using the key I created and verified that munin-node and python2.6-dev were installed, and also that the hostname was changed to 'coolstuf'.

# ssh -i ~/.ssh/lc1.pem ubuntu@domU-12-31-38-00-2C-3B.compute-1.internal

ubuntu@coolstuff:~$ dpkg -l | grep munin
ii  munin-common                      1.4.4-1ubuntu1                    network-wide graphing framework (common)
ii  munin-node                        1.4.4-1ubuntu1                    network-wide graphing framework (node)

ubuntu@coolstuff:~$ dpkg -l | grep python2.6-dev
ii  python2.6-dev                     2.6.5-1ubuntu6                    Header files and a static library for Python

ubuntu@coolstuff:~$ hostname
coolstuff

Anyway....hope this will be useful to somebody one day, even if that somebody is myself ;-)

libcloud 0.4.2 and SSL

Libcloud 0.4.2 was released yesterday. Among its new features is an important one: SSL certificate validation is now supported when opening a connection to a cloud provider. However, for this to work, you have to jump through a couple of hoops.

1) Python 2.5 doesn't have the ssl module installed (2.6 does) -- so you need to install it from PyPI. The current version for ssl is 1.15.

2) By default, SSL cert validation is disabled in libcloud.

If you open a connection to a provider you get:

/usr/lib/python2.5/site-packages/libcloud/httplib_ssl.py:55:
UserWarning: SSL certificate verification is disabled, this can pose a
security risk. For more information how to enable the SSL certificate
verification, please visit the libcloud documentation.
warnings.warn(libcloud.security.VERIFY_SSL_DISABLED_MSG)

To get past the warning, you need to enable SSL cert validation and also provide a path to a file containing common CA certificates (if you don't have that file, you can download cacert.pem from http://curl.haxx.se/docs/caextract.html for example). Add these lines before opening a connection:

import libcloud.security
libcloud.security.VERIFY_SSL_CERT = True
libcloud.security.CA_CERTS_PATH.append("/path/to/cacert.pem")

As an aside, the libcloud wiki page on SSL is very helpful and I used it to figure out what to do.