Tuesday, December 6, 2016

Pivotal Cloud Foundry - Ops Manager Backup & Recovery


As we know PCF Ops Manager is a singleton component and there is no High Availability.  If it is corrupted or crashed for some reason, PCF platform level operations may be hampered - though it does not affect the ELR or running applications.

It is always advisable to test backup and recovery procedure of Ops Manager. I have done the same using these three ways
1) Export / Import via Ops Manager GUI
2) Using "CF OPS" (http://www.cfops.io/)
3) Image level VADP backup of Ops Manager VM (in case of VMWare)

In this test, assumption is that only Ops Manager is affected while BOSH Director and other PCF/ELR VMs are running fine.


1) Export / Import via Ops Manager GUI:
This is pretty much manual way of backing up Ops Manager where you export the settings via Ops Manager.
This exports base VM images and necessary packages and references to the installation IP addresses. Which means the export size can be large and export might take some time. 
This will save a file from the browser window. Which includes the exported content.
 

Now assume that Ops Manager VM is corrupted/crashed/lost (for this testing I am just powering it off).

Now let us deploy a new Ops Manager giving the same IP and hostname.
Once deployed, open a web browser and type FQDN of Ops Manager and choose to Import Existing Installation.
Provide decryption password (set during the first setup of Ops Manager), path of backup/exported installation.zip file and press Import.
Again this import process may take time depending on the size of data to be imported.
Once the import is completed, it will display the login screen. Login with username/password and you will see the message that import is successful.


While this method woks, it may not be straightforward to automate for regular backups. Now let's test second method using CF OPS.


2) "CF OPS" (http://www.cfops.io/)
This is an automation utility from Pivotal which can be downloaded from https://github.com/pivotalservices/cfops/releases and installed on a jumpbox.

Once done, backup can be triggered via a simple script like below and the same can be scheduled via cron or other scheduling tools.


This is a very simple script for testing. For production use, this can be modified to add more bits.

Now assume that Ops Manager VM is corrupted/crashed/lost (for this testing I am just powering it off).

Now let us deploy a new Ops Manager giving the same IP and hostname.

Once deployed, open a web browser and type FQDN of Ops Manager. Choose internal authentication for this fresh setup.

This will bring up a fresh Ops Manager without any configuration/tiles.

DO NOT setup the authentication at this stage. Just go ahead with the restoration using CFOPS.

Here assumption is that original BOSH director is still operational. 
If this is not the case, rename the bosh-state.json in order to force creation of the BOSH director. Removing bosh-state.json causes Ops Manager to treat the deploy as a new deployment, recreating missing Virtual Machines(VMs) including BOSH. The new deployment ignores existing VMs such as your Pivotal Cloud Foundry deployment.

Once restored, open the FQDN in a web browser which will not ask to setup authentication and directly prompt for id/password.
You will need to apply the changes and once done.... it's all good!


And during all these operations, Elastic Runtime and deployed apps were running perfectly fine.


3) Image level VADP backup of Ops Manager VM (in case of VMWare)
This is like backing up any standard VM via any enterprise backup software leveraging VADP api.

For this testing, I am using EMC Avamar.





Now assume that Ops Manager VM is corrupted/crashed/lost (for this testing I am actually deleting the VM).
And let us restore it from the Avamar backup.







Once restore is completed, power it on(if not selected to do so during restore) wait for couple of minutes for it to start and open the FQDN in a web browser.

It will prompt for decryption passphrase. 
Once entered, it will prompt for login/password. That's it... its fully operational again.
I feel this is the simplest way of backing up and restoring Ops Manager.

Hope this will be useful.



Thursday, December 1, 2016

Pivotal Cloud Foundry - Stemcell Upgrade



I have been working on some upgrade scenarios for PCF environment and this is one of them. 

Stemcell is a an OS image which contains bare minimum OS with few utilities/agents/configuration. Cloud Foundry BOSH team frequently releases new version of Stemcell which addresses (let's say) some security vulnerability, it must be upgraded. Let's do that...


Existing Setup

       Ops Mgr and BOSH Director @ v1.7.14
       ELR @ v1.7.35
       Stemcell @ ubuntu trusty 3233.3
Goal
       Upgrade Stemcell to "ubuntu trusty 3233.4" 

I have a simple php test application(3 instances) deployed and I am hitting it continuously using while loop and curl to simulate the application usage.
[root@myApp php]# while true; do curl myphpapp.pivotal.local; echo; sleep 1; done
This is a Test App, Current Time [03:50:26]
This is a Test App, Current Time [03:50:27]
This is a Test App, Current Time [03:50:28]
This is a Test App, Current Time [03:50:29]
...

Now let's start the upgrade steps
  1. Go to Ops Manager Installation Dashboard, ELR tile, Settings, Stemcell and upload the new Stemcell to ELR
  2.  
  3. Go back to Installation Dashboard and apply the changes
  4. This will start the whole process of upgrade
  5.  
  6.  The upgrade follows the "Canary" method --- BOSH will first try to upgrade a small number of servers (usually 1), the “Canary”. Only if it is successful, the remaining servers will be upgraded.
  7.   
  8. During this process, my application was accessible without any issue
  9. ** As I am running this without high availability due to limited resources in my lab, some of the components were not available and I could not push new apps or make any changes in the running apps ** I also turned "VM resurrection" OFF during the process to avoid any conflicts
  10.  
  11. Normally you should not see this happening as Production environment should always be designed with High Availability. So unavailability of one instance/server of any component during upgrade should not affect other instances/servers of that component
  12. And after a while, upgrade is successful.
  13.  
  14. The new version can be verified as below
  15.  
  16. I must turn "VM resurrection" ON at the end :)

That's it. Upgrade is successful. Hope this will be useful.