Upgrade to Pro — share decks privately, control downloads, hide ads and more …

When the Cloud is a Rockin': High Availability in Apache CloudStack

When the Cloud is a Rockin': High Availability in Apache CloudStack

CloudStack currently provides a variety bespoke high availability mechanisms for resources such as virtual machines, hosts, and virtual routers. Each of these implementations duplicates the HA check/recovery cycle, as well as, concurrency, persistence, and clustering required manage high available for any CloudStack resource. The High Availability Resource Management Service has been developed to consolidate these concerns -- providing a robust, extensible HA mechanism. Using this service, plugins only need to define health check, activity check, and fence operations.

John Burwell

June 02, 2016
Tweet

More Decks by John Burwell

Other Decks in Technology

Transcript

  1. The Cloud Specialists When  the  Cloud  is  a  Rockin':  High

      Availability  in  Apache  CloudStack shapeblue.com    •    @ShapeBlue John  Burwell    •    @john_burwell   VP  of  Software  Engineering
  2. C l i c k t o e d i

    t The Cloud Specialists ShapeBlue.com @ShapeBlue A b o u t   M e • VP of Software Engineering @ ShapeBlue • Member, Apache CloudStack PMC (June 2013) • Ran operations and designed automated provisioning for analytic/virtualization clouds • Led architectural design and server-side development of a SaaS physical security platform
  3. C l i c k t o e d i

    t The Cloud Specialists ShapeBlue.com @ShapeBlue • Rohit  Yadav   • Abhi  Prateek   • Murali  Reddy   • Boris  Stoyanov T h e r e ’ s   N o   “ I ”   i n   Te a m
  4. C l i c k t o e d i

    t The Cloud Specialists ShapeBlue.com @ShapeBlue M o t i v a t i o n Currently  [sic]  KVM  HA  works  by  monitoring  an  NFS   based   heartbeat   file   and   it   can   often   fail   whenever   this   network   share   becomes   slower,   causing   the   hypervisors  to  reboot.  …  This  is  embarrassing.  How   can   we   fix   it?   Ideas,   suggestions?   How   are   other   hypervisors  doing  it?   -­‐  Nux   15  October  2015   CLOUDSTACK-­‐8943
  5. C l i c k t o e d i

    t The Cloud Specialists ShapeBlue.com @ShapeBlue • Limited  to  hosts  and  VMs  using  NFS  storage   • Tight  coupling  between  the  Agent  and   HighAvailabilityManager   • False  positives  which  interrupt  the  operation   healthy  resources L i m i t a t i o n s / I s s u e s Inconsistent  behavior  prevents  operators  from  trusting  KVM  HA
  6. C l i c k t o e d i

    t The Cloud Specialists ShapeBlue.com @ShapeBlue B u i l d   v s .   B u y Pros   • Integration  with  the   CloudStack  control  plane  and   abstractions   • Simpler  configuration   • Integrated  instrumentation   and  logging Cons   • Complex  mechanism  to   implement,  test,  and   maintain   • Foregoing  a  proven,  battle   tested  implementation   • Less  functionality  initially A  robust  infrastructure  control  plane  must  include  the  ability  to   recover  and  fence  resources
  7. C l i c k t o e d i

    t The Cloud Specialists ShapeBlue.com @ShapeBlue H A   R e s o u r c e   M a n a g e m e n t   S e r v i c e HA  Resource   Management  Service Plugin •Manages  per  resource  FSM   •Persistence   •Concurrency/Back  Pressure   •Common  Business  Logic •Resource-­‐specific  Business  Logic HA  Provider Resource
  8. C l i c k t o e d i

    t The Cloud Specialists ShapeBlue.com @ShapeBlue • Loose  coupling  between  resources  and  HA   • Consolidate  orthogonal  HA  concerns   • Prove  the  correct  operation  of  the  HA  Resource   Management  Service  and  HA  Providers   independently   • Leverage  CloudStack  abstractions   • Develop  a  model  for  architectural  evolution G o a l s To  create  a  trustworthy  system,  operational   correctness  must  be  the  prevailing  priority
  9. C l i c k t o e d i

    t The Cloud Specialists ShapeBlue.com @ShapeBlue • Health  Check:  An  idempotent  check  of  a  resource  to   directly  verify  its  proper  operation   • Activity  Check:  An  idempotent  check  to  observe  the   side-­‐effects  of  a  resource’s  proper  operation   • Eligibility:  An  idempotent  determination  of  a   resource’s  eligibility  for  HA  management   • Recovery:  Take  potentially  destructive  actions  to   bring  a  resource  back  to  a  healthy  state   • Fence:  Take  potentially  destructive  actions  to   prevent  an  unrecoverable  resource  from  impacting   the  health  of  its  peers   Te r m s   a n d   C o n c e p t s
  10. C l i c k t o e d i

    t The Cloud Specialists ShapeBlue.com @ShapeBlue • DISABLED:  The  resource  is  part  of  a  partition  where  HA  operations  have  been  disabled   or  have  been  disabled  for  the  resource.   • INITIALIZING:  The  initial  health  and  eligibility  of  the  resource  for  HA  management  is   currently  being  determined.   • AVAILABLE:  The  resource  is  available  based  on  the  passage  of  the  most  recent  health   check  and  it  containing  partition  has  an  HA  state  of  ACTIVE.   • INELIGIBLE:  The  resource's  enclosing  partition  has  an  HA  state  of  ACTIVE  but  its   current  state  does  not  support  HA  check  and/or  recovery  operations.   • SUSPECT:  The  resource  pending  an  activity  check  due  to  failing  its  most  recent  health   check.     • CHECKING:  An  activity  check  is  currently  being  performed  on  the  resource.       • RECOVERING:  Recovery  operations  are  in-­‐progress  to  bring  the  resource  back  to  a   healthy  state.     • DEGRADED:  The  resource  cannot  be  managed  by  the  control  plane  but  passed  its  most   recent  activity  check  indicating  that  the  resource  is  still  servicing  end-­‐user  requests   • FENCED:  The  resource  is  not  operating  normally  and  automated  attempts  to  recover  it   failed.    Manual  operator  intervention  is  required  to  recover  the  resource. S t a t e s
  11. C l i c k t o e d i

    t The Cloud Specialists ShapeBlue.com @ShapeBlue S t a t e M o d e l
  12. C l i c k t o e d i

    t The Cloud Specialists ShapeBlue.com @ShapeBlue H A P r o v i d e r I n t e r f a c e public interface HAProvider<R> extends Adapter { ResourceType resourceType(); ResourceSubType resourceSubType(); boolean isEligible(R r); boolean isHealthy(R r) throws HACheckerException; boolean hasActivity(R r) throws HACheckerException; boolean recover(R r) throws HARecoveryException; boolean fence(R r) throws HAFenceException; }
  13. C l i c k t o e d i

    t The Cloud Specialists ShapeBlue.com @ShapeBlue K V M   H o s t   H A KVM  Host  HA  Provider Storage   Processor Activity   Check Host Recover  /   Fence  using   OOBM KVM  Agent Health   Check
  14. C l i c k t o e d i

    t The Cloud Specialists ShapeBlue.com @ShapeBlue C o n c u r r e n c y   M o d e l •Producer/consumer  model   •Size  bounded  work  queues   •Time  bounded  operations   •Fixed  sized  thread  pools   •Idempotent  operations  are  ephemeral   •Non-­‐Idempotent  operations  are  managed   through  AsyncJobManager  using  a  new  time-­‐ delayed  dispatcher HA  operations  cannot  overwhelm  the  control  plane
  15. C l i c k t o e d i

    t The Cloud Specialists ShapeBlue.com @ShapeBlue • Focused  on  KVM  host  HA   • Initial  implementation  started  —  validating  the   design   • Draft  specification  —  functional  spec  will  be   published  in  the  next  1-­‐2  weeks   • Robust  unit  and  integration  test  model  to  verify   both  the  service  and  KVM  host  HA  provider   • Delivery  of  the  first  version  in  July  2016  for   inclusion  in  4.10  (August  2016) S t a t u s
  16. C l i c k t o e d i

    t The Cloud Specialists ShapeBlue.com @ShapeBlue • Support  Nested  HA  Resources   • Instrumentation   • Migrate  VM  HA  to  the  HA  Resource  Management   Service W h a t ’ s   N e x t
  17. C l i c k t o e d i

    t The Cloud Specialists ShapeBlue.com @ShapeBlue Questions?  Comments?   #cloudstackworks