Slide 1

Slide 1 text

The Cloud Specialists When  the  Cloud  is  a  Rockin':  High   Availability  in  Apache  CloudStack shapeblue.com    •    @ShapeBlue John  Burwell    •    @john_burwell   VP  of  Software  Engineering

Slide 2

Slide 2 text

C l i c k t o e d i t The Cloud Specialists ShapeBlue.com @ShapeBlue A b o u t   M e • VP of Software Engineering @ ShapeBlue • Member, Apache CloudStack PMC (June 2013) • Ran operations and designed automated provisioning for analytic/virtualization clouds • Led architectural design and server-side development of a SaaS physical security platform

Slide 3

Slide 3 text

C l i c k t o e d i t The Cloud Specialists ShapeBlue.com @ShapeBlue • Rohit  Yadav   • Abhi  Prateek   • Murali  Reddy   • Boris  Stoyanov T h e r e ’ s   N o   “ I ”   i n   Te a m

Slide 4

Slide 4 text

C l i c k t o e d i t The Cloud Specialists ShapeBlue.com @ShapeBlue M o t i v a t i o n Currently  [sic]  KVM  HA  works  by  monitoring  an  NFS   based   heartbeat   file   and   it   can   often   fail   whenever   this   network   share   becomes   slower,   causing   the   hypervisors  to  reboot.  …  This  is  embarrassing.  How   can   we   fix   it?   Ideas,   suggestions?   How   are   other   hypervisors  doing  it?   -­‐  Nux   15  October  2015   CLOUDSTACK-­‐8943

Slide 5

Slide 5 text

C l i c k t o e d i t The Cloud Specialists ShapeBlue.com @ShapeBlue • Limited  to  hosts  and  VMs  using  NFS  storage   • Tight  coupling  between  the  Agent  and   HighAvailabilityManager   • False  positives  which  interrupt  the  operation   healthy  resources L i m i t a t i o n s / I s s u e s Inconsistent  behavior  prevents  operators  from  trusting  KVM  HA

Slide 6

Slide 6 text

C l i c k t o e d i t The Cloud Specialists ShapeBlue.com @ShapeBlue B u i l d   v s .   B u y Pros   • Integration  with  the   CloudStack  control  plane  and   abstractions   • Simpler  configuration   • Integrated  instrumentation   and  logging Cons   • Complex  mechanism  to   implement,  test,  and   maintain   • Foregoing  a  proven,  battle   tested  implementation   • Less  functionality  initially A  robust  infrastructure  control  plane  must  include  the  ability  to   recover  and  fence  resources

Slide 7

Slide 7 text

C l i c k t o e d i t The Cloud Specialists ShapeBlue.com @ShapeBlue H A   R e s o u r c e   M a n a g e m e n t   S e r v i c e HA  Resource   Management  Service Plugin •Manages  per  resource  FSM   •Persistence   •Concurrency/Back  Pressure   •Common  Business  Logic •Resource-­‐specific  Business  Logic HA  Provider Resource

Slide 8

Slide 8 text

C l i c k t o e d i t The Cloud Specialists ShapeBlue.com @ShapeBlue • Loose  coupling  between  resources  and  HA   • Consolidate  orthogonal  HA  concerns   • Prove  the  correct  operation  of  the  HA  Resource   Management  Service  and  HA  Providers   independently   • Leverage  CloudStack  abstractions   • Develop  a  model  for  architectural  evolution G o a l s To  create  a  trustworthy  system,  operational   correctness  must  be  the  prevailing  priority

Slide 9

Slide 9 text

C l i c k t o e d i t The Cloud Specialists ShapeBlue.com @ShapeBlue • Health  Check:  An  idempotent  check  of  a  resource  to   directly  verify  its  proper  operation   • Activity  Check:  An  idempotent  check  to  observe  the   side-­‐effects  of  a  resource’s  proper  operation   • Eligibility:  An  idempotent  determination  of  a   resource’s  eligibility  for  HA  management   • Recovery:  Take  potentially  destructive  actions  to   bring  a  resource  back  to  a  healthy  state   • Fence:  Take  potentially  destructive  actions  to   prevent  an  unrecoverable  resource  from  impacting   the  health  of  its  peers   Te r m s   a n d   C o n c e p t s

Slide 10

Slide 10 text

C l i c k t o e d i t The Cloud Specialists ShapeBlue.com @ShapeBlue • DISABLED:  The  resource  is  part  of  a  partition  where  HA  operations  have  been  disabled   or  have  been  disabled  for  the  resource.   • INITIALIZING:  The  initial  health  and  eligibility  of  the  resource  for  HA  management  is   currently  being  determined.   • AVAILABLE:  The  resource  is  available  based  on  the  passage  of  the  most  recent  health   check  and  it  containing  partition  has  an  HA  state  of  ACTIVE.   • INELIGIBLE:  The  resource's  enclosing  partition  has  an  HA  state  of  ACTIVE  but  its   current  state  does  not  support  HA  check  and/or  recovery  operations.   • SUSPECT:  The  resource  pending  an  activity  check  due  to  failing  its  most  recent  health   check.     • CHECKING:  An  activity  check  is  currently  being  performed  on  the  resource.       • RECOVERING:  Recovery  operations  are  in-­‐progress  to  bring  the  resource  back  to  a   healthy  state.     • DEGRADED:  The  resource  cannot  be  managed  by  the  control  plane  but  passed  its  most   recent  activity  check  indicating  that  the  resource  is  still  servicing  end-­‐user  requests   • FENCED:  The  resource  is  not  operating  normally  and  automated  attempts  to  recover  it   failed.    Manual  operator  intervention  is  required  to  recover  the  resource. S t a t e s

Slide 11

Slide 11 text

C l i c k t o e d i t The Cloud Specialists ShapeBlue.com @ShapeBlue S t a t e M o d e l

Slide 12

Slide 12 text

C l i c k t o e d i t The Cloud Specialists ShapeBlue.com @ShapeBlue H A P r o v i d e r I n t e r f a c e public interface HAProvider extends Adapter { ResourceType resourceType(); ResourceSubType resourceSubType(); boolean isEligible(R r); boolean isHealthy(R r) throws HACheckerException; boolean hasActivity(R r) throws HACheckerException; boolean recover(R r) throws HARecoveryException; boolean fence(R r) throws HAFenceException; }

Slide 13

Slide 13 text

C l i c k t o e d i t The Cloud Specialists ShapeBlue.com @ShapeBlue K V M   H o s t   H A KVM  Host  HA  Provider Storage   Processor Activity   Check Host Recover  /   Fence  using   OOBM KVM  Agent Health   Check

Slide 14

Slide 14 text

C l i c k t o e d i t The Cloud Specialists ShapeBlue.com @ShapeBlue C o n c u r r e n c y   M o d e l •Producer/consumer  model   •Size  bounded  work  queues   •Time  bounded  operations   •Fixed  sized  thread  pools   •Idempotent  operations  are  ephemeral   •Non-­‐Idempotent  operations  are  managed   through  AsyncJobManager  using  a  new  time-­‐ delayed  dispatcher HA  operations  cannot  overwhelm  the  control  plane

Slide 15

Slide 15 text

C l i c k t o e d i t The Cloud Specialists ShapeBlue.com @ShapeBlue • Focused  on  KVM  host  HA   • Initial  implementation  started  —  validating  the   design   • Draft  specification  —  functional  spec  will  be   published  in  the  next  1-­‐2  weeks   • Robust  unit  and  integration  test  model  to  verify   both  the  service  and  KVM  host  HA  provider   • Delivery  of  the  first  version  in  July  2016  for   inclusion  in  4.10  (August  2016) S t a t u s

Slide 16

Slide 16 text

C l i c k t o e d i t The Cloud Specialists ShapeBlue.com @ShapeBlue • Support  Nested  HA  Resources   • Instrumentation   • Migrate  VM  HA  to  the  HA  Resource  Management   Service W h a t ’ s   N e x t

Slide 17

Slide 17 text

C l i c k t o e d i t The Cloud Specialists ShapeBlue.com @ShapeBlue Questions?  Comments?   #cloudstackworks