Let's Connect on Vodafone 360 - Using Apache ActiveMQ in Mobile Web 2.0
How to meet design goals and non functional requirements: architecture for using JMS with ActiveMQ in very large backends with respect to high availability and high scalability.
Vodafone 360 25 May 2010 Let's Connect on Vodafone 360 - Using Apache ActiveMQ in Mobile Web 2.0 Dirk Fröhner People Services / Vodafone Internet Services (VIS) / Group Marketing 25 May 2010
Vodafone 360 25 May 2010 Table of contents Introduction What is Vodafone 360 JMS components in Vodafone 360 JMS architecture in Vodafone 360 >Design goals Experience / problems / best practice with ActiveMQ >Bugs, problems, trouble shooting, testing
Vodafone 360 25 May 2010 Introduction What this presentation is all about > What do we do with JMS in the Vodafone 360 backend > Non-functional requirements – Design goals for the JMS architecture > Experience – Problems, best practice, testing → An overview on how JMS can be used in Mobile Web 2.0
360 works on a range of mobiles and PC Website Over 100 other mobile phones Vodafone 360 phones www.360.com with Vodafone 360 services best Vodafone 360 experience
360 • Launched in 8 markets: DE, ES, UK, IT, NL, PT, GR & IE • 500K registered 360 customers • Sold ca. 800K devices with 360 services on them • Currently ca. 15 device models that have all/some 360 services pre-embedded • 360 services downloadable to over 100 popular devices • Works on most major phone platforms from S60, through to Apple and Android • Ca. 9K apps in Apps Shop Progress since launch
Vodafone 360 25 May 2010 JMS architecture in Vodafone 360 Most important design goals to meet non-functional requirements > Reliable messaging > High availability > Horizontal scalability > Performance
Vodafone 360 25 May 2010 JMS architecture in Vodafone 360 > Multiple consumers can subscribe to a queue destination > Guaranteed that exactly one consumer (eventually) receives a particular message > Obviously multiple consumers can be used to spread the load of messages from the queue Excursus: JMS messaging paradigms Queues JMS Broker Queue Destination Producers Consumer A Consumer B
Vodafone 360 25 May 2010 JMS architecture in Vodafone 360 > Multiple consumers can subscribe to a topic destination > All consumers receive each message > Obviously multiple consumers lead to more load of messages from the topic Excursus: JMS messaging paradigms Topics JMS Broker Topic Destination Producers Consumer A Consumer B
Vodafone 360 25 May 2010 JMS architecture in Vodafone 360 > Producers send messages to a topic destination > Consumers subscribe to a queue destination > Consumers can be grouped, each group receives each message, but within a particular group, exactly one consumer receives a particular message > Obviously multiple consumers can be used for load distribution within each group Excursus: JMS messaging paradigms Virtual Topics Topic Destination JMS Broker Queue Destination Producers Consumer Group A Consumer Group B Consumer A1 Consumer A2
Vodafone 360 25 May 2010 JMS architecture in Vodafone 360 > A message, once sent to the broker, will survive an outage of the broker and / or the consumers Reliable messaging > Needs persistence layer (JDBC database or file system based) > Message is stored in persistence layer before ACK is sent to producer > Message is removed from persistence layer after ACK is sent from consumer Non-functional requirement Implementation Drawbacks > Performance loss due to access to persistence layer – decent tests regarding throughput essential > Careless setup of persistence layer can make it even worse > Concurrency issues in past versions (5.3.0.4) can lead to total inactivity of broker when number of persistent messages gets significant
Vodafone 360 25 May 2010 JMS architecture in Vodafone 360 > One host > One broker process > If host dies, messaging dies > If process dies, messaging dies > Only vertical scaling (naturally limited) High availability Standalone JMS node Host A Broker 01 Producers Consumers
Vodafone 360 25 May 2010 JMS architecture in Vodafone 360 > Two hosts > Two broker processes > If host A dies, broker012 takes over > If broker11 dies, broker012 takes over > Only vertical scaling (naturally limited) High availability HA JMS node Host A Broker 011 (Master) Producers Consumers Host B Broker 012 (Slave)
Vodafone 360 25 May 2010 JMS architecture in Vodafone 360 > When a master process dies (e.g. due to CPU meltdown), a slave process can instantly take over > The new active broker has instant access to all persistent messages and none of them will be lost High availability > Build a JMS node with at least two processes, master and slave (or even multiple slaves) > Use a shared persistence layer that all processes can access > Use mutex mechanism on persistence layer to determine who is master and who is slave > Clients need to connect with failover and include hostports of all processes Non-functional requirement Implementation Drawbacks > Topic messages in the dead master will be lost (naturally) > Requires expensive additional hardware in case of shared filesystem > ActiveMQ has to rely on the HA capabilities of the underlying persistence layer (naturally) • e.g. issues with unstable MySQL HA cluster that made the brokers shut down repeatedly
Vodafone 360 25 May 2010 JMS architecture in Vodafone 360 Horizontal scalability is most important design goal > Vodafone 360 needs to be able to serve several tens of millions of users > Vertical scalability is unfortunately naturally limited > Message load needs to be spread horizontally – Leads to obvious approach to have more than one JMS node – Different ways of organization and interaction of JMS nodes can be applied – Not all approaches are suitable for all types of destinations Horizontal scalability
Vodafone 360 25 May 2010 JMS architecture in Vodafone 360 > On-board proposal of ActiveMQ > Consists of a number of JMS nodes that know about each other > Nodes keep themselves aligned regarding connected clients through advisory messages > Messages are routed to other nodes in case there is no (available) local app consumer > Scalability details in a minute... Horizontal scalability Approach 1a: network of brokers (NWOB) Node 1 Broker 01 Producers Consumers Node 2 Broker 02
Vodafone 360 25 May 2010 JMS architecture in Vodafone 360 > n hosts, n*2 broker processes > Nodes spread crosswise on hosts > No node lost when one host dies > Combines HA with NWOB (required) > NWOB provides horizontal scalability... > ...for queues, if clients set up properly > ...not for topics (we'll see why not) Horizontal scalability Approach 1b: HA network of brokers Host A Broker 011 Broker 022 Producers Consumers Host B Broker 012 Broker 021 Node 1 Node 2
Vodafone 360 25 May 2010 JMS architecture in Vodafone 360 > Producers can spread their load on all nodes > If existing nodes cannot handle load fast enough so that producers are slowed down, new nodes can be added > → production should go uniformly distributed to all nodes Horizontal scalability – HA NWOB for queues > Consumers can spread their greed on all nodes > If existing nodes cannot serve greed fast enough so that consumers idle around, new nodes can be added > → consumption should take place from all nodes Producer view Consumer view Broker view > For each queue destination there has to be at least one producer and one consumer against every node • To make use of the width of the NWOB • To avoid unnecessary message forwarding between the nodes • To avoid stuck messages when consumers reconnect with randomize=true > → Tuple (producer, broker, consumer) should always be considered in a holistic way
Vodafone 360 25 May 2010 JMS architecture in Vodafone 360 > Producers can spread their load on all nodes > If existing nodes cannot handle load fast enough so that producers are slowed down, new nodes can be added > → production should go uniformly distributed to all nodes Horizontal scalability – HA NWOB for topics > Consumers can spread their greed on all nodes > If existing nodes cannot serve greed fast enough so that consumers idle around, new nodes can be added > → consumption should take place from all nodes Producer view (isolated view same as for queues) Consumer view (isolated view same as for queues) Broker view > Number of incoming and outgoing messages per broker does not decrease > No load distribution, but load multiplication • If n1 wasn't able to cope with the load of topic messages, it won't be better with the additional node • In the above example: still 2 messages in and 2 messages out, now on every node
Vodafone 360 25 May 2010 JMS architecture in Vodafone 360 > Find a way to divide the load of the topic into parts of about the same size > Avoid unnecessary overhead of NWOB Horizontal scalability – partitioning for topics > Easy approach: round robin production on topic partitions (above shown producers p1, ..., p4) > Especially suitable when each consumer process really needs each message Partitioning for topics for horizontal scalability and independent JMS nodes (no NWOB) Partitioning by number of messages Partitioning by message property values: sharding > Sophisticated approach: find a message property whose values can be used to partition the set of messages > Can also reduce total number of messages on the wire: if consumer processes not interested in every message, but only in certain set of partitions, they connect only to those nodes that serve those partitions > Needs a sharding aware lib on producer and consumer side > Avoids overhead of massive use of message selectors
Vodafone 360 25 May 2010 JMS architecture in Vodafone 360 Performance as part of the architecture goals > Unfortunately contrary to reliability and redundancy > Fortunately supported by horizontal scalability – Faster production through more nodes to produce against – Faster consumption through more nodes serving the consumers' greed – Possibility to apply virtual topics to also speed up consumption on topics > Performance also influenced by persistence layer Performance
Vodafone 360 25 May 2010 Experience / problems / best practice with ActiveMQ We can share a experience related things with you > Infamous bugs > Social aspects > Trouble shooting > Testing
Vodafone 360 25 May 2010 Experience / problems / best practice with ActiveMQ > Becomes obvious only in a larger NWOB (e.g. eight nodes) > Number of messages routed around becomes several magnitudes higher than the number of messages actually produced (of course with consideration of multiple subscribers) > In conjunction with concurrency issue that prevents topic messages from being swapped into the temp storage this can lead to “frozen” topics Infamous bugs and problems > Due to a concurrency issue, we can encounter a looping thread on a collection inside of the DefaultJDBCAdapter implementation in case of a significant number of persistent messages > Unfortunately, looping thread also blocks a monitor that all transport threads need to enter to do their work > Results in a practical paralysis of the broker, no production or consumption can take place anymore Multiplication of topic messages in a NWOB / frozen topics Looping thread in DefaultJDBCAdapter
Vodafone 360 25 May 2010 Experience / problems / best practice with ActiveMQ Social aspects It's all the fault of JMS! > When something is suddenly fupped, especially after a deployment, we all do know the responsible entity: • either JMS • or the JMS guy • or both > Opportunity to make it easy on yourself and refuse to check own code / config until the JMS guy provides all evidence that JMS is working fine > But maybe it's actually • The persistence layer • The network • Broken clients • Eyjafjallajokull
Vodafone 360 25 May 2010 Experience / problems / best practice with ActiveMQ Trouble shooting Logs > Factory settings require some refactoring to have a decent setup > Helpful to observe what happens – but logs usually don't tell you what does not happen > Give basic hints on startup if config is accessible, totally broken, other JMS nodes can be discovered and connected with > Good support on debugging of message routing, especially with logging interceptor enabled (unfortunately fills the HDD very soon) > Not very helpful for analyzing technical problems, semantical configuration problems or broken clients
Vodafone 360 25 May 2010 Experience / problems / best practice with ActiveMQ Trouble shooting JMX / jconsole > Essential tool that accumulates all JMX probes of ActiveMQ > Reveals mistakes that occur again and again everywhere: misconfiguration of the broker, broken clients, network issues: • Too many connections • No connections • Too many queue subscriptions • No subscriptions • Consumer blocked in message processing • Multicast not functioning properly (for discovery in a NWOB)
Vodafone 360 25 May 2010 Experience / problems / best practice with ActiveMQ Trouble shooting JMX / jconsole > When brokers reside behind a firewall, configure both JMX ports to fixed values in the XML config: <managementContext> <managementContext createConnector="true" rmiServerPort="4711" connectorPort="4712" /> </managementContext> > Above config makes broker connectable for jconsole via this URI: service:jmx:rmi://host:4711/jndi/rmi://host:4712/jmxrmi
Vodafone 360 25 May 2010 Experience / problems / best practice with ActiveMQ Trouble shooting ActiveMQ web console > Webapp in embedded jetty that presents a subset of the JMX probes > Reported to not be under regular development > Doesn't show all essential values > But provides a nice summary of the state of the destinations
Vodafone 360 25 May 2010 Experience / problems / best practice with ActiveMQ Trouble shooting jconsole vs. ActiveMQ web console > Combination of both needed > Recommended is a webapp that provides JMX probes in an overview of all JMS nodes in an environment • That saves you from stepping through several windows of jconsole • Or several browser tabs of web console • But includes all important values at a glance
Vodafone 360 25 May 2010 Experience / problems / best practice with ActiveMQ Trouble shooting Thread dumps > Although no detectable deadlock occurs in most cases, especially for concurrency issues, a thread dump can immediately highlight what is going wrong > Do a $> kill -3 <pid> several times with 20 – 30 seconds time between them > Analyze with tda and find long running threads or monitors that are not released and everybody else is waiting for > E.g. the issue with the DefaultJDBCAdapter is revealed with a thread dump
Vodafone 360 25 May 2010 Experience / problems / best practice with ActiveMQ Trouble shooting tcpdump > For certain cases where certain forces still need certain prove that JMS message go over the wire or not > Although JMX and thread dumps reveal usually anything already
Vodafone 360 25 May 2010 Experience / problems / best practice with ActiveMQ Trouble shooting Testing > Absolutely suggestive to have decent testing separately on JMS infrastructure > Tests should cover both non-functional and pseudo-functional aspects: • Test throughput • Test reliability • Considering actual destination infrastructure • Considering estimated message load • With different scenarios regarding production and consumption rate > Expected output: • Hints on adjusting the broker config • Hints on adjusting the producer / consumer config • Hints on adjusting the persistency config • Hints on adjusting the network infrastructure