Automating Diagnosis of Cellular Radio Access Network Problems

Automating Diagnosis of Cellular Radio Access Network Problems Anand Iyer
⋆, Li Erran Li ⬩, Ion Stoica ⋆ ⋆ University of California, Berkeley ⬩ Uber Technologies ACM MobiCom 2017

Cellular Radio Access Networks (RAN)

Connect billions of users to Internet everyday…

Cellular RANs § Must provide user satisfaction § Subscribers expect
high quality of experience (QoE) § Critical component for Operators § High impact business for the operators Operators must ensure optimal RAN performance 24x7

Emerging applications demand even more stringent performance requirements…

Ensuring high end-user QoE is hard

Image courtesy: Alcatel-Lucent RAN Performance Problems Prevalent

When performance problems occur… … users are frustrated When users
are frustrated, operators lose money

Operators must understand the impacting factors and diagnose RAN performance
problems quickly

Existing RAN Troubleshooting Techniques

Monitor Key Performance Indicators (KPI)s Existing RAN Troubleshooting Techniques

Monitor Key Performance Indicators (KPI)s Existing RAN Troubleshooting Techniques Drops:
0 Drops: 10 Drops: 0 Drops: 1 Drops: 0 Drops: 3

Monitor Key Performance Indicators (KPI)s Existing RAN Troubleshooting Techniques Drops:
0 Drops: 10 Drops: 0 Drops: 1 Drops: 0 Drops: 3 Poor KPI → (mostly manual) root cause analysis Drops: 10

Root cause analysis involves field trials

{Technicians + Equipment} x Multiple Trips = Multi-billion $$$/year Root
cause analysis involves field trials

{Technicians + Equipment} x Multiple Trips = Multi-billion $$$/year $22
B spent per year on network management & troubleshooting! Root cause analysis involves field trials

Can cellular network operators automate the diagnosis of RAN performance
problems?

Our Experience § Working with a cellular network operator §
Tier-1 operator in the U.S.

Tier-1 operator in the U.S. § Studied portion of RAN for over a year § 13,000+ base stations serving live users

A Typical Week ��
��

Tier-1 operator in the U.S. § Studied portion of RAN for over a year § 13,000+ base stations serving over 2 million users § 1000s of trouble tickets

Tier-1 operator in the U.S. § Studied portion of RAN for over a year § 13,000+ base stations serving over 2 million users § 1000s of trouble tickets § Significant effort by the operator to resolve them

Existing RAN Troubleshooting § Slow & Ineffective § Many problems
incorrectly diagnosed § Source of disagreements § Which team should solve this problem? § Wasted efforts § Known root-causes § Recurring problems

Existing RAN Troubleshooting § Slow & Ineffective § Many problems
incorrectly diagnosed § Source of disagreements § Which team should solve this problem? § Wasted efforts § Known root-causes § Recurring problems Need more fine-grained information for diagnosis

Fine-grained Information § Logging everything ideal, but impossible

Fine-grained Information Base Station (eNodeB) Serving Gateway (S-GW) Packet Gateway
(P-GW) Mobility Management Entity (MME) Home Subscriber Server (HSS) Internet Control Plane Data Plane User Equipment (UE) § Logging everything ideal, but impossible

(P-GW) Mobility Management Entity (MME) Home Subscriber Server (HSS) Internet Control Plane Data Plane User Equipment (UE) § Logging everything ideal, but impossible Radio bearer

(P-GW) Mobility Management Entity (MME) Home Subscriber Server (HSS) Internet Control Plane Data Plane User Equipment (UE) § Logging everything ideal, but impossible Radio bearer GTP tunnel

(P-GW) Mobility Management Entity (MME) Home Subscriber Server (HSS) Internet Control Plane Data Plane User Equipment (UE) § Logging everything ideal, but impossible Radio bearer GTP tunnel EPS Bearer

Fine-grained Information § Control plane procedures

RRC Connection Re-establishment RRC Connection Re-establishment Request UE eNodeB MME
UE Context Established RRC Connection Re-establishment Complete RRC Connection Reconﬁguration UE Context Release Request UE Context Release Command UE Context Release Complete Radio link failure detected Supervision timer expires Fine-grained Information § Control plane procedures

Fine-grained Information § Control plane procedures Bearer traces and control
plane procedure logs provide necessary fine-grained information for efficient diagnosis

Fine-grained Information § Control plane procedures Bearer traces and control
plane procedure logs provide necessary fine-grained information for efficient diagnosis 3 step approach to leverage rich bearer-level traces for RAN performance diagnosis

Step 1: Isolate Problems to RAN End-User QoE

Step 1: Isolate Problems to RAN End-User QoE Client Related

Internet

Cellular Network Internet

Cellular Network Internet RAN Core

Cellular Network Internet RAN Core Leverage existing trouble ticket system

Step 2: Classify RAN Problems

Step 2: Classify RAN Problems Coverage

Step 2: Classify RAN Problems Coverage Interference

Step 2: Classify RAN Problems Coverage Interference Congestion

Step 2: Classify RAN Problems Coverage Interference Congestion Configuration

Step 2: Classify RAN Problems Coverage Interference Congestion Configuration Network
State Changes

State Changes Others

State Changes Others RSRP RSRQ CQI SINR eNodeB Params

Step 3: Model KPI at Bearer Level

Step 3: Model KPI at Bearer Level Accessibility Retainability PHY
layer Throughput Quality Traffic Volume Connection Counts Mobility

layer Throughput Quality Traffic Volume Connection Counts Mobility Model performance metrics at bearer level using classification bin parameters as features

layer Throughput Quality Traffic Volume Connection Counts Mobility Model performance metrics at bearer level using classification bin parameters as features Event Metrics

layer Throughput Quality Traffic Volume Connection Counts Mobility Model performance metrics at bearer level using classification bin parameters as features Event Metrics Non-Event/Volume Metrics

layer Throughput Quality Traffic Volume Connection Counts Mobility Model performance metrics at bearer level using classification bin parameters as features Classification Models Event Metrics Non-Event/Volume Metrics

layer Throughput Quality Traffic Volume Connection Counts Mobility Model performance metrics at bearer level using classification bin parameters as features Classification Models Regression Models Event Metrics Non-Event/Volume Metrics

Bearer-level Modeling for Event Metrics § Connection Drops § Key
retainability metric § How well network can complete connections

��
�� Bearer-level Modeling for Event Metrics § Connection Drops § Key retainability metric § How well network can complete connections

��
�� Bearer-level Modeling for Event Metrics § Connection Drops § Key retainability metric § How well network can complete connections Build decision trees to explain drops

Uplink SINR > -11.75 RSRQ > -16.5 RSRQ Available? Success
Drop Uplink SINR > -5.86 CQI > 5.875 Drop Drop Yes No Yes No No Yes Success No No Yes Yes Success Bearer-level Modeling for Event Metrics

Findings from Call Drop Analysis ��
�� Reference Signal Quality at UE

�� Reference Signal Quality at UE �� Channel Quality at UE

�� Reference Signal Quality at UE �� Channel Quality at UE Shapes not identical (should be, ideally)

0 0.2 0.4 0.6 0.8 1 -5 0 5 10
15 20 Probability SINR Difference (dB) ρ=1/3 ρ=1 ρ=5/3 Findings from Call Drop Analysis

0 0.2 0.4 0.6 0.8 1 -5 0 5 10
15 20 Probability SINR Difference (dB) ρ=1/3 ρ=1 ρ=5/3 Finding Insight: P-CQI is not CRC protected! Findings from Call Drop Analysis

Bearer-level Modeling for Volume Metrics § Throughput § Key metric
users really care about

users really care about diversity, a 29% overhead for each PRB exists on average because of resources allocated to physical downlink control channel, physical broadcast channel and reference signals. The physical layer has a BLER target of 10%. Account for MAC Sub-layer Retransmissions The MAC sub- layer performs retransmissions. We denote the MAC eciency as MAC . It is computed as the ratio of total rst transmissions over total transmissions. We compute MAC using our traces. The predicted throughput due to transmit diversity is calculated as: tputRLCdi = (1.0 MAC ) ⇥ 0.9 ⇥ (1 0.29) ⇥ 180 ⇥ PRBdi ⇥ lo 2(1 + SINRdi )/ TxTimedi PRBdi denotes the total PRBs allocated for transmit diversity. TxTimedi is the total transmission time for transmit diversity.

users really care about diversity, a 29% overhead for each PRB exists on average because of resources allocated to physical downlink control channel, physical broadcast channel and reference signals. The physical layer has a BLER target of 10%. Account for MAC Sub-layer Retransmissions The MAC sub- layer performs retransmissions. We denote the MAC eciency as MAC . It is computed as the ratio of total rst transmissions over total transmissions. We compute MAC using our traces. The predicted throughput due to transmit diversity is calculated as: tputRLCdi = (1.0 MAC ) ⇥ 0.9 ⇥ (1 0.29) ⇥ 180 ⇥ PRBdi ⇥ lo 2(1 + SINRdi )/ TxTimedi PRBdi denotes the total PRBs allocated for transmit diversity. TxTimedi is the total transmission time for transmit diversity. MAC efficiency

users really care about diversity, a 29% overhead for each PRB exists on average because of resources allocated to physical downlink control channel, physical broadcast channel and reference signals. The physical layer has a BLER target of 10%. Account for MAC Sub-layer Retransmissions The MAC sub- layer performs retransmissions. We denote the MAC eciency as MAC . It is computed as the ratio of total rst transmissions over total transmissions. We compute MAC using our traces. The predicted throughput due to transmit diversity is calculated as: tputRLCdi = (1.0 MAC ) ⇥ 0.9 ⇥ (1 0.29) ⇥ 180 ⇥ PRBdi ⇥ lo 2(1 + SINRdi )/ TxTimedi PRBdi denotes the total PRBs allocated for transmit diversity. TxTimedi is the total transmission time for transmit diversity. MAC efficiency # Physical Resource Blocks

users really care about diversity, a 29% overhead for each PRB exists on average because of resources allocated to physical downlink control channel, physical broadcast channel and reference signals. The physical layer has a BLER target of 10%. Account for MAC Sub-layer Retransmissions The MAC sub- layer performs retransmissions. We denote the MAC eciency as MAC . It is computed as the ratio of total rst transmissions over total transmissions. We compute MAC using our traces. The predicted throughput due to transmit diversity is calculated as: tputRLCdi = (1.0 MAC ) ⇥ 0.9 ⇥ (1 0.29) ⇥ 180 ⇥ PRBdi ⇥ lo 2(1 + SINRdi )/ TxTimedi PRBdi denotes the total PRBs allocated for transmit diversity. TxTimedi is the total transmission time for transmit diversity. MAC efficiency # Physical Resource Blocks Transmission time

users really care about diversity, a 29% overhead for each PRB exists on average because of resources allocated to physical downlink control channel, physical broadcast channel and reference signals. The physical layer has a BLER target of 10%. Account for MAC Sub-layer Retransmissions The MAC sub- layer performs retransmissions. We denote the MAC eciency as MAC . It is computed as the ratio of total rst transmissions over total transmissions. We compute MAC using our traces. The predicted throughput due to transmit diversity is calculated as: tputRLCdi = (1.0 MAC ) ⇥ 0.9 ⇥ (1 0.29) ⇥ 180 ⇥ PRBdi ⇥ lo 2(1 + SINRdi )/ TxTimedi PRBdi denotes the total PRBs allocated for transmit diversity. TxTimedi is the total transmission time for transmit diversity. MAC efficiency # Physical Resource Blocks Transmission time Link-adapted SINR

0.86 0.88 0.9 0.92 0.94 0.96 0.98 1 0 20
40 60 80 100 Probability Prediction Error (%) Bearer-level Modeling for Volume Metrics

0.86 0.88 0.9 0.92 0.94 0.96 0.98 1 0 20
40 60 80 100 Probability Prediction Error (%) Bearer-level Modeling for Volume Metrics Works well in most scenarios

Findings from Throughput Analysis § Problematic for some cells

Findings from Throughput Analysis § Problematic for some cells §
To understand why, computed loss of efficiency

Findings from Throughput Analysis 0 0.02 0.04 0.06 0.08 0.1
0.12 0.14 0.16 0.18 0 5 10 15 20 25 Probability SINR Loss (dB) § Problematic for some cells § To understand why, computed loss of efficiency Actual throughput SINR vs computed using parameters

Findings from Throughput Analysis 0 0.02 0.04 0.06 0.08 0.1
0.12 0.14 0.16 0.18 0 5 10 15 20 25 Probability SINR Loss (dB) § Problematic for some cells § To understand why, computed loss of efficiency Actual throughput SINR vs computed using parameters Finding Insight: Link adaptation slow to adapt!

Can cellular network operators automate the diagnosis of RAN performance
problems?

Towards Full Automation § Our methodology is amenable to automation
§ Models can be built on-demand automatically

Usefulness to the Operator ��
��

Usefulness to the Operator 0.3 0.35 0.4 0.45 0.5 0.55
0.6 0.65 0.7 Sun Mon Tue Wed Thu Fri Sat Drop Rate (%) Day Total Explained

Towards Full Automation § Our methodology is amenable to automation
§ Models can be built on-demand automatically § Full automation for next generation networks: § Need to build 1000s of models § Need to keep the models updated § Need real-time diagnosis

We’ve made some progress towards this… § Cells exhibit performance
similarity May be able to group cells by performance

Summary § Experience working with a tier-1 operator § 2
million users, over a period of 1 year § Leveraging bearer-level traces could be the key to automating RAN diagnosis § Proposed bearer-level modeling § Unearthed several insights § Fully automated diagnosis needs more effort

Automating Diagnosis of Cellular Radio Access N...

Automating Diagnosis of Cellular Radio Access Network Problems

More Decks by Anand Iyer

Other Decks in Research

Featured

Transcript