Slide 1

Slide 1 text

Automating Diagnosis of Cellular Radio Access Network Problems Anand Iyer ⋆, Li Erran Li ⬩, Ion Stoica ⋆ ⋆ University of California, Berkeley ⬩ Uber Technologies ACM MobiCom 2017

Slide 2

Slide 2 text

Cellular Radio Access Networks (RAN)

Slide 3

Slide 3 text

Connect billions of users to Internet everyday…

Slide 4

Slide 4 text

Cellular RANs § Must provide user satisfaction § Subscribers expect high quality of experience (QoE) § Critical component for Operators § High impact business for the operators Operators must ensure optimal RAN performance 24x7

Slide 5

Slide 5 text

Emerging applications demand even more stringent performance requirements…

Slide 6

Slide 6 text

No content

Slide 7

Slide 7 text

Ensuring high end-user QoE is hard

Slide 8

Slide 8 text

Image courtesy: Alcatel-Lucent RAN Performance Problems Prevalent

Slide 9

Slide 9 text

When performance problems occur… … users are frustrated When users are frustrated, operators lose money

Slide 10

Slide 10 text

Operators must understand the impacting factors and diagnose RAN performance problems quickly

Slide 11

Slide 11 text

Existing RAN Troubleshooting Techniques

Slide 12

Slide 12 text

Monitor Key Performance Indicators (KPI)s Existing RAN Troubleshooting Techniques

Slide 13

Slide 13 text

Monitor Key Performance Indicators (KPI)s Existing RAN Troubleshooting Techniques Drops: 0 Drops: 10 Drops: 0 Drops: 1 Drops: 0 Drops: 3

Slide 14

Slide 14 text

Monitor Key Performance Indicators (KPI)s Existing RAN Troubleshooting Techniques Drops: 0 Drops: 10 Drops: 0 Drops: 1 Drops: 0 Drops: 3 Poor KPI → (mostly manual) root cause analysis Drops: 10

Slide 15

Slide 15 text

Root cause analysis involves field trials

Slide 16

Slide 16 text

{Technicians + Equipment} x Multiple Trips = Multi-billion $$$/year Root cause analysis involves field trials

Slide 17

Slide 17 text

{Technicians + Equipment} x Multiple Trips = Multi-billion $$$/year $22 B spent per year on network management & troubleshooting! Root cause analysis involves field trials

Slide 18

Slide 18 text

Can cellular network operators automate the diagnosis of RAN performance problems?

Slide 19

Slide 19 text

Our Experience § Working with a cellular network operator § Tier-1 operator in the U.S.

Slide 20

Slide 20 text

Our Experience § Working with a cellular network operator § Tier-1 operator in the U.S. § Studied portion of RAN for over a year § 13,000+ base stations serving live users

Slide 21

Slide 21 text

A Typical Week �� �� �� �� �� �� ��� ��� ��� ��� ��� ��� ��� �� �� �� �� ��� ��� ����������������������� ���������������� ����������� �����

Slide 22

Slide 22 text

Our Experience § Working with a cellular network operator § Tier-1 operator in the U.S. § Studied portion of RAN for over a year § 13,000+ base stations serving over 2 million users § 1000s of trouble tickets

Slide 23

Slide 23 text

Our Experience § Working with a cellular network operator § Tier-1 operator in the U.S. § Studied portion of RAN for over a year § 13,000+ base stations serving over 2 million users § 1000s of trouble tickets § Significant effort by the operator to resolve them

Slide 24

Slide 24 text

Existing RAN Troubleshooting § Slow & Ineffective § Many problems incorrectly diagnosed § Source of disagreements § Which team should solve this problem? § Wasted efforts § Known root-causes § Recurring problems

Slide 25

Slide 25 text

Existing RAN Troubleshooting § Slow & Ineffective § Many problems incorrectly diagnosed § Source of disagreements § Which team should solve this problem? § Wasted efforts § Known root-causes § Recurring problems Need more fine-grained information for diagnosis

Slide 26

Slide 26 text

Fine-grained Information § Logging everything ideal, but impossible

Slide 27

Slide 27 text

Fine-grained Information Base Station (eNodeB) Serving Gateway (S-GW) Packet Gateway (P-GW) Mobility Management Entity (MME) Home Subscriber Server (HSS) Internet Control Plane Data Plane User Equipment (UE) § Logging everything ideal, but impossible

Slide 28

Slide 28 text

Fine-grained Information Base Station (eNodeB) Serving Gateway (S-GW) Packet Gateway (P-GW) Mobility Management Entity (MME) Home Subscriber Server (HSS) Internet Control Plane Data Plane User Equipment (UE) § Logging everything ideal, but impossible Radio bearer

Slide 29

Slide 29 text

Fine-grained Information Base Station (eNodeB) Serving Gateway (S-GW) Packet Gateway (P-GW) Mobility Management Entity (MME) Home Subscriber Server (HSS) Internet Control Plane Data Plane User Equipment (UE) § Logging everything ideal, but impossible Radio bearer GTP tunnel

Slide 30

Slide 30 text

Fine-grained Information Base Station (eNodeB) Serving Gateway (S-GW) Packet Gateway (P-GW) Mobility Management Entity (MME) Home Subscriber Server (HSS) Internet Control Plane Data Plane User Equipment (UE) § Logging everything ideal, but impossible Radio bearer GTP tunnel EPS Bearer

Slide 31

Slide 31 text

Fine-grained Information § Control plane procedures

Slide 32

Slide 32 text

RRC Connection Re-establishment RRC Connection Re-establishment Request UE eNodeB MME UE Context Established RRC Connection Re-establishment Complete RRC Connection Reconfiguration UE Context Release Request UE Context Release Command UE Context Release Complete Radio link failure detected Supervision timer expires Fine-grained Information § Control plane procedures

Slide 33

Slide 33 text

Fine-grained Information § Control plane procedures Bearer traces and control plane procedure logs provide necessary fine-grained information for efficient diagnosis

Slide 34

Slide 34 text

Fine-grained Information § Control plane procedures Bearer traces and control plane procedure logs provide necessary fine-grained information for efficient diagnosis 3 step approach to leverage rich bearer-level traces for RAN performance diagnosis

Slide 35

Slide 35 text

Step 1: Isolate Problems to RAN End-User QoE

Slide 36

Slide 36 text

Step 1: Isolate Problems to RAN End-User QoE Client Related

Slide 37

Slide 37 text

Step 1: Isolate Problems to RAN End-User QoE Client Related Internet

Slide 38

Slide 38 text

Step 1: Isolate Problems to RAN End-User QoE Client Related Cellular Network Internet

Slide 39

Slide 39 text

Step 1: Isolate Problems to RAN End-User QoE Client Related Cellular Network Internet RAN Core

Slide 40

Slide 40 text

Step 1: Isolate Problems to RAN End-User QoE Client Related Cellular Network Internet RAN Core

Slide 41

Slide 41 text

Step 1: Isolate Problems to RAN End-User QoE Client Related Cellular Network Internet RAN Core Leverage existing trouble ticket system

Slide 42

Slide 42 text

Step 2: Classify RAN Problems

Slide 43

Slide 43 text

Step 2: Classify RAN Problems Coverage

Slide 44

Slide 44 text

Step 2: Classify RAN Problems Coverage Interference

Slide 45

Slide 45 text

Step 2: Classify RAN Problems Coverage Interference Congestion

Slide 46

Slide 46 text

Step 2: Classify RAN Problems Coverage Interference Congestion Configuration

Slide 47

Slide 47 text

Step 2: Classify RAN Problems Coverage Interference Congestion Configuration Network State Changes

Slide 48

Slide 48 text

Step 2: Classify RAN Problems Coverage Interference Congestion Configuration Network State Changes Others

Slide 49

Slide 49 text

Step 2: Classify RAN Problems Coverage Interference Congestion Configuration Network State Changes Others

Slide 50

Slide 50 text

Step 2: Classify RAN Problems Coverage Interference Congestion Configuration Network State Changes Others RSRP RSRQ CQI SINR eNodeB Params

Slide 51

Slide 51 text

Step 3: Model KPI at Bearer Level

Slide 52

Slide 52 text

Step 3: Model KPI at Bearer Level Accessibility Retainability PHY layer Throughput Quality Traffic Volume Connection Counts Mobility

Slide 53

Slide 53 text

Step 3: Model KPI at Bearer Level Accessibility Retainability PHY layer Throughput Quality Traffic Volume Connection Counts Mobility Model performance metrics at bearer level using classification bin parameters as features

Slide 54

Slide 54 text

Step 3: Model KPI at Bearer Level Accessibility Retainability PHY layer Throughput Quality Traffic Volume Connection Counts Mobility Model performance metrics at bearer level using classification bin parameters as features Event Metrics

Slide 55

Slide 55 text

Step 3: Model KPI at Bearer Level Accessibility Retainability PHY layer Throughput Quality Traffic Volume Connection Counts Mobility Model performance metrics at bearer level using classification bin parameters as features Event Metrics Non-Event/Volume Metrics

Slide 56

Slide 56 text

Step 3: Model KPI at Bearer Level Accessibility Retainability PHY layer Throughput Quality Traffic Volume Connection Counts Mobility Model performance metrics at bearer level using classification bin parameters as features Classification Models Event Metrics Non-Event/Volume Metrics

Slide 57

Slide 57 text

Step 3: Model KPI at Bearer Level Accessibility Retainability PHY layer Throughput Quality Traffic Volume Connection Counts Mobility Model performance metrics at bearer level using classification bin parameters as features Classification Models Regression Models Event Metrics Non-Event/Volume Metrics

Slide 58

Slide 58 text

Bearer-level Modeling for Event Metrics § Connection Drops § Key retainability metric § How well network can complete connections

Slide 59

Slide 59 text

���� ��� ���� ��� ���� ��� ��� ��� ��� ��� ��� ��� ��� ���� ���� ��� ��� Bearer-level Modeling for Event Metrics § Connection Drops § Key retainability metric § How well network can complete connections

Slide 60

Slide 60 text

���� ��� ���� ��� ���� ��� ��� ��� ��� ��� ��� ��� ��� ���� ���� ��� ��� Bearer-level Modeling for Event Metrics § Connection Drops § Key retainability metric § How well network can complete connections Build decision trees to explain drops

Slide 61

Slide 61 text

Uplink SINR > -11.75 RSRQ > -16.5 RSRQ Available? Success Drop Uplink SINR > -5.86 CQI > 5.875 Drop Drop Yes No Yes No No Yes Success No No Yes Yes Success Bearer-level Modeling for Event Metrics

Slide 62

Slide 62 text

Uplink SINR > -11.75 RSRQ > -16.5 RSRQ Available? Success Drop Uplink SINR > -5.86 CQI > 5.875 Drop Drop Yes No Yes No No Yes Success No No Yes Yes Success Bearer-level Modeling for Event Metrics

Slide 63

Slide 63 text

Uplink SINR > -11.75 RSRQ > -16.5 RSRQ Available? Success Drop Uplink SINR > -5.86 CQI > 5.875 Drop Drop Yes No Yes No No Yes Success No No Yes Yes Success Bearer-level Modeling for Event Metrics

Slide 64

Slide 64 text

Findings from Call Drop Analysis �� ���� ���� ���� ���� ���� ���� ���� ���� ���� �� ��� ��� ��� ��� ��� ��� �� �� �� �� �� ����������� ��������� Reference Signal Quality at UE

Slide 65

Slide 65 text

Findings from Call Drop Analysis �� ���� ���� ���� ���� ���� ���� ���� ���� ���� �� ��� ��� ��� ��� ��� ��� �� �� �� �� �� ����������� ��������� Reference Signal Quality at UE �� ����� ���� ����� ���� ����� ���� �� �� �� �� �� ��� ��� ��� ��� ����������� ��� Channel Quality at UE

Slide 66

Slide 66 text

Findings from Call Drop Analysis �� ���� ���� ���� ���� ���� ���� ���� ���� ���� �� ��� ��� ��� ��� ��� ��� �� �� �� �� �� ����������� ��������� Reference Signal Quality at UE �� ����� ���� ����� ���� ����� ���� �� �� �� �� �� ��� ��� ��� ��� ����������� ��� Channel Quality at UE Shapes not identical (should be, ideally)

Slide 67

Slide 67 text

0 0.2 0.4 0.6 0.8 1 -5 0 5 10 15 20 Probability SINR Difference (dB) ρ=1/3 ρ=1 ρ=5/3 Findings from Call Drop Analysis

Slide 68

Slide 68 text

0 0.2 0.4 0.6 0.8 1 -5 0 5 10 15 20 Probability SINR Difference (dB) ρ=1/3 ρ=1 ρ=5/3 Finding Insight: P-CQI is not CRC protected! Findings from Call Drop Analysis

Slide 69

Slide 69 text

Bearer-level Modeling for Volume Metrics § Throughput § Key metric users really care about

Slide 70

Slide 70 text

Bearer-level Modeling for Volume Metrics § Throughput § Key metric users really care about diversity, a 29% overhead for each PRB exists on average because of resources allocated to physical downlink control channel, physical broadcast channel and reference signals. The physical layer has a BLER target of 10%. Account for MAC Sub-layer Retransmissions The MAC sub- layer performs retransmissions. We denote the MAC e￿ciency as MAC . It is computed as the ratio of total ￿rst transmissions over total transmissions. We compute MAC using our traces. The predicted throughput due to transmit diversity is calculated as: tputRLCdi = (1.0 MAC ) ⇥ 0.9 ⇥ (1 0.29) ⇥ 180 ⇥ PRBdi ⇥ lo 2(1 + SINRdi )/ TxTimedi PRBdi denotes the total PRBs allocated for transmit diversity. TxTimedi is the total transmission time for transmit diversity.

Slide 71

Slide 71 text

Bearer-level Modeling for Volume Metrics § Throughput § Key metric users really care about diversity, a 29% overhead for each PRB exists on average because of resources allocated to physical downlink control channel, physical broadcast channel and reference signals. The physical layer has a BLER target of 10%. Account for MAC Sub-layer Retransmissions The MAC sub- layer performs retransmissions. We denote the MAC e￿ciency as MAC . It is computed as the ratio of total ￿rst transmissions over total transmissions. We compute MAC using our traces. The predicted throughput due to transmit diversity is calculated as: tputRLCdi = (1.0 MAC ) ⇥ 0.9 ⇥ (1 0.29) ⇥ 180 ⇥ PRBdi ⇥ lo 2(1 + SINRdi )/ TxTimedi PRBdi denotes the total PRBs allocated for transmit diversity. TxTimedi is the total transmission time for transmit diversity. MAC efficiency

Slide 72

Slide 72 text

Bearer-level Modeling for Volume Metrics § Throughput § Key metric users really care about diversity, a 29% overhead for each PRB exists on average because of resources allocated to physical downlink control channel, physical broadcast channel and reference signals. The physical layer has a BLER target of 10%. Account for MAC Sub-layer Retransmissions The MAC sub- layer performs retransmissions. We denote the MAC e￿ciency as MAC . It is computed as the ratio of total ￿rst transmissions over total transmissions. We compute MAC using our traces. The predicted throughput due to transmit diversity is calculated as: tputRLCdi = (1.0 MAC ) ⇥ 0.9 ⇥ (1 0.29) ⇥ 180 ⇥ PRBdi ⇥ lo 2(1 + SINRdi )/ TxTimedi PRBdi denotes the total PRBs allocated for transmit diversity. TxTimedi is the total transmission time for transmit diversity. MAC efficiency # Physical Resource Blocks

Slide 73

Slide 73 text

Bearer-level Modeling for Volume Metrics § Throughput § Key metric users really care about diversity, a 29% overhead for each PRB exists on average because of resources allocated to physical downlink control channel, physical broadcast channel and reference signals. The physical layer has a BLER target of 10%. Account for MAC Sub-layer Retransmissions The MAC sub- layer performs retransmissions. We denote the MAC e￿ciency as MAC . It is computed as the ratio of total ￿rst transmissions over total transmissions. We compute MAC using our traces. The predicted throughput due to transmit diversity is calculated as: tputRLCdi = (1.0 MAC ) ⇥ 0.9 ⇥ (1 0.29) ⇥ 180 ⇥ PRBdi ⇥ lo 2(1 + SINRdi )/ TxTimedi PRBdi denotes the total PRBs allocated for transmit diversity. TxTimedi is the total transmission time for transmit diversity. MAC efficiency # Physical Resource Blocks Transmission time

Slide 74

Slide 74 text

Bearer-level Modeling for Volume Metrics § Throughput § Key metric users really care about diversity, a 29% overhead for each PRB exists on average because of resources allocated to physical downlink control channel, physical broadcast channel and reference signals. The physical layer has a BLER target of 10%. Account for MAC Sub-layer Retransmissions The MAC sub- layer performs retransmissions. We denote the MAC e￿ciency as MAC . It is computed as the ratio of total ￿rst transmissions over total transmissions. We compute MAC using our traces. The predicted throughput due to transmit diversity is calculated as: tputRLCdi = (1.0 MAC ) ⇥ 0.9 ⇥ (1 0.29) ⇥ 180 ⇥ PRBdi ⇥ lo 2(1 + SINRdi )/ TxTimedi PRBdi denotes the total PRBs allocated for transmit diversity. TxTimedi is the total transmission time for transmit diversity. MAC efficiency # Physical Resource Blocks Transmission time Link-adapted SINR

Slide 75

Slide 75 text

0.86 0.88 0.9 0.92 0.94 0.96 0.98 1 0 20 40 60 80 100 Probability Prediction Error (%) Bearer-level Modeling for Volume Metrics

Slide 76

Slide 76 text

0.86 0.88 0.9 0.92 0.94 0.96 0.98 1 0 20 40 60 80 100 Probability Prediction Error (%) Bearer-level Modeling for Volume Metrics Works well in most scenarios

Slide 77

Slide 77 text

Findings from Throughput Analysis § Problematic for some cells

Slide 78

Slide 78 text

Findings from Throughput Analysis § Problematic for some cells § To understand why, computed loss of efficiency

Slide 79

Slide 79 text

Findings from Throughput Analysis 0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 0 5 10 15 20 25 Probability SINR Loss (dB) § Problematic for some cells § To understand why, computed loss of efficiency Actual throughput SINR vs computed using parameters

Slide 80

Slide 80 text

Findings from Throughput Analysis 0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 0 5 10 15 20 25 Probability SINR Loss (dB) § Problematic for some cells § To understand why, computed loss of efficiency Actual throughput SINR vs computed using parameters Finding Insight: Link adaptation slow to adapt!

Slide 81

Slide 81 text

Can cellular network operators automate the diagnosis of RAN performance problems?

Slide 82

Slide 82 text

Towards Full Automation § Our methodology is amenable to automation § Models can be built on-demand automatically

Slide 83

Slide 83 text

Usefulness to the Operator ���� ��� ���� ��� ���� ��� ��� ��� ��� ��� ��� ��� ��� ���� ���� ��� ���

Slide 84

Slide 84 text

Usefulness to the Operator 0.3 0.35 0.4 0.45 0.5 0.55 0.6 0.65 0.7 Sun Mon Tue Wed Thu Fri Sat Drop Rate (%) Day Total Explained

Slide 85

Slide 85 text

Towards Full Automation § Our methodology is amenable to automation § Models can be built on-demand automatically § Full automation for next generation networks: § Need to build 1000s of models § Need to keep the models updated § Need real-time diagnosis

Slide 86

Slide 86 text

We’ve made some progress towards this… § Cells exhibit performance similarity May be able to group cells by performance

Slide 87

Slide 87 text

Summary § Experience working with a tier-1 operator § 2 million users, over a period of 1 year § Leveraging bearer-level traces could be the key to automating RAN diagnosis § Proposed bearer-level modeling § Unearthed several insights § Fully automated diagnosis needs more effort