Automating Diagnosis of Cellular Radio Access Network Problems

0ff46442256bf55681d64027c68beea7?s=47 Anand Iyer
October 17, 2017

Automating Diagnosis of Cellular Radio Access Network Problems

0ff46442256bf55681d64027c68beea7?s=128

Anand Iyer

October 17, 2017
Tweet

Transcript

  1. Automating Diagnosis of Cellular Radio Access Network Problems Anand Iyer

    ⋆, Li Erran Li ⬩, Ion Stoica ⋆ ⋆ University of California, Berkeley ⬩ Uber Technologies ACM MobiCom 2017
  2. Cellular Radio Access Networks (RAN)

  3. Connect billions of users to Internet everyday…

  4. Cellular RANs § Must provide user satisfaction § Subscribers expect

    high quality of experience (QoE) § Critical component for Operators § High impact business for the operators Operators must ensure optimal RAN performance 24x7
  5. Emerging applications demand even more stringent performance requirements…

  6. None
  7. Ensuring high end-user QoE is hard

  8. Image courtesy: Alcatel-Lucent RAN Performance Problems Prevalent

  9. When performance problems occur… … users are frustrated When users

    are frustrated, operators lose money
  10. Operators must understand the impacting factors and diagnose RAN performance

    problems quickly
  11. Existing RAN Troubleshooting Techniques

  12. Monitor Key Performance Indicators (KPI)s Existing RAN Troubleshooting Techniques

  13. Monitor Key Performance Indicators (KPI)s Existing RAN Troubleshooting Techniques Drops:

    0 Drops: 10 Drops: 0 Drops: 1 Drops: 0 Drops: 3
  14. Monitor Key Performance Indicators (KPI)s Existing RAN Troubleshooting Techniques Drops:

    0 Drops: 10 Drops: 0 Drops: 1 Drops: 0 Drops: 3 Poor KPI → (mostly manual) root cause analysis Drops: 10
  15. Root cause analysis involves field trials

  16. {Technicians + Equipment} x Multiple Trips = Multi-billion $$$/year Root

    cause analysis involves field trials
  17. {Technicians + Equipment} x Multiple Trips = Multi-billion $$$/year $22

    B spent per year on network management & troubleshooting! Root cause analysis involves field trials
  18. Can cellular network operators automate the diagnosis of RAN performance

    problems?
  19. Our Experience § Working with a cellular network operator §

    Tier-1 operator in the U.S.
  20. Our Experience § Working with a cellular network operator §

    Tier-1 operator in the U.S. § Studied portion of RAN for over a year § 13,000+ base stations serving live users
  21. A Typical Week �� �� �� �� �� �� ���

    ��� ��� ��� ��� ��� ��� �� �� �� �� ��� ��� ����������������������� ���������������� ����������� �����
  22. Our Experience § Working with a cellular network operator §

    Tier-1 operator in the U.S. § Studied portion of RAN for over a year § 13,000+ base stations serving over 2 million users § 1000s of trouble tickets
  23. Our Experience § Working with a cellular network operator §

    Tier-1 operator in the U.S. § Studied portion of RAN for over a year § 13,000+ base stations serving over 2 million users § 1000s of trouble tickets § Significant effort by the operator to resolve them
  24. Existing RAN Troubleshooting § Slow & Ineffective § Many problems

    incorrectly diagnosed § Source of disagreements § Which team should solve this problem? § Wasted efforts § Known root-causes § Recurring problems
  25. Existing RAN Troubleshooting § Slow & Ineffective § Many problems

    incorrectly diagnosed § Source of disagreements § Which team should solve this problem? § Wasted efforts § Known root-causes § Recurring problems Need more fine-grained information for diagnosis
  26. Fine-grained Information § Logging everything ideal, but impossible

  27. Fine-grained Information Base Station (eNodeB) Serving Gateway (S-GW) Packet Gateway

    (P-GW) Mobility Management Entity (MME) Home Subscriber Server (HSS) Internet Control Plane Data Plane User Equipment (UE) § Logging everything ideal, but impossible
  28. Fine-grained Information Base Station (eNodeB) Serving Gateway (S-GW) Packet Gateway

    (P-GW) Mobility Management Entity (MME) Home Subscriber Server (HSS) Internet Control Plane Data Plane User Equipment (UE) § Logging everything ideal, but impossible Radio bearer
  29. Fine-grained Information Base Station (eNodeB) Serving Gateway (S-GW) Packet Gateway

    (P-GW) Mobility Management Entity (MME) Home Subscriber Server (HSS) Internet Control Plane Data Plane User Equipment (UE) § Logging everything ideal, but impossible Radio bearer GTP tunnel
  30. Fine-grained Information Base Station (eNodeB) Serving Gateway (S-GW) Packet Gateway

    (P-GW) Mobility Management Entity (MME) Home Subscriber Server (HSS) Internet Control Plane Data Plane User Equipment (UE) § Logging everything ideal, but impossible Radio bearer GTP tunnel EPS Bearer
  31. Fine-grained Information § Control plane procedures

  32. RRC Connection Re-establishment RRC Connection Re-establishment Request UE eNodeB MME

    UE Context Established RRC Connection Re-establishment Complete RRC Connection Reconfiguration UE Context Release Request UE Context Release Command UE Context Release Complete Radio link failure detected Supervision timer expires Fine-grained Information § Control plane procedures
  33. Fine-grained Information § Control plane procedures Bearer traces and control

    plane procedure logs provide necessary fine-grained information for efficient diagnosis
  34. Fine-grained Information § Control plane procedures Bearer traces and control

    plane procedure logs provide necessary fine-grained information for efficient diagnosis 3 step approach to leverage rich bearer-level traces for RAN performance diagnosis
  35. Step 1: Isolate Problems to RAN End-User QoE

  36. Step 1: Isolate Problems to RAN End-User QoE Client Related

  37. Step 1: Isolate Problems to RAN End-User QoE Client Related

    Internet
  38. Step 1: Isolate Problems to RAN End-User QoE Client Related

    Cellular Network Internet
  39. Step 1: Isolate Problems to RAN End-User QoE Client Related

    Cellular Network Internet RAN Core
  40. Step 1: Isolate Problems to RAN End-User QoE Client Related

    Cellular Network Internet RAN Core
  41. Step 1: Isolate Problems to RAN End-User QoE Client Related

    Cellular Network Internet RAN Core Leverage existing trouble ticket system
  42. Step 2: Classify RAN Problems

  43. Step 2: Classify RAN Problems Coverage

  44. Step 2: Classify RAN Problems Coverage Interference

  45. Step 2: Classify RAN Problems Coverage Interference Congestion

  46. Step 2: Classify RAN Problems Coverage Interference Congestion Configuration

  47. Step 2: Classify RAN Problems Coverage Interference Congestion Configuration Network

    State Changes
  48. Step 2: Classify RAN Problems Coverage Interference Congestion Configuration Network

    State Changes Others
  49. Step 2: Classify RAN Problems Coverage Interference Congestion Configuration Network

    State Changes Others
  50. Step 2: Classify RAN Problems Coverage Interference Congestion Configuration Network

    State Changes Others RSRP RSRQ CQI SINR eNodeB Params
  51. Step 3: Model KPI at Bearer Level

  52. Step 3: Model KPI at Bearer Level Accessibility Retainability PHY

    layer Throughput Quality Traffic Volume Connection Counts Mobility
  53. Step 3: Model KPI at Bearer Level Accessibility Retainability PHY

    layer Throughput Quality Traffic Volume Connection Counts Mobility Model performance metrics at bearer level using classification bin parameters as features
  54. Step 3: Model KPI at Bearer Level Accessibility Retainability PHY

    layer Throughput Quality Traffic Volume Connection Counts Mobility Model performance metrics at bearer level using classification bin parameters as features Event Metrics
  55. Step 3: Model KPI at Bearer Level Accessibility Retainability PHY

    layer Throughput Quality Traffic Volume Connection Counts Mobility Model performance metrics at bearer level using classification bin parameters as features Event Metrics Non-Event/Volume Metrics
  56. Step 3: Model KPI at Bearer Level Accessibility Retainability PHY

    layer Throughput Quality Traffic Volume Connection Counts Mobility Model performance metrics at bearer level using classification bin parameters as features Classification Models Event Metrics Non-Event/Volume Metrics
  57. Step 3: Model KPI at Bearer Level Accessibility Retainability PHY

    layer Throughput Quality Traffic Volume Connection Counts Mobility Model performance metrics at bearer level using classification bin parameters as features Classification Models Regression Models Event Metrics Non-Event/Volume Metrics
  58. Bearer-level Modeling for Event Metrics § Connection Drops § Key

    retainability metric § How well network can complete connections
  59. ���� ��� ���� ��� ���� ��� ��� ��� ��� ���

    ��� ��� ��� ���� ���� ��� ��� Bearer-level Modeling for Event Metrics § Connection Drops § Key retainability metric § How well network can complete connections
  60. ���� ��� ���� ��� ���� ��� ��� ��� ��� ���

    ��� ��� ��� ���� ���� ��� ��� Bearer-level Modeling for Event Metrics § Connection Drops § Key retainability metric § How well network can complete connections Build decision trees to explain drops
  61. Uplink SINR > -11.75 RSRQ > -16.5 RSRQ Available? Success

    Drop Uplink SINR > -5.86 CQI > 5.875 Drop Drop Yes No Yes No No Yes Success No No Yes Yes Success Bearer-level Modeling for Event Metrics
  62. Uplink SINR > -11.75 RSRQ > -16.5 RSRQ Available? Success

    Drop Uplink SINR > -5.86 CQI > 5.875 Drop Drop Yes No Yes No No Yes Success No No Yes Yes Success Bearer-level Modeling for Event Metrics
  63. Uplink SINR > -11.75 RSRQ > -16.5 RSRQ Available? Success

    Drop Uplink SINR > -5.86 CQI > 5.875 Drop Drop Yes No Yes No No Yes Success No No Yes Yes Success Bearer-level Modeling for Event Metrics
  64. Findings from Call Drop Analysis �� ���� ���� ���� ����

    ���� ���� ���� ���� ���� �� ��� ��� ��� ��� ��� ��� �� �� �� �� �� ����������� ��������� Reference Signal Quality at UE
  65. Findings from Call Drop Analysis �� ���� ���� ���� ����

    ���� ���� ���� ���� ���� �� ��� ��� ��� ��� ��� ��� �� �� �� �� �� ����������� ��������� Reference Signal Quality at UE �� ����� ���� ����� ���� ����� ���� �� �� �� �� �� ��� ��� ��� ��� ����������� ��� Channel Quality at UE
  66. Findings from Call Drop Analysis �� ���� ���� ���� ����

    ���� ���� ���� ���� ���� �� ��� ��� ��� ��� ��� ��� �� �� �� �� �� ����������� ��������� Reference Signal Quality at UE �� ����� ���� ����� ���� ����� ���� �� �� �� �� �� ��� ��� ��� ��� ����������� ��� Channel Quality at UE Shapes not identical (should be, ideally)
  67. 0 0.2 0.4 0.6 0.8 1 -5 0 5 10

    15 20 Probability SINR Difference (dB) ρ=1/3 ρ=1 ρ=5/3 Findings from Call Drop Analysis
  68. 0 0.2 0.4 0.6 0.8 1 -5 0 5 10

    15 20 Probability SINR Difference (dB) ρ=1/3 ρ=1 ρ=5/3 Finding Insight: P-CQI is not CRC protected! Findings from Call Drop Analysis
  69. Bearer-level Modeling for Volume Metrics § Throughput § Key metric

    users really care about
  70. Bearer-level Modeling for Volume Metrics § Throughput § Key metric

    users really care about diversity, a 29% overhead for each PRB exists on average because of resources allocated to physical downlink control channel, physical broadcast channel and reference signals. The physical layer has a BLER target of 10%. Account for MAC Sub-layer Retransmissions The MAC sub- layer performs retransmissions. We denote the MAC e￿ciency as MAC . It is computed as the ratio of total ￿rst transmissions over total transmissions. We compute MAC using our traces. The predicted throughput due to transmit diversity is calculated as: tputRLCdi = (1.0 MAC ) ⇥ 0.9 ⇥ (1 0.29) ⇥ 180 ⇥ PRBdi ⇥ lo 2(1 + SINRdi )/ TxTimedi PRBdi denotes the total PRBs allocated for transmit diversity. TxTimedi is the total transmission time for transmit diversity.
  71. Bearer-level Modeling for Volume Metrics § Throughput § Key metric

    users really care about diversity, a 29% overhead for each PRB exists on average because of resources allocated to physical downlink control channel, physical broadcast channel and reference signals. The physical layer has a BLER target of 10%. Account for MAC Sub-layer Retransmissions The MAC sub- layer performs retransmissions. We denote the MAC e￿ciency as MAC . It is computed as the ratio of total ￿rst transmissions over total transmissions. We compute MAC using our traces. The predicted throughput due to transmit diversity is calculated as: tputRLCdi = (1.0 MAC ) ⇥ 0.9 ⇥ (1 0.29) ⇥ 180 ⇥ PRBdi ⇥ lo 2(1 + SINRdi )/ TxTimedi PRBdi denotes the total PRBs allocated for transmit diversity. TxTimedi is the total transmission time for transmit diversity. MAC efficiency
  72. Bearer-level Modeling for Volume Metrics § Throughput § Key metric

    users really care about diversity, a 29% overhead for each PRB exists on average because of resources allocated to physical downlink control channel, physical broadcast channel and reference signals. The physical layer has a BLER target of 10%. Account for MAC Sub-layer Retransmissions The MAC sub- layer performs retransmissions. We denote the MAC e￿ciency as MAC . It is computed as the ratio of total ￿rst transmissions over total transmissions. We compute MAC using our traces. The predicted throughput due to transmit diversity is calculated as: tputRLCdi = (1.0 MAC ) ⇥ 0.9 ⇥ (1 0.29) ⇥ 180 ⇥ PRBdi ⇥ lo 2(1 + SINRdi )/ TxTimedi PRBdi denotes the total PRBs allocated for transmit diversity. TxTimedi is the total transmission time for transmit diversity. MAC efficiency # Physical Resource Blocks
  73. Bearer-level Modeling for Volume Metrics § Throughput § Key metric

    users really care about diversity, a 29% overhead for each PRB exists on average because of resources allocated to physical downlink control channel, physical broadcast channel and reference signals. The physical layer has a BLER target of 10%. Account for MAC Sub-layer Retransmissions The MAC sub- layer performs retransmissions. We denote the MAC e￿ciency as MAC . It is computed as the ratio of total ￿rst transmissions over total transmissions. We compute MAC using our traces. The predicted throughput due to transmit diversity is calculated as: tputRLCdi = (1.0 MAC ) ⇥ 0.9 ⇥ (1 0.29) ⇥ 180 ⇥ PRBdi ⇥ lo 2(1 + SINRdi )/ TxTimedi PRBdi denotes the total PRBs allocated for transmit diversity. TxTimedi is the total transmission time for transmit diversity. MAC efficiency # Physical Resource Blocks Transmission time
  74. Bearer-level Modeling for Volume Metrics § Throughput § Key metric

    users really care about diversity, a 29% overhead for each PRB exists on average because of resources allocated to physical downlink control channel, physical broadcast channel and reference signals. The physical layer has a BLER target of 10%. Account for MAC Sub-layer Retransmissions The MAC sub- layer performs retransmissions. We denote the MAC e￿ciency as MAC . It is computed as the ratio of total ￿rst transmissions over total transmissions. We compute MAC using our traces. The predicted throughput due to transmit diversity is calculated as: tputRLCdi = (1.0 MAC ) ⇥ 0.9 ⇥ (1 0.29) ⇥ 180 ⇥ PRBdi ⇥ lo 2(1 + SINRdi )/ TxTimedi PRBdi denotes the total PRBs allocated for transmit diversity. TxTimedi is the total transmission time for transmit diversity. MAC efficiency # Physical Resource Blocks Transmission time Link-adapted SINR
  75. 0.86 0.88 0.9 0.92 0.94 0.96 0.98 1 0 20

    40 60 80 100 Probability Prediction Error (%) Bearer-level Modeling for Volume Metrics
  76. 0.86 0.88 0.9 0.92 0.94 0.96 0.98 1 0 20

    40 60 80 100 Probability Prediction Error (%) Bearer-level Modeling for Volume Metrics Works well in most scenarios
  77. Findings from Throughput Analysis § Problematic for some cells

  78. Findings from Throughput Analysis § Problematic for some cells §

    To understand why, computed loss of efficiency
  79. Findings from Throughput Analysis 0 0.02 0.04 0.06 0.08 0.1

    0.12 0.14 0.16 0.18 0 5 10 15 20 25 Probability SINR Loss (dB) § Problematic for some cells § To understand why, computed loss of efficiency Actual throughput SINR vs computed using parameters
  80. Findings from Throughput Analysis 0 0.02 0.04 0.06 0.08 0.1

    0.12 0.14 0.16 0.18 0 5 10 15 20 25 Probability SINR Loss (dB) § Problematic for some cells § To understand why, computed loss of efficiency Actual throughput SINR vs computed using parameters Finding Insight: Link adaptation slow to adapt!
  81. Can cellular network operators automate the diagnosis of RAN performance

    problems?
  82. Towards Full Automation § Our methodology is amenable to automation

    § Models can be built on-demand automatically
  83. Usefulness to the Operator ���� ��� ���� ��� ���� ���

    ��� ��� ��� ��� ��� ��� ��� ���� ���� ��� ���
  84. Usefulness to the Operator 0.3 0.35 0.4 0.45 0.5 0.55

    0.6 0.65 0.7 Sun Mon Tue Wed Thu Fri Sat Drop Rate (%) Day Total Explained
  85. Towards Full Automation § Our methodology is amenable to automation

    § Models can be built on-demand automatically § Full automation for next generation networks: § Need to build 1000s of models § Need to keep the models updated § Need real-time diagnosis
  86. We’ve made some progress towards this… § Cells exhibit performance

    similarity May be able to group cells by performance
  87. Summary § Experience working with a tier-1 operator § 2

    million users, over a period of 1 year § Leveraging bearer-level traces could be the key to automating RAN diagnosis § Proposed bearer-level modeling § Unearthed several insights § Fully automated diagnosis needs more effort