Automating Diagnosis of Cellular Radio Access Network Problems

0ff46442256bf55681d64027c68beea7?s=47 Anand Iyer
October 17, 2017

Automating Diagnosis of Cellular Radio Access Network Problems

0ff46442256bf55681d64027c68beea7?s=128

Anand Iyer

October 17, 2017
Tweet

Transcript

  1. 1.

    Automating Diagnosis of Cellular Radio Access Network Problems Anand Iyer

    ⋆, Li Erran Li ⬩, Ion Stoica ⋆ ⋆ University of California, Berkeley ⬩ Uber Technologies ACM MobiCom 2017
  2. 4.

    Cellular RANs § Must provide user satisfaction § Subscribers expect

    high quality of experience (QoE) § Critical component for Operators § High impact business for the operators Operators must ensure optimal RAN performance 24x7
  3. 6.
  4. 14.

    Monitor Key Performance Indicators (KPI)s Existing RAN Troubleshooting Techniques Drops:

    0 Drops: 10 Drops: 0 Drops: 1 Drops: 0 Drops: 3 Poor KPI → (mostly manual) root cause analysis Drops: 10
  5. 17.

    {Technicians + Equipment} x Multiple Trips = Multi-billion $$$/year $22

    B spent per year on network management & troubleshooting! Root cause analysis involves field trials
  6. 20.

    Our Experience § Working with a cellular network operator §

    Tier-1 operator in the U.S. § Studied portion of RAN for over a year § 13,000+ base stations serving live users
  7. 21.

    A Typical Week �� �� �� �� �� �� ���

    ��� ��� ��� ��� ��� ��� �� �� �� �� ��� ��� ����������������������� ���������������� ����������� �����
  8. 22.

    Our Experience § Working with a cellular network operator §

    Tier-1 operator in the U.S. § Studied portion of RAN for over a year § 13,000+ base stations serving over 2 million users § 1000s of trouble tickets
  9. 23.

    Our Experience § Working with a cellular network operator §

    Tier-1 operator in the U.S. § Studied portion of RAN for over a year § 13,000+ base stations serving over 2 million users § 1000s of trouble tickets § Significant effort by the operator to resolve them
  10. 24.

    Existing RAN Troubleshooting § Slow & Ineffective § Many problems

    incorrectly diagnosed § Source of disagreements § Which team should solve this problem? § Wasted efforts § Known root-causes § Recurring problems
  11. 25.

    Existing RAN Troubleshooting § Slow & Ineffective § Many problems

    incorrectly diagnosed § Source of disagreements § Which team should solve this problem? § Wasted efforts § Known root-causes § Recurring problems Need more fine-grained information for diagnosis
  12. 27.

    Fine-grained Information Base Station (eNodeB) Serving Gateway (S-GW) Packet Gateway

    (P-GW) Mobility Management Entity (MME) Home Subscriber Server (HSS) Internet Control Plane Data Plane User Equipment (UE) § Logging everything ideal, but impossible
  13. 28.

    Fine-grained Information Base Station (eNodeB) Serving Gateway (S-GW) Packet Gateway

    (P-GW) Mobility Management Entity (MME) Home Subscriber Server (HSS) Internet Control Plane Data Plane User Equipment (UE) § Logging everything ideal, but impossible Radio bearer
  14. 29.

    Fine-grained Information Base Station (eNodeB) Serving Gateway (S-GW) Packet Gateway

    (P-GW) Mobility Management Entity (MME) Home Subscriber Server (HSS) Internet Control Plane Data Plane User Equipment (UE) § Logging everything ideal, but impossible Radio bearer GTP tunnel
  15. 30.

    Fine-grained Information Base Station (eNodeB) Serving Gateway (S-GW) Packet Gateway

    (P-GW) Mobility Management Entity (MME) Home Subscriber Server (HSS) Internet Control Plane Data Plane User Equipment (UE) § Logging everything ideal, but impossible Radio bearer GTP tunnel EPS Bearer
  16. 32.

    RRC Connection Re-establishment RRC Connection Re-establishment Request UE eNodeB MME

    UE Context Established RRC Connection Re-establishment Complete RRC Connection Reconfiguration UE Context Release Request UE Context Release Command UE Context Release Complete Radio link failure detected Supervision timer expires Fine-grained Information § Control plane procedures
  17. 33.

    Fine-grained Information § Control plane procedures Bearer traces and control

    plane procedure logs provide necessary fine-grained information for efficient diagnosis
  18. 34.

    Fine-grained Information § Control plane procedures Bearer traces and control

    plane procedure logs provide necessary fine-grained information for efficient diagnosis 3 step approach to leverage rich bearer-level traces for RAN performance diagnosis
  19. 39.
  20. 40.
  21. 41.

    Step 1: Isolate Problems to RAN End-User QoE Client Related

    Cellular Network Internet RAN Core Leverage existing trouble ticket system
  22. 50.

    Step 2: Classify RAN Problems Coverage Interference Congestion Configuration Network

    State Changes Others RSRP RSRQ CQI SINR eNodeB Params
  23. 52.

    Step 3: Model KPI at Bearer Level Accessibility Retainability PHY

    layer Throughput Quality Traffic Volume Connection Counts Mobility
  24. 53.

    Step 3: Model KPI at Bearer Level Accessibility Retainability PHY

    layer Throughput Quality Traffic Volume Connection Counts Mobility Model performance metrics at bearer level using classification bin parameters as features
  25. 54.

    Step 3: Model KPI at Bearer Level Accessibility Retainability PHY

    layer Throughput Quality Traffic Volume Connection Counts Mobility Model performance metrics at bearer level using classification bin parameters as features Event Metrics
  26. 55.

    Step 3: Model KPI at Bearer Level Accessibility Retainability PHY

    layer Throughput Quality Traffic Volume Connection Counts Mobility Model performance metrics at bearer level using classification bin parameters as features Event Metrics Non-Event/Volume Metrics
  27. 56.

    Step 3: Model KPI at Bearer Level Accessibility Retainability PHY

    layer Throughput Quality Traffic Volume Connection Counts Mobility Model performance metrics at bearer level using classification bin parameters as features Classification Models Event Metrics Non-Event/Volume Metrics
  28. 57.

    Step 3: Model KPI at Bearer Level Accessibility Retainability PHY

    layer Throughput Quality Traffic Volume Connection Counts Mobility Model performance metrics at bearer level using classification bin parameters as features Classification Models Regression Models Event Metrics Non-Event/Volume Metrics
  29. 58.

    Bearer-level Modeling for Event Metrics § Connection Drops § Key

    retainability metric § How well network can complete connections
  30. 59.

    ���� ��� ���� ��� ���� ��� ��� ��� ��� ���

    ��� ��� ��� ���� ���� ��� ��� Bearer-level Modeling for Event Metrics § Connection Drops § Key retainability metric § How well network can complete connections
  31. 60.

    ���� ��� ���� ��� ���� ��� ��� ��� ��� ���

    ��� ��� ��� ���� ���� ��� ��� Bearer-level Modeling for Event Metrics § Connection Drops § Key retainability metric § How well network can complete connections Build decision trees to explain drops
  32. 61.

    Uplink SINR > -11.75 RSRQ > -16.5 RSRQ Available? Success

    Drop Uplink SINR > -5.86 CQI > 5.875 Drop Drop Yes No Yes No No Yes Success No No Yes Yes Success Bearer-level Modeling for Event Metrics
  33. 62.

    Uplink SINR > -11.75 RSRQ > -16.5 RSRQ Available? Success

    Drop Uplink SINR > -5.86 CQI > 5.875 Drop Drop Yes No Yes No No Yes Success No No Yes Yes Success Bearer-level Modeling for Event Metrics
  34. 63.

    Uplink SINR > -11.75 RSRQ > -16.5 RSRQ Available? Success

    Drop Uplink SINR > -5.86 CQI > 5.875 Drop Drop Yes No Yes No No Yes Success No No Yes Yes Success Bearer-level Modeling for Event Metrics
  35. 64.

    Findings from Call Drop Analysis �� ���� ���� ���� ����

    ���� ���� ���� ���� ���� �� ��� ��� ��� ��� ��� ��� �� �� �� �� �� ����������� ��������� Reference Signal Quality at UE
  36. 65.

    Findings from Call Drop Analysis �� ���� ���� ���� ����

    ���� ���� ���� ���� ���� �� ��� ��� ��� ��� ��� ��� �� �� �� �� �� ����������� ��������� Reference Signal Quality at UE �� ����� ���� ����� ���� ����� ���� �� �� �� �� �� ��� ��� ��� ��� ����������� ��� Channel Quality at UE
  37. 66.

    Findings from Call Drop Analysis �� ���� ���� ���� ����

    ���� ���� ���� ���� ���� �� ��� ��� ��� ��� ��� ��� �� �� �� �� �� ����������� ��������� Reference Signal Quality at UE �� ����� ���� ����� ���� ����� ���� �� �� �� �� �� ��� ��� ��� ��� ����������� ��� Channel Quality at UE Shapes not identical (should be, ideally)
  38. 67.

    0 0.2 0.4 0.6 0.8 1 -5 0 5 10

    15 20 Probability SINR Difference (dB) ρ=1/3 ρ=1 ρ=5/3 Findings from Call Drop Analysis
  39. 68.

    0 0.2 0.4 0.6 0.8 1 -5 0 5 10

    15 20 Probability SINR Difference (dB) ρ=1/3 ρ=1 ρ=5/3 Finding Insight: P-CQI is not CRC protected! Findings from Call Drop Analysis
  40. 70.

    Bearer-level Modeling for Volume Metrics § Throughput § Key metric

    users really care about diversity, a 29% overhead for each PRB exists on average because of resources allocated to physical downlink control channel, physical broadcast channel and reference signals. The physical layer has a BLER target of 10%. Account for MAC Sub-layer Retransmissions The MAC sub- layer performs retransmissions. We denote the MAC e￿ciency as MAC . It is computed as the ratio of total ￿rst transmissions over total transmissions. We compute MAC using our traces. The predicted throughput due to transmit diversity is calculated as: tputRLCdi = (1.0 MAC ) ⇥ 0.9 ⇥ (1 0.29) ⇥ 180 ⇥ PRBdi ⇥ lo 2(1 + SINRdi )/ TxTimedi PRBdi denotes the total PRBs allocated for transmit diversity. TxTimedi is the total transmission time for transmit diversity.
  41. 71.

    Bearer-level Modeling for Volume Metrics § Throughput § Key metric

    users really care about diversity, a 29% overhead for each PRB exists on average because of resources allocated to physical downlink control channel, physical broadcast channel and reference signals. The physical layer has a BLER target of 10%. Account for MAC Sub-layer Retransmissions The MAC sub- layer performs retransmissions. We denote the MAC e￿ciency as MAC . It is computed as the ratio of total ￿rst transmissions over total transmissions. We compute MAC using our traces. The predicted throughput due to transmit diversity is calculated as: tputRLCdi = (1.0 MAC ) ⇥ 0.9 ⇥ (1 0.29) ⇥ 180 ⇥ PRBdi ⇥ lo 2(1 + SINRdi )/ TxTimedi PRBdi denotes the total PRBs allocated for transmit diversity. TxTimedi is the total transmission time for transmit diversity. MAC efficiency
  42. 72.

    Bearer-level Modeling for Volume Metrics § Throughput § Key metric

    users really care about diversity, a 29% overhead for each PRB exists on average because of resources allocated to physical downlink control channel, physical broadcast channel and reference signals. The physical layer has a BLER target of 10%. Account for MAC Sub-layer Retransmissions The MAC sub- layer performs retransmissions. We denote the MAC e￿ciency as MAC . It is computed as the ratio of total ￿rst transmissions over total transmissions. We compute MAC using our traces. The predicted throughput due to transmit diversity is calculated as: tputRLCdi = (1.0 MAC ) ⇥ 0.9 ⇥ (1 0.29) ⇥ 180 ⇥ PRBdi ⇥ lo 2(1 + SINRdi )/ TxTimedi PRBdi denotes the total PRBs allocated for transmit diversity. TxTimedi is the total transmission time for transmit diversity. MAC efficiency # Physical Resource Blocks
  43. 73.

    Bearer-level Modeling for Volume Metrics § Throughput § Key metric

    users really care about diversity, a 29% overhead for each PRB exists on average because of resources allocated to physical downlink control channel, physical broadcast channel and reference signals. The physical layer has a BLER target of 10%. Account for MAC Sub-layer Retransmissions The MAC sub- layer performs retransmissions. We denote the MAC e￿ciency as MAC . It is computed as the ratio of total ￿rst transmissions over total transmissions. We compute MAC using our traces. The predicted throughput due to transmit diversity is calculated as: tputRLCdi = (1.0 MAC ) ⇥ 0.9 ⇥ (1 0.29) ⇥ 180 ⇥ PRBdi ⇥ lo 2(1 + SINRdi )/ TxTimedi PRBdi denotes the total PRBs allocated for transmit diversity. TxTimedi is the total transmission time for transmit diversity. MAC efficiency # Physical Resource Blocks Transmission time
  44. 74.

    Bearer-level Modeling for Volume Metrics § Throughput § Key metric

    users really care about diversity, a 29% overhead for each PRB exists on average because of resources allocated to physical downlink control channel, physical broadcast channel and reference signals. The physical layer has a BLER target of 10%. Account for MAC Sub-layer Retransmissions The MAC sub- layer performs retransmissions. We denote the MAC e￿ciency as MAC . It is computed as the ratio of total ￿rst transmissions over total transmissions. We compute MAC using our traces. The predicted throughput due to transmit diversity is calculated as: tputRLCdi = (1.0 MAC ) ⇥ 0.9 ⇥ (1 0.29) ⇥ 180 ⇥ PRBdi ⇥ lo 2(1 + SINRdi )/ TxTimedi PRBdi denotes the total PRBs allocated for transmit diversity. TxTimedi is the total transmission time for transmit diversity. MAC efficiency # Physical Resource Blocks Transmission time Link-adapted SINR
  45. 75.

    0.86 0.88 0.9 0.92 0.94 0.96 0.98 1 0 20

    40 60 80 100 Probability Prediction Error (%) Bearer-level Modeling for Volume Metrics
  46. 76.

    0.86 0.88 0.9 0.92 0.94 0.96 0.98 1 0 20

    40 60 80 100 Probability Prediction Error (%) Bearer-level Modeling for Volume Metrics Works well in most scenarios
  47. 78.

    Findings from Throughput Analysis § Problematic for some cells §

    To understand why, computed loss of efficiency
  48. 79.

    Findings from Throughput Analysis 0 0.02 0.04 0.06 0.08 0.1

    0.12 0.14 0.16 0.18 0 5 10 15 20 25 Probability SINR Loss (dB) § Problematic for some cells § To understand why, computed loss of efficiency Actual throughput SINR vs computed using parameters
  49. 80.

    Findings from Throughput Analysis 0 0.02 0.04 0.06 0.08 0.1

    0.12 0.14 0.16 0.18 0 5 10 15 20 25 Probability SINR Loss (dB) § Problematic for some cells § To understand why, computed loss of efficiency Actual throughput SINR vs computed using parameters Finding Insight: Link adaptation slow to adapt!
  50. 82.

    Towards Full Automation § Our methodology is amenable to automation

    § Models can be built on-demand automatically
  51. 83.

    Usefulness to the Operator ���� ��� ���� ��� ���� ���

    ��� ��� ��� ��� ��� ��� ��� ���� ���� ��� ���
  52. 84.

    Usefulness to the Operator 0.3 0.35 0.4 0.45 0.5 0.55

    0.6 0.65 0.7 Sun Mon Tue Wed Thu Fri Sat Drop Rate (%) Day Total Explained
  53. 85.

    Towards Full Automation § Our methodology is amenable to automation

    § Models can be built on-demand automatically § Full automation for next generation networks: § Need to build 1000s of models § Need to keep the models updated § Need real-time diagnosis
  54. 86.

    We’ve made some progress towards this… § Cells exhibit performance

    similarity May be able to group cells by performance
  55. 87.

    Summary § Experience working with a tier-1 operator § 2

    million users, over a period of 1 year § Leveraging bearer-level traces could be the key to automating RAN diagnosis § Proposed bearer-level modeling § Unearthed several insights § Fully automated diagnosis needs more effort