Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Automating Diagnosis of Cellular Radio Access Network Problems

Anand Iyer
October 17, 2017

Automating Diagnosis of Cellular Radio Access Network Problems

Anand Iyer

October 17, 2017
Tweet

More Decks by Anand Iyer

Other Decks in Research

Transcript

  1. Automating Diagnosis of Cellular Radio Access Network Problems Anand Iyer

    ⋆, Li Erran Li ⬩, Ion Stoica ⋆ ⋆ University of California, Berkeley ⬩ Uber Technologies ACM MobiCom 2017
  2. Cellular RANs § Must provide user satisfaction § Subscribers expect

    high quality of experience (QoE) § Critical component for Operators § High impact business for the operators Operators must ensure optimal RAN performance 24x7
  3. Monitor Key Performance Indicators (KPI)s Existing RAN Troubleshooting Techniques Drops:

    0 Drops: 10 Drops: 0 Drops: 1 Drops: 0 Drops: 3 Poor KPI → (mostly manual) root cause analysis Drops: 10
  4. {Technicians + Equipment} x Multiple Trips = Multi-billion $$$/year $22

    B spent per year on network management & troubleshooting! Root cause analysis involves field trials
  5. Our Experience § Working with a cellular network operator §

    Tier-1 operator in the U.S. § Studied portion of RAN for over a year § 13,000+ base stations serving live users
  6. A Typical Week �� �� �� �� �� �� ���

    ��� ��� ��� ��� ��� ��� �� �� �� �� ��� ��� ����������������������� ���������������� ����������� �����
  7. Our Experience § Working with a cellular network operator §

    Tier-1 operator in the U.S. § Studied portion of RAN for over a year § 13,000+ base stations serving over 2 million users § 1000s of trouble tickets
  8. Our Experience § Working with a cellular network operator §

    Tier-1 operator in the U.S. § Studied portion of RAN for over a year § 13,000+ base stations serving over 2 million users § 1000s of trouble tickets § Significant effort by the operator to resolve them
  9. Existing RAN Troubleshooting § Slow & Ineffective § Many problems

    incorrectly diagnosed § Source of disagreements § Which team should solve this problem? § Wasted efforts § Known root-causes § Recurring problems
  10. Existing RAN Troubleshooting § Slow & Ineffective § Many problems

    incorrectly diagnosed § Source of disagreements § Which team should solve this problem? § Wasted efforts § Known root-causes § Recurring problems Need more fine-grained information for diagnosis
  11. Fine-grained Information Base Station (eNodeB) Serving Gateway (S-GW) Packet Gateway

    (P-GW) Mobility Management Entity (MME) Home Subscriber Server (HSS) Internet Control Plane Data Plane User Equipment (UE) § Logging everything ideal, but impossible
  12. Fine-grained Information Base Station (eNodeB) Serving Gateway (S-GW) Packet Gateway

    (P-GW) Mobility Management Entity (MME) Home Subscriber Server (HSS) Internet Control Plane Data Plane User Equipment (UE) § Logging everything ideal, but impossible Radio bearer
  13. Fine-grained Information Base Station (eNodeB) Serving Gateway (S-GW) Packet Gateway

    (P-GW) Mobility Management Entity (MME) Home Subscriber Server (HSS) Internet Control Plane Data Plane User Equipment (UE) § Logging everything ideal, but impossible Radio bearer GTP tunnel
  14. Fine-grained Information Base Station (eNodeB) Serving Gateway (S-GW) Packet Gateway

    (P-GW) Mobility Management Entity (MME) Home Subscriber Server (HSS) Internet Control Plane Data Plane User Equipment (UE) § Logging everything ideal, but impossible Radio bearer GTP tunnel EPS Bearer
  15. RRC Connection Re-establishment RRC Connection Re-establishment Request UE eNodeB MME

    UE Context Established RRC Connection Re-establishment Complete RRC Connection Reconfiguration UE Context Release Request UE Context Release Command UE Context Release Complete Radio link failure detected Supervision timer expires Fine-grained Information § Control plane procedures
  16. Fine-grained Information § Control plane procedures Bearer traces and control

    plane procedure logs provide necessary fine-grained information for efficient diagnosis
  17. Fine-grained Information § Control plane procedures Bearer traces and control

    plane procedure logs provide necessary fine-grained information for efficient diagnosis 3 step approach to leverage rich bearer-level traces for RAN performance diagnosis
  18. Step 1: Isolate Problems to RAN End-User QoE Client Related

    Cellular Network Internet RAN Core Leverage existing trouble ticket system
  19. Step 2: Classify RAN Problems Coverage Interference Congestion Configuration Network

    State Changes Others RSRP RSRQ CQI SINR eNodeB Params
  20. Step 3: Model KPI at Bearer Level Accessibility Retainability PHY

    layer Throughput Quality Traffic Volume Connection Counts Mobility
  21. Step 3: Model KPI at Bearer Level Accessibility Retainability PHY

    layer Throughput Quality Traffic Volume Connection Counts Mobility Model performance metrics at bearer level using classification bin parameters as features
  22. Step 3: Model KPI at Bearer Level Accessibility Retainability PHY

    layer Throughput Quality Traffic Volume Connection Counts Mobility Model performance metrics at bearer level using classification bin parameters as features Event Metrics
  23. Step 3: Model KPI at Bearer Level Accessibility Retainability PHY

    layer Throughput Quality Traffic Volume Connection Counts Mobility Model performance metrics at bearer level using classification bin parameters as features Event Metrics Non-Event/Volume Metrics
  24. Step 3: Model KPI at Bearer Level Accessibility Retainability PHY

    layer Throughput Quality Traffic Volume Connection Counts Mobility Model performance metrics at bearer level using classification bin parameters as features Classification Models Event Metrics Non-Event/Volume Metrics
  25. Step 3: Model KPI at Bearer Level Accessibility Retainability PHY

    layer Throughput Quality Traffic Volume Connection Counts Mobility Model performance metrics at bearer level using classification bin parameters as features Classification Models Regression Models Event Metrics Non-Event/Volume Metrics
  26. Bearer-level Modeling for Event Metrics § Connection Drops § Key

    retainability metric § How well network can complete connections
  27. ���� ��� ���� ��� ���� ��� ��� ��� ��� ���

    ��� ��� ��� ���� ���� ��� ��� Bearer-level Modeling for Event Metrics § Connection Drops § Key retainability metric § How well network can complete connections
  28. ���� ��� ���� ��� ���� ��� ��� ��� ��� ���

    ��� ��� ��� ���� ���� ��� ��� Bearer-level Modeling for Event Metrics § Connection Drops § Key retainability metric § How well network can complete connections Build decision trees to explain drops
  29. Uplink SINR > -11.75 RSRQ > -16.5 RSRQ Available? Success

    Drop Uplink SINR > -5.86 CQI > 5.875 Drop Drop Yes No Yes No No Yes Success No No Yes Yes Success Bearer-level Modeling for Event Metrics
  30. Uplink SINR > -11.75 RSRQ > -16.5 RSRQ Available? Success

    Drop Uplink SINR > -5.86 CQI > 5.875 Drop Drop Yes No Yes No No Yes Success No No Yes Yes Success Bearer-level Modeling for Event Metrics
  31. Uplink SINR > -11.75 RSRQ > -16.5 RSRQ Available? Success

    Drop Uplink SINR > -5.86 CQI > 5.875 Drop Drop Yes No Yes No No Yes Success No No Yes Yes Success Bearer-level Modeling for Event Metrics
  32. Findings from Call Drop Analysis �� ���� ���� ���� ����

    ���� ���� ���� ���� ���� �� ��� ��� ��� ��� ��� ��� �� �� �� �� �� ����������� ��������� Reference Signal Quality at UE
  33. Findings from Call Drop Analysis �� ���� ���� ���� ����

    ���� ���� ���� ���� ���� �� ��� ��� ��� ��� ��� ��� �� �� �� �� �� ����������� ��������� Reference Signal Quality at UE �� ����� ���� ����� ���� ����� ���� �� �� �� �� �� ��� ��� ��� ��� ����������� ��� Channel Quality at UE
  34. Findings from Call Drop Analysis �� ���� ���� ���� ����

    ���� ���� ���� ���� ���� �� ��� ��� ��� ��� ��� ��� �� �� �� �� �� ����������� ��������� Reference Signal Quality at UE �� ����� ���� ����� ���� ����� ���� �� �� �� �� �� ��� ��� ��� ��� ����������� ��� Channel Quality at UE Shapes not identical (should be, ideally)
  35. 0 0.2 0.4 0.6 0.8 1 -5 0 5 10

    15 20 Probability SINR Difference (dB) ρ=1/3 ρ=1 ρ=5/3 Findings from Call Drop Analysis
  36. 0 0.2 0.4 0.6 0.8 1 -5 0 5 10

    15 20 Probability SINR Difference (dB) ρ=1/3 ρ=1 ρ=5/3 Finding Insight: P-CQI is not CRC protected! Findings from Call Drop Analysis
  37. Bearer-level Modeling for Volume Metrics § Throughput § Key metric

    users really care about diversity, a 29% overhead for each PRB exists on average because of resources allocated to physical downlink control channel, physical broadcast channel and reference signals. The physical layer has a BLER target of 10%. Account for MAC Sub-layer Retransmissions The MAC sub- layer performs retransmissions. We denote the MAC e￿ciency as MAC . It is computed as the ratio of total ￿rst transmissions over total transmissions. We compute MAC using our traces. The predicted throughput due to transmit diversity is calculated as: tputRLCdi = (1.0 MAC ) ⇥ 0.9 ⇥ (1 0.29) ⇥ 180 ⇥ PRBdi ⇥ lo 2(1 + SINRdi )/ TxTimedi PRBdi denotes the total PRBs allocated for transmit diversity. TxTimedi is the total transmission time for transmit diversity.
  38. Bearer-level Modeling for Volume Metrics § Throughput § Key metric

    users really care about diversity, a 29% overhead for each PRB exists on average because of resources allocated to physical downlink control channel, physical broadcast channel and reference signals. The physical layer has a BLER target of 10%. Account for MAC Sub-layer Retransmissions The MAC sub- layer performs retransmissions. We denote the MAC e￿ciency as MAC . It is computed as the ratio of total ￿rst transmissions over total transmissions. We compute MAC using our traces. The predicted throughput due to transmit diversity is calculated as: tputRLCdi = (1.0 MAC ) ⇥ 0.9 ⇥ (1 0.29) ⇥ 180 ⇥ PRBdi ⇥ lo 2(1 + SINRdi )/ TxTimedi PRBdi denotes the total PRBs allocated for transmit diversity. TxTimedi is the total transmission time for transmit diversity. MAC efficiency
  39. Bearer-level Modeling for Volume Metrics § Throughput § Key metric

    users really care about diversity, a 29% overhead for each PRB exists on average because of resources allocated to physical downlink control channel, physical broadcast channel and reference signals. The physical layer has a BLER target of 10%. Account for MAC Sub-layer Retransmissions The MAC sub- layer performs retransmissions. We denote the MAC e￿ciency as MAC . It is computed as the ratio of total ￿rst transmissions over total transmissions. We compute MAC using our traces. The predicted throughput due to transmit diversity is calculated as: tputRLCdi = (1.0 MAC ) ⇥ 0.9 ⇥ (1 0.29) ⇥ 180 ⇥ PRBdi ⇥ lo 2(1 + SINRdi )/ TxTimedi PRBdi denotes the total PRBs allocated for transmit diversity. TxTimedi is the total transmission time for transmit diversity. MAC efficiency # Physical Resource Blocks
  40. Bearer-level Modeling for Volume Metrics § Throughput § Key metric

    users really care about diversity, a 29% overhead for each PRB exists on average because of resources allocated to physical downlink control channel, physical broadcast channel and reference signals. The physical layer has a BLER target of 10%. Account for MAC Sub-layer Retransmissions The MAC sub- layer performs retransmissions. We denote the MAC e￿ciency as MAC . It is computed as the ratio of total ￿rst transmissions over total transmissions. We compute MAC using our traces. The predicted throughput due to transmit diversity is calculated as: tputRLCdi = (1.0 MAC ) ⇥ 0.9 ⇥ (1 0.29) ⇥ 180 ⇥ PRBdi ⇥ lo 2(1 + SINRdi )/ TxTimedi PRBdi denotes the total PRBs allocated for transmit diversity. TxTimedi is the total transmission time for transmit diversity. MAC efficiency # Physical Resource Blocks Transmission time
  41. Bearer-level Modeling for Volume Metrics § Throughput § Key metric

    users really care about diversity, a 29% overhead for each PRB exists on average because of resources allocated to physical downlink control channel, physical broadcast channel and reference signals. The physical layer has a BLER target of 10%. Account for MAC Sub-layer Retransmissions The MAC sub- layer performs retransmissions. We denote the MAC e￿ciency as MAC . It is computed as the ratio of total ￿rst transmissions over total transmissions. We compute MAC using our traces. The predicted throughput due to transmit diversity is calculated as: tputRLCdi = (1.0 MAC ) ⇥ 0.9 ⇥ (1 0.29) ⇥ 180 ⇥ PRBdi ⇥ lo 2(1 + SINRdi )/ TxTimedi PRBdi denotes the total PRBs allocated for transmit diversity. TxTimedi is the total transmission time for transmit diversity. MAC efficiency # Physical Resource Blocks Transmission time Link-adapted SINR
  42. 0.86 0.88 0.9 0.92 0.94 0.96 0.98 1 0 20

    40 60 80 100 Probability Prediction Error (%) Bearer-level Modeling for Volume Metrics
  43. 0.86 0.88 0.9 0.92 0.94 0.96 0.98 1 0 20

    40 60 80 100 Probability Prediction Error (%) Bearer-level Modeling for Volume Metrics Works well in most scenarios
  44. Findings from Throughput Analysis § Problematic for some cells §

    To understand why, computed loss of efficiency
  45. Findings from Throughput Analysis 0 0.02 0.04 0.06 0.08 0.1

    0.12 0.14 0.16 0.18 0 5 10 15 20 25 Probability SINR Loss (dB) § Problematic for some cells § To understand why, computed loss of efficiency Actual throughput SINR vs computed using parameters
  46. Findings from Throughput Analysis 0 0.02 0.04 0.06 0.08 0.1

    0.12 0.14 0.16 0.18 0 5 10 15 20 25 Probability SINR Loss (dB) § Problematic for some cells § To understand why, computed loss of efficiency Actual throughput SINR vs computed using parameters Finding Insight: Link adaptation slow to adapt!
  47. Towards Full Automation § Our methodology is amenable to automation

    § Models can be built on-demand automatically
  48. Usefulness to the Operator ���� ��� ���� ��� ���� ���

    ��� ��� ��� ��� ��� ��� ��� ���� ���� ��� ���
  49. Usefulness to the Operator 0.3 0.35 0.4 0.45 0.5 0.55

    0.6 0.65 0.7 Sun Mon Tue Wed Thu Fri Sat Drop Rate (%) Day Total Explained
  50. Towards Full Automation § Our methodology is amenable to automation

    § Models can be built on-demand automatically § Full automation for next generation networks: § Need to build 1000s of models § Need to keep the models updated § Need real-time diagnosis
  51. We’ve made some progress towards this… § Cells exhibit performance

    similarity May be able to group cells by performance
  52. Summary § Experience working with a tier-1 operator § 2

    million users, over a period of 1 year § Leveraging bearer-level traces could be the key to automating RAN diagnosis § Proposed bearer-level modeling § Unearthed several insights § Fully automated diagnosis needs more effort