$30 off During Our Annual Pro Sale. View Details »

Automating Diagnosis of Cellular Radio Access Network Problems

Anand Iyer
October 17, 2017

Automating Diagnosis of Cellular Radio Access Network Problems

Anand Iyer

October 17, 2017
Tweet

More Decks by Anand Iyer

Other Decks in Research

Transcript

  1. Automating Diagnosis of Cellular
    Radio Access Network Problems
    Anand Iyer ⋆, Li Erran Li ⬩, Ion Stoica ⋆
    ⋆ University of California, Berkeley ⬩ Uber Technologies
    ACM MobiCom 2017

    View Slide

  2. Cellular Radio Access Networks (RAN)

    View Slide

  3. Connect billions of users to Internet everyday…

    View Slide

  4. Cellular RANs
    § Must provide user satisfaction
    § Subscribers expect high quality of experience (QoE)
    § Critical component for Operators
    § High impact business for the operators
    Operators must ensure optimal
    RAN performance 24x7

    View Slide

  5. Emerging applications demand even more
    stringent performance requirements…

    View Slide

  6. View Slide

  7. Ensuring high end-user QoE is hard

    View Slide

  8. Image courtesy: Alcatel-Lucent
    RAN Performance Problems Prevalent

    View Slide

  9. When performance problems occur…
    … users are frustrated
    When users are frustrated, operators lose money

    View Slide

  10. Operators must understand the
    impacting factors and diagnose
    RAN performance problems quickly

    View Slide

  11. Existing RAN Troubleshooting Techniques

    View Slide

  12. Monitor Key
    Performance
    Indicators (KPI)s
    Existing RAN Troubleshooting Techniques

    View Slide

  13. Monitor Key
    Performance
    Indicators (KPI)s
    Existing RAN Troubleshooting Techniques
    Drops: 0
    Drops: 10
    Drops: 0
    Drops: 1
    Drops: 0
    Drops: 3

    View Slide

  14. Monitor Key
    Performance
    Indicators (KPI)s
    Existing RAN Troubleshooting Techniques
    Drops: 0
    Drops: 10
    Drops: 0
    Drops: 1
    Drops: 0
    Drops: 3
    Poor KPI → (mostly manual) root cause analysis
    Drops: 10

    View Slide

  15. Root cause analysis involves field trials

    View Slide

  16. {Technicians +
    Equipment} x
    Multiple Trips
    =
    Multi-billion $$$/year
    Root cause analysis involves field trials

    View Slide

  17. {Technicians +
    Equipment} x
    Multiple Trips
    =
    Multi-billion $$$/year
    $22 B spent per year on network
    management & troubleshooting!
    Root cause analysis involves field trials

    View Slide

  18. Can cellular network operators
    automate the diagnosis of RAN
    performance problems?

    View Slide

  19. Our Experience
    § Working with a cellular network operator
    § Tier-1 operator in the U.S.

    View Slide

  20. Our Experience
    § Working with a cellular network operator
    § Tier-1 operator in the U.S.
    § Studied portion of RAN for over a year
    § 13,000+ base stations serving live users

    View Slide

  21. A Typical Week
    ��
    ��
    ��
    ��
    ��
    ��
    ��� ��� ��� ��� ��� ��� ���
    ��
    ��
    ��
    ��
    ���
    ���
    �����������������������
    ����������������
    �����������
    �����

    View Slide

  22. Our Experience
    § Working with a cellular network operator
    § Tier-1 operator in the U.S.
    § Studied portion of RAN for over a year
    § 13,000+ base stations serving over 2 million users
    § 1000s of trouble tickets

    View Slide

  23. Our Experience
    § Working with a cellular network operator
    § Tier-1 operator in the U.S.
    § Studied portion of RAN for over a year
    § 13,000+ base stations serving over 2 million users
    § 1000s of trouble tickets
    § Significant effort by the operator to resolve them

    View Slide

  24. Existing RAN Troubleshooting
    § Slow & Ineffective
    § Many problems incorrectly diagnosed
    § Source of disagreements
    § Which team should solve this problem?
    § Wasted efforts
    § Known root-causes
    § Recurring problems

    View Slide

  25. Existing RAN Troubleshooting
    § Slow & Ineffective
    § Many problems incorrectly diagnosed
    § Source of disagreements
    § Which team should solve this problem?
    § Wasted efforts
    § Known root-causes
    § Recurring problems
    Need more fine-grained information for diagnosis

    View Slide

  26. Fine-grained Information
    § Logging everything ideal, but impossible

    View Slide

  27. Fine-grained Information
    Base Station
    (eNodeB)
    Serving Gateway
    (S-GW)
    Packet Gateway
    (P-GW)
    Mobility
    Management Entity
    (MME)
    Home
    Subscriber Server
    (HSS)
    Internet
    Control Plane
    Data Plane
    User
    Equipment
    (UE)
    § Logging everything ideal, but impossible

    View Slide

  28. Fine-grained Information
    Base Station
    (eNodeB)
    Serving Gateway
    (S-GW)
    Packet Gateway
    (P-GW)
    Mobility
    Management Entity
    (MME)
    Home
    Subscriber Server
    (HSS)
    Internet
    Control Plane
    Data Plane
    User
    Equipment
    (UE)
    § Logging everything ideal, but impossible
    Radio bearer

    View Slide

  29. Fine-grained Information
    Base Station
    (eNodeB)
    Serving Gateway
    (S-GW)
    Packet Gateway
    (P-GW)
    Mobility
    Management Entity
    (MME)
    Home
    Subscriber Server
    (HSS)
    Internet
    Control Plane
    Data Plane
    User
    Equipment
    (UE)
    § Logging everything ideal, but impossible
    Radio bearer
    GTP tunnel

    View Slide

  30. Fine-grained Information
    Base Station
    (eNodeB)
    Serving Gateway
    (S-GW)
    Packet Gateway
    (P-GW)
    Mobility
    Management Entity
    (MME)
    Home
    Subscriber Server
    (HSS)
    Internet
    Control Plane
    Data Plane
    User
    Equipment
    (UE)
    § Logging everything ideal, but impossible
    Radio bearer
    GTP tunnel
    EPS Bearer

    View Slide

  31. Fine-grained Information
    § Control plane procedures

    View Slide

  32. RRC Connection Re-establishment
    RRC Connection Re-establishment Request
    UE eNodeB MME
    UE Context Established
    RRC Connection Re-establishment Complete
    RRC Connection Reconfiguration
    UE Context Release Request
    UE Context Release Command
    UE Context Release Complete
    Radio link failure detected
    Supervision timer expires
    Fine-grained Information
    § Control plane procedures

    View Slide

  33. Fine-grained Information
    § Control plane procedures
    Bearer traces and control plane procedure logs
    provide necessary fine-grained information for
    efficient diagnosis

    View Slide

  34. Fine-grained Information
    § Control plane procedures
    Bearer traces and control plane procedure logs
    provide necessary fine-grained information for
    efficient diagnosis
    3 step approach to leverage rich bearer-level
    traces for RAN performance diagnosis

    View Slide

  35. Step 1: Isolate Problems to RAN
    End-User QoE

    View Slide

  36. Step 1: Isolate Problems to RAN
    End-User QoE
    Client Related

    View Slide

  37. Step 1: Isolate Problems to RAN
    End-User QoE
    Client Related Internet

    View Slide

  38. Step 1: Isolate Problems to RAN
    End-User QoE
    Client Related Cellular Network Internet

    View Slide

  39. Step 1: Isolate Problems to RAN
    End-User QoE
    Client Related Cellular Network Internet
    RAN Core

    View Slide

  40. Step 1: Isolate Problems to RAN
    End-User QoE
    Client Related Cellular Network Internet
    RAN Core

    View Slide

  41. Step 1: Isolate Problems to RAN
    End-User QoE
    Client Related Cellular Network Internet
    RAN Core
    Leverage existing trouble ticket system

    View Slide

  42. Step 2: Classify RAN Problems

    View Slide

  43. Step 2: Classify RAN Problems
    Coverage

    View Slide

  44. Step 2: Classify RAN Problems
    Coverage Interference

    View Slide

  45. Step 2: Classify RAN Problems
    Coverage Interference Congestion

    View Slide

  46. Step 2: Classify RAN Problems
    Coverage Interference Congestion
    Configuration

    View Slide

  47. Step 2: Classify RAN Problems
    Coverage Interference Congestion
    Configuration
    Network State
    Changes

    View Slide

  48. Step 2: Classify RAN Problems
    Coverage Interference Congestion
    Configuration
    Network State
    Changes
    Others

    View Slide

  49. Step 2: Classify RAN Problems
    Coverage Interference Congestion
    Configuration
    Network State
    Changes
    Others

    View Slide

  50. Step 2: Classify RAN Problems
    Coverage Interference Congestion
    Configuration
    Network State
    Changes
    Others
    RSRP RSRQ CQI SINR eNodeB Params

    View Slide

  51. Step 3: Model KPI at Bearer Level

    View Slide

  52. Step 3: Model KPI at Bearer Level
    Accessibility Retainability PHY layer Throughput
    Quality Traffic Volume Connection Counts Mobility

    View Slide

  53. Step 3: Model KPI at Bearer Level
    Accessibility Retainability PHY layer Throughput
    Quality Traffic Volume Connection Counts Mobility
    Model performance metrics at bearer level using
    classification bin parameters as features

    View Slide

  54. Step 3: Model KPI at Bearer Level
    Accessibility Retainability PHY layer Throughput
    Quality Traffic Volume Connection Counts Mobility
    Model performance metrics at bearer level using
    classification bin parameters as features
    Event Metrics

    View Slide

  55. Step 3: Model KPI at Bearer Level
    Accessibility Retainability PHY layer Throughput
    Quality Traffic Volume Connection Counts Mobility
    Model performance metrics at bearer level using
    classification bin parameters as features
    Event Metrics Non-Event/Volume Metrics

    View Slide

  56. Step 3: Model KPI at Bearer Level
    Accessibility Retainability PHY layer Throughput
    Quality Traffic Volume Connection Counts Mobility
    Model performance metrics at bearer level using
    classification bin parameters as features
    Classification Models
    Event Metrics Non-Event/Volume Metrics

    View Slide

  57. Step 3: Model KPI at Bearer Level
    Accessibility Retainability PHY layer Throughput
    Quality Traffic Volume Connection Counts Mobility
    Model performance metrics at bearer level using
    classification bin parameters as features
    Classification Models Regression Models
    Event Metrics Non-Event/Volume Metrics

    View Slide

  58. Bearer-level Modeling for Event Metrics
    § Connection Drops
    § Key retainability
    metric
    § How well network can
    complete
    connections

    View Slide

  59. ����
    ���
    ����
    ���
    ����
    ���
    ��� ��� ��� ��� ��� ��� ���
    ���� ���� ���
    ���
    Bearer-level Modeling for Event Metrics
    § Connection Drops
    § Key retainability
    metric
    § How well network can
    complete
    connections

    View Slide

  60. ����
    ���
    ����
    ���
    ����
    ���
    ��� ��� ��� ��� ��� ��� ���
    ���� ���� ���
    ���
    Bearer-level Modeling for Event Metrics
    § Connection Drops
    § Key retainability
    metric
    § How well network can
    complete
    connections
    Build decision trees to explain drops

    View Slide

  61. Uplink SINR >
    -11.75
    RSRQ > -16.5
    RSRQ
    Available?
    Success Drop
    Uplink SINR >
    -5.86
    CQI > 5.875
    Drop
    Drop
    Yes No
    Yes No
    No
    Yes
    Success
    No
    No
    Yes
    Yes
    Success
    Bearer-level Modeling for Event Metrics

    View Slide

  62. Uplink SINR >
    -11.75
    RSRQ > -16.5
    RSRQ
    Available?
    Success Drop
    Uplink SINR >
    -5.86
    CQI > 5.875
    Drop
    Drop
    Yes No
    Yes No
    No
    Yes
    Success
    No
    No
    Yes
    Yes
    Success
    Bearer-level Modeling for Event Metrics

    View Slide

  63. Uplink SINR >
    -11.75
    RSRQ > -16.5
    RSRQ
    Available?
    Success Drop
    Uplink SINR >
    -5.86
    CQI > 5.875
    Drop
    Drop
    Yes No
    Yes No
    No
    Yes
    Success
    No
    No
    Yes
    Yes
    Success
    Bearer-level Modeling for Event Metrics

    View Slide

  64. Findings from Call Drop Analysis
    ��
    ����
    ����
    ����
    ����
    ����
    ����
    ����
    ����
    ����
    ��
    ��� ��� ��� ��� ��� ��� �� �� �� �� ��
    �����������
    ���������
    Reference Signal
    Quality at UE

    View Slide

  65. Findings from Call Drop Analysis
    ��
    ����
    ����
    ����
    ����
    ����
    ����
    ����
    ����
    ����
    ��
    ��� ��� ��� ��� ��� ��� �� �� �� �� ��
    �����������
    ���������
    Reference Signal
    Quality at UE
    ��
    �����
    ����
    �����
    ����
    �����
    ����
    �� �� �� �� �� ��� ��� ��� ���
    �����������
    ���
    Channel Quality at UE

    View Slide

  66. Findings from Call Drop Analysis
    ��
    ����
    ����
    ����
    ����
    ����
    ����
    ����
    ����
    ����
    ��
    ��� ��� ��� ��� ��� ��� �� �� �� �� ��
    �����������
    ���������
    Reference Signal
    Quality at UE
    ��
    �����
    ����
    �����
    ����
    �����
    ����
    �� �� �� �� �� ��� ��� ��� ���
    �����������
    ���
    Channel Quality at UE
    Shapes not identical (should be, ideally)

    View Slide

  67. 0
    0.2
    0.4
    0.6
    0.8
    1
    -5 0 5 10 15 20
    Probability
    SINR Difference (dB)
    ρ=1/3
    ρ=1
    ρ=5/3
    Findings from Call Drop Analysis

    View Slide

  68. 0
    0.2
    0.4
    0.6
    0.8
    1
    -5 0 5 10 15 20
    Probability
    SINR Difference (dB)
    ρ=1/3
    ρ=1
    ρ=5/3
    Finding Insight: P-CQI is not CRC protected!
    Findings from Call Drop Analysis

    View Slide

  69. Bearer-level Modeling for Volume Metrics
    § Throughput
    § Key metric users really care about

    View Slide

  70. Bearer-level Modeling for Volume Metrics
    § Throughput
    § Key metric users really care about
    diversity, a 29% overhead for each PRB exists on average because of
    resources allocated to physical downlink control channel, physical
    broadcast channel and reference signals. The physical layer has a
    BLER target of 10%.
    Account for MAC Sub-layer Retransmissions The MAC sub-
    layer performs retransmissions. We denote the MAC eciency
    as MAC . It is computed as the ratio of total rst transmissions
    over total transmissions. We compute MAC using our traces. The
    predicted throughput due to transmit diversity is calculated as:
    tputRLCdi = (1.0 MAC ) ⇥ 0.9 ⇥ (1 0.29) ⇥ 180 ⇥
    PRBdi ⇥
    lo
    2(1 +
    SINRdi )/
    TxTimedi
    PRBdi denotes the total PRBs allocated for transmit diversity.
    TxTimedi is the total transmission time for transmit diversity.

    View Slide

  71. Bearer-level Modeling for Volume Metrics
    § Throughput
    § Key metric users really care about
    diversity, a 29% overhead for each PRB exists on average because of
    resources allocated to physical downlink control channel, physical
    broadcast channel and reference signals. The physical layer has a
    BLER target of 10%.
    Account for MAC Sub-layer Retransmissions The MAC sub-
    layer performs retransmissions. We denote the MAC eciency
    as MAC . It is computed as the ratio of total rst transmissions
    over total transmissions. We compute MAC using our traces. The
    predicted throughput due to transmit diversity is calculated as:
    tputRLCdi = (1.0 MAC ) ⇥ 0.9 ⇥ (1 0.29) ⇥ 180 ⇥
    PRBdi ⇥
    lo
    2(1 +
    SINRdi )/
    TxTimedi
    PRBdi denotes the total PRBs allocated for transmit diversity.
    TxTimedi is the total transmission time for transmit diversity.
    MAC efficiency

    View Slide

  72. Bearer-level Modeling for Volume Metrics
    § Throughput
    § Key metric users really care about
    diversity, a 29% overhead for each PRB exists on average because of
    resources allocated to physical downlink control channel, physical
    broadcast channel and reference signals. The physical layer has a
    BLER target of 10%.
    Account for MAC Sub-layer Retransmissions The MAC sub-
    layer performs retransmissions. We denote the MAC eciency
    as MAC . It is computed as the ratio of total rst transmissions
    over total transmissions. We compute MAC using our traces. The
    predicted throughput due to transmit diversity is calculated as:
    tputRLCdi = (1.0 MAC ) ⇥ 0.9 ⇥ (1 0.29) ⇥ 180 ⇥
    PRBdi ⇥
    lo
    2(1 +
    SINRdi )/
    TxTimedi
    PRBdi denotes the total PRBs allocated for transmit diversity.
    TxTimedi is the total transmission time for transmit diversity.
    MAC efficiency
    # Physical
    Resource Blocks

    View Slide

  73. Bearer-level Modeling for Volume Metrics
    § Throughput
    § Key metric users really care about
    diversity, a 29% overhead for each PRB exists on average because of
    resources allocated to physical downlink control channel, physical
    broadcast channel and reference signals. The physical layer has a
    BLER target of 10%.
    Account for MAC Sub-layer Retransmissions The MAC sub-
    layer performs retransmissions. We denote the MAC eciency
    as MAC . It is computed as the ratio of total rst transmissions
    over total transmissions. We compute MAC using our traces. The
    predicted throughput due to transmit diversity is calculated as:
    tputRLCdi = (1.0 MAC ) ⇥ 0.9 ⇥ (1 0.29) ⇥ 180 ⇥
    PRBdi ⇥
    lo
    2(1 +
    SINRdi )/
    TxTimedi
    PRBdi denotes the total PRBs allocated for transmit diversity.
    TxTimedi is the total transmission time for transmit diversity.
    MAC efficiency
    # Physical
    Resource Blocks
    Transmission
    time

    View Slide

  74. Bearer-level Modeling for Volume Metrics
    § Throughput
    § Key metric users really care about
    diversity, a 29% overhead for each PRB exists on average because of
    resources allocated to physical downlink control channel, physical
    broadcast channel and reference signals. The physical layer has a
    BLER target of 10%.
    Account for MAC Sub-layer Retransmissions The MAC sub-
    layer performs retransmissions. We denote the MAC eciency
    as MAC . It is computed as the ratio of total rst transmissions
    over total transmissions. We compute MAC using our traces. The
    predicted throughput due to transmit diversity is calculated as:
    tputRLCdi = (1.0 MAC ) ⇥ 0.9 ⇥ (1 0.29) ⇥ 180 ⇥
    PRBdi ⇥
    lo
    2(1 +
    SINRdi )/
    TxTimedi
    PRBdi denotes the total PRBs allocated for transmit diversity.
    TxTimedi is the total transmission time for transmit diversity.
    MAC efficiency
    # Physical
    Resource Blocks
    Transmission
    time
    Link-adapted
    SINR

    View Slide

  75. 0.86
    0.88
    0.9
    0.92
    0.94
    0.96
    0.98
    1
    0 20 40 60 80 100
    Probability
    Prediction Error (%)
    Bearer-level Modeling for Volume Metrics

    View Slide

  76. 0.86
    0.88
    0.9
    0.92
    0.94
    0.96
    0.98
    1
    0 20 40 60 80 100
    Probability
    Prediction Error (%)
    Bearer-level Modeling for Volume Metrics
    Works well in most scenarios

    View Slide

  77. Findings from Throughput Analysis
    § Problematic for some cells

    View Slide

  78. Findings from Throughput Analysis
    § Problematic for some cells
    § To understand why, computed loss of efficiency

    View Slide

  79. Findings from Throughput Analysis
    0
    0.02
    0.04
    0.06
    0.08
    0.1
    0.12
    0.14
    0.16
    0.18
    0 5 10 15 20 25
    Probability
    SINR Loss (dB)
    § Problematic for some cells
    § To understand why, computed loss of efficiency
    Actual throughput SINR vs
    computed using parameters

    View Slide

  80. Findings from Throughput Analysis
    0
    0.02
    0.04
    0.06
    0.08
    0.1
    0.12
    0.14
    0.16
    0.18
    0 5 10 15 20 25
    Probability
    SINR Loss (dB)
    § Problematic for some cells
    § To understand why, computed loss of efficiency
    Actual throughput SINR vs
    computed using parameters
    Finding Insight: Link adaptation slow to adapt!

    View Slide

  81. Can cellular network operators
    automate the diagnosis of RAN
    performance problems?

    View Slide

  82. Towards Full Automation
    § Our methodology is amenable to automation
    § Models can be built on-demand automatically

    View Slide

  83. Usefulness to the Operator
    ����
    ���
    ����
    ���
    ����
    ���
    ��� ��� ��� ��� ��� ��� ���
    ���� ���� ���
    ���

    View Slide

  84. Usefulness to the Operator
    0.3
    0.35
    0.4
    0.45
    0.5
    0.55
    0.6
    0.65
    0.7
    Sun Mon Tue Wed Thu Fri Sat
    Drop Rate (%)
    Day
    Total
    Explained

    View Slide

  85. Towards Full Automation
    § Our methodology is amenable to automation
    § Models can be built on-demand automatically
    § Full automation for next generation networks:
    § Need to build 1000s of models
    § Need to keep the models updated
    § Need real-time diagnosis

    View Slide

  86. We’ve made some progress towards this…
    § Cells exhibit performance similarity
    May be able to group cells by performance

    View Slide

  87. Summary
    § Experience working with a tier-1 operator
    § 2 million users, over a period of 1 year
    § Leveraging bearer-level traces could be the
    key to automating RAN diagnosis
    § Proposed bearer-level modeling
    § Unearthed several insights
    § Fully automated diagnosis needs more effort

    View Slide