Wi-Fi Issues Fall 2021 - Spring 2022

As of March 2, 2022, recent Wi-Fi issues are resolved

Troubleshooting efforts in January identified another bug in the vendor software causing authentication errors and timeouts which resulted in devices becoming unresponsive when connecting to campus Wi-Fi. The vendor developed a bug fix, which was tested and validated in the campus Wi-Fi system, but when the permanent fix was applied to the system it resulted in additional issues. The changes were backed out, and the vendor developed an updated fix which was applied to the system overnight March 1-2. These changes have resolved known system issues causing campus-wide Wi-Fi authentication and connectivity problems. Individual issues with connections may still occur for various reasons, including local access point failures and device configurations. To help us address any new issues that arise, please report all Wi-Fi issues to the service desk (contact info below) so that we may troubleshoot and resolve them.

I thought the problems with Wi-Fi were fixed in December. Why are we having Wi-Fi issues again?

In January, a previously unseen issue was detected that caused additional campus-wide problems. New versions of device OS systems created malformed network packets that corrupted the Wi-Fi system, causing connectivity problems. Workarounds were implemented to mitigate the issue, while the error handling in the Wi-Fi system software was updated by the vendor to properly handle these conditions. This additional vendor bug fix was installed on the campus Wi-Fi system overnight March 1-2, 2022 to permanently address the issue.

Next Steps

Our goal is to minimize disruptions to accessing campus Wi-Fi. However, since ongoing updates are needed in order to maintain the system, there is unfortunately always potential for unforeseen issues to arise. Continued monitoring of the system, sensors, and user-reported tickets will be done to detect any persistent system health and other issues broadly impacting user experience. Individual issues with connecting to Eduroam will be addressed on a case-by-case basis.     

How You Can Help 

If you encounter persistent issues connecting to Wi-Fi and need assistance:

  • Students with personal devices - Drop-in IT support for students is available in Eshleman Hall (1st floor) and Doe Library (190 Doe); see hours of operation. You can also contact Student Technology Services at 510-642-4357 or email sts-help@berkeley.edu

  • Faculty, Staff, and Student employees - Appointments will be available via remote support tools and in-person at your campus location. Drop-in support for faculty and staff is available in Dwinelle Hall, Room 128, across from the Academic Innovation Studio, Monday through Friday, 9 a.m. to 3 p.m. You can also contact the ITCS Service Desk at 510-664-9000 (option 1) or submit a ticket

The information you provide will help us better understand the impact of the problems and improve our troubleshooting efforts. When reporting an issue, the following details help us diagnose, troubleshoot, and resolve the issue: time of day; specific location/building/room; Eduroam vs. AirBears2 vs. CalVisitor; type of device you are using; specific problem(s) encountered; screenshots of error messages; how often you have experienced this at similar/different locations within the past 24-48 hours.

Spring 2022 Wi-Fi Issues & Response Timeline

  • Jan. 25-27 - Berkeley IT (bIT) identified an issue with campus Wi-Fi that was impacting service in multiple locations. This issue was related to a bug in our vendor’s software and the way it interacted with a new set of user devices/operating systems that had been released recently. This issue probably existed before Jan. 25 and would have been causing sporadic issues across campus since December/January but only became apparent when a large volume of users returned to campus after winter break and the delayed start of in-person classes. As time went on this issue spread due to its nature, impacting more and more of the Wi-Fi coverage on campus. A workaround was identified which involved disabling a technology that improves scalability and ensures consistent connectivity for users as they move from place to place. Berkeley IT implemented this workaround pending a software update to resolve the issue.

  • Feb. 7-10 - Wi-Fi on campus experienced significant performance and connectivity issues in many locations. Initial investigation by Berkeley IT identified performance issues on two of the controllers which manage connectivity and network service for campus Wi-Fi (part of a redundant set of controllers of which there are currently 9). Eventually, further investigation identified that this issue was the result of two combined issues: 

    • A network misconfiguration elsewhere on campus had resulted in a loop that caused the controllers to be impacted by unexpectedly high levels of traffic. This loop was exacerbated by the infrastructure in place to maintain connectivity for the remaining obsolete Wi-Fi infrastructure. Berkeley IT is working to replace the infrastructure with current systems that match the rest of the campus Wi-Fi network.

    • The impacted controllers were recently added to the cluster. Deeper investigation identified that other controllers had a configuration in place which made them less susceptible to loops like this. This configuration had not been automatically inherited by the new controllers when they were added to the cluster.  

  • Feb. 10 - The above issues were fully identified and resolved on Feb. 10. The loop was resolved and the configuration of the two new cluster members was normalized with the rest of the cluster to avoid this issue recurring.

  • Feb. 11 - Our vendor provided a software update to resolve the issues experienced on January 25th-27th. We installed this software update but discovered that one of the updates in the software had not been adequately tested by the vendor. This update had an issue that caused Wi-Fi controllers to crash repeatedly. Due to the redundancy of the controller cluster service was maintained, but users across campus experienced disruption if the controller they were connected via crashed, while their service migrated to a still-operational controller. Berkeley IT reverted to the previous version of software to resolve this issue.

  • Feb. 22-23 - One of the Wi-Fi controllers for campus crashed and rebooted on the evening of February 22nd. When it rejoined the cluster it did so in an unstable state, causing some users to appear to be online, but unable to use the network. The controller was removed from the cluster on February 23rd to permit troubleshooting and this action restored service for campus users.

  • March 1 - Our vendor provided a new software update to resolve the issues originally experienced in late January.  Berkeley IT applied this update and were able to restore the fast-roaming protocol we’d had to disable in January.  In addition, the issue with the controller which crashed on Feb 22nd was identified. A configuration was no longer present on the controller and this was impacting the controller’s ability to properly rejoin the cluster. This issue was resolved and the controller returned to the cluster.

As of Dec. 8, 2021, campus Fall Wi-Fi issues known at that time were fully resolved

Troubleshooting efforts in October and November identified another bug in the vendor firmware causing authentication errors/timeouts and devices to become unresponsive when connecting to campus Wi-Fi. This bug did not affect access point models used in residence hall Wi-Fi, it only affected campus Wi-Fi. The vendor developed a bug fix, which was tested and validated in the campus Wi-Fi system, and the permanent fix was applied on the system. Additional Wi-Fi authentication hardware was also added to the system, adding capacity to avoid system overload issues related to peak Wi-Fi connection traffic when classes end/begin throughout the day. These changes resulted in the full resolution of known system issues causing authentication and connectivity problems at that point in time.

Why did we have Wi-Fi issues in the fall?

We started Wi-Fi improvements in FY 2018, moving from old, obsolete Cisco Wi-Fi equipment to new Aruba. The Aruba platform was chosen for implementation through an RFP, and was chosen to provide a better Wi-Fi experience across campus, especially in large classroom environments. Prior to the start of the pandemic in 2020, Aruba equipment had been deployed in the majority of instructional buildings on campus and was operating in a stable manner (this map shows the locations that have migrated to Aruba equipment, highlighted in green). Due to the prioritization of available funding to improve classroom spaces, the buildings that house mainly administrative functions and the sports/entertainment venues remain on the old / obsolete (7-10 year old) system. 

While diagnosing the widespread start-of-semester Wi-Fi issues, we discovered that vendor software updates in late 2019 introduced a bug that causes system performance issues under peak load and high user roaming activity. Unfortunately, when COVID hit and faculty, students and staff went fully remote, it masked these bugs from being detected. As people returned to campus in large numbers this Fall and classroom activity ramped up, these bugs manifested, causing a painful and disruptive experience for a large number of faculty, students, and staff who use campus Wi-Fi. Although these issues were present around the world in Aruba equipment for many months, UC Berkeley was one of the first customers to detect these bugs, due to our large environment and the load that we started placing on the system (many other college campuses were similarly affected). We engaged with the highest levels of vendor management, and obtained their urgent attention on this issue, both for resolution of the issues and for prevention of future problems. 

Initial resolution efforts improved the stability of the underlying infrastructure but did not fully resolve the issues experienced by faculty and students connecting to Wi-Fi on campus. The initial issue manifested by Wi-Fi access points losing connection to the system at peak times and made devices lose connection and unable to reconnect. Mitigating this issue improved overall stability for devices that were able to connect. Subsequent analysis revealed that devices were still having problems with delays of 15 minutes or longer when connecting. In October, additional changes were implemented that largely addressed these issues, but a persistent issue remained that affected device connectivity in specific locations. Continued troubleshooting efforts in November identified another vendor software bug, which was fixed on the system in early December. Additional authentication system capacity was also added in early December, resulting in the full resolution of the remaining connectivity issues at that point in time.

Issues & Response Summary

When bIT started diagnosing the problem in the Fall of 2021, it quickly became apparent that the situation required vendor involvement, so we engaged with Aruba to help troubleshoot and diagnose the issues. Workarounds and fixes were implemented to stabilize the system and, while it fixed the problem identified, it also uncovered additional issues. Further workarounds were implemented to stabilize the system, and the vendor continued to work on a permanent resolution to the bug.

As bIT proceeded with troubleshooting, they realized that the data captured through monitoring was not providing a complete picture of the actual user experience, so they dispatched personnel to perform on-the-ground testing at some of the most hard-hit areas to better understand what was occurring. Sensors that simulate user activity were also installed to help measure user experience. This work continued throughout the months of September and October, resulting in additional tuning while vendor work on the permanent fix continued.

Ongoing troubleshooting and analysis uncovered performance issues in the Wi-Fi authentication system when devices attempted to connect at the start of classes each day. Implementation of 802.11r ‘fast roaming’ protocols, installation of two vendor software bug fixes, several authentication system tuning changes, and the installation of an additional authentication system capacity have resulted in much-improved Wi-Fi performance campus-wide, and the resolution of these issues at that point in time.

Fall Wi-Fi Issues & Response Timeline

  • Pre-March 2020 - Changes were made in the vendor software that controls and manages device/AP/Controller connections on the Wi-Fi management system just prior to the start of the pandemic that introduced flaws in how the system handles Device/AP/Controller connections under high load in very large, complex environments like UCB. The issue remained largely dormant during the pandemic because of the extremely low population on campus (and other similar/large Wi-Fi environments) during the pandemic but became obvious during the first full week of instruction (other higher education customers of the vendor who were remote during the pandemic and returned to campus instruction in the Fall experienced the same issues).  

Fall 2021

  • Aug. 30 - When large-scale issues with Access Points losing connection to the system and causing large numbers of devices to also lose connection were first detected bIT started diagnosing the problem. It became quickly apparent that this was not something that had previously been experienced, so bIT engaged with the Wi-Fi vendor to troubleshoot and diagnose the issues. Over the next two days, workarounds were identified and implemented to stabilize the system. System stability was observed Sept. 2 in the afternoon through Sept. 7.
  • Sept. 8 - A software fix from the vendor was applied to the system, and while it fixed the problems identified, it uncovered an additional issue. Additional mitigations were implemented to stabilize the system that afternoon.
  • Sept. 9 - An instructor helpline was opened with extended hours of operation to assist with any Wi-Fi issues impacting instruction. Additional workarounds were implemented to stabilize the system and it has remained stable since then. The vendor continued to work on a permanent fix and bIT worked with them on a plan to test / validate / and schedule the remaining fix in a way that minimizes / eliminates any further large-scale disruption to instruction. 
  • Aug. 30 to Sept. 13 - As bIT proceeded with troubleshooting, they realized that the data captured through system monitoring was not providing a total picture of the actual user experience, so they dispatched personnel to perform on the ground testing at some of the hardest-hit areas to fully understand what was occurring and to enhance troubleshooting efforts.
  • Sept. 13 to Oct. 4 - Additional analysis of gathered data and inspection of system settings revealed that a system setting needed to be changed. This setting has to do with the 802.11r protocol for ‘fast roaming’, and is expected to reduce authentication traffic and improve user experience.

  • Oct. 1 - Wi-Fi authentication traffic for Cal Visitor routed to virtual servers to offload that traffic from the Airbears2/Eduroam authentication servers. Two sensors added on campus to measure and detect user experience issues to assist with troubleshooting efforts.

  • Oct. 7 - 802.11r ‘fast roaming’ protocol implemented in the early morning, with observed positive impact and elimination of ‘timeout errors’ in the system. Additional tuning of system settings implemented in the early afternoon, resulting in a reduction of ‘Wi-Fi’ association errors and overall improved Wi-Fi connection performance.

  • Oct. 8 - Three additional sensors added on campus to provide additional user experience data.

  • Oct. 12-14 - Permanent fix received from the vendor and installed to resolve the Wi-Fi controller software bug.

  • Oct. 15 - Wi-Fi authentication tuning changes implemented to reduce system congestion and improve system performance.

  • Oct. 18 - Additional vendor recommended changes implemented to improve the performance of the authentication system.

  • Oct. 21 - Enabled monitoring that was disabled to avoid system overload while the permanent fix was being developed. Added temporary capacity to Wi-Fi authentication by adding Virtual Machine resources.
  • Oct. 25 - Changes to timer settings to improve system handling of authentication timeouts implemented.
  • Oct. 27-29 - Additional tuning changes implemented to improve authentication system performance and reduce device connection wait times. Reboots of individual Access Points to address location-specific issues affecting device connections and connection speeds.
  • Nov. 1-30 - Additional data gathering and troubleshooting resulted in another vendor bug being identified. This issue was more localized when it occurred and resulted in significant instability for connections to eduroam. This firmware bug in the Wi-Fi access points prevented authentication and caused connection failures for many users on that access point/in that location. Extensive, repeated data gathering and troubleshooting occurred during this month, resulting in identification of the bug, development of a fix, and testing and validation of the fix in the campus system. 
  • Dec. 4-5 - Bug fix implemented on campus Wi-Fi, fully resolving remaining known authentication and connectivity software issues.
  • Dec. 8 - Authentication systems which handle user connections to eduroam were overwhelmed by everyone returning to campus and utilizing Wi-Fi in a new, hybrid way. Tuning during October mitigated some of these issues. Additional system capacity (hardware) was purchased, installed, and brought online the morning of Dec. 8 to fully resolve this issue.