Issues with login

Incident Report for Marketplace® Simulations

Postmortem

What happened?

During scheduled maintenance, a new Apache configuration was deployed as part of routine infrastructure updates. After the maintenance window concluded, services appeared to be operating normally; however, an issue affecting the legacy LTI 1.1 integration associated with the game.ilsworld.com domain was not detected during initial validation.

When users began reporting problems accessing the system through this integration, troubleshooting efforts initially misidentified the source of the issue. During the investigation, an improper restart of the upstream Apache servers caused a temporary full outage of web services. Once the root cause was identified and services were correctly restarted, full functionality was restored.

Timeline

  • 3:00pm, the maintenance started.
  • 3:07pm, the maintenance was finished, and no issues were observed with the performance of the website.
  • 4:36 pm, the 1st notification that LTI 1.1 does not work properly. All services for users NOT using LTI 1.1 were working properly.
  • 4:50 pm; determination made that only users who use LTI 1.1 and connection to https://game.ilsworld.com/lti/ are having issues. All services for users NOT using LTI 1.1 and connecting to https://game.ilsworld.com/lti/ were working properly.
  • 4:54pm, all web servers restarted; the restart was done improperly. No services are operational.
  • 5:36 pm; all services operational.

How did this impact customers?

The initial impact, between 3:07 pm and 4:36 pm, was limited to customers using LTI 1.1 integration linking to game.ilsworld.com domain. During troubleshooting of this issue, a misstep at 4:54 pm resulted in the complete loss of web services.

How did this incident occur?

The incident resulted from a combination of a configuration gap, incomplete testing coverage, and an operational error during troubleshooting.

During scheduled maintenance, a new Apache configuration was deployed. The configuration did not correctly route certain requests associated with the legacy LTI 1.1 integration using the game.ilsworld.com/lti/ endpoint. Because this legacy integration path was not included in the post-maintenance testing checklist, the issue was not detected when the maintenance window concluded.

When the first user reports were received, the issue was initially diagnosed properly as a problem with the upstream Apache servers. As part of troubleshooting, all upstream Apache servers were restarted. The restart procedure was executed incorrectly, which temporarily caused the upstream web services to become unavailable and resulted in a full outage until services were restored.

Why wasn't this caught in a testing environment?

The issue was not detected during post-maintenance testing because the testing checklist did not include validation of the legacy LTI 1.1 integration that uses the game.ilsworld.com domain. As a result, the routing behavior for this specific integration path was not exercised after the new Apache configuration was deployed.

In addition, our nginx configuration did not include sufficient fallback routing to ensure that requests associated with this integration were forwarded to the appropriate Apache upstream clusters. Because the affected integration path was not tested, this configuration gap remained undetected until users began accessing the system through LTI 1.1.

Why did recovery take so long?

Following the completion of the scheduled maintenance, services were verified as operational, and the maintenance window was closed. After this point, only limited infrastructure staff remained available on-site. When the first support ticket was received at 4:36 pm, troubleshooting began immediately.

However, corrective actions focused on restarting the Apache upstream servers, which did not resolve the underlying problem and temporarily resulted in a broader service outage. Once additional infrastructure staff became available and joined the troubleshooting effort, the root cause was identified, and the services were restored at 5:36 pm.

Remediation and follow-up

Expand Post-Maintenance Testing Coverage
Post-maintenance testing procedures will be updated to include validation of legacy integrations, including LTI 1.1 connections to the game.ilsworld.com domain, to ensure that all supported access paths are verified before maintenance is considered complete.

Improve Configuration Resilience
The nginx configuration will be reviewed and enhanced to ensure more robust routing and fallback handling so that requests can be properly directed to available Apache upstream clusters even if one routing path encounters issues.

Strengthen Post-Maintenance Monitoring
Infrastructure staff will maintain extended monitoring coverage following scheduled maintenance windows. At least two members of the Infrastructure team will remain available for a defined observation period after maintenance concludes to promptly respond to any issues that arise.

Review Operational Procedures
Operational procedures for restarting critical infrastructure components will be reviewed and documented to ensure that restart operations are performed consistently and safely during troubleshooting.

Posted Mar 05, 2026 - 10:50 EST

Resolved

This incident has been resolved.
Posted Mar 03, 2026 - 18:12 EST

Update

We are continuing to investigate this issue.
Posted Mar 03, 2026 - 18:03 EST

Investigating

We are currently investigating issues with login to our gameplay site.
Posted Mar 03, 2026 - 17:34 EST
This incident affected: Marketplace® Simulations Web Services and Marketplace® Simulations Game Processing.