[00:00:04] RoanKattouw, Niharika, and Urbanecm: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Evening backport window . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210120T0000). [00:00:04] tgr: A patch you scheduled for Evening backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [00:00:38] o/ [00:07:16] PROBLEM - Check systemd state on registry2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:07:44] (03PS2) 10Gergő Tisza: [no-op] GrowthExperiments: Disable link recommendations [mediawiki-config] - 10https://gerrit.wikimedia.org/r/655863 (https://phabricator.wikimedia.org/T261408) [00:07:44] PROBLEM - Check systemd state on registry1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:09:25] !log uploaded docker-report 0.0.4-1~deb9u1 to stretch-wikimedia (T179696) [00:09:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:09:29] T179696: Homepage for https://docker-registry.wikimedia.org - https://phabricator.wikimedia.org/T179696 [00:09:42] (03PS2) 10Cwhite: profile: drop ECS messages on legacy cluster [puppet] - 10https://gerrit.wikimedia.org/r/657213 (https://phabricator.wikimedia.org/T234565) [00:09:58] (03PS2) 10Legoktm: docker_registry_ha: Make registery-homepage-builder Python 3.5 compatible [puppet] - 10https://gerrit.wikimedia.org/r/657210 (https://phabricator.wikimedia.org/T179696) [00:10:42] (03CR) 10Gergő Tisza: [C: 03+2] [no-op] GrowthExperiments: Disable link recommendations [mediawiki-config] - 10https://gerrit.wikimedia.org/r/655863 (https://phabricator.wikimedia.org/T261408) (owner: 10Gergő Tisza) [00:12:33] (03Merged) 10jenkins-bot: [no-op] GrowthExperiments: Disable link recommendations [mediawiki-config] - 10https://gerrit.wikimedia.org/r/655863 (https://phabricator.wikimedia.org/T261408) (owner: 10Gergő Tisza) [00:12:56] (03PS3) 10Legoktm: docker_registry_ha: Make registry-homepage-builder Python 3.5 compatible [puppet] - 10https://gerrit.wikimedia.org/r/657210 (https://phabricator.wikimedia.org/T179696) [00:15:25] (03CR) 10Legoktm: [C: 03+2] docker_registry_ha: Make registry-homepage-builder Python 3.5 compatible [puppet] - 10https://gerrit.wikimedia.org/r/657210 (https://phabricator.wikimedia.org/T179696) (owner: 10Legoktm) [00:18:24] (03PS2) 10Gergő Tisza: Update /analytics/legacy/homepagemodule/ schema version to 1.1.0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/656284 (https://phabricator.wikimedia.org/T270309) [00:21:02] PROBLEM - Check systemd state on elastic2048 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:26:20] 10SRE, 10MediaWiki-Containers, 10Patch-For-Review: Homepage for https://docker-registry.wikimedia.org - https://phabricator.wikimedia.org/T179696 (10Legoktm) Got pretty close, one last sticking point is that `docker_report` hardcodes connecting to the registry over HTTPS. So if you try `https://localhost` th... [00:26:54] (03PS1) 10Legoktm: docker_registry_ha: Disable build-homepage job for now [puppet] - 10https://gerrit.wikimedia.org/r/657216 (https://phabricator.wikimedia.org/T179696) [00:29:06] (03CR) 10Legoktm: [C: 03+2] docker_registry_ha: Disable build-homepage job for now [puppet] - 10https://gerrit.wikimedia.org/r/657216 (https://phabricator.wikimedia.org/T179696) (owner: 10Legoktm) [00:29:29] (03CR) 10Gergő Tisza: [C: 03+2] Update /analytics/legacy/homepagemodule/ schema version to 1.1.0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/656284 (https://phabricator.wikimedia.org/T270309) (owner: 10Gergő Tisza) [00:30:20] !log tgr@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:655863|(no-op) GrowthExperiments: Disable link recommendations (T261408)]] (duration: 01m 05s) [00:30:21] (03Merged) 10jenkins-bot: Update /analytics/legacy/homepagemodule/ schema version to 1.1.0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/656284 (https://phabricator.wikimedia.org/T270309) (owner: 10Gergő Tisza) [00:30:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:30:24] T261408: Add a link engineering: Maintenance script for retrieving, caching, and updating search index - https://phabricator.wikimedia.org/T261408 [00:34:12] RECOVERY - Check systemd state on elastic2048 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:43:29] !log tgr@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:656284|Update /analytics/legacy/homepagemodule/ schema version to 1.1.0 (T270309)]] (duration: 01m 03s) [00:43:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:43:32] T270309: Instrument banner module on newcomer homepage - https://phabricator.wikimedia.org/T270309 [00:51:48] 10SRE, 10MediaWiki-Containers, 10serviceops, 10Patch-For-Review: Homepage for https://docker-registry.wikimedia.org - https://phabricator.wikimedia.org/T179696 (10Legoktm) a:03Legoktm [00:55:00] PROBLEM - aqs endpoints health on aqs1004 is CRITICAL: /analytics.wikimedia.org/v1/pageviews/per-article/{project}/{access}/{agent}/{article}/{granularity}/{start}/{end} (Get per article page views) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [00:58:08] RECOVERY - aqs endpoints health on aqs1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [01:03:54] PROBLEM - aqs endpoints health on aqs1006 is CRITICAL: /analytics.wikimedia.org/v1/pageviews/aggregate/{project}/{access}/{agent}/{granularity}/{start}/{end} (Get aggregate page views) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [01:04:00] PROBLEM - wikifeeds eqiad on wikifeeds.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most read articles for January 1, 2016) timed out before a response was received: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) timed out before a response was received https://wikitech.wikimedia.org/wiki/Wikifeeds [01:04:40] PROBLEM - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most read articles for January 1, 2016) timed out before a response was received: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) timed out before a response was received https://wikitech.wikimedia.org/wiki/Wikifeeds [01:04:52] (03PS1) 10Legoktm: Switch to native Debian package [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/657218 [01:05:16] PROBLEM - aqs endpoints health on aqs1008 is CRITICAL: /analytics.wikimedia.org/v1/mediarequests/per-file/{referer}/{agent}/{file_path}/{granularity}/{start}/{end} (Get per file requests) timed out before a response was received: /analytics.wikimedia.org/v1/pageviews/aggregate/{project}/{access}/{agent}/{granularity}/{start}/{end} (Get aggregate page views) timed out before a response was received: /analytics.wikimedia.org/v1/pag [01:05:16] e/{project}/{access}/{agent}/{article}/{granularity}/{start}/{end} (Get per article page views) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [01:06:52] RECOVERY - aqs endpoints health on aqs1006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [01:07:28] PROBLEM - aqs endpoints health on aqs1004 is CRITICAL: /analytics.wikimedia.org/v1/pageviews/aggregate/{project}/{access}/{agent}/{granularity}/{start}/{end} (Get aggregate page views) is CRITICAL: Test Get aggregate page views returned the unexpected status 500 (expecting: 200): /analytics.wikimedia.org/v1/mediarequests/per-file/{referer}/{agent}/{file_path}/{granularity}/{start}/{end} (Get per file requests) is CRITICAL: Test G [01:07:28] sts returned the unexpected status 500 (expecting: 200): /analytics.wikimedia.org/v1/pageviews/per-article/{project}/{access}/{agent}/{article}/{granularity}/{start}/{end} (Get per article page views) is CRITICAL: Test Get per article page views returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [01:09:52] PROBLEM - Check systemd state on aqs1006 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:10:36] RECOVERY - aqs endpoints health on aqs1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [01:10:48] PROBLEM - cassandra-a service on aqs1006 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [01:11:26] RECOVERY - aqs endpoints health on aqs1008 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [01:11:36] PROBLEM - cassandra-a CQL 10.64.48.148:9042 on aqs1006 is CRITICAL: connect to address 10.64.48.148 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886 [01:13:34] PROBLEM - restbase endpoints health on restbase-dev1006 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [01:13:34] PROBLEM - restbase endpoints health on restbase-dev1005 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [01:14:34] PROBLEM - restbase endpoints health on restbase-dev1004 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [01:16:18] PROBLEM - aqs endpoints health on aqs1006 is CRITICAL: /analytics.wikimedia.org/v1/pageviews/aggregate/{project}/{access}/{agent}/{granularity}/{start}/{end} (Get aggregate page views) is CRITICAL: Test Get aggregate page views returned the unexpected status 500 (expecting: 200): /analytics.wikimedia.org/v1/mediarequests/per-file/{referer}/{agent}/{file_path}/{granularity}/{start}/{end} (Get per file requests) is CRITICAL: Test G [01:16:18] sts returned the unexpected status 500 (expecting: 200): /analytics.wikimedia.org/v1/pageviews/per-article/{project}/{access}/{agent}/{article}/{granularity}/{start}/{end} (Get per article page views) is CRITICAL: Test Get per article page views returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [01:17:30] PROBLEM - aqs endpoints health on aqs1005 is CRITICAL: /analytics.wikimedia.org/v1/pageviews/aggregate/{project}/{access}/{agent}/{granularity}/{start}/{end} (Get aggregate page views) is CRITICAL: Test Get aggregate page views returned the unexpected status 500 (expecting: 200): /analytics.wikimedia.org/v1/pageviews/per-article/{project}/{access}/{agent}/{article}/{granularity}/{start}/{end} (Get per article page views) is CRITI [01:17:30] article page views returned the unexpected status 500 (expecting: 200): /analytics.wikimedia.org/v1/mediarequests/per-file/{referer}/{agent}/{file_path}/{granularity}/{start}/{end} (Get per file requests) is CRITICAL: Test Get per file requests returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [01:17:32] PROBLEM - aqs endpoints health on aqs1007 is CRITICAL: /analytics.wikimedia.org/v1/pageviews/aggregate/{project}/{access}/{agent}/{granularity}/{start}/{end} (Get aggregate page views) is CRITICAL: Test Get aggregate page views returned the unexpected status 500 (expecting: 200): /analytics.wikimedia.org/v1/mediarequests/per-file/{referer}/{agent}/{file_path}/{granularity}/{start}/{end} (Get per file requests) is CRITICAL: Test G [01:17:32] sts returned the unexpected status 500 (expecting: 200): /analytics.wikimedia.org/v1/pageviews/per-article/{project}/{access}/{agent}/{article}/{granularity}/{start}/{end} (Get per article page views) is CRITICAL: Test Get per article page views returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [01:18:24] PROBLEM - aqs endpoints health on aqs1009 is CRITICAL: /analytics.wikimedia.org/v1/pageviews/aggregate/{project}/{access}/{agent}/{granularity}/{start}/{end} (Get aggregate page views) timed out before a response was received: /analytics.wikimedia.org/v1/mediarequests/per-file/{referer}/{agent}/{file_path}/{granularity}/{start}/{end} (Get per file requests) timed out before a response was received: /analytics.wikimedia.org/v1/pag [01:18:24] e/{project}/{access}/{agent}/{article}/{granularity}/{start}/{end} (Get per article page views) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [01:19:27] too many connections to aqs cassandra? [01:19:41] 10SRE, 10Discovery-Search (Current work): Reshard commonswiki_file elasticsearch index - https://phabricator.wikimedia.org/T260083 (10EBernhardson) We could probably cancel this? In T271493 we are fixing the data size issues which will remove the need to re-shard. [01:19:42] RECOVERY - wikifeeds eqiad on wikifeeds.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds [01:19:56] RECOVERY - restbase endpoints health on restbase-dev1006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [01:19:56] RECOVERY - restbase endpoints health on restbase-dev1005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [01:20:56] RECOVERY - restbase endpoints health on restbase-dev1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [01:23:42] RECOVERY - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds [01:25:20] PROBLEM - cassandra-a CQL 10.64.0.126:9042 on aqs1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://phabricator.wikimedia.org/T93886 [01:26:32] PROBLEM - aqs endpoints health on aqs1004 is CRITICAL: /analytics.wikimedia.org/v1/pageviews/per-article/{project}/{access}/{agent}/{article}/{granularity}/{start}/{end} (Get per article page views) is CRITICAL: Test Get per article page views returned the unexpected status 500 (expecting: 200): /analytics.wikimedia.org/v1/pageviews/aggregate/{project}/{access}/{agent}/{granularity}/{start}/{end} (Get aggregate page views) is CRI [01:26:32] ggregate page views returned the unexpected status 500 (expecting: 200): /analytics.wikimedia.org/v1/mediarequests/per-file/{referer}/{agent}/{file_path}/{granularity}/{start}/{end} (Get per file requests) is CRITICAL: Test Get per file requests returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [01:30:00] RECOVERY - cassandra-a service on aqs1006 is OK: OK - cassandra-a is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [01:32:14] RECOVERY - Check systemd state on aqs1006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:33:42] RECOVERY - aqs endpoints health on aqs1005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [01:34:00] RECOVERY - cassandra-a CQL 10.64.48.148:9042 on aqs1006 is OK: TCP OK - 0.000 second response time on 10.64.48.148 port 9042 https://phabricator.wikimedia.org/T93886 [01:34:32] RECOVERY - aqs endpoints health on aqs1009 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [01:35:44] RECOVERY - aqs endpoints health on aqs1006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [01:36:16] RECOVERY - aqs endpoints health on aqs1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [01:36:52] RECOVERY - aqs endpoints health on aqs1007 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [01:38:20] PROBLEM - Check systemd state on aqs1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:38:22] PROBLEM - cassandra-a service on aqs1004 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [01:48:16] RECOVERY - Check systemd state on aqs1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:48:20] RECOVERY - cassandra-a service on aqs1004 is OK: OK - cassandra-a is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [01:54:56] RECOVERY - cassandra-a CQL 10.64.0.126:9042 on aqs1004 is OK: TCP OK - 0.000 second response time on 10.64.0.126 port 9042 https://phabricator.wikimedia.org/T93886 [02:08:58] PROBLEM - wikifeeds eqiad on wikifeeds.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most read articles for January 1, 2016) timed out before a response was received: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) timed out before a response was received https://wikitech.wikimedia.org/wiki/Wikifeeds [02:10:08] PROBLEM - cassandra-b CQL 10.64.32.190:9042 on aqs1005 is CRITICAL: connect to address 10.64.32.190 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886 [02:11:08] PROBLEM - cassandra-b service on aqs1005 is CRITICAL: CRITICAL - Expecting active but unit cassandra-b is failed https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [02:11:58] PROBLEM - Check systemd state on aqs1005 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:12:10] PROBLEM - aqs endpoints health on aqs1006 is CRITICAL: /analytics.wikimedia.org/v1/mediarequests/per-file/{referer}/{agent}/{file_path}/{granularity}/{start}/{end} (Get per file requests) timed out before a response was received: /analytics.wikimedia.org/v1/pageviews/per-article/{project}/{access}/{agent}/{article}/{granularity}/{start}/{end} (Get per article page views) timed out before a response was received: /analytics.wikime [02:12:10] ews/top-by-country/{project}/{access}/{year}/{month} (Get top countries by page views) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [02:13:00] PROBLEM - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most read articles for January 1, 2016) timed out before a response was received: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) timed out before a response was received https://wikitech.wikimedia.org/wiki/Wikifeeds [02:15:18] RECOVERY - aqs endpoints health on aqs1006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [02:16:08] RECOVERY - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds [02:18:40] RECOVERY - wikifeeds eqiad on wikifeeds.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds [02:22:18] PROBLEM - restbase endpoints health on restbase-dev1006 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [02:23:04] PROBLEM - aqs endpoints health on aqs1007 is CRITICAL: /analytics.wikimedia.org/v1/pageviews/top-by-country/{project}/{access}/{year}/{month} (Get top countries by page views) timed out before a response was received: /analytics.wikimedia.org/v1/pageviews/per-article/{project}/{access}/{agent}/{article}/{granularity}/{start}/{end} (Get per article page views) timed out before a response was received: /analytics.wikimedia.org/v1/m [02:23:04] file/{referer}/{agent}/{file_path}/{granularity}/{start}/{end} (Get per file requests) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [02:24:00] PROBLEM - aqs endpoints health on aqs1009 is CRITICAL: /analytics.wikimedia.org/v1/mediarequests/per-file/{referer}/{agent}/{file_path}/{granularity}/{start}/{end} (Get per file requests) timed out before a response was received: /analytics.wikimedia.org/v1/pageviews/per-article/{project}/{access}/{agent}/{article}/{granularity}/{start}/{end} (Get per article page views) timed out before a response was received https://wikitech.w [02:24:00] /Services/Monitoring/aqs [02:25:40] PROBLEM - restbase endpoints health on restbase-dev1005 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [02:26:22] RECOVERY - aqs endpoints health on aqs1007 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [02:27:32] RECOVERY - cassandra-b service on aqs1005 is OK: OK - cassandra-b is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [02:28:24] RECOVERY - Check systemd state on aqs1005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:29:54] PROBLEM - aqs endpoints health on aqs1005 is CRITICAL: /analytics.wikimedia.org/v1/mediarequests/per-file/{referer}/{agent}/{file_path}/{granularity}/{start}/{end} (Get per file requests) timed out before a response was received: /analytics.wikimedia.org/v1/pageviews/top-by-country/{project}/{access}/{year}/{month} (Get top countries by page views) timed out before a response was received: /analytics.wikimedia.org/v1/pageviews/pe [02:29:54] t}/{access}/{agent}/{article}/{granularity}/{start}/{end} (Get per article page views) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [02:29:54] RECOVERY - cassandra-b CQL 10.64.32.190:9042 on aqs1005 is OK: TCP OK - 0.000 second response time on 10.64.32.190 port 9042 https://phabricator.wikimedia.org/T93886 [02:32:14] RECOVERY - restbase endpoints health on restbase-dev1006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [02:32:16] RECOVERY - restbase endpoints health on restbase-dev1005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [02:33:08] RECOVERY - aqs endpoints health on aqs1005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [02:33:56] RECOVERY - aqs endpoints health on aqs1009 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [03:19:04] 10SRE, 10Traffic, 10Wikimedia-Logstash, 10observability, and 3 others: Changing Kibana filters is ridiculously slow - https://phabricator.wikimedia.org/T189333 (10Krinkle) a:05Krinkle→03None [03:19:30] 10SRE, 10Traffic, 10Wikimedia-Logstash, 10observability, and 3 others: Changing Kibana filters is ridiculously slow - https://phabricator.wikimedia.org/T189333 (10Krinkle) 05Open→03Resolved a:03Krinkle [03:58:00] (03PS1) 10Patsagorn Y.: Create patroller user group for thwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657225 (https://phabricator.wikimedia.org/T272149) [03:58:02] (03CR) 10Welcome, new contributor!: "Thank you for making your first contribution to Wikimedia! :) To learn how to get your code changes reviewed faster and more likely to get" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657225 (https://phabricator.wikimedia.org/T272149) (owner: 10Patsagorn Y.) [04:21:29] (03PS2) 10Patsagorn Y.: Create patroller user group for thwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657225 (https://phabricator.wikimedia.org/T272149) [04:41:34] (03CR) 10HitomiAkane: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657225 (https://phabricator.wikimedia.org/T272149) (owner: 10Patsagorn Y.) [04:46:19] (03CR) 10HitomiAkane: [C: 03+1] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657225 (https://phabricator.wikimedia.org/T272149) (owner: 10Patsagorn Y.) [05:01:47] RECOVERY - Elevated latency for icinga checks in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/rsCfQfuZz/icinga [05:42:04] (03PS1) 10Andrew Bogott: Revert "nova: install the novavendordata api in codfw1dev" [puppet] - 10https://gerrit.wikimedia.org/r/657116 [05:45:38] (03PS1) 10Andrew Bogott: Revert "Nova: add a simple vendordata REST service" [puppet] - 10https://gerrit.wikimedia.org/r/657117 [05:46:08] (03CR) 10jerkins-bot: [V: 04-1] Revert "Nova: add a simple vendordata REST service" [puppet] - 10https://gerrit.wikimedia.org/r/657117 (owner: 10Andrew Bogott) [05:59:25] (03CR) 10Andrew Bogott: [C: 03+2] Revert "nova: install the novavendordata api in codfw1dev" [puppet] - 10https://gerrit.wikimedia.org/r/657116 (owner: 10Andrew Bogott) [06:10:24] PROBLEM - Elevated latency for icinga checks in eqiad on alert1001 is CRITICAL: cluster=alerting instance=alert1001 job=icinga site=eqiad https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/rsCfQfuZz/icinga [06:20:24] (03PS1) 10Andrew Bogott: nova vendordata: use a bit of jinja and inject dhcp_domain into cloud-config [puppet] - 10https://gerrit.wikimedia.org/r/657230 (https://phabricator.wikimedia.org/T271273) [06:21:09] (03CR) 10jerkins-bot: [V: 04-1] nova vendordata: use a bit of jinja and inject dhcp_domain into cloud-config [puppet] - 10https://gerrit.wikimedia.org/r/657230 (https://phabricator.wikimedia.org/T271273) (owner: 10Andrew Bogott) [06:29:54] (03PS2) 10Andrew Bogott: nova vendordata: use a bit of jinja and inject dhcp_domain into cloud-config [puppet] - 10https://gerrit.wikimedia.org/r/657230 (https://phabricator.wikimedia.org/T271273) [06:45:15] 10SRE, 10LDAP: Create auto-populated LDAP group of those who have production shell access - https://phabricator.wikimedia.org/T271587 (10Legoktm) Sidenote: if this is straightforward to do, it would be nice if we could create an LDAP group of users in the admin `deployment` group, so we can replace the manuall... [06:57:47] RECOVERY - Elevated latency for icinga checks in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/rsCfQfuZz/icinga [07:32:57] (03PS5) 10Ryan Kemper: search: bring "new" relforge hosts into service [puppet] - 10https://gerrit.wikimedia.org/r/656369 (https://phabricator.wikimedia.org/T262211) [07:37:04] (03CR) 10Ryan Kemper: "Right now this rips out all the old `relforge100[1,2]` stuff, but also has the new `relforge100[3,4]` stuff in it. I imagine I'll want to " (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/656369 (https://phabricator.wikimedia.org/T262211) (owner: 10Ryan Kemper) [07:44:23] 10SRE, 10Discovery-Search (Current work): Reshard commonswiki_file elasticsearch index - https://phabricator.wikimedia.org/T260083 (10dcausse) 05Open→03Declined Agreed, we should re-assess the shard sizes end of March 2021. [07:44:59] (03CR) 10Ryan Kemper: [V: 03+1] "PCC SUCCESS (DIFF 13 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27531/console" [puppet] - 10https://gerrit.wikimedia.org/r/656833 (https://phabricator.wikimedia.org/T267175) (owner: 10DCausse) [07:54:52] (03CR) 10Ryan Kemper: [V: 03+1 C: 03+2] [wdqs] disable async imports [puppet] - 10https://gerrit.wikimedia.org/r/656833 (https://phabricator.wikimedia.org/T267175) (owner: 10DCausse) [08:05:02] (03CR) 10Muehlenhoff: [C: 03+2] Add cuminunpriv1001 [puppet] - 10https://gerrit.wikimedia.org/r/657129 (owner: 10Muehlenhoff) [08:13:13] (03CR) 10Gehel: [C: 04-1] "minor syntax issue, see comment inline" (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/654723 (https://phabricator.wikimedia.org/T264006) (owner: 10Mstyles) [08:16:00] (03CR) 10Muehlenhoff: [C: 03+2] Bump timeout for accessing RAID in smart_data_dump [puppet] - 10https://gerrit.wikimedia.org/r/657060 (owner: 10Muehlenhoff) [08:18:22] (03CR) 10Ayounsi: interface_automation: Clean up old interfaces on run (036 comments) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/657199 (owner: 10CRusnov) [08:19:54] (03CR) 10Ayounsi: [C: 03+1] interface_automation: Fix `interface` reference in IP address assignment [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/657207 (owner: 10CRusnov) [08:20:43] 10SRE, 10Wikimedia-Logstash, 10observability, 10Software-Licensing: Elasticsearch and Kibana are switching to non-OSI-approved SSPL licence - https://phabricator.wikimedia.org/T272238 (10MoritzMuehlenhoff) OSI statement: https://opensource.org/node/1099 [08:23:20] (03PS1) 10Nikerabbit: Add flag to toggle the usage of the group synchronization cache [extensions/Translate] (wmf/1.36.0-wmf.27) - 10https://gerrit.wikimedia.org/r/657306 (https://phabricator.wikimedia.org/T272428) [08:26:36] (03PS1) 10Elukey: varnish: avoid python-request UA bots for AQS [puppet] - 10https://gerrit.wikimedia.org/r/657288 [08:30:22] (03CR) 10Elukey: "https://w.wiki/uzb is the traffic mentioned above for reference.." [puppet] - 10https://gerrit.wikimedia.org/r/657288 (owner: 10Elukey) [08:35:30] (03CR) 10Ema: varnish: avoid python-request UA bots for AQS (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/657288 (owner: 10Elukey) [08:40:31] (03PS2) 10Elukey: varnish: block python-request UA bots for AQS [puppet] - 10https://gerrit.wikimedia.org/r/657288 [08:43:55] (03PS1) 10Muehlenhoff: Add Tyler as approval contact for Phabricator-related groups [puppet] - 10https://gerrit.wikimedia.org/r/657291 [08:52:42] (03CR) 10jerkins-bot: [V: 04-1] Add flag to toggle the usage of the group synchronization cache [extensions/Translate] (wmf/1.36.0-wmf.27) - 10https://gerrit.wikimedia.org/r/657306 (https://phabricator.wikimedia.org/T272428) (owner: 10Nikerabbit) [08:53:14] (03CR) 10Nikerabbit: "recheck" [extensions/Translate] (wmf/1.36.0-wmf.27) - 10https://gerrit.wikimedia.org/r/657306 (https://phabricator.wikimedia.org/T272428) (owner: 10Nikerabbit) [08:58:42] (03PS2) 10Giuseppe Lavagetto: [WiP] mediawiki: use a data structure to define prod_sites [puppet] - 10https://gerrit.wikimedia.org/r/657139 (https://phabricator.wikimedia.org/T272305) [08:59:56] (03CR) 10jerkins-bot: [V: 04-1] [WiP] mediawiki: use a data structure to define prod_sites [puppet] - 10https://gerrit.wikimedia.org/r/657139 (https://phabricator.wikimedia.org/T272305) (owner: 10Giuseppe Lavagetto) [09:01:21] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host ms-be2018.codfw.wmnet [09:01:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:02:13] (03PS1) 10Gergő Tisza: [beta] GrowthExperiments: set link recommendation feature flags [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657292 [09:04:04] PROBLEM - Elevated latency for icinga checks in eqiad on alert1001 is CRITICAL: cluster=alerting instance=alert1001 job=icinga site=eqiad https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/rsCfQfuZz/icinga [09:07:46] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 241, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:08:41] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2018.codfw.wmnet [09:08:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:09:13] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host ms-be2019.codfw.wmnet [09:09:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:16:04] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2019.codfw.wmnet [09:16:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:19:54] !log configure Lumen interfaces [09:19:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:21:11] ACKNOWLEDGEMENT - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 241, down: 1, dormant: 0, excluded: 0, unused: 0: ayounsi https://phabricator.wikimedia.org/T270439 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:22:27] (03PS3) 10Filippo Giunchedi: debian: add packaging [debs/phalerts] - 10https://gerrit.wikimedia.org/r/656866 (https://phabricator.wikimedia.org/T272453) [09:23:02] (03CR) 10Effie Mouzeli: [C: 03+1] modules/scap/templates/scap.cfg.erb: Define php_fpm_unsafe_restart_script [puppet] - 10https://gerrit.wikimedia.org/r/636074 (https://phabricator.wikimedia.org/T243009) (owner: 10Ahmon Dancy) [09:23:31] (03CR) 10Nikerabbit: [C: 03+2] "Backport" [extensions/Translate] (wmf/1.36.0-wmf.27) - 10https://gerrit.wikimedia.org/r/657306 (https://phabricator.wikimedia.org/T272428) (owner: 10Nikerabbit) [09:24:17] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host ms-be2020.codfw.wmnet [09:24:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:31:12] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good, three nits inline" (033 comments) [debs/phalerts] - 10https://gerrit.wikimedia.org/r/656866 (https://phabricator.wikimedia.org/T272453) (owner: 10Filippo Giunchedi) [09:31:32] (03CR) 10Ladsgroup: "ping 😄" [puppet] - 10https://gerrit.wikimedia.org/r/655203 (https://phabricator.wikimedia.org/T256542) (owner: 10Ladsgroup) [09:31:37] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2020.codfw.wmnet [09:31:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:32:02] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host ms-be2021.codfw.wmnet [09:32:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:32:15] !log installing cuminunpriv1001 [09:32:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:32:37] (03PS1) 10Filippo Giunchedi: toil: remove rsyslog_tls_remedy [puppet] - 10https://gerrit.wikimedia.org/r/657293 (https://phabricator.wikimedia.org/T199406) [09:32:52] (03CR) 10jerkins-bot: [V: 04-1] toil: remove rsyslog_tls_remedy [puppet] - 10https://gerrit.wikimedia.org/r/657293 (https://phabricator.wikimedia.org/T199406) (owner: 10Filippo Giunchedi) [09:34:55] (03PS4) 10Filippo Giunchedi: debian: add packaging [debs/phalerts] - 10https://gerrit.wikimedia.org/r/656866 (https://phabricator.wikimedia.org/T272453) [09:35:36] (03PS1) 10Vgutierrez: ATS: provide client port information to varnish [puppet] - 10https://gerrit.wikimedia.org/r/657296 (https://phabricator.wikimedia.org/T271953) [09:36:06] (03CR) 10Filippo Giunchedi: "Thank you for the quick review!" (033 comments) [debs/phalerts] - 10https://gerrit.wikimedia.org/r/656866 (https://phabricator.wikimedia.org/T272453) (owner: 10Filippo Giunchedi) [09:37:18] (03PS2) 10Filippo Giunchedi: toil: remove rsyslog_tls_remedy [puppet] - 10https://gerrit.wikimedia.org/r/657293 (https://phabricator.wikimedia.org/T199406) [09:38:30] PROBLEM - Elevated latency for icinga checks in eqiad on alert1001 is CRITICAL: cluster=alerting instance=alert1001 job=icinga site=eqiad https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/rsCfQfuZz/icinga [09:39:22] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2021.codfw.wmnet [09:39:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:39:44] (03CR) 10Muehlenhoff: [C: 03+1] "Ship it :-)" [debs/phalerts] - 10https://gerrit.wikimedia.org/r/656866 (https://phabricator.wikimedia.org/T272453) (owner: 10Filippo Giunchedi) [09:39:51] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host ms-be2023.codfw.wmnet [09:39:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:40:07] (03CR) 10Filippo Giunchedi: [C: 03+2] toil: remove rsyslog_tls_remedy [puppet] - 10https://gerrit.wikimedia.org/r/657293 (https://phabricator.wikimedia.org/T199406) (owner: 10Filippo Giunchedi) [09:41:42] (03CR) 10Filippo Giunchedi: [V: 03+2 C: 03+2] debian: add packaging [debs/phalerts] - 10https://gerrit.wikimedia.org/r/656866 (https://phabricator.wikimedia.org/T272453) (owner: 10Filippo Giunchedi) [09:42:40] (03CR) 10Elukey: "John I did a quick very high level pass, I really like the pre/post actions to do, will review again the code change later on! Thanks a lo" (033 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/657102 (owner: 10Jbond) [09:43:34] 10SRE, 10Patch-For-Review, 10User-fgiunchedi: rsyslog's in:imtcp thread stuck on recvfrom loop from down/rebooted hosts - https://phabricator.wikimedia.org/T199406 (10fgiunchedi) 05Stalled→03Resolved a:03fgiunchedi Resolving, with recent rsyslog on centrallog hosts we haven't experienced this bug [09:46:15] (03Merged) 10jenkins-bot: Add flag to toggle the usage of the group synchronization cache [extensions/Translate] (wmf/1.36.0-wmf.27) - 10https://gerrit.wikimedia.org/r/657306 (https://phabricator.wikimedia.org/T272428) (owner: 10Nikerabbit) [09:47:54] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2023.codfw.wmnet [09:47:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:49:04] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host ms-be2024.codfw.wmnet [09:49:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:50:31] (03CR) 10Ema: [C: 03+1] ATS: provide client port information to varnish [puppet] - 10https://gerrit.wikimedia.org/r/657296 (https://phabricator.wikimedia.org/T271953) (owner: 10Vgutierrez) [09:57:07] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2024.codfw.wmnet [09:57:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:57:24] (03CR) 10Elukey: [C: 03+1] "Thanks! <3" [puppet] - 10https://gerrit.wikimedia.org/r/657296 (https://phabricator.wikimedia.org/T271953) (owner: 10Vgutierrez) [09:59:25] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host ms-be2025.codfw.wmnet [09:59:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:05:07] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2025.codfw.wmnet [10:05:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:07:12] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host ms-be2026.codfw.wmnet [10:07:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:12:52] 10SRE, 10LDAP-Access-Requests, 10WMF-NDA-Requests: Allow WMDE intern Amrutha to access Superset - https://phabricator.wikimedia.org/T271725 (10amy_rc) @KFrancis My full name is Amrutha Varshini Chandra First Name - Amrutha Varshini Last Name - Chandra and WMDE email Address - amrutha.chandra@wikimedia.d... [10:14:22] (03CR) 10Klausman: varnish: block python-request UA bots for AQS (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/657288 (owner: 10Elukey) [10:14:40] (03PS1) 10Muehlenhoff: Fix MAC address [puppet] - 10https://gerrit.wikimedia.org/r/657297 [10:14:44] (03CR) 10WMDE-Fisch: "This change is ready for review." [extensions/CodeMirror] (wmf/1.36.0-wmf.27) - 10https://gerrit.wikimedia.org/r/657308 (https://phabricator.wikimedia.org/T270238) (owner: 10WMDE-Fisch) [10:16:49] (03CR) 10Elukey: varnish: block python-request UA bots for AQS (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/657288 (owner: 10Elukey) [10:16:57] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2026.codfw.wmnet [10:16:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:17:29] (03CR) 10Gehel: "Decommission of old servers and configuration of the new ones should be split into 2 different CR." [puppet] - 10https://gerrit.wikimedia.org/r/656369 (https://phabricator.wikimedia.org/T262211) (owner: 10Ryan Kemper) [10:17:54] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host ms-be2027.codfw.wmnet [10:17:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:18:48] 10SRE, 10cloud-services-team (Kanban): apt key for `thirdparty/ceph-nautilus/buster` has expired. - https://phabricator.wikimedia.org/T259873 (10aborrero) [10:19:26] (03CR) 10Muehlenhoff: [C: 03+2] Fix MAC address [puppet] - 10https://gerrit.wikimedia.org/r/657297 (owner: 10Muehlenhoff) [10:20:46] (03CR) 10Volans: [C: 04-1] "Agree with Arzhel's comments" (033 comments) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/657199 (owner: 10CRusnov) [10:22:18] (03CR) 10Volans: [C: 03+1] "LGTM" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/657207 (owner: 10CRusnov) [10:24:40] (03PS1) 10Matthias Mullie: Remove MediaSearch survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657299 [10:24:54] (03CR) 10Arturo Borrero Gonzalez: [C: 04-1] [DONT MERGE] cloud: drop NAT exceptions for dumps NFS (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/657152 (https://phabricator.wikimedia.org/T272397) (owner: 10Arturo Borrero Gonzalez) [10:26:27] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2027.codfw.wmnet [10:26:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:26:41] (03CR) 10Vgutierrez: [C: 03+2] ATS: provide client port information to varnish [puppet] - 10https://gerrit.wikimedia.org/r/657296 (https://phabricator.wikimedia.org/T271953) (owner: 10Vgutierrez) [10:26:41] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host ms-be2028.codfw.wmnet [10:26:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:28:58] (03PS2) 10Matthias Mullie: Remove MediaSearch survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657299 (https://phabricator.wikimedia.org/T258419) [10:34:11] (03PS1) 10Marostegui: db1079: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/657300 [10:34:49] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1079 to stop replication T272008', diff saved to https://phabricator.wikimedia.org/P13842 and previous config saved to /var/cache/conftool/dbconfig/20210120-103449-marostegui.json [10:34:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:34:53] T272008: Move wikireplicas under the new sanitarium hosts (db1154, db1155) - https://phabricator.wikimedia.org/T272008 [10:35:14] (03CR) 10Marostegui: [C: 03+2] db1079: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/657300 (owner: 10Marostegui) [10:35:20] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2028.codfw.wmnet [10:35:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:35:33] 10SRE, 10SRE-Access-Requests: Requesting access to analytics for lilients - https://phabricator.wikimedia.org/T272264 (10lilients_WMDE) I tested the online tool which seems to be really cool. But accessing the event logging metrics in presto analytics hive I get the following error message: ` presto error: F... [10:37:47] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host ms-be2029.codfw.wmnet [10:37:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:37:52] (03CR) 10Thiemo Kreuz (WMDE): [C: 03+1] "We can backport this, but I don't think this is strictly necessary. This number doesn't do much. It's an upper limit. It will only have an" [extensions/CodeMirror] (wmf/1.36.0-wmf.27) - 10https://gerrit.wikimedia.org/r/657308 (https://phabricator.wikimedia.org/T270238) (owner: 10WMDE-Fisch) [10:41:53] (03PS1) 10Marostegui: Revert "db1079: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/657312 [10:42:37] (03CR) 10Marostegui: [C: 03+2] Revert "db1079: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/657312 (owner: 10Marostegui) [10:42:58] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1079 (re)pooling @ 25%: After moving wikireplicas to another host', diff saved to https://phabricator.wikimedia.org/P13844 and previous config saved to /var/cache/conftool/dbconfig/20210120-104257-root.json [10:43:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:43:09] (03PS1) 10Arturo Borrero Gonzalez: network: add cloud_networks_public constant and use it in ferm [puppet] - 10https://gerrit.wikimedia.org/r/657301 [10:44:29] (03PS2) 10Arturo Borrero Gonzalez: [DONT MERGE] cloud: drop NAT exceptions for dumps NFS [puppet] - 10https://gerrit.wikimedia.org/r/657152 (https://phabricator.wikimedia.org/T272397) [10:46:39] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2029.codfw.wmnet [10:46:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:47:44] (03CR) 10Ema: [C: 04-1] varnish: check for debug=1 value in X-Analytics header (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/629735 (https://phabricator.wikimedia.org/T263683) (owner: 10Effie Mouzeli) [10:47:55] (03PS1) 10Ayounsi: Discard the non-whitelisted 172.16.0.0/12 traffic [homer/public] - 10https://gerrit.wikimedia.org/r/657302 (https://phabricator.wikimedia.org/T209082) [10:48:20] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host ms-be2030.codfw.wmnet [10:48:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:48:44] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] Discard the non-whitelisted 172.16.0.0/12 traffic [homer/public] - 10https://gerrit.wikimedia.org/r/657302 (https://phabricator.wikimedia.org/T209082) (owner: 10Ayounsi) [10:49:25] (03CR) 10Ayounsi: [C: 03+2] Discard the non-whitelisted 172.16.0.0/12 traffic [homer/public] - 10https://gerrit.wikimedia.org/r/657302 (https://phabricator.wikimedia.org/T209082) (owner: 10Ayounsi) [10:50:02] (03Merged) 10jenkins-bot: Discard the non-whitelisted 172.16.0.0/12 traffic [homer/public] - 10https://gerrit.wikimedia.org/r/657302 (https://phabricator.wikimedia.org/T209082) (owner: 10Ayounsi) [10:51:12] (03PS6) 10Jbond: sre: convert the generic reboot functions to the cookbook class API [cookbooks] - 10https://gerrit.wikimedia.org/r/657102 [10:51:28] !log Discard the non-whitelisted 172.16.0.0/12 traffic - T209082 [10:51:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:53:10] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2030.codfw.wmnet [10:53:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:56:04] (03CR) 10Ayounsi: [C: 03+1] "LGTM, but I'd prefer someone with better Puppet/ferm skills to review it as well." [puppet] - 10https://gerrit.wikimedia.org/r/657301 (owner: 10Arturo Borrero Gonzalez) [10:58:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1079 (re)pooling @ 50%: After moving wikireplicas to another host', diff saved to https://phabricator.wikimedia.org/P13845 and previous config saved to /var/cache/conftool/dbconfig/20210120-105801-root.json [10:58:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:00:39] RECOVERY - Elevated latency for icinga checks in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/rsCfQfuZz/icinga [11:13:05] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1079 (re)pooling @ 75%: After moving wikireplicas to another host', diff saved to https://phabricator.wikimedia.org/P13846 and previous config saved to /var/cache/conftool/dbconfig/20210120-111305-root.json [11:13:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:16:10] (03CR) 10Giuseppe Lavagetto: [C: 03+1] modules/scap/templates/scap.cfg.erb: Define php_fpm_unsafe_restart_script [puppet] - 10https://gerrit.wikimedia.org/r/636074 (https://phabricator.wikimedia.org/T243009) (owner: 10Ahmon Dancy) [11:16:13] 10SRE, 10MediaWiki-Containers, 10serviceops, 10Patch-For-Review: Homepage for https://docker-registry.wikimedia.org - https://phabricator.wikimedia.org/T179696 (10akosiaris) >>! In T179696#6760378, @Legoktm wrote: > Got pretty close, one last sticking point is that `docker_report` hardcodes connecting to t... [11:16:50] 10SRE, 10MediaWiki-Containers, 10serviceops, 10Patch-For-Review: Homepage for https://docker-registry.wikimedia.org - https://phabricator.wikimedia.org/T179696 (10Joe) >>! In T179696#6760378, @Legoktm wrote: > Got pretty close, one last sticking point is that `docker_report` hardcodes connecting to the reg... [11:22:45] (03CR) 10Volans: "I don't mind it but currently the whole project is using format() and it's good to have consistency and not everyone likes f-strings (as t" [cookbooks] - 10https://gerrit.wikimedia.org/r/656923 (owner: 10RhinosF1) [11:23:57] (03CR) 10RhinosF1: "> Patch Set 3:" [cookbooks] - 10https://gerrit.wikimedia.org/r/656923 (owner: 10RhinosF1) [11:25:03] (03CR) 10Klausman: varnish: block python-request UA bots for AQS (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/657288 (owner: 10Elukey) [11:25:09] (03CR) 10Klausman: [C: 03+1] varnish: block python-request UA bots for AQS [puppet] - 10https://gerrit.wikimedia.org/r/657288 (owner: 10Elukey) [11:28:09] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1079 (re)pooling @ 100%: After moving wikireplicas to another host', diff saved to https://phabricator.wikimedia.org/P13847 and previous config saved to /var/cache/conftool/dbconfig/20210120-112808-root.json [11:28:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:28:14] (03CR) 10Jbond: "updated thanks" (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/655896 (https://phabricator.wikimedia.org/T271702) (owner: 10Jbond) [11:37:43] PROBLEM - Elevated latency for icinga checks in eqiad on alert1001 is CRITICAL: cluster=alerting instance=icinga1001 job=icinga site=eqiad https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/rsCfQfuZz/icinga [11:51:04] (03CR) 10Arturo Borrero Gonzalez: "PCC report is fine:" [puppet] - 10https://gerrit.wikimedia.org/r/657301 (owner: 10Arturo Borrero Gonzalez) [11:51:41] 10SRE, 10Patch-For-Review, 10Tracking-Neverending: Tracking and Reducing cron-spam to root@ - https://phabricator.wikimedia.org/T132324 (10jcrespo) [11:54:28] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host ms-be2032.codfw.wmnet [11:54:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:57:53] (03PS4) 10Jbond: ferm-status: add ability to ignore rules with a specific comment prefix [puppet] - 10https://gerrit.wikimedia.org/r/655896 (https://phabricator.wikimedia.org/T271702) [11:58:36] (03PS2) 10Hnowlan: similar-users: correct loglevel, remove readiness probe [deployment-charts] - 10https://gerrit.wikimedia.org/r/657151 (https://phabricator.wikimedia.org/T268837) [12:00:04] Amir1, Lucas_WMDE, awight, and Urbanecm: That opportune time is upon us again. Time for a European mid-day backport window deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210120T1200). [12:00:04] Kizule and Matthias: A patch you scheduled for European mid-day backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [12:00:24] I can dpeloy today! [12:00:30] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/657149 (https://phabricator.wikimedia.org/T135991) (owner: 10Dave Pifke) [12:00:32] (03CR) 10Muehlenhoff: [C: 03+2] webperf: enable Apache base::service_auto_restart [puppet] - 10https://gerrit.wikimedia.org/r/657149 (https://phabricator.wikimedia.org/T135991) (owner: 10Dave Pifke) [12:00:38] o/ [12:00:50] (03CR) 10Matthias Mullie: [C: 03+1] Remove MediaSearch survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657299 (https://phabricator.wikimedia.org/T258419) (owner: 10Matthias Mullie) [12:01:10] matthiasmullie: or you can if you wish, you're the only customer who's here :) (didn't recognize you under the irc handle) [12:01:19] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2032.codfw.wmnet [12:01:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:02:10] (03CR) 10Urbanecm: [C: 03+2] Remove MediaSearch survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657299 (https://phabricator.wikimedia.org/T258419) (owner: 10Matthias Mullie) [12:02:27] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host ms-be2033.codfw.wmnet [12:02:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:02:53] oh okay - yeah I'm happy to self-deploy, no need to waste your time :) [12:02:56] (03CR) 10Hnowlan: [C: 03+2] similar-users: correct loglevel, remove readiness probe [deployment-charts] - 10https://gerrit.wikimedia.org/r/657151 (https://phabricator.wikimedia.org/T268837) (owner: 10Hnowlan) [12:03:06] (03Merged) 10jenkins-bot: Remove MediaSearch survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657299 (https://phabricator.wikimedia.org/T258419) (owner: 10Matthias Mullie) [12:03:52] (03CR) 10Volans: "Thanks for the fixes, last question inline" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/655896 (https://phabricator.wikimedia.org/T271702) (owner: 10Jbond) [12:04:25] (03Merged) 10jenkins-bot: similar-users: correct loglevel, remove readiness probe [deployment-charts] - 10https://gerrit.wikimedia.org/r/657151 (https://phabricator.wikimedia.org/T268837) (owner: 10Hnowlan) [12:08:26] !log mlitn@deploy1001 Synchronized wmf-config/InitialiseSettings.php: 2fc57b259: Remove MediaSearch survey (duration: 01m 10s) [12:08:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:09:39] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2033.codfw.wmnet [12:09:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:10:50] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host ms-be2034.codfw.wmnet [12:10:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:10:54] !log EU config window done [12:10:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:11:09] (03PS5) 10Jbond: ferm-status: add ability to ignore rules with a specific comment prefix [puppet] - 10https://gerrit.wikimedia.org/r/655896 (https://phabricator.wikimedia.org/T271702) [12:11:38] (03CR) 10Jbond: "thanks updated" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/655896 (https://phabricator.wikimedia.org/T271702) (owner: 10Jbond) [12:12:47] PROBLEM - Check systemd state on deploy1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:15:33] (03PS1) 10Jcrespo: admin: Provide cluster access to lilients [puppet] - 10https://gerrit.wikimedia.org/r/657330 (https://phabricator.wikimedia.org/T272264) [12:16:43] (03PS2) 10Jcrespo: admin: Provide cluster access to lilients [puppet] - 10https://gerrit.wikimedia.org/r/657330 (https://phabricator.wikimedia.org/T272264) [12:17:53] (03PS1) 10Hnowlan: similar-users: change container version [deployment-charts] - 10https://gerrit.wikimedia.org/r/657331 (https://phabricator.wikimedia.org/T268837) [12:18:13] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics for lilients - https://phabricator.wikimedia.org/T272264 (10jcrespo) @lilients_WMDE Please confirm your data seems correct at the above patch, thank you! [12:19:14] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2034.codfw.wmnet [12:19:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:20:16] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host ms-be2035.codfw.wmnet [12:20:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:23:07] 10SRE, 10LDAP-Access-Requests, 10WMF-NDA-Requests: Allow WMDE intern Amrutha to access Superset - https://phabricator.wikimedia.org/T271725 (10jcrespo) @KFrancis ^^^ Thank you for your quick response to both. [12:23:39] (03CR) 10Hnowlan: [C: 03+2] similar-users: change container version [deployment-charts] - 10https://gerrit.wikimedia.org/r/657331 (https://phabricator.wikimedia.org/T268837) (owner: 10Hnowlan) [12:24:58] (03Merged) 10jenkins-bot: similar-users: change container version [deployment-charts] - 10https://gerrit.wikimedia.org/r/657331 (https://phabricator.wikimedia.org/T268837) (owner: 10Hnowlan) [12:25:23] heads up, I noticed icinga check latency has gone up significantly, going to restart icinga in 5 min [12:26:27] upps, matthiasmullie, sorry, I totally forgot I'm SWATting :/ [12:27:23] No worries. I’m done with my patch, and I think I was the only one [12:27:36] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2035.codfw.wmnet [12:27:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:27:47] great :) [12:27:57] !log hnowlan@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'similar-users' for release 'canary' . [12:27:57] !log hnowlan@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'similar-users' for release 'main' . [12:27:57] !log hnowlan@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'similar-users' for release 'test' . [12:27:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:28:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:28:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:29:00] Urbanecm: there was another patch scheduled that window, but user doesn't appear to be around - do we still want to swat that, or not since user is not here? [12:29:20] let's ignore it as they're not around [12:29:37] RECOVERY - Check systemd state on deploy1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:31:25] Urbanecm: are you swatting now/soon ? should I hold off on restarting icinga just in case ? [12:31:37] godog: no, I'm done :) [12:31:43] ack [12:31:50] !log bounce icinga on alert1001 [12:31:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:36:17] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host ms-be2036.codfw.wmnet [12:36:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:41:16] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2036.codfw.wmnet [12:41:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:41:37] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host ms-be2037.codfw.wmnet [12:41:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:45:57] (03CR) 10Filippo Giunchedi: [C: 03+1] profile: drop ECS messages on legacy cluster [puppet] - 10https://gerrit.wikimedia.org/r/657213 (https://phabricator.wikimedia.org/T234565) (owner: 10Cwhite) [12:46:17] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2037.codfw.wmnet [12:46:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:46:48] (03PS1) 10Hnowlan: similar-users: use puppet ca bundle with requests via env var [deployment-charts] - 10https://gerrit.wikimedia.org/r/657333 (https://phabricator.wikimedia.org/T268837) [12:49:47] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2087.codfw.wmnet with reason: Schema change T267767 [12:49:47] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2087.codfw.wmnet with reason: Schema change T267767 [12:49:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:49:51] T267767: Drop default of revactor_timestamp - https://phabricator.wikimedia.org/T267767 [12:49:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:50:26] volans: ooo, fancy cookbook logging! [12:52:18] RECOVERY - Elevated latency for icinga checks in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/rsCfQfuZz/icinga [12:58:22] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host ms-be2038.codfw.wmnet [12:58:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:58:56] (03PS2) 10Muehlenhoff: Disable bast3004/bast4002/bast5001 as bastions [puppet] - 10https://gerrit.wikimedia.org/r/656894 (https://phabricator.wikimedia.org/T257324) [12:58:57] (03CR) 10Hnowlan: [C: 03+2] similar-users: use puppet ca bundle with requests via env var [deployment-charts] - 10https://gerrit.wikimedia.org/r/657333 (https://phabricator.wikimedia.org/T268837) (owner: 10Hnowlan) [12:59:43] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics for lilients - https://phabricator.wikimedia.org/T272264 (10lilients_WMDE) >>! In T272264#6761382, @jcrespo wrote: > @lilients_WMDE Please confirm your data seems correct at the above patch, thank you! Looks fine. Thanks! [13:00:28] (03Merged) 10jenkins-bot: similar-users: use puppet ca bundle with requests via env var [deployment-charts] - 10https://gerrit.wikimedia.org/r/657333 (https://phabricator.wikimedia.org/T268837) (owner: 10Hnowlan) [13:02:13] !log hnowlan@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'similar-users' for release 'test' . [13:02:13] !log hnowlan@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'similar-users' for release 'canary' . [13:02:13] !log hnowlan@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'similar-users' for release 'main' . [13:02:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:02:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:02:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:03:15] Urbanecm: was my change deployed or reverted? Should I revert it now? [13:04:33] Nikerabbit: it was neither deployed or reverted AFAIK (so we're still in that bad state of uncertainity). Since it's a train blocker, I'm happy to deploy it now, provided you know how to check it works at testwiki. [13:06:36] Urbanecm: I think I can. I need to check I have the necessary rights, and https://phabricator.wikimedia.org/T157997 may be a blocker [13:07:10] Nikerabbit: you do, it's more of "do I feel confident syncing backports, not being a regular deployer" [13:07:11] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2038.codfw.wmnet [13:07:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:08:40] Urbanecm: sorry I didn't quite understand your last message [13:09:41] Nikerabbit: sorry. You are a deployer, so you can technically sync it out. But, as I know you're not doing deployment regularly, I'm offering myself to do it for you - so you need only to make sure it works as intended, and didn't break something else :) [13:10:33] Urbanecm: yes, my comment was about testing. I just checked while we are chatting that I can reproduce the error, so I am able to test it [13:10:44] ah, great [13:10:51] I'll ping you when it's ready then [13:11:27] (03CR) 10Jcrespo: [C: 03+2] admin: Provide cluster access to lilients [puppet] - 10https://gerrit.wikimedia.org/r/657330 (https://phabricator.wikimedia.org/T272264) (owner: 10Jcrespo) [13:11:33] godog: you wanted to do something with icinga earlier, is that done? (in other words, can I deploy now)? [13:14:09] !log hnowlan@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'similar-users' for release 'canary' . [13:14:09] !log hnowlan@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'similar-users' for release 'main' . [13:14:09] !log hnowlan@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'similar-users' for release 'test' . [13:14:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:14:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:14:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:14:18] Nikerabbit: pulled onto mwdebug1001, please test [13:14:21] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host ms-be2039.codfw.wmnet [13:14:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:14:42] Urbanecm: do you know if jobqueue is also affected by mwdebug-extension? [13:14:46] (03PS1) 10Hnowlan: similar-users: one replica in staging, 2 in prod [deployment-charts] - 10https://gerrit.wikimedia.org/r/657335 (https://phabricator.wikimedia.org/T268837) [13:15:03] Nikerabbit: I'm not sure about that. [13:15:21] well, we should see soon [13:15:22] (03CR) 10Muehlenhoff: [C: 03+2] Disable bast3004/bast4002/bast5001 as bastions [puppet] - 10https://gerrit.wikimedia.org/r/656894 (https://phabricator.wikimedia.org/T257324) (owner: 10Muehlenhoff) [13:15:26] (note only testwikis are affected by wmf.27 by now) [13:16:48] Urbanecm: error is coming from mw1306, so I assume jobqueue is not affected by mwdebug [13:17:11] Nikerabbit: ack. Is there any other way to test it? If not, I'll just sync it [13:17:53] Urbanecm: I can't think of any other way (maybe some shell.php hackery but have not prepared for that either) [13:18:04] okay, I'll sync it then [13:18:45] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics for lilients - https://phabricator.wikimedia.org/T272264 (10jcrespo) @lilients_WMDE access has been merged, it will take a few minutes to be deployed throughout the cluster (all servers). When it does, I will take care of kerber... [13:19:00] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics for lilients - https://phabricator.wikimedia.org/T272264 (10jcrespo) [13:20:44] !log urbanecm@deploy1001 Synchronized php-1.36.0-wmf.27/extensions/Translate/: 20decbd5cc3de0af655b9419cf69fc442ab056a4: Add flag to toggle the usage of the group synchronization cache (T272428; T182433) (duration: 01m 10s) [13:20:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:20:50] T182433: Implement a stronger synchronization in RepoNG and Translate - https://phabricator.wikimedia.org/T182433 [13:20:50] T272428: Error 1146: Table 'mediawikiwiki.translate_cache' doesn't exist - https://phabricator.wikimedia.org/T272428 [13:20:54] Nikerabbit: synced. Can you check now? [13:20:59] Urbanecm: checking [13:22:56] Urbanecm: looks good to me [13:23:03] great, thanks! [13:24:36] Urbanecm: thanks for the help, sorry for the trouble [13:26:09] !log hnowlan@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'similar-users' for release 'canary' . [13:26:09] !log hnowlan@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'similar-users' for release 'test' . [13:26:09] !log hnowlan@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'similar-users' for release 'main' . [13:26:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:26:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:26:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:27:22] (03PS1) 10Urbanecm: Set wgTranslateGroupSynchronizationCache to false explicitly [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657337 (https://phabricator.wikimedia.org/T272428) [13:27:41] Nikerabbit: I also feel we should explicitly set that flag to false in wmf config, to prevent this from re-occuring when it's set to true in extension.json. [13:28:22] !log hnowlan@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'similar-users' for release 'test' . [13:28:22] !log hnowlan@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'similar-users' for release 'canary' . [13:28:23] !log hnowlan@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'similar-users' for release 'main' . [13:28:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:28:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:28:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:29:44] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2039.codfw.wmnet [13:29:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:30:02] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host ms-be2040.codfw.wmnet [13:30:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:30:12] Urbanecm: sorry was in a meeting, yeah all done [13:30:43] no problem, I synced it anyway, as you didn't object :) [13:31:57] Urbanecm: I doubt if it is ever enabled by default (and we're adding another check tooo), but I'll keep in mind [13:32:25] ack [13:32:34] kormat: lol (fancy logging) [13:32:36] (03PS1) 10Jcrespo: Add Dom Walden (dwalden) to the list of privileged ldap users [puppet] - 10https://gerrit.wikimedia.org/r/657338 (https://phabricator.wikimedia.org/T272477) [13:35:04] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2040.codfw.wmnet [13:35:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:35:24] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host ms-be2041.codfw.wmnet [13:35:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:35:47] (03CR) 10Jcrespo: [C: 03+2] Add Dom Walden (dwalden) to the list of privileged ldap users [puppet] - 10https://gerrit.wikimedia.org/r/657338 (https://phabricator.wikimedia.org/T272477) (owner: 10Jcrespo) [13:38:38] (03PS4) 10Joal: profile::analytics::refinery Create HDFS folders [puppet] - 10https://gerrit.wikimedia.org/r/656307 (https://phabricator.wikimedia.org/T271560) [13:41:25] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2041.codfw.wmnet [13:41:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:43:50] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics for lilients - https://phabricator.wikimedia.org/T272264 (10Ottomata) > I tested the online tool which seems to be really cool. But accessing the event logging metrics in presto analytics hive I get the following error message:... [13:44:52] (03PS3) 10Joal: profile::analytics::refinery::job::hdfs_cleaner Update [puppet] - 10https://gerrit.wikimedia.org/r/656376 (https://phabricator.wikimedia.org/T271560) [13:46:20] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics for lilients - https://phabricator.wikimedia.org/T272264 (10jcrespo) [13:53:07] PROBLEM - Check for VMs leaked by the nova-fullstack test on cloudcontrol1003 is CRITICAL: 10 instances in the admin-monitoring project https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting%23Nova-fullstack [13:55:02] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host ms-be2042.codfw.wmnet [13:55:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:55:07] 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Grant Access to ldap/wmf for Dom Walden - https://phabricator.wikimedia.org/T272477 (10dom_walden) >>! In T272477#6761745, @jcrespo wrote: > Access has been deployed to LDAP and you should have immediately access to logstash. > > Please read the note that i... [14:00:04] brennen and liw: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Mediawiki train - American+European Version (secondary timeslot). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210120T1400). [14:00:28] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2042.codfw.wmnet [14:00:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:00:47] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host ms-be2043.codfw.wmnet [14:00:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:02:23] 10SRE, 10MediaWiki-Containers, 10serviceops, 10Patch-For-Review: Homepage for https://docker-registry.wikimedia.org - https://phabricator.wikimedia.org/T179696 (10Joe) For the record, I'm running the script as follows on registry1001: ` /usr/local/bin/registry-homepage-builder docker-registry.wikimedia.or... [14:07:16] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2043.codfw.wmnet [14:07:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:07:24] 10SRE, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker11[18-41] - https://phabricator.wikimedia.org/T260445 (10elukey) @Cmjohnson before racking the remaining 6 nodes (that we can do it in another task) could you check an-worker1119 and an-worker1131 to see if they... [14:07:47] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host ms-be2044.codfw.wmnet [14:07:48] (03CR) 10Alexandros Kosiaris: [C: 04-1] services: similar-users discovery and LVS component (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/657101 (https://phabricator.wikimedia.org/T268837) (owner: 10Hnowlan) [14:07:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:10:01] PROBLEM - nova instance creation test on cloudcontrol1003 is CRITICAL: PROCS CRITICAL: 0 processes with command name python, args nova-fullstack https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [14:12:25] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:30:00 on db1075.eqiad.wmnet with reason: Rebooting for T272255 [14:12:26] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:30:00 on db1075.eqiad.wmnet with reason: Rebooting for T272255 [14:12:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:12:31] !log kormat@cumin1001 dbctl commit (dc=all): 'db1075 depooling: Rebooting for T272255', diff saved to https://phabricator.wikimedia.org/P13848 and previous config saved to /var/cache/conftool/dbconfig/20210120-141230-kormat.json [14:12:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:12:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:13:37] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2044.codfw.wmnet [14:13:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:14:05] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host ms-be2045.codfw.wmnet [14:14:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:14:57] (03CR) 10Andrew Bogott: [C: 03+2] nova vendordata: use a bit of jinja and inject dhcp_domain into cloud-config [puppet] - 10https://gerrit.wikimedia.org/r/657230 (https://phabricator.wikimedia.org/T271273) (owner: 10Andrew Bogott) [14:19:05] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2045.codfw.wmnet [14:19:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:19:57] (03CR) 10Muehlenhoff: "Looks good, couple of nits inline" (036 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/657102 (owner: 10Jbond) [14:20:06] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host ms-be2046.codfw.wmnet [14:20:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:20:33] (03CR) 10Thcipriani: [C: 03+1] Add Tyler as approval contact for Phabricator-related groups [puppet] - 10https://gerrit.wikimedia.org/r/657291 (owner: 10Muehlenhoff) [14:21:40] !log kormat@cumin1001 dbctl commit (dc=all): 'Repooling after reboot. T272255', diff saved to https://phabricator.wikimedia.org/P13849 and previous config saved to /var/cache/conftool/dbconfig/20210120-142139-kormat.json [14:21:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:26:03] brennen: liw o/ I have some config changes coming in, what's the train status? can sync those? [14:26:03] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2046.codfw.wmnet [14:26:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:26:31] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:30:00 on db1076.eqiad.wmnet with reason: Rebooting for T272255 [14:26:32] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:30:00 on db1076.eqiad.wmnet with reason: Rebooting for T272255 [14:26:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:26:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:26:37] !log kormat@cumin1001 dbctl commit (dc=all): 'db1076 depooling: Rebooting for T272255', diff saved to https://phabricator.wikimedia.org/P13850 and previous config saved to /var/cache/conftool/dbconfig/20210120-142636-kormat.json [14:26:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:26:49] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host ms-be2047.codfw.wmnet [14:26:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:27:28] (03PS1) 10Ottomata: Migrate QuickSurveyInitiation and QuickSurveysResponses to eventgate on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657343 (https://phabricator.wikimedia.org/T271165) [14:28:21] (03PS2) 10Hnowlan: similar-users: one replica in staging, 2 in prod [deployment-charts] - 10https://gerrit.wikimedia.org/r/657335 (https://phabricator.wikimedia.org/T268837) [14:29:03] (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/655896 (https://phabricator.wikimedia.org/T271702) (owner: 10Jbond) [14:30:24] (03CR) 10Hnowlan: [C: 03+2] similar-users: one replica in staging, 2 in prod [deployment-charts] - 10https://gerrit.wikimedia.org/r/657335 (https://phabricator.wikimedia.org/T268837) (owner: 10Hnowlan) [14:30:41] (03CR) 10Ottomata: [C: 03+2] Migrate QuickSurveyInitiation and QuickSurveysResponses to eventgate on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657343 (https://phabricator.wikimedia.org/T271165) (owner: 10Ottomata) [14:31:39] brennen: , liw, looks like train is blocked? proceeding with config changes, they are low risk, please tell me if i should stop [14:31:45] (03Merged) 10jenkins-bot: similar-users: one replica in staging, 2 in prod [deployment-charts] - 10https://gerrit.wikimedia.org/r/657335 (https://phabricator.wikimedia.org/T268837) (owner: 10Hnowlan) [14:31:48] ottomata, brennen is asleep, I'm not moving train this window, go ahead [14:31:49] (03PS1) 10David Caro: Revert "Discard the non-whitelisted 172.16.0.0/12 traffic" [homer/public] - 10https://gerrit.wikimedia.org/r/657345 (https://phabricator.wikimedia.org/T272486) [14:31:53] ok thank you [14:32:08] !log hnowlan@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'similar-users' for release 'main' . [14:32:09] !log hnowlan@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'similar-users' for release 'canary' . [14:32:09] !log hnowlan@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'similar-users' for release 'test' . [14:32:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:32:12] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2047.codfw.wmnet [14:32:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:32:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:32:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:32:25] (03CR) 10Andrew Bogott: [C: 03+1] Revert "Discard the non-whitelisted 172.16.0.0/12 traffic" [homer/public] - 10https://gerrit.wikimedia.org/r/657345 (https://phabricator.wikimedia.org/T272486) (owner: 10David Caro) [14:33:36] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good to me" [puppet] - 10https://gerrit.wikimedia.org/r/657301 (owner: 10Arturo Borrero Gonzalez) [14:34:10] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host ms-be2048.codfw.wmnet [14:34:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:34:53] (03CR) 10David Caro: [C: 03+2] Revert "Discard the non-whitelisted 172.16.0.0/12 traffic" [homer/public] - 10https://gerrit.wikimedia.org/r/657345 (https://phabricator.wikimedia.org/T272486) (owner: 10David Caro) [14:34:58] (03CR) 10Alexandros Kosiaris: [C: 04-1] Add support for php deployments (039 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/651757 (owner: 10Giuseppe Lavagetto) [14:35:22] (03Merged) 10jenkins-bot: Revert "Discard the non-whitelisted 172.16.0.0/12 traffic" [homer/public] - 10https://gerrit.wikimedia.org/r/657345 (https://phabricator.wikimedia.org/T272486) (owner: 10David Caro) [14:35:53] (03PS9) 10Elukey: Add new service eventstreams-internal [deployment-charts] - 10https://gerrit.wikimedia.org/r/644612 (https://phabricator.wikimedia.org/T269160) (owner: 10Ottomata) [14:35:59] ^ :o [14:36:00] :) [14:36:03] 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): (Need By: 2020-06-20) rack/setup/install cloudcephosd10[04-15].wikimedia.org - https://phabricator.wikimedia.org/T251619 (10Andrew) [14:36:37] (03PS1) 10Ottomata: Declare streams QuickSurveysResponses and QuickSurveyInitiation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657346 (https://phabricator.wikimedia.org/T271165) [14:37:05] (03PS2) 10Ottomata: Declare streams QuickSurveysResponses and QuickSurveyInitiation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657346 (https://phabricator.wikimedia.org/T271165) [14:38:08] (03CR) 10Elukey: [C: 03+2] Add new service eventstreams-internal [deployment-charts] - 10https://gerrit.wikimedia.org/r/644612 (https://phabricator.wikimedia.org/T269160) (owner: 10Ottomata) [14:39:03] (03CR) 10Ottomata: [C: 03+2] Declare streams QuickSurveysResponses and QuickSurveyInitiation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657346 (https://phabricator.wikimedia.org/T271165) (owner: 10Ottomata) [14:40:10] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2048.codfw.wmnet [14:40:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:40:19] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host ms-be2049.codfw.wmnet [14:40:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:43:23] (03PS1) 10Andrew Bogott: nova vendordata: fix ## template: jinja intro [puppet] - 10https://gerrit.wikimedia.org/r/657348 (https://phabricator.wikimedia.org/T271273) [14:44:18] (03CR) 10Andrew Bogott: [C: 03+2] nova vendordata: fix ## template: jinja intro [puppet] - 10https://gerrit.wikimedia.org/r/657348 (https://phabricator.wikimedia.org/T271273) (owner: 10Andrew Bogott) [14:44:52] (03PS1) 10CDanis: klaxon: autorestart on envfile changes [puppet] - 10https://gerrit.wikimedia.org/r/657349 [14:45:21] (03PS1) 10Hnowlan: similar-users: make worker timeout configurable [deployment-charts] - 10https://gerrit.wikimedia.org/r/657350 (https://phabricator.wikimedia.org/T268837) [14:45:48] !log kormat@cumin1001 dbctl commit (dc=all): 'db1076 (re)pooling @ 33%: Reboot T272255', diff saved to https://phabricator.wikimedia.org/P13851 and previous config saved to /var/cache/conftool/dbconfig/20210120-144547-kormat.json [14:45:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:46:47] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2049.codfw.wmnet [14:46:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:46:54] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host ms-be2050.codfw.wmnet [14:46:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:47:50] !log otto@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Migrate QuickSurveys schemas to EventGate on testwiki - T271165, T271166 (duration: 01m 06s) [14:47:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:47:55] T271165: QuickSurveyInitiation Event Platform Migration - https://phabricator.wikimedia.org/T271165 [14:47:55] T271166: QuickSurveysResponses Event Platform Migration - https://phabricator.wikimedia.org/T271166 [14:48:23] (03PS2) 10Andrew Bogott: Revert "Nova: add a simple vendordata REST service" [puppet] - 10https://gerrit.wikimedia.org/r/657117 [14:48:50] (03CR) 10jerkins-bot: [V: 04-1] Revert "Nova: add a simple vendordata REST service" [puppet] - 10https://gerrit.wikimedia.org/r/657117 (owner: 10Andrew Bogott) [14:51:03] (03CR) 10CDanis: [C: 03+2] klaxon: autorestart on envfile changes [puppet] - 10https://gerrit.wikimedia.org/r/657349 (owner: 10CDanis) [14:51:32] (03PS3) 10Andrew Bogott: Revert "Nova: add a simple vendordata REST service" [puppet] - 10https://gerrit.wikimedia.org/r/657117 [14:52:45] (03CR) 10Andrew Bogott: [C: 03+2] Revert "Nova: add a simple vendordata REST service" [puppet] - 10https://gerrit.wikimedia.org/r/657117 (owner: 10Andrew Bogott) [14:53:03] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2050.codfw.wmnet [14:53:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:53:15] RECOVERY - Check for VMs leaked by the nova-fullstack test on cloudcontrol1003 is OK: 1 instances in the admin-monitoring project https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting%23Nova-fullstack [14:53:55] (03PS1) 10Ottomata: Migrate QuickSurveyInitiation and QuickSurveysResponses to eventgate on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657351 (https://phabricator.wikimedia.org/T271165) [14:55:28] (03CR) 10Ottomata: [C: 03+2] Migrate QuickSurveyInitiation and QuickSurveysResponses to eventgate on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657351 (https://phabricator.wikimedia.org/T271165) (owner: 10Ottomata) [14:55:34] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host ms-be2051.codfw.wmnet [14:55:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:55:56] (03CR) 10Jbond: [C: 03+2] ferm-status: add ability to ignore rules with a specific comment prefix [puppet] - 10https://gerrit.wikimedia.org/r/655896 (https://phabricator.wikimedia.org/T271702) (owner: 10Jbond) [14:55:59] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:30:00 on db1109.eqiad.wmnet with reason: Rebooting for T272255 [14:56:00] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:30:00 on db1109.eqiad.wmnet with reason: Rebooting for T272255 [14:56:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:56:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:56:05] !log kormat@cumin1001 dbctl commit (dc=all): 'db1109 depooling: Rebooting for T272255', diff saved to https://phabricator.wikimedia.org/P13852 and previous config saved to /var/cache/conftool/dbconfig/20210120-145605-kormat.json [14:56:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:57:09] !log otto@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Migrate QuickSurveys schemas to EventGate on all wikis - T271165, T271166 (duration: 01m 05s) [14:57:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:57:13] T271165: QuickSurveyInitiation Event Platform Migration - https://phabricator.wikimedia.org/T271165 [14:57:13] T271166: QuickSurveysResponses Event Platform Migration - https://phabricator.wikimedia.org/T271166 [14:57:15] PROBLEM - Postgres Replication Lag on puppetdb2002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 238408568 and 20 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [14:59:31] RECOVERY - Postgres Replication Lag on puppetdb2002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 484728 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [14:59:34] !log elukey@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'kube-system' for release 'rbac-deploy-clusterrole' . [14:59:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:00:52] !log kormat@cumin1001 dbctl commit (dc=all): 'db1076 (re)pooling @ 66%: Reboot T272255', diff saved to https://phabricator.wikimedia.org/P13853 and previous config saved to /var/cache/conftool/dbconfig/20210120-150051-kormat.json [15:00:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:01:39] 10SRE, 10LDAP-Access-Requests: Add wikitrent to `wmf` LDAP group - https://phabricator.wikimedia.org/T272489 (10wikitrent) [15:01:53] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2051.codfw.wmnet [15:01:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:02:16] !log kormat@cumin1001 dbctl commit (dc=all): 'db1109 (re)pooling @ 25%: Reboot T272255', diff saved to https://phabricator.wikimedia.org/P13854 and previous config saved to /var/cache/conftool/dbconfig/20210120-150216-kormat.json [15:02:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:02:18] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host ms-be2052.codfw.wmnet [15:02:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:08:02] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2052.codfw.wmnet [15:08:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:09:13] 10SRE, 10Dumps-Generation, 10SRE-Access-Requests, 10Patch-For-Review, 10Platform Team Workboards (Clinic Duty Team): Add all of CPT to snapshot/dumpsdata admins - https://phabricator.wikimedia.org/T271718 (10ArielGlenn) >>! In T271718#6755969, @jcrespo wrote: > Next meeting is expected to happen on 25 Ja... [15:09:27] RECOVERY - nova instance creation test on cloudcontrol1003 is OK: PROCS OK: 1 process with command name python, args nova-fullstack https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [15:12:09] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host ms-be2053.codfw.wmnet [15:12:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:12:14] (03CR) 10Ottomata: [C: 03+2] "> you expose your helm-releases to changes in the infra which will ofc not be picked up automatically" [puppet] - 10https://gerrit.wikimedia.org/r/656253 (https://phabricator.wikimedia.org/T253058) (owner: 10Ottomata) [15:13:35] (03PS2) 10Muehlenhoff: Add Tyler as approval contact for Phabricator-related groups [puppet] - 10https://gerrit.wikimedia.org/r/657291 [15:15:09] (03CR) 10Muehlenhoff: [C: 03+2] Add Tyler as approval contact for Phabricator-related groups [puppet] - 10https://gerrit.wikimedia.org/r/657291 (owner: 10Muehlenhoff) [15:15:55] !log kormat@cumin1001 dbctl commit (dc=all): 'db1076 (re)pooling @ 100%: Reboot T272255', diff saved to https://phabricator.wikimedia.org/P13855 and previous config saved to /var/cache/conftool/dbconfig/20210120-151555-kormat.json [15:15:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:16:13] jouncebot now [15:16:13] For the next 0 hour(s) and 43 minute(s): Mediawiki train - American+European Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210120T1400) [15:17:01] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2053.codfw.wmnet [15:17:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:17:20] !log kormat@cumin1001 dbctl commit (dc=all): 'db1109 (re)pooling @ 50%: Reboot T272255', diff saved to https://phabricator.wikimedia.org/P13856 and previous config saved to /var/cache/conftool/dbconfig/20210120-151719-kormat.json [15:17:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:17:50] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host ms-be2054.codfw.wmnet [15:17:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:18:05] !log 1.36.0-wmf.27 train unblocked, proceeding to group0 (T271341) [15:18:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:18:14] T271341: 1.36.0-wmf.27 deployment blockers - https://phabricator.wikimedia.org/T271341 [15:21:23] (03PS1) 10Brennen Bearnes: group0 wikis to 1.36.0-wmf.27 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657357 [15:21:25] (03CR) 10Brennen Bearnes: [C: 03+2] group0 wikis to 1.36.0-wmf.27 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657357 (owner: 10Brennen Bearnes) [15:22:14] (03CR) 10Gmodena: [C: 03+1] "lgtm" [deployment-charts] - 10https://gerrit.wikimedia.org/r/657350 (https://phabricator.wikimedia.org/T268837) (owner: 10Hnowlan) [15:22:25] (03Merged) 10jenkins-bot: group0 wikis to 1.36.0-wmf.27 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657357 (owner: 10Brennen Bearnes) [15:23:34] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2054.codfw.wmnet [15:23:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:24:10] !log brennen@deploy1001 rebuilt and synchronized wikiversions files: group0 wikis to 1.36.0-wmf.27 [15:24:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:25:26] (03PS1) 10Arturo Borrero Gonzalez: cr/firewall.conf: cloud-in4: introduce ACL for novafullstack [homer/public] - 10https://gerrit.wikimedia.org/r/657358 [15:29:36] (03CR) 10Hnowlan: [C: 03+2] similar-users: make worker timeout configurable [deployment-charts] - 10https://gerrit.wikimedia.org/r/657350 (https://phabricator.wikimedia.org/T268837) (owner: 10Hnowlan) [15:30:44] (03CR) 10Ayounsi: [C: 03+1] "LGTM!" [homer/public] - 10https://gerrit.wikimedia.org/r/657358 (owner: 10Arturo Borrero Gonzalez) [15:31:08] (03Merged) 10jenkins-bot: similar-users: make worker timeout configurable [deployment-charts] - 10https://gerrit.wikimedia.org/r/657350 (https://phabricator.wikimedia.org/T268837) (owner: 10Hnowlan) [15:31:42] (03CR) 10Abijeet Patro: [C: 03+1] Set wgTranslateGroupSynchronizationCache to false explicitly [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657337 (https://phabricator.wikimedia.org/T272428) (owner: 10Urbanecm) [15:32:01] !log hnowlan@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'similar-users' for release 'test' . [15:32:01] !log hnowlan@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'similar-users' for release 'main' . [15:32:01] !log hnowlan@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'similar-users' for release 'canary' . [15:34:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:34:07] !log kormat@cumin1001 dbctl commit (dc=all): 'db1109 (re)pooling @ 75%: Reboot T272255', diff saved to https://phabricator.wikimedia.org/P13857 and previous config saved to /var/cache/conftool/dbconfig/20210120-153223-kormat.json [15:34:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:34:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:34:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:34:27] !log hnowlan@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'similar-users' for release 'main' . [15:34:27] !log hnowlan@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'similar-users' for release 'canary' . [15:34:27] !log hnowlan@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'similar-users' for release 'test' . [15:34:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:34:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:34:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:40:19] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host ms-be2056.codfw.wmnet [15:40:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:43:27] 10SRE, 10observability, 10User-fgiunchedi: VictorOps behavior on long-ack'd incidents - https://phabricator.wikimedia.org/T259465 (10fgiunchedi) 05Stalled→03Resolved a:03fgiunchedi This policy change has been implemented for >6 months now and seems to work well (i.e. no incidents left acknowledged) [15:43:34] !log hnowlan@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'similar-users' for release 'test' . [15:43:34] !log hnowlan@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'similar-users' for release 'main' . [15:43:34] !log hnowlan@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'similar-users' for release 'canary' . [15:43:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:43:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:43:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:46:26] !log hnowlan@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'similar-users' for release 'canary' . [15:46:26] !log hnowlan@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'similar-users' for release 'test' . [15:46:26] !log hnowlan@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'similar-users' for release 'main' . [15:46:28] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2056.codfw.wmnet [15:46:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:46:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:46:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:46:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:47:27] !log kormat@cumin1001 dbctl commit (dc=all): 'db1109 (re)pooling @ 100%: Reboot T272255', diff saved to https://phabricator.wikimedia.org/P13858 and previous config saved to /var/cache/conftool/dbconfig/20210120-154726-kormat.json [15:47:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:49:10] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host ms-be2057.codfw.wmnet [15:49:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:55:11] !log elukey@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'eventstreams-internal' for release 'main' . [15:55:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:56:42] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2057.codfw.wmnet [15:56:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:57:02] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host ms-be2058.codfw.wmnet [15:57:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:58:24] !log hnowlan@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'similar-users' for release 'main' . [15:58:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:58:34] (03PS1) 10Ladsgroup: refinery: Migrate hiera() to lookup() [puppet] - 10https://gerrit.wikimedia.org/r/657363 (https://phabricator.wikimedia.org/T209953) [15:59:33] !log hnowlan@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'similar-users' for release 'main' . [15:59:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:02:45] (03CR) 10Ladsgroup: "PCC: https://puppet-compiler.wmflabs.org/compiler1001/27536/" [puppet] - 10https://gerrit.wikimedia.org/r/657363 (https://phabricator.wikimedia.org/T209953) (owner: 10Ladsgroup) [16:04:34] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2058.codfw.wmnet [16:04:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:05:27] 10SRE, 10ops-eqiad, 10Traffic: lvs1015 interface errors - https://phabricator.wikimedia.org/T272258 (10Cmjohnson) @BBlack Can you schedule this for this afternoon? [16:06:15] (03PS1) 10Ladsgroup: analytics: Migrate hiera() to lookup() [puppet] - 10https://gerrit.wikimedia.org/r/657364 (https://phabricator.wikimedia.org/T209953) [16:06:36] 10SRE, 10ops-eqiad, 10Traffic: lvs1015 interface errors - https://phabricator.wikimedia.org/T272258 (10BBlack) @Cmjohnson Yes, just let me know a timeframe and we'll get it ready [16:08:15] (03CR) 10Elukey: refinery: Migrate hiera() to lookup() (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/657363 (https://phabricator.wikimedia.org/T209953) (owner: 10Ladsgroup) [16:08:28] (03CR) 10Ladsgroup: "PCC: https://puppet-compiler.wmflabs.org/compiler1003/27537/" [puppet] - 10https://gerrit.wikimedia.org/r/657364 (https://phabricator.wikimedia.org/T209953) (owner: 10Ladsgroup) [16:08:37] (03CR) 10David Caro: [C: 03+1] "\o/ yay" [homer/public] - 10https://gerrit.wikimedia.org/r/657358 (owner: 10Arturo Borrero Gonzalez) [16:09:23] 10SRE, 10MediaWiki-Containers, 10serviceops, 10Patch-For-Review: Homepage for https://docker-registry.wikimedia.org - https://phabricator.wikimedia.org/T179696 (10Joe) For the record, I got the script running, by using ` /usr/local/bin/registry-homepage-builder docker-registry.wikimedia.org /root/homepag... [16:09:27] (03CR) 10Elukey: [C: 03+2] analytics: Migrate hiera() to lookup() [puppet] - 10https://gerrit.wikimedia.org/r/657364 (https://phabricator.wikimedia.org/T209953) (owner: 10Ladsgroup) [16:09:50] 10SRE, 10ops-eqiad, 10Traffic: lvs1015 interface errors - https://phabricator.wikimedia.org/T272258 (10Cmjohnson) @BBlack Okay, 2 hours from now. 1300 EST [16:10:33] (03CR) 10Ladsgroup: refinery: Migrate hiera() to lookup() (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/657363 (https://phabricator.wikimedia.org/T209953) (owner: 10Ladsgroup) [16:20:56] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host ms-be2059.codfw.wmnet [16:20:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:24:46] (03CR) 10Alexandros Kosiaris: "> Patch Set 2:" [dns] - 10https://gerrit.wikimedia.org/r/656430 (https://phabricator.wikimedia.org/T258978) (owner: 10Alexandros Kosiaris) [16:25:09] (03CR) 10Elukey: refinery: Migrate hiera() to lookup() (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/657363 (https://phabricator.wikimedia.org/T209953) (owner: 10Ladsgroup) [16:29:37] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2059.codfw.wmnet [16:29:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:29:45] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host ms-be2060.codfw.wmnet [16:29:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:31:34] 10SRE, 10ops-eqiad, 10Traffic: lvs1015 interface errors - https://phabricator.wikimedia.org/T272258 (10BBlack) @Cmjohnson - we'll have it ready then. [16:35:44] (03PS3) 10Alexandros Kosiaris: Introduce linkrecommendation{,-external} [dns] - 10https://gerrit.wikimedia.org/r/656430 (https://phabricator.wikimedia.org/T258978) [16:37:02] (03CR) 10Alexandros Kosiaris: Introduce linkrecommendation{,-external} (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/656430 (https://phabricator.wikimedia.org/T258978) (owner: 10Alexandros Kosiaris) [16:37:15] (03CR) 10Alexandros Kosiaris: [C: 03+2] Introduce linkrecommendation{,-external} [dns] - 10https://gerrit.wikimedia.org/r/656430 (https://phabricator.wikimedia.org/T258978) (owner: 10Alexandros Kosiaris) [16:37:16] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2060.codfw.wmnet [16:37:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:37:26] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host ms-be2061.codfw.wmnet [16:37:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:38:56] (03PS2) 10Ladsgroup: refinery: Migrate hiera() to lookup() [puppet] - 10https://gerrit.wikimedia.org/r/657363 (https://phabricator.wikimedia.org/T209953) [16:39:33] (03CR) 10Ladsgroup: refinery: Migrate hiera() to lookup() (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/657363 (https://phabricator.wikimedia.org/T209953) (owner: 10Ladsgroup) [16:39:47] (03CR) 10Volans: [C: 03+1] Introduce linkrecommendation{,-external} (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/656430 (https://phabricator.wikimedia.org/T258978) (owner: 10Alexandros Kosiaris) [16:40:15] akosiaris: you might have forgot to submit and deploy^^^ ;) [16:42:38] (03PS7) 10Jbond: sre: convert the generic reboot functions to the cookbook class API [cookbooks] - 10https://gerrit.wikimedia.org/r/657102 [16:44:46] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2061.codfw.wmnet [16:44:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:44:49] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/657155 (https://phabricator.wikimedia.org/T267376) (owner: 10Bstorm) [16:46:00] volans: done. I was just working through gerrit complaining about "1 unresolved comment". But you resolved it for me :-) [16:46:15] nice race condition [16:46:16] (03CR) 10jerkins-bot: [V: 04-1] sre: convert the generic reboot functions to the cookbook class API [cookbooks] - 10https://gerrit.wikimedia.org/r/657102 (owner: 10Jbond) [16:46:16] :D [16:46:33] (03PS3) 10CDanis: Send pages with user's email address, if available [software/klaxon] - 10https://gerrit.wikimedia.org/r/655117 [16:48:23] (03CR) 10CDanis: Send pages with user's email address, if available (031 comment) [software/klaxon] - 10https://gerrit.wikimedia.org/r/655117 (owner: 10CDanis) [16:48:40] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] network: add cloud_networks_public constant and use it in ferm [puppet] - 10https://gerrit.wikimedia.org/r/657301 (owner: 10Arturo Borrero Gonzalez) [16:49:24] (03PS4) 10CDanis: Send pages with user's email address, if available [software/klaxon] - 10https://gerrit.wikimedia.org/r/655117 [16:51:53] (03CR) 10Elukey: [C: 03+2] profile::analytics::refinery Create HDFS folders [puppet] - 10https://gerrit.wikimedia.org/r/656307 (https://phabricator.wikimedia.org/T271560) (owner: 10Joal) [16:52:18] 10SRE, 10LDAP-Access-Requests: Add wikitrent to `wmf` LDAP group - https://phabricator.wikimedia.org/T272489 (10jcrespo) Hello, @wikitrent! Aside from the approval, may I ask you for your Wikitech user name -or create one if you don't have one- and wikimedia email handle? Also, while not technically required... [16:53:45] (03CR) 10CDanis: [C: 03+1] swift: decrease object replicator concurrency [puppet] - 10https://gerrit.wikimedia.org/r/656837 (https://phabricator.wikimedia.org/T271415) (owner: 10Filippo Giunchedi) [16:54:45] PROBLEM - High average POST latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code=200 handler=- method=POST https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=POST [16:56:10] (03PS1) 10Cwhite: logstash: enable curator to accept custom age filters [puppet] - 10https://gerrit.wikimedia.org/r/657370 (https://phabricator.wikimedia.org/T234565) [16:56:12] (03PS1) 10Cwhite: profile: ecs indices to use a weekly rotation [puppet] - 10https://gerrit.wikimedia.org/r/657371 (https://phabricator.wikimedia.org/T234565) [16:56:31] (03CR) 10Arturo Borrero Gonzalez: wikireplicas: set up LVS for multiinstance wikireplicas (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/655533 (https://phabricator.wikimedia.org/T271476) (owner: 10Bstorm) [16:56:57] RECOVERY - High average POST latency for mw requests on appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=POST [17:01:13] (03PS2) 10Arturo Borrero Gonzalez: cr/firewall.conf: cloud-in4: introduce ACL for novafullstack [homer/public] - 10https://gerrit.wikimedia.org/r/657358 (https://phabricator.wikimedia.org/T272486) [17:06:23] (03PS1) 10Joal: Update HDFS folder creation for analytics refinery [puppet] - 10https://gerrit.wikimedia.org/r/657372 (https://phabricator.wikimedia.org/T271415) [17:06:25] 10SRE, 10MediaWiki-Containers, 10serviceops, 10Patch-For-Review: Homepage for https://docker-registry.wikimedia.org - https://phabricator.wikimedia.org/T179696 (10Legoktm) >>! In T179696#6761286, @Joe wrote: > Given the banner page we're creating is for use by the public, I think it can simply run against... [17:06:37] elukey: --^ [17:07:57] (03CR) 10jerkins-bot: [V: 04-1] Update HDFS folder creation for analytics refinery [puppet] - 10https://gerrit.wikimedia.org/r/657372 (https://phabricator.wikimedia.org/T271415) (owner: 10Joal) [17:08:26] 10ops-eqiad, 10DC-Ops, 10Dumps-Generation: (Need By: 2021-03-31) rack/setup/install snapshot101[1-5] - https://phabricator.wikimedia.org/T272509 (10RobH) [17:08:29] joal: I think require goes for last [17:08:33] Arf [17:08:34] ok [17:08:37] 10ops-eqiad, 10DC-Ops, 10Dumps-Generation: (Need By: 2021-03-31) rack/setup/install snapshot101[1-5] - https://phabricator.wikimedia.org/T272509 (10RobH) [17:09:20] 10SRE, 10LDAP-Access-Requests: Add wikitrent to `wmf` LDAP group - https://phabricator.wikimedia.org/T272489 (10wikitrent) hi @jcrespo ! Thanks for getting back to me! I've linked the LDAP account and my username is Wikitrent. Email is thand@wikimedia.org [17:10:25] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 236, down: 1, dormant: 0, excluded: 1, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [17:13:30] (03PS2) 10Joal: Update HDFS folder creation for analytics refinery [puppet] - 10https://gerrit.wikimedia.org/r/657372 (https://phabricator.wikimedia.org/T271415) [17:15:27] PROBLEM - Uncommitted DNS changes in Netbox on netbox1001 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [17:16:25] (03CR) 10Elukey: [C: 03+2] Update HDFS folder creation for analytics refinery [puppet] - 10https://gerrit.wikimedia.org/r/657372 (https://phabricator.wikimedia.org/T271415) (owner: 10Joal) [17:19:35] * volans looking ^^^ [17:19:48] volans: please stop breaking netbox [17:19:50] (03PS1) 10Hnowlan: similar-users: new version of image, more debugging. [deployment-charts] - 10https://gerrit.wikimedia.org/r/657373 [17:19:55] was not me :D [17:20:19] elukey: netbox = volans [17:21:48] (03CR) 10Gmodena: [C: 03+1] similar-users: new version of image, more debugging. [deployment-charts] - 10https://gerrit.wikimedia.org/r/657373 (owner: 10Hnowlan) [17:22:49] papaul: :D [17:22:58] he can't escape from that [17:23:05] 10SRE, 10SRE-Access-Requests: Requesting access to analytics for lilients - https://phabricator.wikimedia.org/T272264 (10jcrespo) [17:23:52] akosiaris: I'm running the sre.dns.netbox cookbook to address the above ^^^ until we decide if we use or not netbox as source of truth for the .svc. zonefiles this will fire as there is a diff although it's not actually included live. Sorry for the noise [17:23:59] papaul: we joke about it but we are lucky to have Riccardo patroling netbox :D [17:24:09] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (bad URL) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid [17:24:57] !log volans@cumin2001 START - Cookbook sre.dns.netbox [17:25:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:25:05] elukey: yes he does a great job i give him credit for that always avaible to help [17:25:54] <3 :) [17:25:56] 10SRE, 10SRE-Access-Requests: Requesting access to analytics for lilients - https://phabricator.wikimedia.org/T272264 (10jcrespo) 05Open→03Resolved All tasks have been completed, please try to access your cluster account to confirm it works as intended and you have the right permissions. For kerberos, you... [17:26:07] volans: we will change netbox name to netvolans [17:26:16] lol [17:26:29] PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1004 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1 [17:27:45] (03CR) 10Hnowlan: [C: 03+2] similar-users: new version of image, more debugging. [deployment-charts] - 10https://gerrit.wikimedia.org/r/657373 (owner: 10Hnowlan) [17:28:27] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [17:29:11] (03Merged) 10jenkins-bot: similar-users: new version of image, more debugging. [deployment-charts] - 10https://gerrit.wikimedia.org/r/657373 (owner: 10Hnowlan) [17:29:41] 10SRE, 10Analytics, 10Analytics-Kanban, 10Event-Platform, and 5 others: Set up internal eventstreams instance exposing all streams declared in stream config (and in kafka jumbo) - https://phabricator.wikimedia.org/T269160 (10elukey) Get up to deploying the service in staging, it seems working! Updated all... [17:31:37] !log hnowlan@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'similar-users' for release 'main' . [17:31:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:33:27] 10SRE, 10LDAP-Access-Requests: Add wikitrent to `wmf` LDAP group - https://phabricator.wikimedia.org/T272489 (10Aklapper) Hi and welcome @wikitrent. Please update the team's onboarding docs to link to https://phabricator.wikimedia.org/project/profile/1564/ which has instructions. Thanks a lot! :) [17:34:26] (03CR) 10Elukey: profile::analytics::refinery::job::hdfs_cleaner Update (036 comments) [puppet] - 10https://gerrit.wikimedia.org/r/656376 (https://phabricator.wikimedia.org/T271560) (owner: 10Joal) [17:35:12] (03PS1) 10Brennen Bearnes: Catch ClosestFilterVersionNotFoundException in ViewDiff [extensions/AbuseFilter] (wmf/1.36.0-wmf.27) - 10https://gerrit.wikimedia.org/r/657322 (https://phabricator.wikimedia.org/T272505) [17:35:17] RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1004 is OK: OK: Less than 20.00% above the threshold [500.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1 [17:36:23] (03CR) 10Brennen Bearnes: [C: 03+2] Catch ClosestFilterVersionNotFoundException in ViewDiff [extensions/AbuseFilter] (wmf/1.36.0-wmf.27) - 10https://gerrit.wikimedia.org/r/657322 (https://phabricator.wikimedia.org/T272505) (owner: 10Brennen Bearnes) [17:36:38] 10SRE, 10LDAP-Access-Requests: Add wikitrent to `wmf` LDAP group - https://phabricator.wikimedia.org/T272489 (10jcrespo) @Aklapper thanks for mentioning it, I was in a good mood I didn't want to be very inflexible with procedures for a new colleague, but following and using the templates helps indeed to speed... [17:36:39] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 90, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [17:40:06] !log hnowlan@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'similar-users' for release 'main' . [17:40:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:42:51] (03PS1) 10Jcrespo: admin: Add wikitrent to the list of privileged LDAP accounts [puppet] - 10https://gerrit.wikimedia.org/r/657378 (https://phabricator.wikimedia.org/T272489) [17:43:21] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 92, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [17:43:22] !log hnowlan@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'similar-users' for release 'main' . [17:43:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:44:24] (03CR) 10RLazarus: [C: 03+1] Send pages with user's email address, if available [software/klaxon] - 10https://gerrit.wikimedia.org/r/655117 (owner: 10CDanis) [17:44:40] (03CR) 10CDanis: [C: 03+2] Send pages with user's email address, if available [software/klaxon] - 10https://gerrit.wikimedia.org/r/655117 (owner: 10CDanis) [17:46:14] (03Merged) 10jenkins-bot: Send pages with user's email address, if available [software/klaxon] - 10https://gerrit.wikimedia.org/r/655117 (owner: 10CDanis) [17:49:38] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia... [17:50:22] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia... [17:51:14] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia... [17:51:50] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia... [17:52:57] (03PS2) 10CRusnov: interface_automation: Clean up old interfaces on run [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/657199 [17:52:59] (03CR) 10CRusnov: interface_automation: Clean up old interfaces on run (036 comments) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/657199 (owner: 10CRusnov) [17:53:01] (03PS2) 10CRusnov: interface_automation: Fix `interface` reference in IP address assignment [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/657207 [17:54:06] (03CR) 10jerkins-bot: [V: 04-1] interface_automation: Clean up old interfaces on run [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/657199 (owner: 10CRusnov) [17:54:19] (03CR) 10jerkins-bot: [V: 04-1] interface_automation: Fix `interface` reference in IP address assignment [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/657207 (owner: 10CRusnov) [17:54:27] !log lvs1015: stopping pybal with puppet disabled for T272258 [17:54:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:54:34] T272258: lvs1015 interface errors - https://phabricator.wikimedia.org/T272258 [17:56:01] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:56:55] BGP alerts are related to pybal stop above, will ack them all shortly [17:57:12] probably also a "mediawiki exceptions/minute" alert incoming shortly [17:58:13] !log volans@cumin2001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:58:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:58:23] ACKNOWLEDGEMENT - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal Brandon Black lvs1015 hw maint - T272258 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:58:23] ACKNOWLEDGEMENT - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal Brandon Black lvs1015 hw maint - T272258 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:58:50] cdanis: from the pybal shift? [17:59:13] bblack: yeah -- doing so resets all open connections; there's always some disruption visible to internal clients [17:59:36] it doesn't have to, but it might commonly due to [17:59:53] IME it commonly does :) [18:01:31] (03PS3) 10CRusnov: interface_automation: Clean up old interfaces on run [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/657199 [18:01:33] (03PS3) 10CRusnov: interface_automation: Fix `interface` reference in IP address assignment [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/657207 [18:01:34] !log lvs1015 - shutdown for T272258 [18:01:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:01:38] T272258: lvs1015 interface errors - https://phabricator.wikimedia.org/T272258 [18:05:21] RECOVERY - Uncommitted DNS changes in Netbox on netbox1001 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [18:07:54] (03Merged) 10jenkins-bot: Catch ClosestFilterVersionNotFoundException in ViewDiff [extensions/AbuseFilter] (wmf/1.36.0-wmf.27) - 10https://gerrit.wikimedia.org/r/657322 (https://phabricator.wikimedia.org/T272505) (owner: 10Brennen Bearnes) [18:08:40] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2316.codfw.wmnet with reason: REIMAGE [18:08:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:08:49] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1001/27538/" [puppet] - 10https://gerrit.wikimedia.org/r/655518 (https://phabricator.wikimedia.org/T209953) (owner: 10Dzahn) [18:09:19] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [18:09:25] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2325.codfw.wmnet with reason: REIMAGE [18:09:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:10:10] (03CR) 10Dzahn: "noop confirmed on rdb1006" [puppet] - 10https://gerrit.wikimedia.org/r/655518 (https://phabricator.wikimedia.org/T209953) (owner: 10Dzahn) [18:10:17] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2327.codfw.wmnet with reason: REIMAGE [18:10:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:10:29] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 90, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [18:10:40] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2316.codfw.wmnet with reason: REIMAGE [18:10:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:10:54] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2329.codfw.wmnet with reason: REIMAGE [18:10:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:12:44] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2325.codfw.wmnet with reason: REIMAGE [18:12:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:13:32] OSPF status 👀 [18:14:10] 10SRE, 10ops-eqiad, 10Traffic: lvs1015 interface errors - https://phabricator.wikimedia.org/T272258 (10Cmjohnson) @BBlack I swapped both optics on lvs1015 and b2 xe-2/0/3 [18:14:32] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2329.codfw.wmnet with reason: REIMAGE [18:14:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:14:58] jouncebot now [18:14:58] No deployments scheduled for the next 0 hour(s) and 45 minute(s) [18:15:00] noting here that i'm going to sling out https://gerrit.wikimedia.org/r/c/mediawiki/extensions/AbuseFilter/+/657322 [18:15:04] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 92, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [18:15:17] !log dzahn@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on mw2327.codfw.wmnet with reason: REIMAGE [18:15:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:15:57] XioNoX: AFAICT, on cr2-esams, xe-0/1/3 (Lumen transport to cr2-eqiad) is flapping? [18:16:54] no planned maintenance [18:17:09] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on mw2327.codfw.wmnet with reason: new install on buster [18:17:09] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on mw2327.codfw.wmnet with reason: new install on buster [18:17:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:17:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:17:32] (03PS1) 10Jbond: icinga: add wait_for_optimal function [software/spicerack] - 10https://gerrit.wikimedia.org/r/657385 [18:18:06] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [18:19:10] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [18:19:42] (03CR) 10Hnowlan: services: similar-users discovery and LVS component (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/657101 (https://phabricator.wikimedia.org/T268837) (owner: 10Hnowlan) [18:20:46] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [18:21:55] looking [18:22:10] !log ganeti - creating 105G virtual harddisk and adding to releases1002 for T272092 [18:22:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:22:15] T272092: Request volume for Docker images and container filesystems on releases machines - https://phabricator.wikimedia.org/T272092 [18:22:17] s/105/150 [18:23:04] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 90, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [18:23:19] could be a faulty optic, great :) [18:24:24] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 92, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [18:24:34] !log ganeti - creating 150G virtual hard disk and adding it to releases2002 for T272092 [18:24:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:24:49] (03CR) 10jerkins-bot: [V: 04-1] icinga: add wait_for_optimal function [software/spicerack] - 10https://gerrit.wikimedia.org/r/657385 (owner: 10Jbond) [18:25:00] !log draining esams-eqiad link [18:25:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:26:14] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: (Need By: 2020-11-29) rack/setup/install db11[51-76] - https://phabricator.wikimedia.org/T267043 (10Cmjohnson) the issue I ran into is db1169 was created in netbox w/out a mgmt ip. I didn't see until I went through and assigned mgmt IP's. so now everything is 1 off, I... [18:26:31] 10SRE, 10ops-eqiad, 10DC-Ops: frdev1001 ILO inaccessible - https://phabricator.wikimedia.org/T267969 (10Cmjohnson) This is planned for this coming Friday around 1100 EST [18:27:38] 10SRE, 10ops-eqiad, 10Traffic: lvs1015 interface errors - https://phabricator.wikimedia.org/T272258 (10Cmjohnson) 05Open→03Resolved resolving this task, if the issue persists please re-open and ping me. Thanks! [18:29:40] !log lvs1015: re-enabling puppet + pybal - T272258 [18:29:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:29:46] T272258: lvs1015 interface errors - https://phabricator.wikimedia.org/T272258 [18:32:36] 10SRE, 10ops-eqiad: ms-be1046 stuck on reboot - https://phabricator.wikimedia.org/T272396 (10Cmjohnson) Locally, ms-be1046 is not coming up either. I tried removing the power and psu's waiting for 45 secs and plugging back in. The server will not power-on. A Dell support task will need to be opened. [18:33:13] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2325.codfw.wmnet'] ` an... [18:33:46] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2316.codfw.wmnet'] ` an... [18:34:41] 10SRE, 10ops-eqiad: Degraded RAID on ms-be1032 - https://phabricator.wikimedia.org/T272209 (10Cmjohnson) This server is out of warranty, I do not have any HP 4TB disks to replace it with. I may have some old Dell ones I can use. They should work [18:35:30] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2327.codfw.wmnet'] ` an... [18:36:06] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2329.codfw.wmnet'] ` an... [18:36:57] (03CR) 10Elukey: [C: 03+2] refinery: Migrate hiera() to lookup() [puppet] - 10https://gerrit.wikimedia.org/r/657363 (https://phabricator.wikimedia.org/T209953) (owner: 10Ladsgroup) [18:37:36] (03CR) 10Elukey: "Ah snap this collides with another change that we just merged, can you rebase Amir? Sorry :(" [puppet] - 10https://gerrit.wikimedia.org/r/657363 (https://phabricator.wikimedia.org/T209953) (owner: 10Ladsgroup) [18:39:16] 01/20/2021 18:10:30 GMT - Light level testing has identified an issue 21 miles from the test site and Field Operations is working with the Transport NOC for the next step to take. [18:42:06] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 243, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [18:42:49] !log brennen@deploy1001 Synchronized php-1.36.0-wmf.27/extensions/AbuseFilter/includes/View/AbuseFilterViewDiff.php: Backport: [[gerrit:657366|Catch ClosestFilterVersionNotFoundException in ViewDiff (T272505)]] (duration: 01m 06s) [18:42:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:42:53] T272505: FilterLookup: No version of filter [x] closest to [y] found - https://phabricator.wikimedia.org/T272505 [18:45:37] (03PS5) 10Legoktm: mailman3: Add parts for Postorius (web interface) [puppet] - 10https://gerrit.wikimedia.org/r/655203 (https://phabricator.wikimedia.org/T256542) (owner: 10Ladsgroup) [18:46:02] 10SRE, 10ops-eqiad, 10DBA: Memory errors on clouddb1019 - https://phabricator.wikimedia.org/T272125 (10Cmjohnson) Record: 1 Date/Time: 08/31/2020 17:37:02 Source: system Severity: Ok Description: Log cleared. ------------------------------------------------------------------------------- Recor... [18:46:41] 10SRE, 10ops-eqiad, 10DBA: Memory errors on clouddb1019 - https://phabricator.wikimedia.org/T272125 (10Cmjohnson) Swapped DIMM A4 with DIMM B4, cleared the system log and powered on. Let's see if the error returns, stays the same or changes. [18:47:11] (03CR) 10Legoktm: [C: 03+2] mailman3: Add parts for Postorius (web interface) [puppet] - 10https://gerrit.wikimedia.org/r/655203 (https://phabricator.wikimedia.org/T256542) (owner: 10Ladsgroup) [18:49:31] I created https://phabricator.wikimedia.org/T272524 to not forget to put it back in service [18:51:05] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw2316.codfw.wmnet [18:51:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:51:26] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw2325.codfw.wmnet [18:51:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:51:35] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw2327.codfw.wmnet [18:51:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:51:45] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw2329.codfw.wmnet [18:51:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:53:13] PROBLEM - Kafka Broker Replica Max Lag on kafka-jumbo1009 is CRITICAL: 3.463e+07 ge 5e+06 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=16&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=jumbo-eqiad&var-kafka_broker=kafka-jumbo1009 [18:53:34] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw2316.codfw.wmnet [18:53:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:53:42] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw2329.codfw.wmnet [18:53:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:53:53] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw2327.codfw.wmnet [18:53:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:54:24] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw2325.codfw.wmnet [18:54:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:55:31] (03PS1) 10Ottomata: Refactor EventLogging Event Platform PHP integration [extensions/EventLogging] (wmf/1.36.0-wmf.27) - 10https://gerrit.wikimedia.org/r/657391 (https://phabricator.wikimedia.org/T253121) [18:55:33] (03PS1) 10Ottomata: Fix possible undefined index warning in arg checking in EventServiceClient [extensions/EventLogging] (wmf/1.36.0-wmf.27) - 10https://gerrit.wikimedia.org/r/657392 (https://phabricator.wikimedia.org/T253121) [18:57:08] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia... [18:57:29] (03CR) 10Mholloway: [C: 03+1] Refactor EventLogging Event Platform PHP integration [extensions/EventLogging] (wmf/1.36.0-wmf.27) - 10https://gerrit.wikimedia.org/r/657391 (https://phabricator.wikimedia.org/T253121) (owner: 10Ottomata) [18:57:34] (03CR) 10Mholloway: [C: 03+1] Fix possible undefined index warning in arg checking in EventServiceClient [extensions/EventLogging] (wmf/1.36.0-wmf.27) - 10https://gerrit.wikimedia.org/r/657392 (https://phabricator.wikimedia.org/T253121) (owner: 10Ottomata) [18:58:08] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia... [18:58:47] PROBLEM - Host clouddb1019.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [18:59:13] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia... [18:59:43] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 90, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [18:59:58] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia... [19:00:04] brennen and liw: I, the Bot under the Fountain, allow thee, The Deployer, to do Train log triage with CPT deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210120T1900). [19:00:04] RoanKattouw, Niharika, and Urbanecm: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210120T1900). [19:00:04] Kizule: A patch you scheduled for Morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [19:00:12] (03PS3) 10Ladsgroup: refinery: Migrate hiera() to lookup() [puppet] - 10https://gerrit.wikimedia.org/r/657363 (https://phabricator.wikimedia.org/T209953) [19:00:19] 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Add wikitrent to `wmf` LDAP group - https://phabricator.wikimedia.org/T272489 (10wikitrent) @Aklapper & @jcrespo I didn't follow what you were asking and created this task from the link: https://phabricator.wikimedia.org/T272525 Can we ignore/delete the n... [19:00:20] jouncebot: I'm glad to hear it. ;) [19:00:34] Kizule: finally you arrived :) [19:00:36] I can deploy today [19:00:52] RoanKattouw, Niharika, Urbanecm: the train is blocked from rolling forward, but also where it should be theoretically at the moment (group0), so feel free to go ahead with backports. [19:01:02] (03CR) 10Ladsgroup: "Done. It actually did the work I was doing, so I just dropped the file from my patch." [puppet] - 10https://gerrit.wikimedia.org/r/657363 (https://phabricator.wikimedia.org/T209953) (owner: 10Ladsgroup) [19:01:19] brennen: ack, thanks, I'm going to do config only for now anyway :) [19:01:22] Urbanecm: Yup, I’ve been unstable lately and juggling with so many things. [19:01:42] 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Add wikitrent to `wmf` LDAP group - https://phabricator.wikimedia.org/T272489 (10jcrespo) [19:01:48] (03CR) 10Urbanecm: [C: 03+2] Enable visualeditor on kuwiki by default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/655714 (https://phabricator.wikimedia.org/T270841) (owner: 10Zoranzoki21) [19:02:44] (03Merged) 10jenkins-bot: Enable visualeditor on kuwiki by default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/655714 (https://phabricator.wikimedia.org/T270841) (owner: 10Zoranzoki21) [19:03:51] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 92, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:04:09] (03PS1) 10Ayounsi: Add Lumen transit BGP to eqiad [homer/public] - 10https://gerrit.wikimedia.org/r/657395 (https://phabricator.wikimedia.org/T270439) [19:04:14] 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Add wikitrent to `wmf` LDAP group - https://phabricator.wikimedia.org/T272489 (10jcrespo) Don't worry, please read when you have time Phabricator (and task management)'s documentation, it will help you for future tickets): https://www.mediawiki.org/wiki/Bug_... [19:04:25] Kizule: please test at mwdebug1001 [19:04:30] Urbanecm: Sure [19:04:47] 10SRE, 10ops-eqiad, 10DBA: Memory errors on clouddb1019 - https://phabricator.wikimedia.org/T272125 (10Cmjohnson) Fast response by the server, after swapping the DIMM, the server was stuck in a continuous reboot. connected the console and see that the server is failing during post at the memory check. Not s... [19:05:10] (03CR) 10Ayounsi: [C: 03+2] Add Lumen transit BGP to eqiad [homer/public] - 10https://gerrit.wikimedia.org/r/657395 (https://phabricator.wikimedia.org/T270439) (owner: 10Ayounsi) [19:05:39] (03Merged) 10jenkins-bot: Add Lumen transit BGP to eqiad [homer/public] - 10https://gerrit.wikimedia.org/r/657395 (https://phabricator.wikimedia.org/T270439) (owner: 10Ayounsi) [19:07:07] Urbanecm: VE disappeared from "beta features" (it is okay). But, there isn't pen in the source editor with which I can switch to the visual editor. [19:07:27] Kizule: mind screenshotting? [19:07:36] Urbanecm: I'll. [19:08:24] (03PS1) 10Ayounsi: Fix typo [homer/public] - 10https://gerrit.wikimedia.org/r/657396 [19:08:41] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [19:09:21] Urbanecm: Everything looks good, maybe some caching issues happened in my browser, but now everything is correct for me. [19:09:23] (03CR) 10Dzahn: [C: 03+1] "confirmed on mwmaint (wikitech LDAP) and ldap-corp (OIT LDAP). full-time employee, UID matches, lgtm, all it needs is aezell's approval o" [puppet] - 10https://gerrit.wikimedia.org/r/657378 (https://phabricator.wikimedia.org/T272489) (owner: 10Jcrespo) [19:09:28] okay [19:09:30] (03CR) 10Ayounsi: [C: 03+2] Fix typo [homer/public] - 10https://gerrit.wikimedia.org/r/657396 (owner: 10Ayounsi) [19:10:05] (03Merged) 10jenkins-bot: Fix typo [homer/public] - 10https://gerrit.wikimedia.org/r/657396 (owner: 10Ayounsi) [19:10:24] Urbanecm: So... you can sync this. [19:10:30] syncing [19:11:18] !log urbanecm@deploy1001 Synchronized dblists/visualeditor-nondefault.dblist: a736d97463e7a42b41dbcff19a8c2c3c62f8bf6d: Enable visualeditor on kuwiki by default (T270841; 1/2) (duration: 01m 04s) [19:11:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:11:24] T270841: Enable VisualEditor by default for all users of the ku.wikipedia - https://phabricator.wikimedia.org/T270841 [19:11:36] !log add BGP to Lumen in eqiad [19:11:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:12:55] !log urbanecm@deploy1001 Synchronized wmf-config/config/kuwiki.yaml: a736d97463e7a42b41dbcff19a8c2c3c62f8bf6d: Enable visualeditor on kuwiki by default (T270841; 2/2) (duration: 01m 05s) [19:13:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:13:12] Kizule: here you go :) [19:13:29] (03CR) 10Joal: "Thanks for the review @elukey - I'm sorry for the forgotten changes :S" (036 comments) [puppet] - 10https://gerrit.wikimedia.org/r/656376 (https://phabricator.wikimedia.org/T271560) (owner: 10Joal) [19:13:31] (03PS1) 10Urbanecm: [enwiki] Update celebration logo to "option A" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657397 (https://phabricator.wikimedia.org/T272526) [19:13:41] (03PS1) 10Effie Mouzeli: mediawiki: reduce the number of cached keys that trigger a restart [puppet] - 10https://gerrit.wikimedia.org/r/657398 (https://phabricator.wikimedia.org/T245183) [19:13:43] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 90, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:13:49] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [19:13:53] Urbanecm: Thank you, everything works. [19:14:08] (03PS1) 10David Caro: wmcs.backup.images: Fix full backup creation [puppet] - 10https://gerrit.wikimedia.org/r/657399 (https://phabricator.wikimedia.org/T272510) [19:14:09] 10SRE, 10ops-eqiad, 10Analytics-Radar, 10Patch-For-Review: Degraded RAID on an-coord1002 - https://phabricator.wikimedia.org/T270768 (10Cmjohnson) 05Open→03Resolved a:03Cmjohnson @elukey I swapped the SSD. The only spare I had is 300GB. It's new. Feel free to do what you need. I am resolving this t... [19:14:09] great :) [19:14:36] (03PS4) 10Joal: profile::analytics::refinery::job::hdfs_cleaner Update [puppet] - 10https://gerrit.wikimedia.org/r/656376 (https://phabricator.wikimedia.org/T271560) [19:16:12] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2331.codfw.wmnet with reason: REIMAGE [19:16:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:17:09] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2333.codfw.wmnet with reason: REIMAGE [19:17:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:17:21] (03PS2) 10Urbanecm: [enwiki] Update celebration logo to "option A" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657397 (https://phabricator.wikimedia.org/T272526) [19:17:23] 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Add wikitrent to `wmf` LDAP group - https://phabricator.wikimedia.org/T272489 (10Tchanders) @Aklapper @jcrespo - thanks for the help with this task! > Hi and welcome @wikitrent. Please update the team's onboarding docs to link to https://phabricator.wikimed... [19:17:33] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 92, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:18:00] (03CR) 10Urbanecm: [C: 03+2] [enwiki] Update celebration logo to "option A" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657397 (https://phabricator.wikimedia.org/T272526) (owner: 10Urbanecm) [19:18:16] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2335.codfw.wmnet with reason: REIMAGE [19:18:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:18:17] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2331.codfw.wmnet with reason: REIMAGE [19:18:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:19:00] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2337.codfw.wmnet with reason: REIMAGE [19:19:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:19:05] (03Merged) 10jenkins-bot: [enwiki] Update celebration logo to "option A" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657397 (https://phabricator.wikimedia.org/T272526) (owner: 10Urbanecm) [19:20:22] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2333.codfw.wmnet with reason: REIMAGE [19:20:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:20:39] 10SRE, 10ops-eqiad: Degraded RAID on ms-be1032 - https://phabricator.wikimedia.org/T272209 (10Cmjohnson) I did swap it with a 4TB disk from a Dell server. Hopefully, this works. [19:20:51] (03PS1) 10Andrew Bogott: nova vendordata/firstboot: move puppet logic into cloud-init [puppet] - 10https://gerrit.wikimedia.org/r/657401 (https://phabricator.wikimedia.org/T271273) [19:22:18] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2335.codfw.wmnet with reason: REIMAGE [19:22:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:22:30] !log urbanecm@deploy1001 Synchronized static/images/project-logos: 13fb338249b3ec73e380c4971ee697f28a2f6d76: [enwiki] Update celebration logo to "option A" (T272526) (duration: 01m 05s) [19:22:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:22:34] T272526: Change enwiki logo to "Option A" until February 4 - https://phabricator.wikimedia.org/T272526 [19:23:00] 10SRE, 10Graphoid, 10Platform Engineering, 10serviceops: Final undeploy for graphoid - en.wiki - https://phabricator.wikimedia.org/T271495 (10Jdlrobson) Opened T272530 with suggested high priority [19:24:00] !log dzahn@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on mw2337.codfw.wmnet with reason: REIMAGE [19:24:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:24:30] !log depool and repool thumbor* to upgrade python-thumbor-wikimedia to v2.9 [19:24:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:25:37] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 90, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:27:15] PROBLEM - mediawiki-installation DSH group on mw2337 is CRITICAL: Host mw2337 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [19:27:49] RECOVERY - Device not healthy -SMART- on an-coord1002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/SMART%23Alerts https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=an-coord1002&var-datasource=eqiad+prometheus/ops [19:27:57] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 92, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:28:27] (03CR) 10Jcrespo: "BTW, I don't think management approval is needed according to our procedures, but it helps establishing the connection between the coporat" [puppet] - 10https://gerrit.wikimedia.org/r/657378 (https://phabricator.wikimedia.org/T272489) (owner: 10Jcrespo) [19:29:33] PROBLEM - Apache HTTP on mw2337 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [19:29:59] (03PS1) 10Urbanecm: Revert "[enwiki] Update celebration logo to "option A"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657323 (https://phabricator.wikimedia.org/T272526) [19:30:15] (03CR) 10Urbanecm: [C: 03+2] Revert "[enwiki] Update celebration logo to "option A"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657323 (https://phabricator.wikimedia.org/T272526) (owner: 10Urbanecm) [19:30:29] 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Add wikitrent to `wmf` LDAP group - https://phabricator.wikimedia.org/T272489 (10jcrespo) FYI, this is only blocked on @aezel response T272489#6762016 (not deployed yet). [19:31:06] (03Merged) 10jenkins-bot: Revert "[enwiki] Update celebration logo to "option A"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657323 (https://phabricator.wikimedia.org/T272526) (owner: 10Urbanecm) [19:31:44] 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Add wikitrent to `wmf` LDAP group - https://phabricator.wikimedia.org/T272489 (10jcrespo) a:05Tchanders→03aezell [19:32:02] 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Add wikitrent to `wmf` LDAP group - https://phabricator.wikimedia.org/T272489 (10jcrespo) p:05Triage→03High [19:33:41] !log urbanecm@deploy1001 Synchronized static/images/project-logos: 5c941678ec739dd6b5257b4a8f866b7e3a257f45: Revert: [enwiki] Update celebration logo to "option A" (T272526) (duration: 01m 04s) [19:33:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:33:46] T272526: Change enwiki logo to "Option A" until February 4 - https://phabricator.wikimedia.org/T272526 [19:34:39] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 90, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:35:28] 10SRE, 10LDAP-Access-Requests: LDAP access to the wmf group for Jazmin Tanner - https://phabricator.wikimedia.org/T272522 (10jcrespo) p:05Triage→03High a:03JTannerWMF Waiting on additional user information to be provided, following Aklapper provided link. [19:38:33] PROBLEM - Host mw2337 is DOWN: PING CRITICAL - Packet loss = 100% [19:39:00] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2331.codfw.wmnet'] ` an... [19:39:05] Urbanecm: why reverted? [19:39:35] (03PS1) 10CDanis: add bot_posts_blocked_nets: IP ranges to block POSTs from bot U-As [puppet] - 10https://gerrit.wikimedia.org/r/657402 (https://phabricator.wikimedia.org/T272330) [19:39:36] legoktm: because it's too big and it would require dropping HD logo [19:39:43] it looks like this https://usercontent.irccloud-cdn.com/file/kxXQmCdD/image.png [19:39:49] RECOVERY - Host mw2337 is UP: PING OK - Packet loss = 0%, RTA = 33.45 ms [19:39:51] RECOVERY - Apache HTTP on mw2337 is OK: HTTP OK: HTTP/1.1 302 Found - 641 bytes in 1.031 second response time https://wikitech.wikimedia.org/wiki/Application_servers [19:40:11] PROBLEM - Check systemd state on mw2337 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:40:11] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2333.codfw.wmnet'] ` an... [19:40:19] Urbanecm: gotcha :| [19:40:27] RECOVERY - Kafka Broker Replica Max Lag on kafka-jumbo1009 is OK: (C)5e+06 ge (W)1e+06 ge 0 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=16&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=jumbo-eqiad&var-kafka_broker=kafka-jumbo1009 [19:40:43] sadly i wasn't able to make it fit easily, so i reverted :/ [19:40:43] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2335.codfw.wmnet'] ` an... [19:41:35] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2337.codfw.wmnet'] ` an... [19:42:01] RECOVERY - Check systemd state on mw2337 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:42:12] legoktm: I'll put a note on the task momentarily :). [19:42:56] :) thanks [19:44:13] PROBLEM - PHP opcache health on mw2337 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [19:45:13] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [19:47:40] !log lvs1015: stopping pybal to try to fix a lingering ifup service state issue on the host, which may require downing an interface [19:47:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:50:38] !log lvs1015: bringing pybal back online [19:50:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:53:00] (03PS1) 10CDanis: add bot_posts_blocked_nets [labs/private] - 10https://gerrit.wikimedia.org/r/657406 [19:55:48] (03PS1) 10Tpt: Enables the Wikisource extension on oldwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657407 (https://phabricator.wikimedia.org/T272163) [19:57:06] (03CR) 10jerkins-bot: [V: 04-1] Enables the Wikisource extension on oldwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657407 (https://phabricator.wikimedia.org/T272163) (owner: 10Tpt) [19:57:17] (03PS2) 10Tpt: Enables the Wikisource extension on oldwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657407 (https://phabricator.wikimedia.org/T272163) [19:57:51] (03PS2) 10CDanis: add bot_posts_blocked_nets [labs/private] - 10https://gerrit.wikimedia.org/r/657406 [19:57:53] legoktm: {{done}} as https://phabricator.wikimedia.org/T272526#6763287. [19:58:26] (03CR) 10CDanis: [V: 03+2 C: 03+2] add bot_posts_blocked_nets [labs/private] - 10https://gerrit.wikimedia.org/r/657406 (owner: 10CDanis) [20:00:04] brennen and liw: Time to snap out of that daydream and deploy Mediawiki train - American+European Version. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210120T2000). [20:02:23] 10SRE, 10ops-eqiad, 10Traffic: lvs1015 interface errors - https://phabricator.wikimedia.org/T272258 (10BBlack) All appears healthy now and downtimes are removed, and librenms isn't showing those errors on the interface anymore, either. Thanks! [20:06:12] !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox [20:06:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:06:29] !log 1.36.0-wmf.27 (T271341) train status as of deploy window: currently blocked at group0 on T272508 [20:06:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:06:33] T272508: PropertyInfoSnakUrlExpander: Bad value for parameter $snak->getDataValue(): must be a DataValues\StringValue - https://phabricator.wikimedia.org/T272508 [20:06:34] T271341: 1.36.0-wmf.27 deployment blockers - https://phabricator.wikimedia.org/T271341 [20:08:48] 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Add wikitrent to `wmf` LDAP group - https://phabricator.wikimedia.org/T272489 (10aezell) As @wikitrent's manager, I approve this access. [20:11:04] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [20:11:41] (03PS2) 10CDanis: add bot_posts_blocked_nets: IP ranges to block POSTs from bot U-As [puppet] - 10https://gerrit.wikimedia.org/r/657402 (https://phabricator.wikimedia.org/T272330) [20:12:35] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [20:12:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:13:14] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 92, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [20:13:28] (03PS3) 10CDanis: add bot_posts_blocked_nets: IP ranges to block POSTs from bot U-As [puppet] - 10https://gerrit.wikimedia.org/r/657402 (https://phabricator.wikimedia.org/T272330) [20:14:23] (03PS2) 10Andrew Bogott: nova vendordata/firstboot: move puppet logic into cloud-init [puppet] - 10https://gerrit.wikimedia.org/r/657401 (https://phabricator.wikimedia.org/T271273) [20:15:32] (03CR) 10CDanis: "Added tests, re-checking once more, then will disable puppet on cps, merge, test on a few, re-enable." [puppet] - 10https://gerrit.wikimedia.org/r/657402 (https://phabricator.wikimedia.org/T272330) (owner: 10CDanis) [20:15:42] PROBLEM - PHP opcache health on mw2329 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [20:16:06] !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox [20:16:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:17:24] ✔️ cdanis@cumin1001.eqiad.wmnet ~ 🕒🍵 sudo cumin A:cp 'disable-puppet "cdanis deploying I558346d T272330"' [20:17:28] !log ✔️ cdanis@cumin1001.eqiad.wmnet ~ 🕒🍵 sudo cumin A:cp 'disable-puppet "cdanis deploying I558346d T272330"' [20:17:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:20:58] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 90, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [20:21:51] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [20:21:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:22:36] (03CR) 10CDanis: [C: 03+2] add bot_posts_blocked_nets: IP ranges to block POSTs from bot U-As [puppet] - 10https://gerrit.wikimedia.org/r/657402 (https://phabricator.wikimedia.org/T272330) (owner: 10CDanis) [20:22:40] jouncebot now [20:22:40] For the next 1 hour(s) and 37 minute(s): Mediawiki train - American+European Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210120T2000) [20:23:06] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 92, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [20:23:11] !log 1.36.0-wmf.27 (T271341) train: proceeding to group1 [20:23:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:23:16] T271341: 1.36.0-wmf.27 deployment blockers - https://phabricator.wikimedia.org/T271341 [20:24:30] 10SRE, 10puppet-compiler: run-puppet-agent --enable flag is broken - https://phabricator.wikimedia.org/T272539 (10CDanis) [20:24:42] 10Puppet, 10SRE: run-puppet-agent --enable flag is broken - https://phabricator.wikimedia.org/T272539 (10CDanis) [20:25:23] (03PS1) 10Brennen Bearnes: group1 wikis to 1.36.0-wmf.27 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657409 [20:25:25] (03CR) 10Brennen Bearnes: [C: 03+2] group1 wikis to 1.36.0-wmf.27 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657409 (owner: 10Brennen Bearnes) [20:26:16] (03Merged) 10jenkins-bot: group1 wikis to 1.36.0-wmf.27 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657409 (owner: 10Brennen Bearnes) [20:28:48] !log brennen@deploy1001 rebuilt and synchronized wikiversions files: group1 wikis to 1.36.0-wmf.27 [20:28:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:29:14] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [20:30:22] PROBLEM - High average GET latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code=205 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [20:31:54] !log brennen@deploy1001 Synchronized php: group1 wikis to 1.36.0-wmf.27 (duration: 03m 05s) [20:31:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:32:32] brennen: this time the spike was real.. but ..a spike.. already back down so that's from deploying itself it looks [20:32:36] RECOVERY - High average GET latency for mw requests on appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [20:32:40] there we go [20:33:42] (03CR) 10Cwhite: [C: 03+2] profile: drop ECS messages on legacy cluster [puppet] - 10https://gerrit.wikimedia.org/r/657213 (https://phabricator.wikimedia.org/T234565) (owner: 10Cwhite) [20:34:44] whew. [20:36:09] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: (Need By: 2020-11-29) rack/setup/install db11[51-76] - https://phabricator.wikimedia.org/T267043 (10Cmjohnson) [20:37:05] (03Abandoned) 10Effie Mouzeli: Revert "hieradata: enable onhost memcached on mw2271" [puppet] - 10https://gerrit.wikimedia.org/r/631230 (owner: 10Effie Mouzeli) [20:37:28] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: (Need By: 2020-11-29) rack/setup/install db11[51-76] - https://phabricator.wikimedia.org/T267043 (10Cmjohnson) a:05Jclark-ctr→03RobH All of the servers are in the racks, idracs are setup including db1169 and db1175. Outstanding items that @robh will do - raid - p... [20:38:06] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [20:41:15] !log restart mc-gp2001, mc-gp2002, mc-gp2003 for T269596 [20:41:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:41:58] !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc-gp2001.codfw.wmnet [20:42:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:43:18] 10Puppet, 10SRE: run-puppet-agent --enable flag is broken - https://phabricator.wikimedia.org/T272539 (10CDanis) There's one related problem, which is that `enable-puppet` should check the given message both with and without appending ` - $SUDO_USER`, as perhaps you set a disable-puppet from a context where th... [20:44:07] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw2331.codfw.wmnet [20:44:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:44:35] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw2333.codfw.wmnet [20:44:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:44:48] 10Puppet, 10SRE: run-puppet-agent --enable flag is broken - https://phabricator.wikimedia.org/T272539 (10CDanis) >>! In T272539#6763461, @CDanis wrote: > There's one related problem, which is that `enable-puppet` should check the given message both with and without appending ` - $SUDO_USER`, as perhaps you set... [20:45:15] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw2335.codfw.wmnet [20:45:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:45:26] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw2337.codfw.wmnet [20:45:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:45:45] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw2331.codfw.wmnet [20:45:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:46:18] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw2333.codfw.wmnet [20:46:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:46:40] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw2335.codfw.wmnet [20:46:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:46:46] !log ✔️ cdanis@cumin1001.eqiad.wmnet ~ 🕞🍵 sudo cumin A:cp 'enable-puppet "cdanis deploying I558346d T272330"' [20:46:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:46:50] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw2337.codfw.wmnet [20:46:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:48:25] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc-gp2001.codfw.wmnet [20:48:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:48:42] (03PS3) 10Effie Mouzeli: profile::memcached::instance: simplify handling of extendend_options [puppet] - 10https://gerrit.wikimedia.org/r/647190 (owner: 10Elukey) [20:48:58] (03CR) 10Effie Mouzeli: [C: 03+1] profile::memcached::instance: simplify handling of extendend_options [puppet] - 10https://gerrit.wikimedia.org/r/647190 (owner: 10Elukey) [20:49:43] (03CR) 10Effie Mouzeli: [C: 03+1] "@elukey, since puppet does not trigger a memcached restart, I think we can merge it and slowly do the restarts" [puppet] - 10https://gerrit.wikimedia.org/r/647190 (owner: 10Elukey) [20:53:18] PROBLEM - SSH on logstash2005 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [20:54:40] ^^ looking [20:55:52] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 90, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [20:55:52] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia... [20:56:14] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia... [20:56:17] !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc-gp2002.codfw.wmnet [20:56:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:57:07] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia... [20:57:25] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia... [20:57:30] RECOVERY - SSH on logstash2005 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [20:57:34] PROBLEM - logstash JSON linesTCP port on logstash2005 is CRITICAL: connect to address 127.0.0.1 and port 11514: Connection refused https://wikitech.wikimedia.org/wiki/Logstash [20:58:00] 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Add wikitrent to `wmf` LDAP group - https://phabricator.wikimedia.org/T272489 (10jcrespo) a:05aezell→03jcrespo [20:58:04] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 92, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [20:58:07] (03CR) 10Effie Mouzeli: [C: 03+1] "PCC https://puppet-compiler.wmflabs.org/compiler1002/27548/" [puppet] - 10https://gerrit.wikimedia.org/r/647190 (owner: 10Elukey) [20:59:27] (03CR) 10Effie Mouzeli: [C: 03+1] "Thanks @ahmon for having a look" [puppet] - 10https://gerrit.wikimedia.org/r/574485 (https://phabricator.wikimedia.org/T227080) (owner: 10Filippo Giunchedi) [20:59:44] RECOVERY - logstash JSON linesTCP port on logstash2005 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11514 https://wikitech.wikimedia.org/wiki/Logstash [21:00:05] chrisalbon and accraze: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Services – Graphoid / ORES deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210120T2100). [21:02:33] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc-gp2002.codfw.wmnet [21:02:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:05:02] PROBLEM - SSH on logstash2006 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [21:07:18] (03CR) 10Dzahn: [C: 03+1] "+1, the approval is already there regardless of it being required" [puppet] - 10https://gerrit.wikimedia.org/r/657378 (https://phabricator.wikimedia.org/T272489) (owner: 10Jcrespo) [21:12:37] (03CR) 10Dzahn: [C: 03+2] "compiled on everything, including alert1001 (this is a defined type, not a class and used in base and all over the place). shows there are" [puppet] - 10https://gerrit.wikimedia.org/r/655516 (https://phabricator.wikimedia.org/T209953) (owner: 10Dzahn) [21:13:03] !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc-gp2003.codfw.wmnet [21:13:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:13:32] RECOVERY - SSH on logstash2006 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [21:14:34] PROBLEM - logstash JSON linesTCP port on logstash2006 is CRITICAL: connect to address 127.0.0.1 and port 11514: Connection refused https://wikitech.wikimedia.org/wiki/Logstash [21:14:40] (03PS1) 10Legoktm: docker_registry_ha: Enable and fix build-homepage job [puppet] - 10https://gerrit.wikimedia.org/r/657412 (https://phabricator.wikimedia.org/T179696) [21:14:55] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2338.codfw.wmnet with reason: REIMAGE [21:14:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:15:15] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2339.codfw.wmnet with reason: REIMAGE [21:15:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:15:47] !log dzahn@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on mw2339.codfw.wmnet with reason: REIMAGE [21:15:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:16:09] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2351.codfw.wmnet with reason: REIMAGE [21:16:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:16:28] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2353.codfw.wmnet with reason: REIMAGE [21:16:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:16:45] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 90, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [21:17:03] PROBLEM - Disk space on dumpsdata1003 is CRITICAL: DISK CRITICAL - free space: /data 848763 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=dumpsdata1003&var-datasource=eqiad+prometheus/ops [21:17:27] (03CR) 10Kosta Harlan: [C: 03+1] [beta] GrowthExperiments: set link recommendation feature flags [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657292 (owner: 10Gergő Tisza) [21:17:41] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2338.codfw.wmnet with reason: REIMAGE [21:17:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:19:31] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc-gp2003.codfw.wmnet [21:19:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:19:35] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2353.codfw.wmnet with reason: REIMAGE [21:19:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:21:10] !log dzahn@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on mw2351.codfw.wmnet with reason: REIMAGE [21:21:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:21:18] (03PS6) 10Effie Mouzeli: varnish: check for debug=1 value in X-Analytics header [puppet] - 10https://gerrit.wikimedia.org/r/629735 (https://phabricator.wikimedia.org/T263683) [21:22:10] PROBLEM - Apache HTTP on mw2351 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [21:22:10] PROBLEM - Memcached on mw2339 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Memcached [21:22:48] ^ those are being reimaged by mutante [21:23:02] (03PS2) 10Legoktm: docker_registry_ha: Enable and fix build-homepage job [puppet] - 10https://gerrit.wikimedia.org/r/657412 (https://phabricator.wikimedia.org/T179696) [21:23:04] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 92, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [21:23:10] (03CR) 10Legoktm: [C: 03+2] docker_registry_ha: Enable and fix build-homepage job [puppet] - 10https://gerrit.wikimedia.org/r/657412 (https://phabricator.wikimedia.org/T179696) (owner: 10Legoktm) [21:24:03] !log milimetric@deploy1001 Started deploy [analytics/refinery@1313244]: Regular analytics weekly train [analytics/refinery@1313244] [21:24:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:28:34] RECOVERY - mediawiki-installation DSH group on mw2337 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [21:28:50] PROBLEM - PHP7 rendering on mw2339 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [21:29:22] PROBLEM - PHP opcache health on mw2335 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [21:29:28] RECOVERY - Check systemd state on registry2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:29:40] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [21:31:30] PROBLEM - Memcached on mw2351 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Memcached [21:31:43] (03PS7) 10Effie Mouzeli: varnish: check for debug=1 value in X-Analytics header [puppet] - 10https://gerrit.wikimedia.org/r/629735 (https://phabricator.wikimedia.org/T263683) [21:32:18] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [21:34:02] PROBLEM - Host mw2339 is DOWN: PING CRITICAL - Packet loss = 100% [21:34:18] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 90, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [21:34:55] !log milimetric@deploy1001 Finished deploy [analytics/refinery@1313244]: Regular analytics weekly train [analytics/refinery@1313244] (duration: 10m 52s) [21:34:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:35:02] PROBLEM - Check systemd state on registry2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:35:21] !log milimetric@deploy1001 Started deploy [analytics/refinery@1313244] (thin): Regular analytics weekly train THIN [analytics/refinery@1313244] [21:35:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:35:28] !log milimetric@deploy1001 Finished deploy [analytics/refinery@1313244] (thin): Regular analytics weekly train THIN [analytics/refinery@1313244] (duration: 00m 07s) [21:35:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:35:48] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 92, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [21:37:14] RECOVERY - PHP7 rendering on mw2339 is OK: HTTP OK: HTTP/1.1 302 Found - 656 bytes in 1.171 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [21:37:16] RECOVERY - Host mw2339 is UP: PING OK - Packet loss = 0%, RTA = 33.43 ms [21:37:18] RECOVERY - Memcached on mw2351 is OK: TCP OK - 0.034 second response time on 10.192.32.201 port 11210 https://wikitech.wikimedia.org/wiki/Memcached [21:37:18] RECOVERY - Memcached on mw2339 is OK: TCP OK - 0.034 second response time on 10.192.32.117 port 11210 https://wikitech.wikimedia.org/wiki/Memcached [21:37:20] RECOVERY - Apache HTTP on mw2351 is OK: HTTP OK: HTTP/1.1 302 Found - 641 bytes in 1.047 second response time https://wikitech.wikimedia.org/wiki/Application_servers [21:37:21] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2338.codfw.wmnet'] ` an... [21:37:32] PROBLEM - PHP opcache health on mw2339 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [21:37:54] PROBLEM - PHP opcache health on mw2351 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [21:37:56] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2339.codfw.wmnet'] ` an... [21:38:06] (03PS3) 10Andrew Bogott: nova vendordata/firstboot: move puppet logic into cloud-init [puppet] - 10https://gerrit.wikimedia.org/r/657401 (https://phabricator.wikimedia.org/T271273) [21:38:28] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2351.codfw.wmnet'] ` an... [21:39:01] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2353.codfw.wmnet'] ` an... [21:40:24] RECOVERY - logstash JSON linesTCP port on logstash2006 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11514 https://wikitech.wikimedia.org/wiki/Logstash [21:40:32] PROBLEM - Ensure local MW versions match expected deployment on mw2337 is CRITICAL: CRITICAL: 522 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [21:41:22] PROBLEM - Ensure local MW versions match expected deployment on mw2335 is CRITICAL: CRITICAL: 522 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [21:42:04] (03PS1) 10Effie Mouzeli: varnish: include X-Client-Port in X-Analytics [puppet] - 10https://gerrit.wikimedia.org/r/657416 (https://phabricator.wikimedia.org/T181368) [21:43:26] PROBLEM - mediawiki-installation DSH group on mw2339 is CRITICAL: Host mw2339 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [21:43:41] (03PS8) 10Effie Mouzeli: varnish: Set debug=1 in X-Analytics header [puppet] - 10https://gerrit.wikimedia.org/r/629735 (https://phabricator.wikimedia.org/T263683) [21:44:05] (03CR) 10Effie Mouzeli: varnish: Set debug=1 in X-Analytics header (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/629735 (https://phabricator.wikimedia.org/T263683) (owner: 10Effie Mouzeli) [21:46:26] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 90, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [21:48:28] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 92, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [21:49:56] PROBLEM - PHP opcache health on mw2316 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [21:51:19] PROBLEM - PHP opcache health on mw2325 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [21:53:53] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 90, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [21:56:03] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 92, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [22:00:19] RECOVERY - Check systemd state on registry2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:01:38] (03PS4) 10Andrew Bogott: nova vendordata/firstboot: move puppet config into cloud-init [puppet] - 10https://gerrit.wikimedia.org/r/657401 (https://phabricator.wikimedia.org/T271273) [22:04:19] PROBLEM - Check systemd state on registry2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:05:47] RECOVERY - Check systemd state on registry2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:08:09] PROBLEM - mediawiki-installation DSH group on mw2351 is CRITICAL: Host mw2351 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [22:09:14] mediawiki-installation DSH group - dont worry, fixing it right now [22:09:26] it's the reaimaging and because i was on a break [22:09:59] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [22:11:17] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 90, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [22:11:39] RECOVERY - Check systemd state on registry1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:12:07] PROBLEM - Check systemd state on registry2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:15:15] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [22:15:47] RECOVERY - Check systemd state on registry1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:15:51] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw2338.codfw.wmnet [22:15:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:16:10] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw2339.codfw.wmnet [22:16:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:16:33] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw2351.codfw.wmnet [22:16:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:16:37] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 92, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [22:16:48] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw2353.codfw.wmnet [22:16:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:17:42] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw2338.codfw.wmnet [22:17:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:17:49] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw2339.codfw.wmnet [22:17:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:17:57] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw2351.codfw.wmnet [22:17:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:18:08] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw2353.codfw.wmnet [22:18:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:18:51] PROBLEM - PHP opcache health on mw2327 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [22:19:27] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on mw2327.codfw.wmnet with reason: new install on buster [22:19:28] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on mw2327.codfw.wmnet with reason: new install on buster [22:19:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:19:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:19:41] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job={atlas_exporter,netbox_device_statistics} site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [22:21:49] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [22:22:12] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia... [22:22:55] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 90, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [22:23:21] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [22:23:54] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia... [22:24:10] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2355.codfw.wmnet'] ` Of... [22:24:22] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia... [22:24:47] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia... [22:25:03] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia... [22:25:31] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [22:27:02] 10SRE, 10serviceops: Upgrade docker-registry servers to Debian Buster - https://phabricator.wikimedia.org/T272550 (10Legoktm) [22:27:21] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 92, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [22:28:38] 10SRE, 10Epic: Migrate all of production metal and VMs to Buster or later - https://phabricator.wikimedia.org/T247045 (10Dzahn) [22:28:40] 10SRE, 10serviceops: Upgrade docker-registry servers to Debian Buster - https://phabricator.wikimedia.org/T272550 (10Dzahn) [22:30:27] (03PS7) 10Legoktm: docker_registry_ha: Have nginx serve /srv/homepage for / [puppet] - 10https://gerrit.wikimedia.org/r/655792 (https://phabricator.wikimedia.org/T179696) [22:32:39] (03PS8) 10Legoktm: docker_registry_ha: Have nginx serve /srv/homepage for / [puppet] - 10https://gerrit.wikimedia.org/r/655792 (https://phabricator.wikimedia.org/T179696) [22:35:24] (03CR) 10Legoktm: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27553/console" [puppet] - 10https://gerrit.wikimedia.org/r/655792 (https://phabricator.wikimedia.org/T179696) (owner: 10Legoktm) [22:37:17] (03CR) 10Legoktm: [V: 03+1 C: 03+2] docker_registry_ha: Have nginx serve /srv/homepage for / [puppet] - 10https://gerrit.wikimedia.org/r/655792 (https://phabricator.wikimedia.org/T179696) (owner: 10Legoktm) [22:41:14] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2355.codfw.wmnet with reason: REIMAGE [22:41:15] PROBLEM - mediawiki originals uploads -hourly- for eqiad on alert1001 is CRITICAL: account=mw-media class=originals cluster=swift instance=ms-fe1005 job=statsd_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=eqiad [22:41:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:41:52] https://docker-registry.wikimedia.org/ should work now [22:42:09] PROBLEM - mediawiki originals uploads -hourly- for codfw on alert1001 is CRITICAL: account=mw-media class=originals cluster=swift instance=ms-fe2005 job=statsd_exporter site=codfw https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=codfw [22:43:17] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2355.codfw.wmnet with reason: REIMAGE [22:43:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:43:25] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2357.codfw.wmnet with reason: REIMAGE [22:43:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:43:43] RECOVERY - mediawiki-installation DSH group on mw2339 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [22:43:50] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2359.codfw.wmnet with reason: REIMAGE [22:43:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:44:05] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2361.codfw.wmnet with reason: REIMAGE [22:44:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:45:24] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2357.codfw.wmnet with reason: REIMAGE [22:45:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:47:15] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2361.codfw.wmnet with reason: REIMAGE [22:47:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:48:50] !log dzahn@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on mw2359.codfw.wmnet with reason: REIMAGE [22:48:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:49:55] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on mw2359.codfw.wmnet with reason: new install on buster [22:49:55] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on mw2359.codfw.wmnet with reason: new install on buster [22:49:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:49:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:57:29] PROBLEM - PHP opcache health on mw2331 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [22:58:41] PROBLEM - Ensure local MW versions match expected deployment on mw2333 is CRITICAL: CRITICAL: 522 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [22:58:43] PROBLEM - Ensure local MW versions match expected deployment on mw2331 is CRITICAL: CRITICAL: 522 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [22:59:17] PROBLEM - PHP opcache health on mw2333 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [22:59:58] sorry, trying to minimize the noise but they still happen sometimes [23:00:20] every once in a while there is a race condition and the downtime is not set [23:00:29] RECOVERY - Check systemd state on registry2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:01:25] !log mw2331, mw2333 - scap pull [23:01:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:03:26] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2355.codfw.wmnet'] ` an... [23:03:34] !log updated docker-registry.discovery.wmnet/wikimedia-buster image [23:03:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:03:49] RECOVERY - Ensure local MW versions match expected deployment on mw2337 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [23:04:21] RECOVERY - Ensure local MW versions match expected deployment on mw2333 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [23:04:23] RECOVERY - Ensure local MW versions match expected deployment on mw2331 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [23:05:27] PROBLEM - Check systemd state on registry2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:05:39] RECOVERY - Ensure local MW versions match expected deployment on mw2335 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [23:05:57] PROBLEM - Check systemd state on registry1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:06:01] PROBLEM - Check systemd state on registry1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:06:11] that's the build-homepage.service currently being worked on by lego [23:06:19] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2361.codfw.wmnet'] ` an... [23:06:56] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw2355.codfw.wmnet [23:06:58] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2359.codfw.wmnet'] ` an... [23:06:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:07:05] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw2359.codfw.wmnet [23:07:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:07:15] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw2361.codfw.wmnet [23:07:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:07:20] it looks like the script works independently, but when all 4 servers run the systemd job at the same time, the registry times out and the job fails [23:07:37] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2357.codfw.wmnet'] ` an... [23:08:51] legoktm: ah, you can randomize the minute like this: $minute = Integer(seeded_rand(60, $title)) [23:08:52] 10SRE, 10MediaWiki-Containers, 10serviceops, 10Patch-For-Review: Homepage for https://docker-registry.wikimedia.org - https://phabricator.wikimedia.org/T179696 (10Legoktm) https://docker-registry.wikimedia.org/ ta-da Tested by: * `docker pull docker-registry.discovery.wmnet/wikimedia-buster:latest` on a... [23:09:03] RECOVERY - mediawiki-installation DSH group on mw2351 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [23:09:37] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install aqs101[0-5] - https://phabricator.wikimedia.org/T267414 (10Jclark-ctr) [23:10:02] mutante: ahh, thanks. Is $title just a fixed string? [23:10:03] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [23:10:05] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw2357.codfw.wmnet [23:10:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:10:15] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install aqs101[0-5] - https://phabricator.wikimedia.org/T267414 (10Jclark-ctr) @Cmjohnson. host are racked and cabled netbox is updated host , port aqs1010 7 aqs1011 23 aqs1012 36 aqs1013 31 aqs1014 6 aqs1015 14 [23:12:20] legoktm: yea, a string. it's because my example is from inside a defined type which then uses systemd::timer::job so $title is re-using the title of the defined type [23:12:29] ack [23:12:30] you can just use a random word too [23:13:09] (03PS1) 10Legoktm: docker_registry_ha: Randomize timing of build-homepage job [puppet] - 10https://gerrit.wikimedia.org/r/657446 (https://phabricator.wikimedia.org/T179696) [23:13:53] (03CR) 10Dzahn: [C: 03+1] "that's how I do it for planet feed updates as well" [puppet] - 10https://gerrit.wikimedia.org/r/657446 (https://phabricator.wikimedia.org/T179696) (owner: 10Legoktm) [23:14:05] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [23:14:07] (03CR) 10Legoktm: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27554/console" [puppet] - 10https://gerrit.wikimedia.org/r/657446 (https://phabricator.wikimedia.org/T179696) (owner: 10Legoktm) [23:15:00] or could use $::fqdn as seed i guess [23:15:52] (03CR) 10Legoktm: [V: 03+1] "PCC SUCCESS (DIFF 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27555/console" [puppet] - 10https://gerrit.wikimedia.org/r/657446 (https://phabricator.wikimedia.org/T179696) (owner: 10Legoktm) [23:16:49] mutante: PCC https://puppet-compiler.wmflabs.org/compiler1002/27555/ has them all moving the minute to 20...is that just a limitation of the compiler? or is that not right? [23:17:40] legoktm: hmm.. I wonder what happens if you use $::fqdn as part of the seed and compile it again [23:17:49] not entirely sure [23:19:37] (03PS2) 10Legoktm: docker_registry_ha: Randomize timing of build-homepage job [puppet] - 10https://gerrit.wikimedia.org/r/657446 (https://phabricator.wikimedia.org/T179696) [23:19:49] * legoktm tries [23:19:54] I did not have the exact same case, i had just one host per DC that is active but 10 different jobs [23:20:05] which all get to use different seeds [23:20:47] (03CR) 10Legoktm: [V: 03+1] "PCC SUCCESS (DIFF 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27556/console" [puppet] - 10https://gerrit.wikimedia.org/r/657446 (https://phabricator.wikimedia.org/T179696) (owner: 10Legoktm) [23:21:07] ok, they're all different now [23:21:17] not perfectly spread out but pretty good enough [23:21:20] looks like it works [23:21:24] thank you :) [23:21:24] yep [23:21:27] yw! [23:21:42] (03CR) 10Legoktm: [V: 03+1 C: 03+2] docker_registry_ha: Randomize timing of build-homepage job [puppet] - 10https://gerrit.wikimedia.org/r/657446 (https://phabricator.wikimedia.org/T179696) (owner: 10Legoktm) [23:22:42] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on releases2002.codfw.wmnet with reason: rebooting to add a disk [23:22:42] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on releases2002.codfw.wmnet with reason: rebooting to add a disk [23:22:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:22:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:23:51] now hopefully within the next hour they'll all recover [23:25:00] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw2357.codfw.wmnet [23:25:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:25:06] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw2355.codfw.wmnet [23:25:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:25:16] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw2359.codfw.wmnet [23:25:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:25:23] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw2361.codfw.wmnet [23:25:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:26:51] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia... [23:27:27] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia... [23:28:05] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia... [23:28:35] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia... [23:30:24] !log releases2002 - rebooting VM [23:30:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:31:12] RECOVERY - Check systemd state on registry2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:33:46] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [23:35:36] PROBLEM - Check systemd state on registry2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:36:50] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [23:41:38] RECOVERY - Check systemd state on registry1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:45:53] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2363.codfw.wmnet with reason: REIMAGE [23:45:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:46:32] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2365.codfw.wmnet with reason: REIMAGE [23:46:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:47:07] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2367.codfw.wmnet with reason: REIMAGE [23:47:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:47:39] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2369.codfw.wmnet with reason: REIMAGE [23:47:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:47:55] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2363.codfw.wmnet with reason: REIMAGE [23:47:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:49:02] PROBLEM - PHP opcache health on mw2353 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [23:49:06] RECOVERY - Check systemd state on registry1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:49:56] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2367.codfw.wmnet with reason: REIMAGE [23:49:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:51:18] 10SRE, 10serviceops: Upgrade docker-registry servers to Debian Buster - https://phabricator.wikimedia.org/T272550 (10Legoktm) Skimming the puppet role, there's: ` # this could be removed when buster or next debian includes a 2.7+ version apt::pin { 'strech_wikimedia_docker_registry_27': packag... [23:51:32] !log dzahn@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on mw2365.codfw.wmnet with reason: REIMAGE [23:51:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:51:47] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2369.codfw.wmnet with reason: REIMAGE [23:51:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:53:58] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 90, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [23:56:22] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 92, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [23:58:56] RECOVERY - Check systemd state on registry2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state