[00:00:04] <jouncebot>	 RoanKattouw, Niharika, and Urbanecm: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Evening backport window . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210120T0000).
[00:00:04] <jouncebot>	 tgr: A patch you scheduled for Evening backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[00:00:38] <tgr_>	 o/
[00:07:16] <icinga-wm>	 PROBLEM - Check systemd state on registry2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:07:44] <wikibugs>	 (03PS2) 10Gergő Tisza: [no-op] GrowthExperiments: Disable link recommendations [mediawiki-config] - 10https://gerrit.wikimedia.org/r/655863 (https://phabricator.wikimedia.org/T261408)
[00:07:44] <icinga-wm>	 PROBLEM - Check systemd state on registry1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:09:25] <legoktm>	 !log uploaded docker-report 0.0.4-1~deb9u1 to stretch-wikimedia (T179696)
[00:09:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:09:29] <stashbot>	 T179696: Homepage for https://docker-registry.wikimedia.org - https://phabricator.wikimedia.org/T179696
[00:09:42] <wikibugs>	 (03PS2) 10Cwhite: profile: drop ECS messages on legacy cluster [puppet] - 10https://gerrit.wikimedia.org/r/657213 (https://phabricator.wikimedia.org/T234565)
[00:09:58] <wikibugs>	 (03PS2) 10Legoktm: docker_registry_ha: Make registery-homepage-builder Python 3.5 compatible [puppet] - 10https://gerrit.wikimedia.org/r/657210 (https://phabricator.wikimedia.org/T179696)
[00:10:42] <wikibugs>	 (03CR) 10Gergő Tisza: [C: 03+2] [no-op] GrowthExperiments: Disable link recommendations [mediawiki-config] - 10https://gerrit.wikimedia.org/r/655863 (https://phabricator.wikimedia.org/T261408) (owner: 10Gergő Tisza)
[00:12:33] <wikibugs>	 (03Merged) 10jenkins-bot: [no-op] GrowthExperiments: Disable link recommendations [mediawiki-config] - 10https://gerrit.wikimedia.org/r/655863 (https://phabricator.wikimedia.org/T261408) (owner: 10Gergő Tisza)
[00:12:56] <wikibugs>	 (03PS3) 10Legoktm: docker_registry_ha: Make registry-homepage-builder Python 3.5 compatible [puppet] - 10https://gerrit.wikimedia.org/r/657210 (https://phabricator.wikimedia.org/T179696)
[00:15:25] <wikibugs>	 (03CR) 10Legoktm: [C: 03+2] docker_registry_ha: Make registry-homepage-builder Python 3.5 compatible [puppet] - 10https://gerrit.wikimedia.org/r/657210 (https://phabricator.wikimedia.org/T179696) (owner: 10Legoktm)
[00:18:24] <wikibugs>	 (03PS2) 10Gergő Tisza: Update /analytics/legacy/homepagemodule/ schema version to 1.1.0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/656284 (https://phabricator.wikimedia.org/T270309)
[00:21:02] <icinga-wm>	 PROBLEM - Check systemd state on elastic2048 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:26:20] <wikibugs>	 10SRE, 10MediaWiki-Containers, 10Patch-For-Review: Homepage for https://docker-registry.wikimedia.org - https://phabricator.wikimedia.org/T179696 (10Legoktm) Got pretty close, one last sticking point is that `docker_report` hardcodes connecting to the registry over HTTPS. So if you try `https://localhost` th...
[00:26:54] <wikibugs>	 (03PS1) 10Legoktm: docker_registry_ha: Disable build-homepage job for now [puppet] - 10https://gerrit.wikimedia.org/r/657216 (https://phabricator.wikimedia.org/T179696)
[00:29:06] <wikibugs>	 (03CR) 10Legoktm: [C: 03+2] docker_registry_ha: Disable build-homepage job for now [puppet] - 10https://gerrit.wikimedia.org/r/657216 (https://phabricator.wikimedia.org/T179696) (owner: 10Legoktm)
[00:29:29] <wikibugs>	 (03CR) 10Gergő Tisza: [C: 03+2] Update /analytics/legacy/homepagemodule/ schema version to 1.1.0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/656284 (https://phabricator.wikimedia.org/T270309) (owner: 10Gergő Tisza)
[00:30:20] <logmsgbot>	 !log tgr@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:655863|(no-op) GrowthExperiments: Disable link recommendations (T261408)]] (duration: 01m 05s)
[00:30:21] <wikibugs>	 (03Merged) 10jenkins-bot: Update /analytics/legacy/homepagemodule/ schema version to 1.1.0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/656284 (https://phabricator.wikimedia.org/T270309) (owner: 10Gergő Tisza)
[00:30:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:30:24] <stashbot>	 T261408: Add a link engineering: Maintenance script for retrieving, caching, and updating search index - https://phabricator.wikimedia.org/T261408
[00:34:12] <icinga-wm>	 RECOVERY - Check systemd state on elastic2048 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:43:29] <logmsgbot>	 !log tgr@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:656284|Update /analytics/legacy/homepagemodule/ schema version to 1.1.0 (T270309)]] (duration: 01m 03s)
[00:43:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:43:32] <stashbot>	 T270309: Instrument banner module on newcomer homepage - https://phabricator.wikimedia.org/T270309
[00:51:48] <wikibugs>	 10SRE, 10MediaWiki-Containers, 10serviceops, 10Patch-For-Review: Homepage for https://docker-registry.wikimedia.org - https://phabricator.wikimedia.org/T179696 (10Legoktm) a:03Legoktm
[00:55:00] <icinga-wm>	 PROBLEM - aqs endpoints health on aqs1004 is CRITICAL: /analytics.wikimedia.org/v1/pageviews/per-article/{project}/{access}/{agent}/{article}/{granularity}/{start}/{end} (Get per article page views) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[00:58:08] <icinga-wm>	 RECOVERY - aqs endpoints health on aqs1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[01:03:54] <icinga-wm>	 PROBLEM - aqs endpoints health on aqs1006 is CRITICAL: /analytics.wikimedia.org/v1/pageviews/aggregate/{project}/{access}/{agent}/{granularity}/{start}/{end} (Get aggregate page views) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[01:04:00] <icinga-wm>	 PROBLEM - wikifeeds eqiad on wikifeeds.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most read articles for January 1, 2016) timed out before a response was received: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) timed out before a response was received https://wikitech.wikimedia.org/wiki/Wikifeeds
[01:04:40] <icinga-wm>	 PROBLEM - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most read articles for January 1, 2016) timed out before a response was received: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) timed out before a response was received https://wikitech.wikimedia.org/wiki/Wikifeeds
[01:04:52] <wikibugs>	 (03PS1) 10Legoktm: Switch to native Debian package [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/657218
[01:05:16] <icinga-wm>	 PROBLEM - aqs endpoints health on aqs1008 is CRITICAL: /analytics.wikimedia.org/v1/mediarequests/per-file/{referer}/{agent}/{file_path}/{granularity}/{start}/{end} (Get per file requests) timed out before a response was received: /analytics.wikimedia.org/v1/pageviews/aggregate/{project}/{access}/{agent}/{granularity}/{start}/{end} (Get aggregate page views) timed out before a response was received: /analytics.wikimedia.org/v1/pag
[01:05:16] <icinga-wm>	 e/{project}/{access}/{agent}/{article}/{granularity}/{start}/{end} (Get per article page views) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[01:06:52] <icinga-wm>	 RECOVERY - aqs endpoints health on aqs1006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[01:07:28] <icinga-wm>	 PROBLEM - aqs endpoints health on aqs1004 is CRITICAL: /analytics.wikimedia.org/v1/pageviews/aggregate/{project}/{access}/{agent}/{granularity}/{start}/{end} (Get aggregate page views) is CRITICAL: Test Get aggregate page views returned the unexpected status 500 (expecting: 200): /analytics.wikimedia.org/v1/mediarequests/per-file/{referer}/{agent}/{file_path}/{granularity}/{start}/{end} (Get per file requests) is CRITICAL: Test G
[01:07:28] <icinga-wm>	 sts returned the unexpected status 500 (expecting: 200): /analytics.wikimedia.org/v1/pageviews/per-article/{project}/{access}/{agent}/{article}/{granularity}/{start}/{end} (Get per article page views) is CRITICAL: Test Get per article page views returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[01:09:52] <icinga-wm>	 PROBLEM - Check systemd state on aqs1006 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:10:36] <icinga-wm>	 RECOVERY - aqs endpoints health on aqs1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[01:10:48] <icinga-wm>	 PROBLEM - cassandra-a service on aqs1006 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[01:11:26] <icinga-wm>	 RECOVERY - aqs endpoints health on aqs1008 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[01:11:36] <icinga-wm>	 PROBLEM - cassandra-a CQL 10.64.48.148:9042 on aqs1006 is CRITICAL: connect to address 10.64.48.148 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886
[01:13:34] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase-dev1006 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[01:13:34] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase-dev1005 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[01:14:34] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase-dev1004 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[01:16:18] <icinga-wm>	 PROBLEM - aqs endpoints health on aqs1006 is CRITICAL: /analytics.wikimedia.org/v1/pageviews/aggregate/{project}/{access}/{agent}/{granularity}/{start}/{end} (Get aggregate page views) is CRITICAL: Test Get aggregate page views returned the unexpected status 500 (expecting: 200): /analytics.wikimedia.org/v1/mediarequests/per-file/{referer}/{agent}/{file_path}/{granularity}/{start}/{end} (Get per file requests) is CRITICAL: Test G
[01:16:18] <icinga-wm>	 sts returned the unexpected status 500 (expecting: 200): /analytics.wikimedia.org/v1/pageviews/per-article/{project}/{access}/{agent}/{article}/{granularity}/{start}/{end} (Get per article page views) is CRITICAL: Test Get per article page views returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[01:17:30] <icinga-wm>	 PROBLEM - aqs endpoints health on aqs1005 is CRITICAL: /analytics.wikimedia.org/v1/pageviews/aggregate/{project}/{access}/{agent}/{granularity}/{start}/{end} (Get aggregate page views) is CRITICAL: Test Get aggregate page views returned the unexpected status 500 (expecting: 200): /analytics.wikimedia.org/v1/pageviews/per-article/{project}/{access}/{agent}/{article}/{granularity}/{start}/{end} (Get per article page views) is CRITI
[01:17:30] <icinga-wm>	  article page views returned the unexpected status 500 (expecting: 200): /analytics.wikimedia.org/v1/mediarequests/per-file/{referer}/{agent}/{file_path}/{granularity}/{start}/{end} (Get per file requests) is CRITICAL: Test Get per file requests returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[01:17:32] <icinga-wm>	 PROBLEM - aqs endpoints health on aqs1007 is CRITICAL: /analytics.wikimedia.org/v1/pageviews/aggregate/{project}/{access}/{agent}/{granularity}/{start}/{end} (Get aggregate page views) is CRITICAL: Test Get aggregate page views returned the unexpected status 500 (expecting: 200): /analytics.wikimedia.org/v1/mediarequests/per-file/{referer}/{agent}/{file_path}/{granularity}/{start}/{end} (Get per file requests) is CRITICAL: Test G
[01:17:32] <icinga-wm>	 sts returned the unexpected status 500 (expecting: 200): /analytics.wikimedia.org/v1/pageviews/per-article/{project}/{access}/{agent}/{article}/{granularity}/{start}/{end} (Get per article page views) is CRITICAL: Test Get per article page views returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[01:18:24] <icinga-wm>	 PROBLEM - aqs endpoints health on aqs1009 is CRITICAL: /analytics.wikimedia.org/v1/pageviews/aggregate/{project}/{access}/{agent}/{granularity}/{start}/{end} (Get aggregate page views) timed out before a response was received: /analytics.wikimedia.org/v1/mediarequests/per-file/{referer}/{agent}/{file_path}/{granularity}/{start}/{end} (Get per file requests) timed out before a response was received: /analytics.wikimedia.org/v1/pag
[01:18:24] <icinga-wm>	 e/{project}/{access}/{agent}/{article}/{granularity}/{start}/{end} (Get per article page views) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[01:19:27] <legoktm>	 too many connections to aqs cassandra?
[01:19:41] <wikibugs>	 10SRE, 10Discovery-Search (Current work): Reshard commonswiki_file elasticsearch index - https://phabricator.wikimedia.org/T260083 (10EBernhardson) We could probably cancel this? In T271493 we are fixing the data size issues which will remove the need to re-shard.
[01:19:42] <icinga-wm>	 RECOVERY - wikifeeds eqiad on wikifeeds.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds
[01:19:56] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase-dev1006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[01:19:56] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase-dev1005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[01:20:56] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase-dev1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[01:23:42] <icinga-wm>	 RECOVERY - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds
[01:25:20] <icinga-wm>	 PROBLEM - cassandra-a CQL 10.64.0.126:9042 on aqs1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://phabricator.wikimedia.org/T93886
[01:26:32] <icinga-wm>	 PROBLEM - aqs endpoints health on aqs1004 is CRITICAL: /analytics.wikimedia.org/v1/pageviews/per-article/{project}/{access}/{agent}/{article}/{granularity}/{start}/{end} (Get per article page views) is CRITICAL: Test Get per article page views returned the unexpected status 500 (expecting: 200): /analytics.wikimedia.org/v1/pageviews/aggregate/{project}/{access}/{agent}/{granularity}/{start}/{end} (Get aggregate page views) is CRI
[01:26:32] <icinga-wm>	 ggregate page views returned the unexpected status 500 (expecting: 200): /analytics.wikimedia.org/v1/mediarequests/per-file/{referer}/{agent}/{file_path}/{granularity}/{start}/{end} (Get per file requests) is CRITICAL: Test Get per file requests returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[01:30:00] <icinga-wm>	 RECOVERY - cassandra-a service on aqs1006 is OK: OK - cassandra-a is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[01:32:14] <icinga-wm>	 RECOVERY - Check systemd state on aqs1006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:33:42] <icinga-wm>	 RECOVERY - aqs endpoints health on aqs1005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[01:34:00] <icinga-wm>	 RECOVERY - cassandra-a CQL 10.64.48.148:9042 on aqs1006 is OK: TCP OK - 0.000 second response time on 10.64.48.148 port 9042 https://phabricator.wikimedia.org/T93886
[01:34:32] <icinga-wm>	 RECOVERY - aqs endpoints health on aqs1009 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[01:35:44] <icinga-wm>	 RECOVERY - aqs endpoints health on aqs1006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[01:36:16] <icinga-wm>	 RECOVERY - aqs endpoints health on aqs1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[01:36:52] <icinga-wm>	 RECOVERY - aqs endpoints health on aqs1007 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[01:38:20] <icinga-wm>	 PROBLEM - Check systemd state on aqs1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:38:22] <icinga-wm>	 PROBLEM - cassandra-a service on aqs1004 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[01:48:16] <icinga-wm>	 RECOVERY - Check systemd state on aqs1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:48:20] <icinga-wm>	 RECOVERY - cassandra-a service on aqs1004 is OK: OK - cassandra-a is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[01:54:56] <icinga-wm>	 RECOVERY - cassandra-a CQL 10.64.0.126:9042 on aqs1004 is OK: TCP OK - 0.000 second response time on 10.64.0.126 port 9042 https://phabricator.wikimedia.org/T93886
[02:08:58] <icinga-wm>	 PROBLEM - wikifeeds eqiad on wikifeeds.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most read articles for January 1, 2016) timed out before a response was received: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) timed out before a response was received https://wikitech.wikimedia.org/wiki/Wikifeeds
[02:10:08] <icinga-wm>	 PROBLEM - cassandra-b CQL 10.64.32.190:9042 on aqs1005 is CRITICAL: connect to address 10.64.32.190 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886
[02:11:08] <icinga-wm>	 PROBLEM - cassandra-b service on aqs1005 is CRITICAL: CRITICAL - Expecting active but unit cassandra-b is failed https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[02:11:58] <icinga-wm>	 PROBLEM - Check systemd state on aqs1005 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:12:10] <icinga-wm>	 PROBLEM - aqs endpoints health on aqs1006 is CRITICAL: /analytics.wikimedia.org/v1/mediarequests/per-file/{referer}/{agent}/{file_path}/{granularity}/{start}/{end} (Get per file requests) timed out before a response was received: /analytics.wikimedia.org/v1/pageviews/per-article/{project}/{access}/{agent}/{article}/{granularity}/{start}/{end} (Get per article page views) timed out before a response was received: /analytics.wikime
[02:12:10] <icinga-wm>	 ews/top-by-country/{project}/{access}/{year}/{month} (Get top countries by page views) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[02:13:00] <icinga-wm>	 PROBLEM - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most read articles for January 1, 2016) timed out before a response was received: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) timed out before a response was received https://wikitech.wikimedia.org/wiki/Wikifeeds
[02:15:18] <icinga-wm>	 RECOVERY - aqs endpoints health on aqs1006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[02:16:08] <icinga-wm>	 RECOVERY - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds
[02:18:40] <icinga-wm>	 RECOVERY - wikifeeds eqiad on wikifeeds.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds
[02:22:18] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase-dev1006 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[02:23:04] <icinga-wm>	 PROBLEM - aqs endpoints health on aqs1007 is CRITICAL: /analytics.wikimedia.org/v1/pageviews/top-by-country/{project}/{access}/{year}/{month} (Get top countries by page views) timed out before a response was received: /analytics.wikimedia.org/v1/pageviews/per-article/{project}/{access}/{agent}/{article}/{granularity}/{start}/{end} (Get per article page views) timed out before a response was received: /analytics.wikimedia.org/v1/m
[02:23:04] <icinga-wm>	 file/{referer}/{agent}/{file_path}/{granularity}/{start}/{end} (Get per file requests) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[02:24:00] <icinga-wm>	 PROBLEM - aqs endpoints health on aqs1009 is CRITICAL: /analytics.wikimedia.org/v1/mediarequests/per-file/{referer}/{agent}/{file_path}/{granularity}/{start}/{end} (Get per file requests) timed out before a response was received: /analytics.wikimedia.org/v1/pageviews/per-article/{project}/{access}/{agent}/{article}/{granularity}/{start}/{end} (Get per article page views) timed out before a response was received https://wikitech.w
[02:24:00] <icinga-wm>	 /Services/Monitoring/aqs
[02:25:40] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase-dev1005 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[02:26:22] <icinga-wm>	 RECOVERY - aqs endpoints health on aqs1007 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[02:27:32] <icinga-wm>	 RECOVERY - cassandra-b service on aqs1005 is OK: OK - cassandra-b is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[02:28:24] <icinga-wm>	 RECOVERY - Check systemd state on aqs1005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:29:54] <icinga-wm>	 PROBLEM - aqs endpoints health on aqs1005 is CRITICAL: /analytics.wikimedia.org/v1/mediarequests/per-file/{referer}/{agent}/{file_path}/{granularity}/{start}/{end} (Get per file requests) timed out before a response was received: /analytics.wikimedia.org/v1/pageviews/top-by-country/{project}/{access}/{year}/{month} (Get top countries by page views) timed out before a response was received: /analytics.wikimedia.org/v1/pageviews/pe
[02:29:54] <icinga-wm>	 t}/{access}/{agent}/{article}/{granularity}/{start}/{end} (Get per article page views) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[02:29:54] <icinga-wm>	 RECOVERY - cassandra-b CQL 10.64.32.190:9042 on aqs1005 is OK: TCP OK - 0.000 second response time on 10.64.32.190 port 9042 https://phabricator.wikimedia.org/T93886
[02:32:14] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase-dev1006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[02:32:16] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase-dev1005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[02:33:08] <icinga-wm>	 RECOVERY - aqs endpoints health on aqs1005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[02:33:56] <icinga-wm>	 RECOVERY - aqs endpoints health on aqs1009 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[03:19:04] <wikibugs>	 10SRE, 10Traffic, 10Wikimedia-Logstash, 10observability, and 3 others: Changing Kibana filters is ridiculously slow - https://phabricator.wikimedia.org/T189333 (10Krinkle) a:05Krinkle→03None
[03:19:30] <wikibugs>	 10SRE, 10Traffic, 10Wikimedia-Logstash, 10observability, and 3 others: Changing Kibana filters is ridiculously slow - https://phabricator.wikimedia.org/T189333 (10Krinkle) 05Open→03Resolved a:03Krinkle
[03:58:00] <wikibugs>	 (03PS1) 10Patsagorn Y.: Create patroller user group for thwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657225 (https://phabricator.wikimedia.org/T272149)
[03:58:02] <wikibugs>	 (03CR) 10Welcome, new contributor!: "Thank you for making your first contribution to Wikimedia! :) To learn how to get your code changes reviewed faster and more likely to get" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657225 (https://phabricator.wikimedia.org/T272149) (owner: 10Patsagorn Y.)
[04:21:29] <wikibugs>	 (03PS2) 10Patsagorn Y.: Create patroller user group for thwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657225 (https://phabricator.wikimedia.org/T272149)
[04:41:34] <wikibugs>	 (03CR) 10HitomiAkane: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657225 (https://phabricator.wikimedia.org/T272149) (owner: 10Patsagorn Y.)
[04:46:19] <wikibugs>	 (03CR) 10HitomiAkane: [C: 03+1] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657225 (https://phabricator.wikimedia.org/T272149) (owner: 10Patsagorn Y.)
[05:01:47] <icinga-wm>	 RECOVERY - Elevated latency for icinga checks in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/rsCfQfuZz/icinga
[05:42:04] <wikibugs>	 (03PS1) 10Andrew Bogott: Revert "nova: install the novavendordata api in codfw1dev" [puppet] - 10https://gerrit.wikimedia.org/r/657116
[05:45:38] <wikibugs>	 (03PS1) 10Andrew Bogott: Revert "Nova: add a simple vendordata REST service" [puppet] - 10https://gerrit.wikimedia.org/r/657117
[05:46:08] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Revert "Nova: add a simple vendordata REST service" [puppet] - 10https://gerrit.wikimedia.org/r/657117 (owner: 10Andrew Bogott)
[05:59:25] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] Revert "nova: install the novavendordata api in codfw1dev" [puppet] - 10https://gerrit.wikimedia.org/r/657116 (owner: 10Andrew Bogott)
[06:10:24] <icinga-wm>	 PROBLEM - Elevated latency for icinga checks in eqiad on alert1001 is CRITICAL: cluster=alerting instance=alert1001 job=icinga site=eqiad https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/rsCfQfuZz/icinga
[06:20:24] <wikibugs>	 (03PS1) 10Andrew Bogott: nova vendordata: use a bit of jinja and inject dhcp_domain into cloud-config [puppet] - 10https://gerrit.wikimedia.org/r/657230 (https://phabricator.wikimedia.org/T271273)
[06:21:09] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] nova vendordata: use a bit of jinja and inject dhcp_domain into cloud-config [puppet] - 10https://gerrit.wikimedia.org/r/657230 (https://phabricator.wikimedia.org/T271273) (owner: 10Andrew Bogott)
[06:29:54] <wikibugs>	 (03PS2) 10Andrew Bogott: nova vendordata: use a bit of jinja and inject dhcp_domain into cloud-config [puppet] - 10https://gerrit.wikimedia.org/r/657230 (https://phabricator.wikimedia.org/T271273)
[06:45:15] <wikibugs>	 10SRE, 10LDAP: Create auto-populated LDAP group of those who have production shell access - https://phabricator.wikimedia.org/T271587 (10Legoktm) Sidenote: if this is straightforward to do, it would be nice if we could create an LDAP group of users in the admin `deployment` group, so we can replace the manuall...
[06:57:47] <icinga-wm>	 RECOVERY - Elevated latency for icinga checks in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/rsCfQfuZz/icinga
[07:32:57] <wikibugs>	 (03PS5) 10Ryan Kemper: search: bring "new" relforge hosts into service [puppet] - 10https://gerrit.wikimedia.org/r/656369 (https://phabricator.wikimedia.org/T262211)
[07:37:04] <wikibugs>	 (03CR) 10Ryan Kemper: "Right now this rips out all the old `relforge100[1,2]` stuff, but also has the new `relforge100[3,4]` stuff in it. I imagine I'll want to " (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/656369 (https://phabricator.wikimedia.org/T262211) (owner: 10Ryan Kemper)
[07:44:23] <wikibugs>	 10SRE, 10Discovery-Search (Current work): Reshard commonswiki_file elasticsearch index - https://phabricator.wikimedia.org/T260083 (10dcausse) 05Open→03Declined Agreed, we should re-assess the shard sizes end of March 2021.
[07:44:59] <wikibugs>	 (03CR) 10Ryan Kemper: [V: 03+1] "PCC SUCCESS (DIFF 13 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27531/console" [puppet] - 10https://gerrit.wikimedia.org/r/656833 (https://phabricator.wikimedia.org/T267175) (owner: 10DCausse)
[07:54:52] <wikibugs>	 (03CR) 10Ryan Kemper: [V: 03+1 C: 03+2] [wdqs] disable async imports [puppet] - 10https://gerrit.wikimedia.org/r/656833 (https://phabricator.wikimedia.org/T267175) (owner: 10DCausse)
[08:05:02] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Add cuminunpriv1001 [puppet] - 10https://gerrit.wikimedia.org/r/657129 (owner: 10Muehlenhoff)
[08:13:13] <wikibugs>	 (03CR) 10Gehel: [C: 04-1] "minor syntax issue, see comment inline" (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/654723 (https://phabricator.wikimedia.org/T264006) (owner: 10Mstyles)
[08:16:00] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Bump timeout for accessing RAID in smart_data_dump [puppet] - 10https://gerrit.wikimedia.org/r/657060 (owner: 10Muehlenhoff)
[08:18:22] <wikibugs>	 (03CR) 10Ayounsi: interface_automation: Clean up old interfaces on run (036 comments) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/657199 (owner: 10CRusnov)
[08:19:54] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+1] interface_automation: Fix `interface` reference in IP address assignment [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/657207 (owner: 10CRusnov)
[08:20:43] <wikibugs>	 10SRE, 10Wikimedia-Logstash, 10observability, 10Software-Licensing: Elasticsearch and Kibana are switching to non-OSI-approved SSPL licence - https://phabricator.wikimedia.org/T272238 (10MoritzMuehlenhoff) OSI statement: https://opensource.org/node/1099
[08:23:20] <wikibugs>	 (03PS1) 10Nikerabbit: Add flag to toggle the usage of the group synchronization cache [extensions/Translate] (wmf/1.36.0-wmf.27) - 10https://gerrit.wikimedia.org/r/657306 (https://phabricator.wikimedia.org/T272428)
[08:26:36] <wikibugs>	 (03PS1) 10Elukey: varnish: avoid python-request UA bots for AQS [puppet] - 10https://gerrit.wikimedia.org/r/657288
[08:30:22] <wikibugs>	 (03CR) 10Elukey: "https://w.wiki/uzb is the traffic mentioned above for reference.." [puppet] - 10https://gerrit.wikimedia.org/r/657288 (owner: 10Elukey)
[08:35:30] <wikibugs>	 (03CR) 10Ema: varnish: avoid python-request UA bots for AQS (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/657288 (owner: 10Elukey)
[08:40:31] <wikibugs>	 (03PS2) 10Elukey: varnish: block python-request UA bots for AQS [puppet] - 10https://gerrit.wikimedia.org/r/657288
[08:43:55] <wikibugs>	 (03PS1) 10Muehlenhoff: Add Tyler as approval contact for Phabricator-related groups [puppet] - 10https://gerrit.wikimedia.org/r/657291
[08:52:42] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Add flag to toggle the usage of the group synchronization cache [extensions/Translate] (wmf/1.36.0-wmf.27) - 10https://gerrit.wikimedia.org/r/657306 (https://phabricator.wikimedia.org/T272428) (owner: 10Nikerabbit)
[08:53:14] <wikibugs>	 (03CR) 10Nikerabbit: "recheck" [extensions/Translate] (wmf/1.36.0-wmf.27) - 10https://gerrit.wikimedia.org/r/657306 (https://phabricator.wikimedia.org/T272428) (owner: 10Nikerabbit)
[08:58:42] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: [WiP] mediawiki: use a data structure to define prod_sites [puppet] - 10https://gerrit.wikimedia.org/r/657139 (https://phabricator.wikimedia.org/T272305)
[08:59:56] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] [WiP] mediawiki: use a data structure to define prod_sites [puppet] - 10https://gerrit.wikimedia.org/r/657139 (https://phabricator.wikimedia.org/T272305) (owner: 10Giuseppe Lavagetto)
[09:01:21] <logmsgbot>	 !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host ms-be2018.codfw.wmnet
[09:01:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:02:13] <wikibugs>	 (03PS1) 10Gergő Tisza: [beta] GrowthExperiments: set link recommendation feature flags [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657292
[09:04:04] <icinga-wm>	 PROBLEM - Elevated latency for icinga checks in eqiad on alert1001 is CRITICAL: cluster=alerting instance=alert1001 job=icinga site=eqiad https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/rsCfQfuZz/icinga
[09:07:46] <icinga-wm>	 PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 241, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[09:08:41] <logmsgbot>	 !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2018.codfw.wmnet
[09:08:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:09:13] <logmsgbot>	 !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host ms-be2019.codfw.wmnet
[09:09:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:16:04] <logmsgbot>	 !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2019.codfw.wmnet
[09:16:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:19:54] <XioNoX>	 !log configure Lumen interfaces
[09:19:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:21:11] <icinga-wm>	 ACKNOWLEDGEMENT - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 241, down: 1, dormant: 0, excluded: 0, unused: 0: ayounsi https://phabricator.wikimedia.org/T270439 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[09:22:27] <wikibugs>	 (03PS3) 10Filippo Giunchedi: debian: add packaging [debs/phalerts] - 10https://gerrit.wikimedia.org/r/656866 (https://phabricator.wikimedia.org/T272453)
[09:23:02] <wikibugs>	 (03CR) 10Effie Mouzeli: [C: 03+1] modules/scap/templates/scap.cfg.erb: Define php_fpm_unsafe_restart_script [puppet] - 10https://gerrit.wikimedia.org/r/636074 (https://phabricator.wikimedia.org/T243009) (owner: 10Ahmon Dancy)
[09:23:31] <wikibugs>	 (03CR) 10Nikerabbit: [C: 03+2] "Backport" [extensions/Translate] (wmf/1.36.0-wmf.27) - 10https://gerrit.wikimedia.org/r/657306 (https://phabricator.wikimedia.org/T272428) (owner: 10Nikerabbit)
[09:24:17] <logmsgbot>	 !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host ms-be2020.codfw.wmnet
[09:24:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:31:12] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good, three nits inline" (033 comments) [debs/phalerts] - 10https://gerrit.wikimedia.org/r/656866 (https://phabricator.wikimedia.org/T272453) (owner: 10Filippo Giunchedi)
[09:31:32] <wikibugs>	 (03CR) 10Ladsgroup: "ping 😄" [puppet] - 10https://gerrit.wikimedia.org/r/655203 (https://phabricator.wikimedia.org/T256542) (owner: 10Ladsgroup)
[09:31:37] <logmsgbot>	 !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2020.codfw.wmnet
[09:31:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:32:02] <logmsgbot>	 !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host ms-be2021.codfw.wmnet
[09:32:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:32:15] <moritzm>	 !log installing cuminunpriv1001
[09:32:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:32:37] <wikibugs>	 (03PS1) 10Filippo Giunchedi: toil: remove rsyslog_tls_remedy [puppet] - 10https://gerrit.wikimedia.org/r/657293 (https://phabricator.wikimedia.org/T199406)
[09:32:52] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] toil: remove rsyslog_tls_remedy [puppet] - 10https://gerrit.wikimedia.org/r/657293 (https://phabricator.wikimedia.org/T199406) (owner: 10Filippo Giunchedi)
[09:34:55] <wikibugs>	 (03PS4) 10Filippo Giunchedi: debian: add packaging [debs/phalerts] - 10https://gerrit.wikimedia.org/r/656866 (https://phabricator.wikimedia.org/T272453)
[09:35:36] <wikibugs>	 (03PS1) 10Vgutierrez: ATS: provide client port information to varnish [puppet] - 10https://gerrit.wikimedia.org/r/657296 (https://phabricator.wikimedia.org/T271953)
[09:36:06] <wikibugs>	 (03CR) 10Filippo Giunchedi: "Thank you for the quick review!" (033 comments) [debs/phalerts] - 10https://gerrit.wikimedia.org/r/656866 (https://phabricator.wikimedia.org/T272453) (owner: 10Filippo Giunchedi)
[09:37:18] <wikibugs>	 (03PS2) 10Filippo Giunchedi: toil: remove rsyslog_tls_remedy [puppet] - 10https://gerrit.wikimedia.org/r/657293 (https://phabricator.wikimedia.org/T199406)
[09:38:30] <icinga-wm>	 PROBLEM - Elevated latency for icinga checks in eqiad on alert1001 is CRITICAL: cluster=alerting instance=alert1001 job=icinga site=eqiad https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/rsCfQfuZz/icinga
[09:39:22] <logmsgbot>	 !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2021.codfw.wmnet
[09:39:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:39:44] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Ship it :-)" [debs/phalerts] - 10https://gerrit.wikimedia.org/r/656866 (https://phabricator.wikimedia.org/T272453) (owner: 10Filippo Giunchedi)
[09:39:51] <logmsgbot>	 !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host ms-be2023.codfw.wmnet
[09:39:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:40:07] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] toil: remove rsyslog_tls_remedy [puppet] - 10https://gerrit.wikimedia.org/r/657293 (https://phabricator.wikimedia.org/T199406) (owner: 10Filippo Giunchedi)
[09:41:42] <wikibugs>	 (03CR) 10Filippo Giunchedi: [V: 03+2 C: 03+2] debian: add packaging [debs/phalerts] - 10https://gerrit.wikimedia.org/r/656866 (https://phabricator.wikimedia.org/T272453) (owner: 10Filippo Giunchedi)
[09:42:40] <wikibugs>	 (03CR) 10Elukey: "John I did a quick very high level pass, I really like the pre/post actions to do, will review again the code change later on! Thanks a lo" (033 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/657102 (owner: 10Jbond)
[09:43:34] <wikibugs>	 10SRE, 10Patch-For-Review, 10User-fgiunchedi: rsyslog's in:imtcp thread stuck on recvfrom loop from down/rebooted hosts - https://phabricator.wikimedia.org/T199406 (10fgiunchedi) 05Stalled→03Resolved a:03fgiunchedi Resolving, with recent rsyslog on centrallog hosts we haven't experienced this bug
[09:46:15] <wikibugs>	 (03Merged) 10jenkins-bot: Add flag to toggle the usage of the group synchronization cache [extensions/Translate] (wmf/1.36.0-wmf.27) - 10https://gerrit.wikimedia.org/r/657306 (https://phabricator.wikimedia.org/T272428) (owner: 10Nikerabbit)
[09:47:54] <logmsgbot>	 !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2023.codfw.wmnet
[09:47:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:49:04] <logmsgbot>	 !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host ms-be2024.codfw.wmnet
[09:49:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:50:31] <wikibugs>	 (03CR) 10Ema: [C: 03+1] ATS: provide client port information to varnish [puppet] - 10https://gerrit.wikimedia.org/r/657296 (https://phabricator.wikimedia.org/T271953) (owner: 10Vgutierrez)
[09:57:07] <logmsgbot>	 !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2024.codfw.wmnet
[09:57:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:57:24] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] "Thanks! <3" [puppet] - 10https://gerrit.wikimedia.org/r/657296 (https://phabricator.wikimedia.org/T271953) (owner: 10Vgutierrez)
[09:59:25] <logmsgbot>	 !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host ms-be2025.codfw.wmnet
[09:59:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:05:07] <logmsgbot>	 !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2025.codfw.wmnet
[10:05:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:07:12] <logmsgbot>	 !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host ms-be2026.codfw.wmnet
[10:07:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:12:52] <wikibugs>	 10SRE, 10LDAP-Access-Requests, 10WMF-NDA-Requests: Allow WMDE intern Amrutha to access Superset - https://phabricator.wikimedia.org/T271725 (10amy_rc) @KFrancis  My full name is Amrutha Varshini Chandra First Name - Amrutha Varshini    Last Name - Chandra  and WMDE email Address - amrutha.chandra@wikimedia.d...
[10:14:22] <wikibugs>	 (03CR) 10Klausman: varnish: block python-request UA bots for AQS (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/657288 (owner: 10Elukey)
[10:14:40] <wikibugs>	 (03PS1) 10Muehlenhoff: Fix MAC address [puppet] - 10https://gerrit.wikimedia.org/r/657297
[10:14:44] <wikibugs>	 (03CR) 10WMDE-Fisch: "This change is ready for review." [extensions/CodeMirror] (wmf/1.36.0-wmf.27) - 10https://gerrit.wikimedia.org/r/657308 (https://phabricator.wikimedia.org/T270238) (owner: 10WMDE-Fisch)
[10:16:49] <wikibugs>	 (03CR) 10Elukey: varnish: block python-request UA bots for AQS (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/657288 (owner: 10Elukey)
[10:16:57] <logmsgbot>	 !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2026.codfw.wmnet
[10:16:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:17:29] <wikibugs>	 (03CR) 10Gehel: "Decommission of old servers and configuration of the new ones should be split into 2 different CR." [puppet] - 10https://gerrit.wikimedia.org/r/656369 (https://phabricator.wikimedia.org/T262211) (owner: 10Ryan Kemper)
[10:17:54] <logmsgbot>	 !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host ms-be2027.codfw.wmnet
[10:17:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:18:48] <wikibugs>	 10SRE, 10cloud-services-team (Kanban): apt key for `thirdparty/ceph-nautilus/buster` has expired. - https://phabricator.wikimedia.org/T259873 (10aborrero)
[10:19:26] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Fix MAC address [puppet] - 10https://gerrit.wikimedia.org/r/657297 (owner: 10Muehlenhoff)
[10:20:46] <wikibugs>	 (03CR) 10Volans: [C: 04-1] "Agree with Arzhel's comments" (033 comments) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/657199 (owner: 10CRusnov)
[10:22:18] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/657207 (owner: 10CRusnov)
[10:24:40] <wikibugs>	 (03PS1) 10Matthias Mullie: Remove MediaSearch survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657299
[10:24:54] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 04-1] [DONT MERGE] cloud: drop NAT exceptions for dumps NFS (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/657152 (https://phabricator.wikimedia.org/T272397) (owner: 10Arturo Borrero Gonzalez)
[10:26:27] <logmsgbot>	 !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2027.codfw.wmnet
[10:26:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:26:41] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+2] ATS: provide client port information to varnish [puppet] - 10https://gerrit.wikimedia.org/r/657296 (https://phabricator.wikimedia.org/T271953) (owner: 10Vgutierrez)
[10:26:41] <logmsgbot>	 !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host ms-be2028.codfw.wmnet
[10:26:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:28:58] <wikibugs>	 (03PS2) 10Matthias Mullie: Remove MediaSearch survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657299 (https://phabricator.wikimedia.org/T258419)
[10:34:11] <wikibugs>	 (03PS1) 10Marostegui: db1079: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/657300
[10:34:49] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1079 to stop replication T272008', diff saved to https://phabricator.wikimedia.org/P13842 and previous config saved to /var/cache/conftool/dbconfig/20210120-103449-marostegui.json
[10:34:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:34:53] <stashbot>	 T272008: Move wikireplicas under the new sanitarium hosts (db1154, db1155) - https://phabricator.wikimedia.org/T272008
[10:35:14] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db1079: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/657300 (owner: 10Marostegui)
[10:35:20] <logmsgbot>	 !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2028.codfw.wmnet
[10:35:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:35:33] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to analytics for lilients - https://phabricator.wikimedia.org/T272264 (10lilients_WMDE) I tested the online tool which seems to be really cool. But accessing the event logging metrics in presto analytics hive I get the following error message:   ` presto error: F...
[10:37:47] <logmsgbot>	 !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host ms-be2029.codfw.wmnet
[10:37:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:37:52] <wikibugs>	 (03CR) 10Thiemo Kreuz (WMDE): [C: 03+1] "We can backport this, but I don't think this is strictly necessary. This number doesn't do much. It's an upper limit. It will only have an" [extensions/CodeMirror] (wmf/1.36.0-wmf.27) - 10https://gerrit.wikimedia.org/r/657308 (https://phabricator.wikimedia.org/T270238) (owner: 10WMDE-Fisch)
[10:41:53] <wikibugs>	 (03PS1) 10Marostegui: Revert "db1079: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/657312
[10:42:37] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] Revert "db1079: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/657312 (owner: 10Marostegui)
[10:42:58] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1079 (re)pooling @ 25%: After moving wikireplicas to another host', diff saved to https://phabricator.wikimedia.org/P13844 and previous config saved to /var/cache/conftool/dbconfig/20210120-104257-root.json
[10:43:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:43:09] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: network: add cloud_networks_public constant and use it in ferm [puppet] - 10https://gerrit.wikimedia.org/r/657301
[10:44:29] <wikibugs>	 (03PS2) 10Arturo Borrero Gonzalez: [DONT MERGE] cloud: drop NAT exceptions for dumps NFS [puppet] - 10https://gerrit.wikimedia.org/r/657152 (https://phabricator.wikimedia.org/T272397)
[10:46:39] <logmsgbot>	 !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2029.codfw.wmnet
[10:46:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:47:44] <wikibugs>	 (03CR) 10Ema: [C: 04-1] varnish: check for debug=1 value in X-Analytics header (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/629735 (https://phabricator.wikimedia.org/T263683) (owner: 10Effie Mouzeli)
[10:47:55] <wikibugs>	 (03PS1) 10Ayounsi: Discard the non-whitelisted 172.16.0.0/12 traffic [homer/public] - 10https://gerrit.wikimedia.org/r/657302 (https://phabricator.wikimedia.org/T209082)
[10:48:20] <logmsgbot>	 !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host ms-be2030.codfw.wmnet
[10:48:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:48:44] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] Discard the non-whitelisted 172.16.0.0/12 traffic [homer/public] - 10https://gerrit.wikimedia.org/r/657302 (https://phabricator.wikimedia.org/T209082) (owner: 10Ayounsi)
[10:49:25] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] Discard the non-whitelisted 172.16.0.0/12 traffic [homer/public] - 10https://gerrit.wikimedia.org/r/657302 (https://phabricator.wikimedia.org/T209082) (owner: 10Ayounsi)
[10:50:02] <wikibugs>	 (03Merged) 10jenkins-bot: Discard the non-whitelisted 172.16.0.0/12 traffic [homer/public] - 10https://gerrit.wikimedia.org/r/657302 (https://phabricator.wikimedia.org/T209082) (owner: 10Ayounsi)
[10:51:12] <wikibugs>	 (03PS6) 10Jbond: sre: convert the generic reboot functions to the cookbook class API [cookbooks] - 10https://gerrit.wikimedia.org/r/657102
[10:51:28] <XioNoX>	 !log Discard the non-whitelisted 172.16.0.0/12 traffic - T209082
[10:51:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:53:10] <logmsgbot>	 !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2030.codfw.wmnet
[10:53:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:56:04] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+1] "LGTM, but I'd prefer someone with better Puppet/ferm skills to review it as well." [puppet] - 10https://gerrit.wikimedia.org/r/657301 (owner: 10Arturo Borrero Gonzalez)
[10:58:02] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1079 (re)pooling @ 50%: After moving wikireplicas to another host', diff saved to https://phabricator.wikimedia.org/P13845 and previous config saved to /var/cache/conftool/dbconfig/20210120-105801-root.json
[10:58:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:00:39] <icinga-wm>	 RECOVERY - Elevated latency for icinga checks in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/rsCfQfuZz/icinga
[11:13:05] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1079 (re)pooling @ 75%: After moving wikireplicas to another host', diff saved to https://phabricator.wikimedia.org/P13846 and previous config saved to /var/cache/conftool/dbconfig/20210120-111305-root.json
[11:13:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:16:10] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+1] modules/scap/templates/scap.cfg.erb: Define php_fpm_unsafe_restart_script [puppet] - 10https://gerrit.wikimedia.org/r/636074 (https://phabricator.wikimedia.org/T243009) (owner: 10Ahmon Dancy)
[11:16:13] <wikibugs>	 10SRE, 10MediaWiki-Containers, 10serviceops, 10Patch-For-Review: Homepage for https://docker-registry.wikimedia.org - https://phabricator.wikimedia.org/T179696 (10akosiaris) >>! In T179696#6760378, @Legoktm wrote: > Got pretty close, one last sticking point is that `docker_report` hardcodes connecting to t...
[11:16:50] <wikibugs>	 10SRE, 10MediaWiki-Containers, 10serviceops, 10Patch-For-Review: Homepage for https://docker-registry.wikimedia.org - https://phabricator.wikimedia.org/T179696 (10Joe) >>! In T179696#6760378, @Legoktm wrote: > Got pretty close, one last sticking point is that `docker_report` hardcodes connecting to the reg...
[11:22:45] <wikibugs>	 (03CR) 10Volans: "I don't mind it but currently the whole project is using format() and it's good to have consistency and not everyone likes f-strings (as t" [cookbooks] - 10https://gerrit.wikimedia.org/r/656923 (owner: 10RhinosF1)
[11:23:57] <wikibugs>	 (03CR) 10RhinosF1: "> Patch Set 3:" [cookbooks] - 10https://gerrit.wikimedia.org/r/656923 (owner: 10RhinosF1)
[11:25:03] <wikibugs>	 (03CR) 10Klausman: varnish: block python-request UA bots for AQS (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/657288 (owner: 10Elukey)
[11:25:09] <wikibugs>	 (03CR) 10Klausman: [C: 03+1] varnish: block python-request UA bots for AQS [puppet] - 10https://gerrit.wikimedia.org/r/657288 (owner: 10Elukey)
[11:28:09] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1079 (re)pooling @ 100%: After moving wikireplicas to another host', diff saved to https://phabricator.wikimedia.org/P13847 and previous config saved to /var/cache/conftool/dbconfig/20210120-112808-root.json
[11:28:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:28:14] <wikibugs>	 (03CR) 10Jbond: "updated thanks" (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/655896 (https://phabricator.wikimedia.org/T271702) (owner: 10Jbond)
[11:37:43] <icinga-wm>	 PROBLEM - Elevated latency for icinga checks in eqiad on alert1001 is CRITICAL: cluster=alerting instance=icinga1001 job=icinga site=eqiad https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/rsCfQfuZz/icinga
[11:51:04] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: "PCC report is fine:" [puppet] - 10https://gerrit.wikimedia.org/r/657301 (owner: 10Arturo Borrero Gonzalez)
[11:51:41] <wikibugs>	 10SRE, 10Patch-For-Review, 10Tracking-Neverending: Tracking and Reducing cron-spam to root@ - https://phabricator.wikimedia.org/T132324 (10jcrespo)
[11:54:28] <logmsgbot>	 !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host ms-be2032.codfw.wmnet
[11:54:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:57:53] <wikibugs>	 (03PS4) 10Jbond: ferm-status: add ability to ignore rules with a specific comment prefix [puppet] - 10https://gerrit.wikimedia.org/r/655896 (https://phabricator.wikimedia.org/T271702)
[11:58:36] <wikibugs>	 (03PS2) 10Hnowlan: similar-users: correct loglevel, remove readiness probe [deployment-charts] - 10https://gerrit.wikimedia.org/r/657151 (https://phabricator.wikimedia.org/T268837)
[12:00:04] <jouncebot>	 Amir1, Lucas_WMDE, awight, and Urbanecm: That opportune time is upon us again. Time for a European mid-day backport window deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210120T1200).
[12:00:04] <jouncebot>	 Kizule and Matthias: A patch you scheduled for European mid-day backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[12:00:24] <Urbanecm>	 I can dpeloy today!
[12:00:30] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/657149 (https://phabricator.wikimedia.org/T135991) (owner: 10Dave Pifke)
[12:00:32] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] webperf: enable Apache base::service_auto_restart [puppet] - 10https://gerrit.wikimedia.org/r/657149 (https://phabricator.wikimedia.org/T135991) (owner: 10Dave Pifke)
[12:00:38] <matthiasmullie>	 o/
[12:00:50] <wikibugs>	 (03CR) 10Matthias Mullie: [C: 03+1] Remove MediaSearch survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657299 (https://phabricator.wikimedia.org/T258419) (owner: 10Matthias Mullie)
[12:01:10] <Urbanecm>	 matthiasmullie: or you can if you wish, you're the only customer who's here :) (didn't recognize you under the irc handle)
[12:01:19] <logmsgbot>	 !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2032.codfw.wmnet
[12:01:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:02:10] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] Remove MediaSearch survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657299 (https://phabricator.wikimedia.org/T258419) (owner: 10Matthias Mullie)
[12:02:27] <logmsgbot>	 !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host ms-be2033.codfw.wmnet
[12:02:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:02:53] <matthiasmullie>	 oh okay - yeah I'm happy to self-deploy, no need to waste your time :)
[12:02:56] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+2] similar-users: correct loglevel, remove readiness probe [deployment-charts] - 10https://gerrit.wikimedia.org/r/657151 (https://phabricator.wikimedia.org/T268837) (owner: 10Hnowlan)
[12:03:06] <wikibugs>	 (03Merged) 10jenkins-bot: Remove MediaSearch survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657299 (https://phabricator.wikimedia.org/T258419) (owner: 10Matthias Mullie)
[12:03:52] <wikibugs>	 (03CR) 10Volans: "Thanks for the fixes, last question inline" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/655896 (https://phabricator.wikimedia.org/T271702) (owner: 10Jbond)
[12:04:25] <wikibugs>	 (03Merged) 10jenkins-bot: similar-users: correct loglevel, remove readiness probe [deployment-charts] - 10https://gerrit.wikimedia.org/r/657151 (https://phabricator.wikimedia.org/T268837) (owner: 10Hnowlan)
[12:08:26] <logmsgbot>	 !log mlitn@deploy1001 Synchronized wmf-config/InitialiseSettings.php: 2fc57b259: Remove MediaSearch survey (duration: 01m 10s)
[12:08:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:09:39] <logmsgbot>	 !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2033.codfw.wmnet
[12:09:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:10:50] <logmsgbot>	 !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host ms-be2034.codfw.wmnet
[12:10:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:10:54] <matthiasmullie>	 !log EU config window done
[12:10:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:11:09] <wikibugs>	 (03PS5) 10Jbond: ferm-status: add ability to ignore rules with a specific comment prefix [puppet] - 10https://gerrit.wikimedia.org/r/655896 (https://phabricator.wikimedia.org/T271702)
[12:11:38] <wikibugs>	 (03CR) 10Jbond: "thanks updated" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/655896 (https://phabricator.wikimedia.org/T271702) (owner: 10Jbond)
[12:12:47] <icinga-wm>	 PROBLEM - Check systemd state on deploy1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:15:33] <wikibugs>	 (03PS1) 10Jcrespo: admin: Provide cluster access to lilients [puppet] - 10https://gerrit.wikimedia.org/r/657330 (https://phabricator.wikimedia.org/T272264)
[12:16:43] <wikibugs>	 (03PS2) 10Jcrespo: admin: Provide cluster access to lilients [puppet] - 10https://gerrit.wikimedia.org/r/657330 (https://phabricator.wikimedia.org/T272264)
[12:17:53] <wikibugs>	 (03PS1) 10Hnowlan: similar-users: change container version [deployment-charts] - 10https://gerrit.wikimedia.org/r/657331 (https://phabricator.wikimedia.org/T268837)
[12:18:13] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics for lilients - https://phabricator.wikimedia.org/T272264 (10jcrespo) @lilients_WMDE Please confirm your data seems correct at the above patch, thank you!
[12:19:14] <logmsgbot>	 !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2034.codfw.wmnet
[12:19:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:20:16] <logmsgbot>	 !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host ms-be2035.codfw.wmnet
[12:20:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:23:07] <wikibugs>	 10SRE, 10LDAP-Access-Requests, 10WMF-NDA-Requests: Allow WMDE intern Amrutha to access Superset - https://phabricator.wikimedia.org/T271725 (10jcrespo) @KFrancis ^^^ Thank you for your quick response to both.
[12:23:39] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+2] similar-users: change container version [deployment-charts] - 10https://gerrit.wikimedia.org/r/657331 (https://phabricator.wikimedia.org/T268837) (owner: 10Hnowlan)
[12:24:58] <wikibugs>	 (03Merged) 10jenkins-bot: similar-users: change container version [deployment-charts] - 10https://gerrit.wikimedia.org/r/657331 (https://phabricator.wikimedia.org/T268837) (owner: 10Hnowlan)
[12:25:23] <godog>	 heads up, I noticed icinga check latency has gone up significantly, going to restart icinga in 5 min
[12:26:27] <Urbanecm>	 upps, matthiasmullie, sorry, I totally forgot I'm SWATting :/ 
[12:27:23] <matthiasmullie>	 No worries. I’m done with my patch, and I think I was the only one
[12:27:36] <logmsgbot>	 !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2035.codfw.wmnet
[12:27:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:27:47] <Urbanecm>	 great :)
[12:27:57] <logmsgbot>	 !log hnowlan@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'similar-users' for release 'canary' .
[12:27:57] <logmsgbot>	 !log hnowlan@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'similar-users' for release 'main' .
[12:27:57] <logmsgbot>	 !log hnowlan@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'similar-users' for release 'test' .
[12:27:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:28:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:28:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:29:00] <matthiasmullie>	 Urbanecm: there was another patch scheduled that window, but user doesn't appear to be around - do we still want to swat that, or not since user is not here?
[12:29:20] <Urbanecm>	 let's ignore it as they're not around
[12:29:37] <icinga-wm>	 RECOVERY - Check systemd state on deploy1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:31:25] <godog>	 Urbanecm: are you swatting now/soon ? should I hold off on restarting icinga just in case ?
[12:31:37] <Urbanecm>	 godog: no, I'm done :)
[12:31:43] <godog>	 ack
[12:31:50] <godog>	 !log bounce icinga on alert1001
[12:31:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:36:17] <logmsgbot>	 !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host ms-be2036.codfw.wmnet
[12:36:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:41:16] <logmsgbot>	 !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2036.codfw.wmnet
[12:41:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:41:37] <logmsgbot>	 !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host ms-be2037.codfw.wmnet
[12:41:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:45:57] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] profile: drop ECS messages on legacy cluster [puppet] - 10https://gerrit.wikimedia.org/r/657213 (https://phabricator.wikimedia.org/T234565) (owner: 10Cwhite)
[12:46:17] <logmsgbot>	 !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2037.codfw.wmnet
[12:46:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:46:48] <wikibugs>	 (03PS1) 10Hnowlan: similar-users: use puppet ca bundle with requests via env var [deployment-charts] - 10https://gerrit.wikimedia.org/r/657333 (https://phabricator.wikimedia.org/T268837)
[12:49:47] <logmsgbot>	 !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2087.codfw.wmnet with reason: Schema change T267767
[12:49:47] <logmsgbot>	 !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2087.codfw.wmnet with reason: Schema change T267767
[12:49:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:49:51] <stashbot>	 T267767: Drop default of revactor_timestamp - https://phabricator.wikimedia.org/T267767
[12:49:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:50:26] <kormat>	 volans: ooo, fancy cookbook logging!
[12:52:18] <icinga-wm>	 RECOVERY - Elevated latency for icinga checks in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/rsCfQfuZz/icinga
[12:58:22] <logmsgbot>	 !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host ms-be2038.codfw.wmnet
[12:58:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:58:56] <wikibugs>	 (03PS2) 10Muehlenhoff: Disable bast3004/bast4002/bast5001 as bastions [puppet] - 10https://gerrit.wikimedia.org/r/656894 (https://phabricator.wikimedia.org/T257324)
[12:58:57] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+2] similar-users: use puppet ca bundle with requests via env var [deployment-charts] - 10https://gerrit.wikimedia.org/r/657333 (https://phabricator.wikimedia.org/T268837) (owner: 10Hnowlan)
[12:59:43] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics for lilients - https://phabricator.wikimedia.org/T272264 (10lilients_WMDE) >>! In T272264#6761382, @jcrespo wrote: > @lilients_WMDE Please confirm your data seems correct at the above patch, thank you!  Looks fine. Thanks!
[13:00:28] <wikibugs>	 (03Merged) 10jenkins-bot: similar-users: use puppet ca bundle with requests via env var [deployment-charts] - 10https://gerrit.wikimedia.org/r/657333 (https://phabricator.wikimedia.org/T268837) (owner: 10Hnowlan)
[13:02:13] <logmsgbot>	 !log hnowlan@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'similar-users' for release 'test' .
[13:02:13] <logmsgbot>	 !log hnowlan@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'similar-users' for release 'canary' .
[13:02:13] <logmsgbot>	 !log hnowlan@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'similar-users' for release 'main' .
[13:02:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:02:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:02:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:03:15] <Nikerabbit>	 Urbanecm: was my change deployed or reverted? Should I revert it now?
[13:04:33] <Urbanecm>	 Nikerabbit: it was neither deployed or reverted AFAIK (so we're still in that bad state of uncertainity). Since it's a train blocker, I'm happy to deploy it now, provided you know how to check it works at testwiki.
[13:06:36] <Nikerabbit>	 Urbanecm: I think I can. I need to check I have the necessary rights, and https://phabricator.wikimedia.org/T157997 may be a blocker
[13:07:10] <Urbanecm>	 Nikerabbit: you do, it's more of "do I feel confident syncing backports, not being a regular deployer"
[13:07:11] <logmsgbot>	 !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2038.codfw.wmnet
[13:07:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:08:40] <Nikerabbit>	 Urbanecm: sorry I didn't quite understand your last message
[13:09:41] <Urbanecm>	 Nikerabbit: sorry. You are a deployer, so you can technically sync it out. But, as I know you're not doing deployment regularly, I'm offering myself to do it for you - so you need only to make sure it works as intended, and didn't break something else :)
[13:10:33] <Nikerabbit>	 Urbanecm: yes, my comment was about testing. I just checked while we are chatting that I can reproduce the error, so I am able to test it
[13:10:44] <Urbanecm>	 ah, great
[13:10:51] <Urbanecm>	 I'll ping you when it's ready then
[13:11:27] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+2] admin: Provide cluster access to lilients [puppet] - 10https://gerrit.wikimedia.org/r/657330 (https://phabricator.wikimedia.org/T272264) (owner: 10Jcrespo)
[13:11:33] <Urbanecm>	 godog: you wanted to do something with icinga earlier, is that done? (in other words, can I deploy now)?
[13:14:09] <logmsgbot>	 !log hnowlan@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'similar-users' for release 'canary' .
[13:14:09] <logmsgbot>	 !log hnowlan@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'similar-users' for release 'main' .
[13:14:09] <logmsgbot>	 !log hnowlan@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'similar-users' for release 'test' .
[13:14:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:14:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:14:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:14:18] <Urbanecm>	 Nikerabbit: pulled onto mwdebug1001, please test
[13:14:21] <logmsgbot>	 !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host ms-be2039.codfw.wmnet
[13:14:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:14:42] <Nikerabbit>	 Urbanecm: do you know if jobqueue is also affected by mwdebug-extension?
[13:14:46] <wikibugs>	 (03PS1) 10Hnowlan: similar-users: one replica in staging, 2 in prod [deployment-charts] - 10https://gerrit.wikimedia.org/r/657335 (https://phabricator.wikimedia.org/T268837)
[13:15:03] <Urbanecm>	 Nikerabbit: I'm not sure about that.
[13:15:21] <Nikerabbit>	 well, we should see soon
[13:15:22] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Disable bast3004/bast4002/bast5001 as bastions [puppet] - 10https://gerrit.wikimedia.org/r/656894 (https://phabricator.wikimedia.org/T257324) (owner: 10Muehlenhoff)
[13:15:26] <Urbanecm>	 (note only testwikis are affected by wmf.27 by now)
[13:16:48] <Nikerabbit>	 Urbanecm: error is coming from mw1306, so I assume jobqueue is not affected by mwdebug
[13:17:11] <Urbanecm>	 Nikerabbit: ack. Is there any other way to test it? If not, I'll just sync it
[13:17:53] <Nikerabbit>	 Urbanecm: I can't think of any other way (maybe some shell.php hackery but have not prepared for that either)
[13:18:04] <Urbanecm>	 okay, I'll sync it then
[13:18:45] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics for lilients - https://phabricator.wikimedia.org/T272264 (10jcrespo) @lilients_WMDE access has been merged, it will take a few minutes to be deployed throughout the cluster (all servers). When it does, I will take care of kerber...
[13:19:00] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics for lilients - https://phabricator.wikimedia.org/T272264 (10jcrespo)
[13:20:44] <logmsgbot>	 !log urbanecm@deploy1001 Synchronized php-1.36.0-wmf.27/extensions/Translate/: 20decbd5cc3de0af655b9419cf69fc442ab056a4: Add flag to toggle the usage of the group synchronization cache (T272428; T182433) (duration: 01m 10s)
[13:20:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:20:50] <stashbot>	 T182433: Implement a stronger synchronization in RepoNG and Translate - https://phabricator.wikimedia.org/T182433
[13:20:50] <stashbot>	 T272428: Error 1146: Table 'mediawikiwiki.translate_cache' doesn't exist - https://phabricator.wikimedia.org/T272428
[13:20:54] <Urbanecm>	 Nikerabbit: synced. Can you check now?
[13:20:59] <Nikerabbit>	 Urbanecm: checking
[13:22:56] <Nikerabbit>	 Urbanecm: looks good to me
[13:23:03] <Urbanecm>	 great, thanks!
[13:24:36] <Nikerabbit>	 Urbanecm: thanks for the help, sorry for the trouble
[13:26:09] <logmsgbot>	 !log hnowlan@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'similar-users' for release 'canary' .
[13:26:09] <logmsgbot>	 !log hnowlan@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'similar-users' for release 'test' .
[13:26:09] <logmsgbot>	 !log hnowlan@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'similar-users' for release 'main' .
[13:26:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:26:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:26:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:27:22] <wikibugs>	 (03PS1) 10Urbanecm: Set wgTranslateGroupSynchronizationCache to false explicitly [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657337 (https://phabricator.wikimedia.org/T272428)
[13:27:41] <Urbanecm>	 Nikerabbit: I also feel we should explicitly set that flag to false in wmf config, to prevent this from re-occuring when it's set to true in extension.json.
[13:28:22] <logmsgbot>	 !log hnowlan@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'similar-users' for release 'test' .
[13:28:22] <logmsgbot>	 !log hnowlan@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'similar-users' for release 'canary' .
[13:28:23] <logmsgbot>	 !log hnowlan@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'similar-users' for release 'main' .
[13:28:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:28:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:28:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:29:44] <logmsgbot>	 !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2039.codfw.wmnet
[13:29:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:30:02] <logmsgbot>	 !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host ms-be2040.codfw.wmnet
[13:30:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:30:12] <godog>	 Urbanecm: sorry was in a meeting, yeah all done
[13:30:43] <Urbanecm>	 no problem, I synced it anyway, as you didn't object :)
[13:31:57] <Nikerabbit>	 Urbanecm: I doubt if it is ever enabled by default (and we're adding another check tooo), but I'll keep in mind
[13:32:25] <Urbanecm>	 ack
[13:32:34] <volans>	 kormat: lol (fancy logging)
[13:32:36] <wikibugs>	 (03PS1) 10Jcrespo: Add Dom Walden (dwalden) to the list of privileged ldap users [puppet] - 10https://gerrit.wikimedia.org/r/657338 (https://phabricator.wikimedia.org/T272477)
[13:35:04] <logmsgbot>	 !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2040.codfw.wmnet
[13:35:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:35:24] <logmsgbot>	 !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host ms-be2041.codfw.wmnet
[13:35:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:35:47] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+2] Add Dom Walden (dwalden) to the list of privileged ldap users [puppet] - 10https://gerrit.wikimedia.org/r/657338 (https://phabricator.wikimedia.org/T272477) (owner: 10Jcrespo)
[13:38:38] <wikibugs>	 (03PS4) 10Joal: profile::analytics::refinery Create HDFS folders [puppet] - 10https://gerrit.wikimedia.org/r/656307 (https://phabricator.wikimedia.org/T271560)
[13:41:25] <logmsgbot>	 !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2041.codfw.wmnet
[13:41:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:43:50] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics for lilients - https://phabricator.wikimedia.org/T272264 (10Ottomata) > I tested the online tool which seems to be really cool. But accessing the event logging metrics in presto analytics hive I get the following error message:...
[13:44:52] <wikibugs>	 (03PS3) 10Joal: profile::analytics::refinery::job::hdfs_cleaner Update [puppet] - 10https://gerrit.wikimedia.org/r/656376 (https://phabricator.wikimedia.org/T271560)
[13:46:20] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics for lilients - https://phabricator.wikimedia.org/T272264 (10jcrespo)
[13:53:07] <icinga-wm>	 PROBLEM - Check for VMs leaked by the nova-fullstack test on cloudcontrol1003 is CRITICAL: 10 instances in the admin-monitoring project https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting%23Nova-fullstack
[13:55:02] <logmsgbot>	 !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host ms-be2042.codfw.wmnet
[13:55:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:55:07] <wikibugs>	 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Grant Access to ldap/wmf for Dom Walden - https://phabricator.wikimedia.org/T272477 (10dom_walden) >>! In T272477#6761745, @jcrespo wrote: > Access has been deployed to LDAP and you should have immediately access to logstash. >  > Please read the note that i...
[14:00:04] <jouncebot>	 brennen and liw: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Mediawiki train - American+European Version (secondary timeslot). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210120T1400).
[14:00:28] <logmsgbot>	 !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2042.codfw.wmnet
[14:00:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:00:47] <logmsgbot>	 !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host ms-be2043.codfw.wmnet
[14:00:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:02:23] <wikibugs>	 10SRE, 10MediaWiki-Containers, 10serviceops, 10Patch-For-Review: Homepage for https://docker-registry.wikimedia.org - https://phabricator.wikimedia.org/T179696 (10Joe) For the record, I'm running the script as follows on registry1001:  ` /usr/local/bin/registry-homepage-builder docker-registry.wikimedia.or...
[14:07:16] <logmsgbot>	 !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2043.codfw.wmnet
[14:07:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:07:24] <wikibugs>	 10SRE, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker11[18-41] - https://phabricator.wikimedia.org/T260445 (10elukey) @Cmjohnson before racking the remaining 6 nodes (that we can do it in another task) could you check  an-worker1119 and an-worker1131 to see if they...
[14:07:47] <logmsgbot>	 !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host ms-be2044.codfw.wmnet
[14:07:48] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 04-1] services: similar-users discovery and LVS component (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/657101 (https://phabricator.wikimedia.org/T268837) (owner: 10Hnowlan)
[14:07:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:10:01] <icinga-wm>	 PROBLEM - nova instance creation test on cloudcontrol1003 is CRITICAL: PROCS CRITICAL: 0 processes with command name python, args nova-fullstack https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[14:12:25] <logmsgbot>	 !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:30:00 on db1075.eqiad.wmnet with reason: Rebooting for T272255
[14:12:26] <logmsgbot>	 !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:30:00 on db1075.eqiad.wmnet with reason: Rebooting for T272255
[14:12:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:12:31] <logmsgbot>	 !log kormat@cumin1001 dbctl commit (dc=all): 'db1075 depooling: Rebooting for T272255', diff saved to https://phabricator.wikimedia.org/P13848 and previous config saved to /var/cache/conftool/dbconfig/20210120-141230-kormat.json
[14:12:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:12:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:13:37] <logmsgbot>	 !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2044.codfw.wmnet
[14:13:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:14:05] <logmsgbot>	 !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host ms-be2045.codfw.wmnet
[14:14:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:14:57] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] nova vendordata: use a bit of jinja and inject dhcp_domain into cloud-config [puppet] - 10https://gerrit.wikimedia.org/r/657230 (https://phabricator.wikimedia.org/T271273) (owner: 10Andrew Bogott)
[14:19:05] <logmsgbot>	 !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2045.codfw.wmnet
[14:19:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:19:57] <wikibugs>	 (03CR) 10Muehlenhoff: "Looks good, couple of nits inline" (036 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/657102 (owner: 10Jbond)
[14:20:06] <logmsgbot>	 !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host ms-be2046.codfw.wmnet
[14:20:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:20:33] <wikibugs>	 (03CR) 10Thcipriani: [C: 03+1] Add Tyler as approval contact for Phabricator-related groups [puppet] - 10https://gerrit.wikimedia.org/r/657291 (owner: 10Muehlenhoff)
[14:21:40] <logmsgbot>	 !log kormat@cumin1001 dbctl commit (dc=all): 'Repooling after reboot. T272255', diff saved to https://phabricator.wikimedia.org/P13849 and previous config saved to /var/cache/conftool/dbconfig/20210120-142139-kormat.json
[14:21:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:26:03] <ottomata>	 brennen: liw  o/  I have some config changes coming in, what's the train status?  can  sync those?
[14:26:03] <logmsgbot>	 !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2046.codfw.wmnet
[14:26:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:26:31] <logmsgbot>	 !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:30:00 on db1076.eqiad.wmnet with reason: Rebooting for T272255
[14:26:32] <logmsgbot>	 !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:30:00 on db1076.eqiad.wmnet with reason: Rebooting for T272255
[14:26:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:26:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:26:37] <logmsgbot>	 !log kormat@cumin1001 dbctl commit (dc=all): 'db1076 depooling: Rebooting for T272255', diff saved to https://phabricator.wikimedia.org/P13850 and previous config saved to /var/cache/conftool/dbconfig/20210120-142636-kormat.json
[14:26:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:26:49] <logmsgbot>	 !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host ms-be2047.codfw.wmnet
[14:26:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:27:28] <wikibugs>	 (03PS1) 10Ottomata: Migrate QuickSurveyInitiation and QuickSurveysResponses to eventgate on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657343 (https://phabricator.wikimedia.org/T271165)
[14:28:21] <wikibugs>	 (03PS2) 10Hnowlan: similar-users: one replica in staging, 2 in prod [deployment-charts] - 10https://gerrit.wikimedia.org/r/657335 (https://phabricator.wikimedia.org/T268837)
[14:29:03] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/655896 (https://phabricator.wikimedia.org/T271702) (owner: 10Jbond)
[14:30:24] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+2] similar-users: one replica in staging, 2 in prod [deployment-charts] - 10https://gerrit.wikimedia.org/r/657335 (https://phabricator.wikimedia.org/T268837) (owner: 10Hnowlan)
[14:30:41] <wikibugs>	 (03CR) 10Ottomata: [C: 03+2] Migrate QuickSurveyInitiation and QuickSurveysResponses to eventgate on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657343 (https://phabricator.wikimedia.org/T271165) (owner: 10Ottomata)
[14:31:39] <ottomata>	 brennen: , liw, looks like train is blocked?  proceeding with config changes, they are low risk, please tell me if i should stop
[14:31:45] <wikibugs>	 (03Merged) 10jenkins-bot: similar-users: one replica in staging, 2 in prod [deployment-charts] - 10https://gerrit.wikimedia.org/r/657335 (https://phabricator.wikimedia.org/T268837) (owner: 10Hnowlan)
[14:31:48] <liw>	 ottomata, brennen is asleep, I'm not moving train this window, go ahead
[14:31:49] <wikibugs>	 (03PS1) 10David Caro: Revert "Discard the non-whitelisted 172.16.0.0/12 traffic" [homer/public] - 10https://gerrit.wikimedia.org/r/657345 (https://phabricator.wikimedia.org/T272486)
[14:31:53] <ottomata>	 ok thank you
[14:32:08] <logmsgbot>	 !log hnowlan@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'similar-users' for release 'main' .
[14:32:09] <logmsgbot>	 !log hnowlan@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'similar-users' for release 'canary' .
[14:32:09] <logmsgbot>	 !log hnowlan@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'similar-users' for release 'test' .
[14:32:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:32:12] <logmsgbot>	 !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2047.codfw.wmnet
[14:32:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:32:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:32:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:32:25] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+1] Revert "Discard the non-whitelisted 172.16.0.0/12 traffic" [homer/public] - 10https://gerrit.wikimedia.org/r/657345 (https://phabricator.wikimedia.org/T272486) (owner: 10David Caro)
[14:33:36] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good to me" [puppet] - 10https://gerrit.wikimedia.org/r/657301 (owner: 10Arturo Borrero Gonzalez)
[14:34:10] <logmsgbot>	 !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host ms-be2048.codfw.wmnet
[14:34:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:34:53] <wikibugs>	 (03CR) 10David Caro: [C: 03+2] Revert "Discard the non-whitelisted 172.16.0.0/12 traffic" [homer/public] - 10https://gerrit.wikimedia.org/r/657345 (https://phabricator.wikimedia.org/T272486) (owner: 10David Caro)
[14:34:58] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 04-1] Add support for php deployments (039 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/651757 (owner: 10Giuseppe Lavagetto)
[14:35:22] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "Discard the non-whitelisted 172.16.0.0/12 traffic" [homer/public] - 10https://gerrit.wikimedia.org/r/657345 (https://phabricator.wikimedia.org/T272486) (owner: 10David Caro)
[14:35:53] <wikibugs>	 (03PS9) 10Elukey: Add new service eventstreams-internal [deployment-charts] - 10https://gerrit.wikimedia.org/r/644612 (https://phabricator.wikimedia.org/T269160) (owner: 10Ottomata)
[14:35:59] <ottomata>	 ^ :o
[14:36:00] <ottomata>	 :)
[14:36:03] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): (Need By: 2020-06-20) rack/setup/install cloudcephosd10[04-15].wikimedia.org - https://phabricator.wikimedia.org/T251619 (10Andrew)
[14:36:37] <wikibugs>	 (03PS1) 10Ottomata: Declare streams QuickSurveysResponses and QuickSurveyInitiation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657346 (https://phabricator.wikimedia.org/T271165)
[14:37:05] <wikibugs>	 (03PS2) 10Ottomata: Declare streams QuickSurveysResponses and QuickSurveyInitiation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657346 (https://phabricator.wikimedia.org/T271165)
[14:38:08] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] Add new service eventstreams-internal [deployment-charts] - 10https://gerrit.wikimedia.org/r/644612 (https://phabricator.wikimedia.org/T269160) (owner: 10Ottomata)
[14:39:03] <wikibugs>	 (03CR) 10Ottomata: [C: 03+2] Declare streams QuickSurveysResponses and QuickSurveyInitiation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657346 (https://phabricator.wikimedia.org/T271165) (owner: 10Ottomata)
[14:40:10] <logmsgbot>	 !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2048.codfw.wmnet
[14:40:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:40:19] <logmsgbot>	 !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host ms-be2049.codfw.wmnet
[14:40:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:43:23] <wikibugs>	 (03PS1) 10Andrew Bogott: nova vendordata: fix ## template: jinja intro [puppet] - 10https://gerrit.wikimedia.org/r/657348 (https://phabricator.wikimedia.org/T271273)
[14:44:18] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] nova vendordata: fix ## template: jinja intro [puppet] - 10https://gerrit.wikimedia.org/r/657348 (https://phabricator.wikimedia.org/T271273) (owner: 10Andrew Bogott)
[14:44:52] <wikibugs>	 (03PS1) 10CDanis: klaxon: autorestart on envfile changes [puppet] - 10https://gerrit.wikimedia.org/r/657349
[14:45:21] <wikibugs>	 (03PS1) 10Hnowlan: similar-users: make worker timeout configurable [deployment-charts] - 10https://gerrit.wikimedia.org/r/657350 (https://phabricator.wikimedia.org/T268837)
[14:45:48] <logmsgbot>	 !log kormat@cumin1001 dbctl commit (dc=all): 'db1076 (re)pooling @ 33%: Reboot T272255', diff saved to https://phabricator.wikimedia.org/P13851 and previous config saved to /var/cache/conftool/dbconfig/20210120-144547-kormat.json
[14:45:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:46:47] <logmsgbot>	 !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2049.codfw.wmnet
[14:46:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:46:54] <logmsgbot>	 !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host ms-be2050.codfw.wmnet
[14:46:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:47:50] <logmsgbot>	 !log otto@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Migrate QuickSurveys schemas to EventGate on testwiki - T271165, T271166 (duration: 01m 06s)
[14:47:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:47:55] <stashbot>	 T271165: QuickSurveyInitiation Event Platform Migration - https://phabricator.wikimedia.org/T271165
[14:47:55] <stashbot>	 T271166: QuickSurveysResponses Event Platform Migration - https://phabricator.wikimedia.org/T271166
[14:48:23] <wikibugs>	 (03PS2) 10Andrew Bogott: Revert "Nova: add a simple vendordata REST service" [puppet] - 10https://gerrit.wikimedia.org/r/657117
[14:48:50] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Revert "Nova: add a simple vendordata REST service" [puppet] - 10https://gerrit.wikimedia.org/r/657117 (owner: 10Andrew Bogott)
[14:51:03] <wikibugs>	 (03CR) 10CDanis: [C: 03+2] klaxon: autorestart on envfile changes [puppet] - 10https://gerrit.wikimedia.org/r/657349 (owner: 10CDanis)
[14:51:32] <wikibugs>	 (03PS3) 10Andrew Bogott: Revert "Nova: add a simple vendordata REST service" [puppet] - 10https://gerrit.wikimedia.org/r/657117
[14:52:45] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] Revert "Nova: add a simple vendordata REST service" [puppet] - 10https://gerrit.wikimedia.org/r/657117 (owner: 10Andrew Bogott)
[14:53:03] <logmsgbot>	 !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2050.codfw.wmnet
[14:53:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:53:15] <icinga-wm>	 RECOVERY - Check for VMs leaked by the nova-fullstack test on cloudcontrol1003 is OK: 1 instances in the admin-monitoring project https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting%23Nova-fullstack
[14:53:55] <wikibugs>	 (03PS1) 10Ottomata: Migrate QuickSurveyInitiation and QuickSurveysResponses to eventgate on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657351 (https://phabricator.wikimedia.org/T271165)
[14:55:28] <wikibugs>	 (03CR) 10Ottomata: [C: 03+2] Migrate QuickSurveyInitiation and QuickSurveysResponses to eventgate on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657351 (https://phabricator.wikimedia.org/T271165) (owner: 10Ottomata)
[14:55:34] <logmsgbot>	 !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host ms-be2051.codfw.wmnet
[14:55:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:55:56] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] ferm-status: add ability to ignore rules with a specific comment prefix [puppet] - 10https://gerrit.wikimedia.org/r/655896 (https://phabricator.wikimedia.org/T271702) (owner: 10Jbond)
[14:55:59] <logmsgbot>	 !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:30:00 on db1109.eqiad.wmnet with reason: Rebooting for T272255
[14:56:00] <logmsgbot>	 !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:30:00 on db1109.eqiad.wmnet with reason: Rebooting for T272255
[14:56:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:56:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:56:05] <logmsgbot>	 !log kormat@cumin1001 dbctl commit (dc=all): 'db1109 depooling: Rebooting for T272255', diff saved to https://phabricator.wikimedia.org/P13852 and previous config saved to /var/cache/conftool/dbconfig/20210120-145605-kormat.json
[14:56:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:57:09] <logmsgbot>	 !log otto@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Migrate QuickSurveys schemas to EventGate on all wikis - T271165, T271166 (duration: 01m 05s)
[14:57:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:57:13] <stashbot>	 T271165: QuickSurveyInitiation Event Platform Migration - https://phabricator.wikimedia.org/T271165
[14:57:13] <stashbot>	 T271166: QuickSurveysResponses Event Platform Migration - https://phabricator.wikimedia.org/T271166
[14:57:15] <icinga-wm>	 PROBLEM - Postgres Replication Lag on puppetdb2002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 238408568 and 20 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[14:59:31] <icinga-wm>	 RECOVERY - Postgres Replication Lag on puppetdb2002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 484728 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[14:59:34] <logmsgbot>	 !log elukey@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'kube-system' for release 'rbac-deploy-clusterrole' .
[14:59:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:00:52] <logmsgbot>	 !log kormat@cumin1001 dbctl commit (dc=all): 'db1076 (re)pooling @ 66%: Reboot T272255', diff saved to https://phabricator.wikimedia.org/P13853 and previous config saved to /var/cache/conftool/dbconfig/20210120-150051-kormat.json
[15:00:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:01:39] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Add wikitrent to `wmf` LDAP group - https://phabricator.wikimedia.org/T272489 (10wikitrent)
[15:01:53] <logmsgbot>	 !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2051.codfw.wmnet
[15:01:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:02:16] <logmsgbot>	 !log kormat@cumin1001 dbctl commit (dc=all): 'db1109 (re)pooling @ 25%: Reboot T272255', diff saved to https://phabricator.wikimedia.org/P13854 and previous config saved to /var/cache/conftool/dbconfig/20210120-150216-kormat.json
[15:02:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:02:18] <logmsgbot>	 !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host ms-be2052.codfw.wmnet
[15:02:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:08:02] <logmsgbot>	 !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2052.codfw.wmnet
[15:08:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:09:13] <wikibugs>	 10SRE, 10Dumps-Generation, 10SRE-Access-Requests, 10Patch-For-Review, 10Platform Team Workboards (Clinic Duty Team): Add all of CPT to snapshot/dumpsdata admins - https://phabricator.wikimedia.org/T271718 (10ArielGlenn) >>! In T271718#6755969, @jcrespo wrote: > Next meeting is expected to happen on 25 Ja...
[15:09:27] <icinga-wm>	 RECOVERY - nova instance creation test on cloudcontrol1003 is OK: PROCS OK: 1 process with command name python, args nova-fullstack https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[15:12:09] <logmsgbot>	 !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host ms-be2053.codfw.wmnet
[15:12:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:12:14] <wikibugs>	 (03CR) 10Ottomata: [C: 03+2] "> you expose your helm-releases to changes in the infra which will ofc not be picked up automatically" [puppet] - 10https://gerrit.wikimedia.org/r/656253 (https://phabricator.wikimedia.org/T253058) (owner: 10Ottomata)
[15:13:35] <wikibugs>	 (03PS2) 10Muehlenhoff: Add Tyler as approval contact for Phabricator-related groups [puppet] - 10https://gerrit.wikimedia.org/r/657291
[15:15:09] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Add Tyler as approval contact for Phabricator-related groups [puppet] - 10https://gerrit.wikimedia.org/r/657291 (owner: 10Muehlenhoff)
[15:15:55] <logmsgbot>	 !log kormat@cumin1001 dbctl commit (dc=all): 'db1076 (re)pooling @ 100%: Reboot T272255', diff saved to https://phabricator.wikimedia.org/P13855 and previous config saved to /var/cache/conftool/dbconfig/20210120-151555-kormat.json
[15:15:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:16:13] <brennen>	 jouncebot now
[15:16:13] <jouncebot>	 For the next 0 hour(s) and 43 minute(s): Mediawiki train - American+European Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210120T1400)
[15:17:01] <logmsgbot>	 !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2053.codfw.wmnet
[15:17:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:17:20] <logmsgbot>	 !log kormat@cumin1001 dbctl commit (dc=all): 'db1109 (re)pooling @ 50%: Reboot T272255', diff saved to https://phabricator.wikimedia.org/P13856 and previous config saved to /var/cache/conftool/dbconfig/20210120-151719-kormat.json
[15:17:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:17:50] <logmsgbot>	 !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host ms-be2054.codfw.wmnet
[15:17:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:18:05] <brennen>	 !log 1.36.0-wmf.27 train unblocked, proceeding to group0 (T271341)
[15:18:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:18:14] <stashbot>	 T271341: 1.36.0-wmf.27 deployment blockers - https://phabricator.wikimedia.org/T271341
[15:21:23] <wikibugs>	 (03PS1) 10Brennen Bearnes: group0 wikis to 1.36.0-wmf.27 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657357
[15:21:25] <wikibugs>	 (03CR) 10Brennen Bearnes: [C: 03+2] group0 wikis to 1.36.0-wmf.27 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657357 (owner: 10Brennen Bearnes)
[15:22:14] <wikibugs>	 (03CR) 10Gmodena: [C: 03+1] "lgtm" [deployment-charts] - 10https://gerrit.wikimedia.org/r/657350 (https://phabricator.wikimedia.org/T268837) (owner: 10Hnowlan)
[15:22:25] <wikibugs>	 (03Merged) 10jenkins-bot: group0 wikis to 1.36.0-wmf.27 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657357 (owner: 10Brennen Bearnes)
[15:23:34] <logmsgbot>	 !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2054.codfw.wmnet
[15:23:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:24:10] <logmsgbot>	 !log brennen@deploy1001 rebuilt and synchronized wikiversions files: group0 wikis to 1.36.0-wmf.27
[15:24:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:25:26] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: cr/firewall.conf: cloud-in4: introduce ACL for novafullstack [homer/public] - 10https://gerrit.wikimedia.org/r/657358
[15:29:36] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+2] similar-users: make worker timeout configurable [deployment-charts] - 10https://gerrit.wikimedia.org/r/657350 (https://phabricator.wikimedia.org/T268837) (owner: 10Hnowlan)
[15:30:44] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+1] "LGTM!" [homer/public] - 10https://gerrit.wikimedia.org/r/657358 (owner: 10Arturo Borrero Gonzalez)
[15:31:08] <wikibugs>	 (03Merged) 10jenkins-bot: similar-users: make worker timeout configurable [deployment-charts] - 10https://gerrit.wikimedia.org/r/657350 (https://phabricator.wikimedia.org/T268837) (owner: 10Hnowlan)
[15:31:42] <wikibugs>	 (03CR) 10Abijeet Patro: [C: 03+1] Set wgTranslateGroupSynchronizationCache to false explicitly [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657337 (https://phabricator.wikimedia.org/T272428) (owner: 10Urbanecm)
[15:32:01] <logmsgbot>	 !log hnowlan@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'similar-users' for release 'test' .
[15:32:01] <logmsgbot>	 !log hnowlan@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'similar-users' for release 'main' .
[15:32:01] <logmsgbot>	 !log hnowlan@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'similar-users' for release 'canary' .
[15:34:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:34:07] <logmsgbot>	 !log kormat@cumin1001 dbctl commit (dc=all): 'db1109 (re)pooling @ 75%: Reboot T272255', diff saved to https://phabricator.wikimedia.org/P13857 and previous config saved to /var/cache/conftool/dbconfig/20210120-153223-kormat.json
[15:34:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:34:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:34:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:34:27] <logmsgbot>	 !log hnowlan@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'similar-users' for release 'main' .
[15:34:27] <logmsgbot>	 !log hnowlan@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'similar-users' for release 'canary' .
[15:34:27] <logmsgbot>	 !log hnowlan@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'similar-users' for release 'test' .
[15:34:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:34:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:34:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:40:19] <logmsgbot>	 !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host ms-be2056.codfw.wmnet
[15:40:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:43:27] <wikibugs>	 10SRE, 10observability, 10User-fgiunchedi: VictorOps behavior on long-ack'd incidents - https://phabricator.wikimedia.org/T259465 (10fgiunchedi) 05Stalled→03Resolved a:03fgiunchedi This policy change has been implemented for >6 months now and seems to work well (i.e. no incidents left acknowledged)
[15:43:34] <logmsgbot>	 !log hnowlan@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'similar-users' for release 'test' .
[15:43:34] <logmsgbot>	 !log hnowlan@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'similar-users' for release 'main' .
[15:43:34] <logmsgbot>	 !log hnowlan@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'similar-users' for release 'canary' .
[15:43:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:43:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:43:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:46:26] <logmsgbot>	 !log hnowlan@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'similar-users' for release 'canary' .
[15:46:26] <logmsgbot>	 !log hnowlan@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'similar-users' for release 'test' .
[15:46:26] <logmsgbot>	 !log hnowlan@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'similar-users' for release 'main' .
[15:46:28] <logmsgbot>	 !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2056.codfw.wmnet
[15:46:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:46:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:46:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:46:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:47:27] <logmsgbot>	 !log kormat@cumin1001 dbctl commit (dc=all): 'db1109 (re)pooling @ 100%: Reboot T272255', diff saved to https://phabricator.wikimedia.org/P13858 and previous config saved to /var/cache/conftool/dbconfig/20210120-154726-kormat.json
[15:47:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:49:10] <logmsgbot>	 !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host ms-be2057.codfw.wmnet
[15:49:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:55:11] <logmsgbot>	 !log elukey@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'eventstreams-internal' for release 'main' .
[15:55:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:56:42] <logmsgbot>	 !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2057.codfw.wmnet
[15:56:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:57:02] <logmsgbot>	 !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host ms-be2058.codfw.wmnet
[15:57:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:58:24] <logmsgbot>	 !log hnowlan@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'similar-users' for release 'main' .
[15:58:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:58:34] <wikibugs>	 (03PS1) 10Ladsgroup: refinery: Migrate hiera() to lookup() [puppet] - 10https://gerrit.wikimedia.org/r/657363 (https://phabricator.wikimedia.org/T209953)
[15:59:33] <logmsgbot>	 !log hnowlan@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'similar-users' for release 'main' .
[15:59:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:02:45] <wikibugs>	 (03CR) 10Ladsgroup: "PCC: https://puppet-compiler.wmflabs.org/compiler1001/27536/" [puppet] - 10https://gerrit.wikimedia.org/r/657363 (https://phabricator.wikimedia.org/T209953) (owner: 10Ladsgroup)
[16:04:34] <logmsgbot>	 !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2058.codfw.wmnet
[16:04:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:05:27] <wikibugs>	 10SRE, 10ops-eqiad, 10Traffic: lvs1015 interface errors - https://phabricator.wikimedia.org/T272258 (10Cmjohnson) @BBlack Can you schedule this for this afternoon?
[16:06:15] <wikibugs>	 (03PS1) 10Ladsgroup: analytics: Migrate hiera() to lookup() [puppet] - 10https://gerrit.wikimedia.org/r/657364 (https://phabricator.wikimedia.org/T209953)
[16:06:36] <wikibugs>	 10SRE, 10ops-eqiad, 10Traffic: lvs1015 interface errors - https://phabricator.wikimedia.org/T272258 (10BBlack) @Cmjohnson Yes, just let me know a timeframe and we'll get it ready
[16:08:15] <wikibugs>	 (03CR) 10Elukey: refinery: Migrate hiera() to lookup() (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/657363 (https://phabricator.wikimedia.org/T209953) (owner: 10Ladsgroup)
[16:08:28] <wikibugs>	 (03CR) 10Ladsgroup: "PCC: https://puppet-compiler.wmflabs.org/compiler1003/27537/" [puppet] - 10https://gerrit.wikimedia.org/r/657364 (https://phabricator.wikimedia.org/T209953) (owner: 10Ladsgroup)
[16:08:37] <wikibugs>	 (03CR) 10David Caro: [C: 03+1] "\o/ yay" [homer/public] - 10https://gerrit.wikimedia.org/r/657358 (owner: 10Arturo Borrero Gonzalez)
[16:09:23] <wikibugs>	 10SRE, 10MediaWiki-Containers, 10serviceops, 10Patch-For-Review: Homepage for https://docker-registry.wikimedia.org - https://phabricator.wikimedia.org/T179696 (10Joe) For the record, I got the script running, by using   ` /usr/local/bin/registry-homepage-builder docker-registry.wikimedia.org /root/homepag...
[16:09:27] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] analytics: Migrate hiera() to lookup() [puppet] - 10https://gerrit.wikimedia.org/r/657364 (https://phabricator.wikimedia.org/T209953) (owner: 10Ladsgroup)
[16:09:50] <wikibugs>	 10SRE, 10ops-eqiad, 10Traffic: lvs1015 interface errors - https://phabricator.wikimedia.org/T272258 (10Cmjohnson) @BBlack Okay, 2 hours from now.   1300 EST
[16:10:33] <wikibugs>	 (03CR) 10Ladsgroup: refinery: Migrate hiera() to lookup() (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/657363 (https://phabricator.wikimedia.org/T209953) (owner: 10Ladsgroup)
[16:20:56] <logmsgbot>	 !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host ms-be2059.codfw.wmnet
[16:20:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:24:46] <wikibugs>	 (03CR) 10Alexandros Kosiaris: "> Patch Set 2:" [dns] - 10https://gerrit.wikimedia.org/r/656430 (https://phabricator.wikimedia.org/T258978) (owner: 10Alexandros Kosiaris)
[16:25:09] <wikibugs>	 (03CR) 10Elukey: refinery: Migrate hiera() to lookup() (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/657363 (https://phabricator.wikimedia.org/T209953) (owner: 10Ladsgroup)
[16:29:37] <logmsgbot>	 !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2059.codfw.wmnet
[16:29:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:29:45] <logmsgbot>	 !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host ms-be2060.codfw.wmnet
[16:29:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:31:34] <wikibugs>	 10SRE, 10ops-eqiad, 10Traffic: lvs1015 interface errors - https://phabricator.wikimedia.org/T272258 (10BBlack) @Cmjohnson - we'll have it ready then.
[16:35:44] <wikibugs>	 (03PS3) 10Alexandros Kosiaris: Introduce linkrecommendation{,-external} [dns] - 10https://gerrit.wikimedia.org/r/656430 (https://phabricator.wikimedia.org/T258978)
[16:37:02] <wikibugs>	 (03CR) 10Alexandros Kosiaris: Introduce linkrecommendation{,-external} (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/656430 (https://phabricator.wikimedia.org/T258978) (owner: 10Alexandros Kosiaris)
[16:37:15] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+2] Introduce linkrecommendation{,-external} [dns] - 10https://gerrit.wikimedia.org/r/656430 (https://phabricator.wikimedia.org/T258978) (owner: 10Alexandros Kosiaris)
[16:37:16] <logmsgbot>	 !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2060.codfw.wmnet
[16:37:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:37:26] <logmsgbot>	 !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host ms-be2061.codfw.wmnet
[16:37:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:38:56] <wikibugs>	 (03PS2) 10Ladsgroup: refinery: Migrate hiera() to lookup() [puppet] - 10https://gerrit.wikimedia.org/r/657363 (https://phabricator.wikimedia.org/T209953)
[16:39:33] <wikibugs>	 (03CR) 10Ladsgroup: refinery: Migrate hiera() to lookup() (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/657363 (https://phabricator.wikimedia.org/T209953) (owner: 10Ladsgroup)
[16:39:47] <wikibugs>	 (03CR) 10Volans: [C: 03+1] Introduce linkrecommendation{,-external} (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/656430 (https://phabricator.wikimedia.org/T258978) (owner: 10Alexandros Kosiaris)
[16:40:15] <volans>	 akosiaris: you might have forgot to submit and deploy^^^ ;)
[16:42:38] <wikibugs>	 (03PS7) 10Jbond: sre: convert the generic reboot functions to the cookbook class API [cookbooks] - 10https://gerrit.wikimedia.org/r/657102
[16:44:46] <logmsgbot>	 !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2061.codfw.wmnet
[16:44:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:44:49] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/657155 (https://phabricator.wikimedia.org/T267376) (owner: 10Bstorm)
[16:46:00] <akosiaris>	 volans: done. I was just working through gerrit complaining about "1 unresolved comment". But you resolved it for me :-)
[16:46:15] <akosiaris>	 nice race condition
[16:46:16] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] sre: convert the generic reboot functions to the cookbook class API [cookbooks] - 10https://gerrit.wikimedia.org/r/657102 (owner: 10Jbond)
[16:46:16] <volans>	 :D
[16:46:33] <wikibugs>	 (03PS3) 10CDanis: Send pages with user's email address, if available [software/klaxon] - 10https://gerrit.wikimedia.org/r/655117
[16:48:23] <wikibugs>	 (03CR) 10CDanis: Send pages with user's email address, if available (031 comment) [software/klaxon] - 10https://gerrit.wikimedia.org/r/655117 (owner: 10CDanis)
[16:48:40] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] network: add cloud_networks_public constant and use it in ferm [puppet] - 10https://gerrit.wikimedia.org/r/657301 (owner: 10Arturo Borrero Gonzalez)
[16:49:24] <wikibugs>	 (03PS4) 10CDanis: Send pages with user's email address, if available [software/klaxon] - 10https://gerrit.wikimedia.org/r/655117
[16:51:53] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] profile::analytics::refinery Create HDFS folders [puppet] - 10https://gerrit.wikimedia.org/r/656307 (https://phabricator.wikimedia.org/T271560) (owner: 10Joal)
[16:52:18] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Add wikitrent to `wmf` LDAP group - https://phabricator.wikimedia.org/T272489 (10jcrespo) Hello, @wikitrent!  Aside from the approval, may I ask you for your Wikitech user name -or create one if you don't have one- and wikimedia email handle?  Also, while not technically required...
[16:53:45] <wikibugs>	 (03CR) 10CDanis: [C: 03+1] swift: decrease object replicator concurrency [puppet] - 10https://gerrit.wikimedia.org/r/656837 (https://phabricator.wikimedia.org/T271415) (owner: 10Filippo Giunchedi)
[16:54:45] <icinga-wm>	 PROBLEM - High average POST latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code=200 handler=- method=POST https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=POST
[16:56:10] <wikibugs>	 (03PS1) 10Cwhite: logstash: enable curator to accept custom age filters [puppet] - 10https://gerrit.wikimedia.org/r/657370 (https://phabricator.wikimedia.org/T234565)
[16:56:12] <wikibugs>	 (03PS1) 10Cwhite: profile: ecs indices to use a weekly rotation [puppet] - 10https://gerrit.wikimedia.org/r/657371 (https://phabricator.wikimedia.org/T234565)
[16:56:31] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: wikireplicas: set up LVS for multiinstance wikireplicas (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/655533 (https://phabricator.wikimedia.org/T271476) (owner: 10Bstorm)
[16:56:57] <icinga-wm>	 RECOVERY - High average POST latency for mw requests on appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=POST
[17:01:13] <wikibugs>	 (03PS2) 10Arturo Borrero Gonzalez: cr/firewall.conf: cloud-in4: introduce ACL for novafullstack [homer/public] - 10https://gerrit.wikimedia.org/r/657358 (https://phabricator.wikimedia.org/T272486)
[17:06:23] <wikibugs>	 (03PS1) 10Joal: Update HDFS folder creation for analytics refinery [puppet] - 10https://gerrit.wikimedia.org/r/657372 (https://phabricator.wikimedia.org/T271415)
[17:06:25] <wikibugs>	 10SRE, 10MediaWiki-Containers, 10serviceops, 10Patch-For-Review: Homepage for https://docker-registry.wikimedia.org - https://phabricator.wikimedia.org/T179696 (10Legoktm) >>! In T179696#6761286, @Joe wrote: > Given the banner page we're creating is for use by the public, I think it can simply run against...
[17:06:37] <joal>	 elukey: --^
[17:07:57] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Update HDFS folder creation for analytics refinery [puppet] - 10https://gerrit.wikimedia.org/r/657372 (https://phabricator.wikimedia.org/T271415) (owner: 10Joal)
[17:08:26] <wikibugs>	 10ops-eqiad, 10DC-Ops, 10Dumps-Generation: (Need By: 2021-03-31) rack/setup/install snapshot101[1-5] - https://phabricator.wikimedia.org/T272509 (10RobH)
[17:08:29] <elukey>	 joal: I think require goes for last
[17:08:33] <joal>	 Arf
[17:08:34] <joal>	 ok
[17:08:37] <wikibugs>	 10ops-eqiad, 10DC-Ops, 10Dumps-Generation: (Need By: 2021-03-31) rack/setup/install snapshot101[1-5] - https://phabricator.wikimedia.org/T272509 (10RobH)
[17:09:20] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Add wikitrent to `wmf` LDAP group - https://phabricator.wikimedia.org/T272489 (10wikitrent) hi @jcrespo !  Thanks for getting back to me! I've linked the LDAP account and my username is Wikitrent. Email is thand@wikimedia.org
[17:10:25] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 236, down: 1, dormant: 0, excluded: 1, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[17:13:30] <wikibugs>	 (03PS2) 10Joal: Update HDFS folder creation for analytics refinery [puppet] - 10https://gerrit.wikimedia.org/r/657372 (https://phabricator.wikimedia.org/T271415)
[17:15:27] <icinga-wm>	 PROBLEM - Uncommitted DNS changes in Netbox on netbox1001 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes
[17:16:25] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] Update HDFS folder creation for analytics refinery [puppet] - 10https://gerrit.wikimedia.org/r/657372 (https://phabricator.wikimedia.org/T271415) (owner: 10Joal)
[17:19:35] * volans looking ^^^
[17:19:48] <elukey>	 volans: please stop breaking netbox
[17:19:50] <wikibugs>	 (03PS1) 10Hnowlan: similar-users: new version of image, more debugging. [deployment-charts] - 10https://gerrit.wikimedia.org/r/657373
[17:19:55] <volans>	 was not me :D
[17:20:19] <papaul>	 elukey: netbox = volans 
[17:21:48] <wikibugs>	 (03CR) 10Gmodena: [C: 03+1] similar-users: new version of image, more debugging. [deployment-charts] - 10https://gerrit.wikimedia.org/r/657373 (owner: 10Hnowlan)
[17:22:49] <elukey>	 papaul: :D 
[17:22:58] <papaul>	 he can't escape from that 
[17:23:05] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to analytics for lilients - https://phabricator.wikimedia.org/T272264 (10jcrespo)
[17:23:52] <volans>	 akosiaris: I'm running the sre.dns.netbox cookbook to address the above ^^^ until we decide if we use or not netbox as source of truth for the .svc. zonefiles this will fire as there is a diff although it's not actually included live. Sorry for the noise
[17:23:59] <elukey>	 papaul: we joke about it but we are lucky to have Riccardo patroling netbox :D
[17:24:09] <icinga-wm>	 PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (bad URL) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid
[17:24:57] <logmsgbot>	 !log volans@cumin2001 START - Cookbook sre.dns.netbox
[17:25:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:25:05] <papaul>	 elukey: yes he does a great job i give him credit for that always avaible to help
[17:25:54] <volans>	 <3 :)
[17:25:56] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to analytics for lilients - https://phabricator.wikimedia.org/T272264 (10jcrespo) 05Open→03Resolved All tasks have been completed, please try to access your cluster account to confirm it works as intended and you have the right permissions.  For kerberos, you...
[17:26:07] <papaul>	 volans: we will change netbox name to netvolans
[17:26:16] <volans>	 lol
[17:26:29] <icinga-wm>	 PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1004 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1
[17:27:45] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+2] similar-users: new version of image, more debugging. [deployment-charts] - 10https://gerrit.wikimedia.org/r/657373 (owner: 10Hnowlan)
[17:28:27] <icinga-wm>	 RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid
[17:29:11] <wikibugs>	 (03Merged) 10jenkins-bot: similar-users: new version of image, more debugging. [deployment-charts] - 10https://gerrit.wikimedia.org/r/657373 (owner: 10Hnowlan)
[17:29:41] <wikibugs>	 10SRE, 10Analytics, 10Analytics-Kanban, 10Event-Platform, and 5 others: Set up internal eventstreams instance exposing all streams declared in stream config (and in kafka jumbo) - https://phabricator.wikimedia.org/T269160 (10elukey) Get up to deploying the service in staging, it seems working! Updated all...
[17:31:37] <logmsgbot>	 !log hnowlan@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'similar-users' for release 'main' .
[17:31:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:33:27] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Add wikitrent to `wmf` LDAP group - https://phabricator.wikimedia.org/T272489 (10Aklapper) Hi and welcome @wikitrent. Please update the team's onboarding docs to link to https://phabricator.wikimedia.org/project/profile/1564/ which has instructions. Thanks a lot! :)
[17:34:26] <wikibugs>	 (03CR) 10Elukey: profile::analytics::refinery::job::hdfs_cleaner Update (036 comments) [puppet] - 10https://gerrit.wikimedia.org/r/656376 (https://phabricator.wikimedia.org/T271560) (owner: 10Joal)
[17:35:12] <wikibugs>	 (03PS1) 10Brennen Bearnes: Catch ClosestFilterVersionNotFoundException in ViewDiff [extensions/AbuseFilter] (wmf/1.36.0-wmf.27) - 10https://gerrit.wikimedia.org/r/657322 (https://phabricator.wikimedia.org/T272505)
[17:35:17] <icinga-wm>	 RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1004 is OK: OK: Less than 20.00% above the threshold [500.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1
[17:36:23] <wikibugs>	 (03CR) 10Brennen Bearnes: [C: 03+2] Catch ClosestFilterVersionNotFoundException in ViewDiff [extensions/AbuseFilter] (wmf/1.36.0-wmf.27) - 10https://gerrit.wikimedia.org/r/657322 (https://phabricator.wikimedia.org/T272505) (owner: 10Brennen Bearnes)
[17:36:38] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Add wikitrent to `wmf` LDAP group - https://phabricator.wikimedia.org/T272489 (10jcrespo) @Aklapper thanks for mentioning it, I was in a good mood I didn't want to be very inflexible with procedures for a new colleague, but following and using the templates helps indeed to speed...
[17:36:39] <icinga-wm>	 PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 90, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[17:40:06] <logmsgbot>	 !log hnowlan@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'similar-users' for release 'main' .
[17:40:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:42:51] <wikibugs>	 (03PS1) 10Jcrespo: admin: Add wikitrent to the list of privileged LDAP accounts [puppet] - 10https://gerrit.wikimedia.org/r/657378 (https://phabricator.wikimedia.org/T272489)
[17:43:21] <icinga-wm>	 RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 92, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[17:43:22] <logmsgbot>	 !log hnowlan@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'similar-users' for release 'main' .
[17:43:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:44:24] <wikibugs>	 (03CR) 10RLazarus: [C: 03+1] Send pages with user's email address, if available [software/klaxon] - 10https://gerrit.wikimedia.org/r/655117 (owner: 10CDanis)
[17:44:40] <wikibugs>	 (03CR) 10CDanis: [C: 03+2] Send pages with user's email address, if available [software/klaxon] - 10https://gerrit.wikimedia.org/r/655117 (owner: 10CDanis)
[17:46:14] <wikibugs>	 (03Merged) 10jenkins-bot: Send pages with user's email address, if available [software/klaxon] - 10https://gerrit.wikimedia.org/r/655117 (owner: 10CDanis)
[17:49:38] <wikibugs>	 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia...
[17:50:22] <wikibugs>	 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia...
[17:51:14] <wikibugs>	 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia...
[17:51:50] <wikibugs>	 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia...
[17:52:57] <wikibugs>	 (03PS2) 10CRusnov: interface_automation: Clean up old interfaces on run [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/657199
[17:52:59] <wikibugs>	 (03CR) 10CRusnov: interface_automation: Clean up old interfaces on run (036 comments) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/657199 (owner: 10CRusnov)
[17:53:01] <wikibugs>	 (03PS2) 10CRusnov: interface_automation: Fix `interface` reference in IP address assignment [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/657207
[17:54:06] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] interface_automation: Clean up old interfaces on run [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/657199 (owner: 10CRusnov)
[17:54:19] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] interface_automation: Fix `interface` reference in IP address assignment [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/657207 (owner: 10CRusnov)
[17:54:27] <bblack>	 !log lvs1015: stopping pybal with puppet disabled for T272258
[17:54:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:54:34] <stashbot>	 T272258: lvs1015 interface errors - https://phabricator.wikimedia.org/T272258
[17:56:01] <icinga-wm>	 PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[17:56:55] <bblack>	 BGP alerts are related to pybal stop above, will ack them all shortly
[17:57:12] <cdanis>	 probably also a "mediawiki exceptions/minute" alert incoming shortly
[17:58:13] <logmsgbot>	 !log volans@cumin2001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[17:58:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:58:23] <icinga-wm>	 ACKNOWLEDGEMENT - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal Brandon Black lvs1015 hw maint - T272258 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[17:58:23] <icinga-wm>	 ACKNOWLEDGEMENT - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal Brandon Black lvs1015 hw maint - T272258 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[17:58:50] <bblack>	 cdanis: from the pybal shift?
[17:59:13] <cdanis>	 bblack: yeah -- doing so resets all open connections; there's always some disruption visible to internal clients
[17:59:36] <bblack>	 it doesn't have to, but it might commonly due to <things>
[17:59:53] <cdanis>	 IME it commonly does :)
[18:01:31] <wikibugs>	 (03PS3) 10CRusnov: interface_automation: Clean up old interfaces on run [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/657199
[18:01:33] <wikibugs>	 (03PS3) 10CRusnov: interface_automation: Fix `interface` reference in IP address assignment [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/657207
[18:01:34] <bblack>	 !log lvs1015 - shutdown for T272258
[18:01:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:01:38] <stashbot>	 T272258: lvs1015 interface errors - https://phabricator.wikimedia.org/T272258
[18:05:21] <icinga-wm>	 RECOVERY - Uncommitted DNS changes in Netbox on netbox1001 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes
[18:07:54] <wikibugs>	 (03Merged) 10jenkins-bot: Catch ClosestFilterVersionNotFoundException in ViewDiff [extensions/AbuseFilter] (wmf/1.36.0-wmf.27) - 10https://gerrit.wikimedia.org/r/657322 (https://phabricator.wikimedia.org/T272505) (owner: 10Brennen Bearnes)
[18:08:40] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2316.codfw.wmnet with reason: REIMAGE
[18:08:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:08:49] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1001/27538/" [puppet] - 10https://gerrit.wikimedia.org/r/655518 (https://phabricator.wikimedia.org/T209953) (owner: 10Dzahn)
[18:09:19] <icinga-wm>	 PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[18:09:25] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2325.codfw.wmnet with reason: REIMAGE
[18:09:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:10:10] <wikibugs>	 (03CR) 10Dzahn: "noop confirmed on rdb1006" [puppet] - 10https://gerrit.wikimedia.org/r/655518 (https://phabricator.wikimedia.org/T209953) (owner: 10Dzahn)
[18:10:17] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2327.codfw.wmnet with reason: REIMAGE
[18:10:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:10:29] <icinga-wm>	 PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 90, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[18:10:40] <logmsgbot>	 !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2316.codfw.wmnet with reason: REIMAGE
[18:10:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:10:54] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2329.codfw.wmnet with reason: REIMAGE
[18:10:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:12:44] <logmsgbot>	 !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2325.codfw.wmnet with reason: REIMAGE
[18:12:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:13:32] <cdanis>	 OSPF status 👀
[18:14:10] <wikibugs>	 10SRE, 10ops-eqiad, 10Traffic: lvs1015 interface errors - https://phabricator.wikimedia.org/T272258 (10Cmjohnson) @BBlack I swapped both optics on lvs1015 and b2 xe-2/0/3
[18:14:32] <logmsgbot>	 !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2329.codfw.wmnet with reason: REIMAGE
[18:14:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:14:58] <brennen>	 jouncebot now
[18:14:58] <jouncebot>	 No deployments scheduled for the next 0 hour(s) and 45 minute(s)
[18:15:00] <brennen>	 noting here that i'm going to sling out https://gerrit.wikimedia.org/r/c/mediawiki/extensions/AbuseFilter/+/657322
[18:15:04] <icinga-wm>	 RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 92, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[18:15:17] <logmsgbot>	 !log dzahn@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on mw2327.codfw.wmnet with reason: REIMAGE
[18:15:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:15:57] <cdanis>	 XioNoX: AFAICT, on cr2-esams, xe-0/1/3 (Lumen transport to cr2-eqiad) is flapping?
[18:16:54] <XioNoX>	 no planned maintenance
[18:17:09] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on mw2327.codfw.wmnet with reason: new install on buster
[18:17:09] <logmsgbot>	 !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on mw2327.codfw.wmnet with reason: new install on buster
[18:17:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:17:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:17:32] <wikibugs>	 (03PS1) 10Jbond: icinga: add wait_for_optimal function [software/spicerack] - 10https://gerrit.wikimedia.org/r/657385
[18:18:06] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[18:19:10] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[18:19:42] <wikibugs>	 (03CR) 10Hnowlan: services: similar-users discovery and LVS component (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/657101 (https://phabricator.wikimedia.org/T268837) (owner: 10Hnowlan)
[18:20:46] <icinga-wm>	 RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[18:21:55] <XioNoX>	 looking
[18:22:10] <mutante>	 !log ganeti - creating 105G virtual harddisk and adding to releases1002 for T272092
[18:22:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:22:15] <stashbot>	 T272092: Request volume for Docker images and container filesystems on releases machines - https://phabricator.wikimedia.org/T272092
[18:22:17] <mutante>	 s/105/150
[18:23:04] <icinga-wm>	 PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 90, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[18:23:19] <XioNoX>	 could be a faulty optic, great :)
[18:24:24] <icinga-wm>	 RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 92, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[18:24:34] <mutante>	 !log ganeti - creating 150G virtual hard disk and adding it to releases2002 for T272092
[18:24:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:24:49] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] icinga: add wait_for_optimal function [software/spicerack] - 10https://gerrit.wikimedia.org/r/657385 (owner: 10Jbond)
[18:25:00] <XioNoX>	 !log draining esams-eqiad link
[18:25:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:26:14] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: (Need By: 2020-11-29) rack/setup/install db11[51-76] - https://phabricator.wikimedia.org/T267043 (10Cmjohnson) the issue I ran into is db1169 was created in netbox w/out a mgmt ip.  I didn't see until I went through and assigned mgmt IP's. so now everything is 1 off, I...
[18:26:31] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops: frdev1001 ILO inaccessible - https://phabricator.wikimedia.org/T267969 (10Cmjohnson) This is planned for this coming Friday around 1100 EST
[18:27:38] <wikibugs>	 10SRE, 10ops-eqiad, 10Traffic: lvs1015 interface errors - https://phabricator.wikimedia.org/T272258 (10Cmjohnson) 05Open→03Resolved resolving this task, if the issue persists please re-open and ping me.  Thanks!
[18:29:40] <bblack>	 !log lvs1015: re-enabling puppet + pybal - T272258
[18:29:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:29:46] <stashbot>	 T272258: lvs1015 interface errors - https://phabricator.wikimedia.org/T272258
[18:32:36] <wikibugs>	 10SRE, 10ops-eqiad: ms-be1046 stuck on reboot - https://phabricator.wikimedia.org/T272396 (10Cmjohnson) Locally, ms-be1046 is not coming up either. I tried removing the power and psu's waiting for 45 secs and plugging back in. The server will not power-on.  A Dell support task will need to be opened.
[18:33:13] <wikibugs>	 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2325.codfw.wmnet'] `  an...
[18:33:46] <wikibugs>	 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2316.codfw.wmnet'] `  an...
[18:34:41] <wikibugs>	 10SRE, 10ops-eqiad: Degraded RAID on ms-be1032 - https://phabricator.wikimedia.org/T272209 (10Cmjohnson) This server is out of warranty, I do not have any HP 4TB disks to replace it with.  I may have some old Dell ones I can use.  They should work
[18:35:30] <wikibugs>	 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2327.codfw.wmnet'] `  an...
[18:36:06] <wikibugs>	 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2329.codfw.wmnet'] `  an...
[18:36:57] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] refinery: Migrate hiera() to lookup() [puppet] - 10https://gerrit.wikimedia.org/r/657363 (https://phabricator.wikimedia.org/T209953) (owner: 10Ladsgroup)
[18:37:36] <wikibugs>	 (03CR) 10Elukey: "Ah snap  this collides with another change that we just merged, can you rebase Amir? Sorry :(" [puppet] - 10https://gerrit.wikimedia.org/r/657363 (https://phabricator.wikimedia.org/T209953) (owner: 10Ladsgroup)
[18:39:16] <XioNoX>	 01/20/2021 18:10:30 GMT - Light level testing has identified an issue 21 miles from the test site and Field Operations is working with the Transport NOC for the next step to take.
[18:42:06] <icinga-wm>	 RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 243, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[18:42:49] <logmsgbot>	 !log brennen@deploy1001 Synchronized php-1.36.0-wmf.27/extensions/AbuseFilter/includes/View/AbuseFilterViewDiff.php: Backport: [[gerrit:657366|Catch ClosestFilterVersionNotFoundException in ViewDiff (T272505)]] (duration: 01m 06s)
[18:42:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:42:53] <stashbot>	 T272505: FilterLookup: No version of filter [x] closest to [y] found - https://phabricator.wikimedia.org/T272505
[18:45:37] <wikibugs>	 (03PS5) 10Legoktm: mailman3: Add parts for Postorius (web interface) [puppet] - 10https://gerrit.wikimedia.org/r/655203 (https://phabricator.wikimedia.org/T256542) (owner: 10Ladsgroup)
[18:46:02] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA: Memory errors on clouddb1019 - https://phabricator.wikimedia.org/T272125 (10Cmjohnson) Record:      1 Date/Time:   08/31/2020 17:37:02 Source:      system Severity:    Ok Description: Log cleared. ------------------------------------------------------------------------------- Recor...
[18:46:41] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA: Memory errors on clouddb1019 - https://phabricator.wikimedia.org/T272125 (10Cmjohnson) Swapped DIMM A4 with DIMM B4, cleared the system log and powered on.  Let's see if the error returns, stays the same or changes.
[18:47:11] <wikibugs>	 (03CR) 10Legoktm: [C: 03+2] mailman3: Add parts for Postorius (web interface) [puppet] - 10https://gerrit.wikimedia.org/r/655203 (https://phabricator.wikimedia.org/T256542) (owner: 10Ladsgroup)
[18:49:31] <XioNoX>	 I created https://phabricator.wikimedia.org/T272524 to not forget to put it back in service
[18:51:05] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw2316.codfw.wmnet
[18:51:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:51:26] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw2325.codfw.wmnet
[18:51:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:51:35] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw2327.codfw.wmnet
[18:51:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:51:45] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw2329.codfw.wmnet
[18:51:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:53:13] <icinga-wm>	 PROBLEM - Kafka Broker Replica Max Lag on kafka-jumbo1009 is CRITICAL: 3.463e+07 ge 5e+06 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=16&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=jumbo-eqiad&var-kafka_broker=kafka-jumbo1009
[18:53:34] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw2316.codfw.wmnet
[18:53:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:53:42] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw2329.codfw.wmnet
[18:53:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:53:53] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw2327.codfw.wmnet
[18:53:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:54:24] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw2325.codfw.wmnet
[18:54:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:55:31] <wikibugs>	 (03PS1) 10Ottomata: Refactor EventLogging Event Platform PHP integration [extensions/EventLogging] (wmf/1.36.0-wmf.27) - 10https://gerrit.wikimedia.org/r/657391 (https://phabricator.wikimedia.org/T253121)
[18:55:33] <wikibugs>	 (03PS1) 10Ottomata: Fix possible undefined index warning in arg checking in EventServiceClient [extensions/EventLogging] (wmf/1.36.0-wmf.27) - 10https://gerrit.wikimedia.org/r/657392 (https://phabricator.wikimedia.org/T253121)
[18:57:08] <wikibugs>	 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia...
[18:57:29] <wikibugs>	 (03CR) 10Mholloway: [C: 03+1] Refactor EventLogging Event Platform PHP integration [extensions/EventLogging] (wmf/1.36.0-wmf.27) - 10https://gerrit.wikimedia.org/r/657391 (https://phabricator.wikimedia.org/T253121) (owner: 10Ottomata)
[18:57:34] <wikibugs>	 (03CR) 10Mholloway: [C: 03+1] Fix possible undefined index warning in arg checking in EventServiceClient [extensions/EventLogging] (wmf/1.36.0-wmf.27) - 10https://gerrit.wikimedia.org/r/657392 (https://phabricator.wikimedia.org/T253121) (owner: 10Ottomata)
[18:58:08] <wikibugs>	 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia...
[18:58:47] <icinga-wm>	 PROBLEM - Host clouddb1019.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[18:59:13] <wikibugs>	 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia...
[18:59:43] <icinga-wm>	 PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 90, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[18:59:58] <wikibugs>	 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia...
[19:00:04] <jouncebot>	 brennen and liw: I, the Bot under the Fountain, allow thee, The Deployer, to do Train log triage with CPT deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210120T1900).
[19:00:04] <jouncebot>	 RoanKattouw, Niharika, and Urbanecm: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210120T1900).
[19:00:04] <jouncebot>	 Kizule: A patch you scheduled for Morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[19:00:12] <wikibugs>	 (03PS3) 10Ladsgroup: refinery: Migrate hiera() to lookup() [puppet] - 10https://gerrit.wikimedia.org/r/657363 (https://phabricator.wikimedia.org/T209953)
[19:00:19] <wikibugs>	 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Add wikitrent to `wmf` LDAP group - https://phabricator.wikimedia.org/T272489 (10wikitrent) @Aklapper  & @jcrespo  I didn't follow what you were asking and created this task from the link: https://phabricator.wikimedia.org/T272525  Can we ignore/delete the n...
[19:00:20] <Kizule>	 jouncebot: I'm glad to hear it. ;)
[19:00:34] <Urbanecm>	 Kizule: finally you arrived :)
[19:00:36] <Urbanecm>	 I can deploy today
[19:00:52] <brennen>	 RoanKattouw, Niharika, Urbanecm: the train is blocked from rolling forward, but also where it should be theoretically at the moment (group0), so feel free to go ahead with backports.
[19:01:02] <wikibugs>	 (03CR) 10Ladsgroup: "Done. It actually did the work I was doing, so I just dropped the file from my patch." [puppet] - 10https://gerrit.wikimedia.org/r/657363 (https://phabricator.wikimedia.org/T209953) (owner: 10Ladsgroup)
[19:01:19] <Urbanecm>	 brennen: ack, thanks, I'm going to do config only for now anyway :)
[19:01:22] <Kizule>	 Urbanecm: Yup, I’ve been unstable lately and juggling with so many things.
[19:01:42] <wikibugs>	 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Add wikitrent to `wmf` LDAP group - https://phabricator.wikimedia.org/T272489 (10jcrespo)
[19:01:48] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] Enable visualeditor on kuwiki by default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/655714 (https://phabricator.wikimedia.org/T270841) (owner: 10Zoranzoki21)
[19:02:44] <wikibugs>	 (03Merged) 10jenkins-bot: Enable visualeditor on kuwiki by default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/655714 (https://phabricator.wikimedia.org/T270841) (owner: 10Zoranzoki21)
[19:03:51] <icinga-wm>	 RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 92, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[19:04:09] <wikibugs>	 (03PS1) 10Ayounsi: Add Lumen transit BGP to eqiad [homer/public] - 10https://gerrit.wikimedia.org/r/657395 (https://phabricator.wikimedia.org/T270439)
[19:04:14] <wikibugs>	 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Add wikitrent to `wmf` LDAP group - https://phabricator.wikimedia.org/T272489 (10jcrespo) Don't worry, please read when you have time Phabricator (and task management)'s documentation, it will help you for future tickets): https://www.mediawiki.org/wiki/Bug_...
[19:04:25] <Urbanecm>	 Kizule: please test at mwdebug1001
[19:04:30] <Kizule>	 Urbanecm: Sure
[19:04:47] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA: Memory errors on clouddb1019 - https://phabricator.wikimedia.org/T272125 (10Cmjohnson) Fast response by the server, after swapping the DIMM, the server was stuck in a continuous reboot. connected the console and see that the server is failing during post at the memory check.  Not s...
[19:05:10] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] Add Lumen transit BGP to eqiad [homer/public] - 10https://gerrit.wikimedia.org/r/657395 (https://phabricator.wikimedia.org/T270439) (owner: 10Ayounsi)
[19:05:39] <wikibugs>	 (03Merged) 10jenkins-bot: Add Lumen transit BGP to eqiad [homer/public] - 10https://gerrit.wikimedia.org/r/657395 (https://phabricator.wikimedia.org/T270439) (owner: 10Ayounsi)
[19:07:07] <Kizule>	 Urbanecm: VE disappeared from "beta features" (it is okay). But, there isn't pen in the source editor with which I can switch to the visual editor.
[19:07:27] <Urbanecm>	 Kizule: mind screenshotting?
[19:07:36] <Kizule>	 Urbanecm: I'll.
[19:08:24] <wikibugs>	 (03PS1) 10Ayounsi: Fix typo [homer/public] - 10https://gerrit.wikimedia.org/r/657396
[19:08:41] <icinga-wm>	 PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[19:09:21] <Kizule>	 Urbanecm: Everything looks good, maybe some caching issues happened in my browser, but now everything is correct for me.
[19:09:23] <wikibugs>	 (03CR) 10Dzahn: [C: 03+1] "confirmed on mwmaint (wikitech LDAP) and ldap-corp (OIT LDAP).  full-time employee, UID matches, lgtm, all it needs is aezell's approval o" [puppet] - 10https://gerrit.wikimedia.org/r/657378 (https://phabricator.wikimedia.org/T272489) (owner: 10Jcrespo)
[19:09:28] <Urbanecm>	 okay
[19:09:30] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] Fix typo [homer/public] - 10https://gerrit.wikimedia.org/r/657396 (owner: 10Ayounsi)
[19:10:05] <wikibugs>	 (03Merged) 10jenkins-bot: Fix typo [homer/public] - 10https://gerrit.wikimedia.org/r/657396 (owner: 10Ayounsi)
[19:10:24] <Kizule>	 Urbanecm: So... you can sync this.
[19:10:30] <Urbanecm>	 syncing
[19:11:18] <logmsgbot>	 !log urbanecm@deploy1001 Synchronized dblists/visualeditor-nondefault.dblist: a736d97463e7a42b41dbcff19a8c2c3c62f8bf6d: Enable visualeditor on kuwiki by default (T270841; 1/2) (duration: 01m 04s)
[19:11:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:11:24] <stashbot>	 T270841: Enable VisualEditor by default for all users of the ku.wikipedia - https://phabricator.wikimedia.org/T270841
[19:11:36] <XioNoX>	 !log add BGP to Lumen in eqiad
[19:11:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:12:55] <logmsgbot>	 !log urbanecm@deploy1001 Synchronized wmf-config/config/kuwiki.yaml: a736d97463e7a42b41dbcff19a8c2c3c62f8bf6d: Enable visualeditor on kuwiki by default (T270841; 2/2) (duration: 01m 05s)
[19:13:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:13:12] <Urbanecm>	 Kizule: here you go :)
[19:13:29] <wikibugs>	 (03CR) 10Joal: "Thanks for the review @elukey - I'm sorry for the forgotten changes :S" (036 comments) [puppet] - 10https://gerrit.wikimedia.org/r/656376 (https://phabricator.wikimedia.org/T271560) (owner: 10Joal)
[19:13:31] <wikibugs>	 (03PS1) 10Urbanecm: [enwiki] Update celebration logo to "option A" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657397 (https://phabricator.wikimedia.org/T272526)
[19:13:41] <wikibugs>	 (03PS1) 10Effie Mouzeli: mediawiki: reduce the number of cached keys that trigger a restart [puppet] - 10https://gerrit.wikimedia.org/r/657398 (https://phabricator.wikimedia.org/T245183)
[19:13:43] <icinga-wm>	 PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 90, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[19:13:49] <icinga-wm>	 RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[19:13:53] <Kizule>	 Urbanecm: Thank you, everything works.
[19:14:08] <wikibugs>	 (03PS1) 10David Caro: wmcs.backup.images: Fix full backup creation [puppet] - 10https://gerrit.wikimedia.org/r/657399 (https://phabricator.wikimedia.org/T272510)
[19:14:09] <wikibugs>	 10SRE, 10ops-eqiad, 10Analytics-Radar, 10Patch-For-Review: Degraded RAID on an-coord1002 - https://phabricator.wikimedia.org/T270768 (10Cmjohnson) 05Open→03Resolved a:03Cmjohnson @elukey I swapped the SSD.  The only spare I had is 300GB. It's new.  Feel free to do what you need. I am resolving this t...
[19:14:09] <Urbanecm>	 great :)
[19:14:36] <wikibugs>	 (03PS4) 10Joal: profile::analytics::refinery::job::hdfs_cleaner Update [puppet] - 10https://gerrit.wikimedia.org/r/656376 (https://phabricator.wikimedia.org/T271560)
[19:16:12] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2331.codfw.wmnet with reason: REIMAGE
[19:16:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:17:09] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2333.codfw.wmnet with reason: REIMAGE
[19:17:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:17:21] <wikibugs>	 (03PS2) 10Urbanecm: [enwiki] Update celebration logo to "option A" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657397 (https://phabricator.wikimedia.org/T272526)
[19:17:23] <wikibugs>	 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Add wikitrent to `wmf` LDAP group - https://phabricator.wikimedia.org/T272489 (10Tchanders) @Aklapper @jcrespo - thanks for the help with this task!  > Hi and welcome @wikitrent. Please update the team's onboarding docs to link to https://phabricator.wikimed...
[19:17:33] <icinga-wm>	 RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 92, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[19:18:00] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] [enwiki] Update celebration logo to "option A" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657397 (https://phabricator.wikimedia.org/T272526) (owner: 10Urbanecm)
[19:18:16] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2335.codfw.wmnet with reason: REIMAGE
[19:18:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:18:17] <logmsgbot>	 !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2331.codfw.wmnet with reason: REIMAGE
[19:18:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:19:00] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2337.codfw.wmnet with reason: REIMAGE
[19:19:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:19:05] <wikibugs>	 (03Merged) 10jenkins-bot: [enwiki] Update celebration logo to "option A" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657397 (https://phabricator.wikimedia.org/T272526) (owner: 10Urbanecm)
[19:20:22] <logmsgbot>	 !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2333.codfw.wmnet with reason: REIMAGE
[19:20:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:20:39] <wikibugs>	 10SRE, 10ops-eqiad: Degraded RAID on ms-be1032 - https://phabricator.wikimedia.org/T272209 (10Cmjohnson) I did swap it with a 4TB disk from a Dell server. Hopefully, this works.
[19:20:51] <wikibugs>	 (03PS1) 10Andrew Bogott: nova vendordata/firstboot: move puppet logic into cloud-init [puppet] - 10https://gerrit.wikimedia.org/r/657401 (https://phabricator.wikimedia.org/T271273)
[19:22:18] <logmsgbot>	 !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2335.codfw.wmnet with reason: REIMAGE
[19:22:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:22:30] <logmsgbot>	 !log urbanecm@deploy1001 Synchronized static/images/project-logos: 13fb338249b3ec73e380c4971ee697f28a2f6d76: [enwiki] Update celebration logo to "option A" (T272526) (duration: 01m 05s)
[19:22:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:22:34] <stashbot>	 T272526: Change enwiki logo to "Option A" until February 4 - https://phabricator.wikimedia.org/T272526
[19:23:00] <wikibugs>	 10SRE, 10Graphoid, 10Platform Engineering, 10serviceops: Final undeploy for graphoid - en.wiki - https://phabricator.wikimedia.org/T271495 (10Jdlrobson) Opened T272530 with suggested high priority
[19:24:00] <logmsgbot>	 !log dzahn@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on mw2337.codfw.wmnet with reason: REIMAGE
[19:24:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:24:30] <effie>	 !log depool and repool thumbor* to upgrade python-thumbor-wikimedia to v2.9 
[19:24:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:25:37] <icinga-wm>	 PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 90, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[19:27:15] <icinga-wm>	 PROBLEM - mediawiki-installation DSH group on mw2337 is CRITICAL: Host mw2337 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups
[19:27:49] <icinga-wm>	 RECOVERY - Device not healthy -SMART- on an-coord1002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/SMART%23Alerts https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=an-coord1002&var-datasource=eqiad+prometheus/ops
[19:27:57] <icinga-wm>	 RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 92, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[19:28:27] <wikibugs>	 (03CR) 10Jcrespo: "BTW, I don't think management approval is needed according to our procedures, but it helps establishing the connection between the coporat" [puppet] - 10https://gerrit.wikimedia.org/r/657378 (https://phabricator.wikimedia.org/T272489) (owner: 10Jcrespo)
[19:29:33] <icinga-wm>	 PROBLEM - Apache HTTP on mw2337 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[19:29:59] <wikibugs>	 (03PS1) 10Urbanecm: Revert "[enwiki] Update celebration logo to "option A"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657323 (https://phabricator.wikimedia.org/T272526)
[19:30:15] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] Revert "[enwiki] Update celebration logo to "option A"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657323 (https://phabricator.wikimedia.org/T272526) (owner: 10Urbanecm)
[19:30:29] <wikibugs>	 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Add wikitrent to `wmf` LDAP group - https://phabricator.wikimedia.org/T272489 (10jcrespo) FYI, this is only blocked on @aezel response T272489#6762016 (not deployed yet).
[19:31:06] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "[enwiki] Update celebration logo to "option A"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657323 (https://phabricator.wikimedia.org/T272526) (owner: 10Urbanecm)
[19:31:44] <wikibugs>	 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Add wikitrent to `wmf` LDAP group - https://phabricator.wikimedia.org/T272489 (10jcrespo) a:05Tchanders→03aezell
[19:32:02] <wikibugs>	 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Add wikitrent to `wmf` LDAP group - https://phabricator.wikimedia.org/T272489 (10jcrespo) p:05Triage→03High
[19:33:41] <logmsgbot>	 !log urbanecm@deploy1001 Synchronized static/images/project-logos: 5c941678ec739dd6b5257b4a8f866b7e3a257f45: Revert: [enwiki] Update celebration logo to "option A" (T272526) (duration: 01m 04s)
[19:33:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:33:46] <stashbot>	 T272526: Change enwiki logo to "Option A" until February 4 - https://phabricator.wikimedia.org/T272526
[19:34:39] <icinga-wm>	 PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 90, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[19:35:28] <wikibugs>	 10SRE, 10LDAP-Access-Requests: LDAP access to the wmf group for Jazmin Tanner - https://phabricator.wikimedia.org/T272522 (10jcrespo) p:05Triage→03High a:03JTannerWMF Waiting on additional user information to be provided, following Aklapper provided link.
[19:38:33] <icinga-wm>	 PROBLEM - Host mw2337 is DOWN: PING CRITICAL - Packet loss = 100%
[19:39:00] <wikibugs>	 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2331.codfw.wmnet'] `  an...
[19:39:05] <legoktm>	 Urbanecm: why reverted?
[19:39:35] <wikibugs>	 (03PS1) 10CDanis: add bot_posts_blocked_nets: IP ranges to block POSTs from bot U-As [puppet] - 10https://gerrit.wikimedia.org/r/657402 (https://phabricator.wikimedia.org/T272330)
[19:39:36] <Urbanecm>	 legoktm: because it's too big and it would require dropping HD logo
[19:39:43] <Urbanecm>	 it looks like this https://usercontent.irccloud-cdn.com/file/kxXQmCdD/image.png
[19:39:49] <icinga-wm>	 RECOVERY - Host mw2337 is UP: PING OK - Packet loss = 0%, RTA = 33.45 ms
[19:39:51] <icinga-wm>	 RECOVERY - Apache HTTP on mw2337 is OK: HTTP OK: HTTP/1.1 302 Found - 641 bytes in 1.031 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[19:40:11] <icinga-wm>	 PROBLEM - Check systemd state on mw2337 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:40:11] <wikibugs>	 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2333.codfw.wmnet'] `  an...
[19:40:19] <legoktm>	 Urbanecm: gotcha :|
[19:40:27] <icinga-wm>	 RECOVERY - Kafka Broker Replica Max Lag on kafka-jumbo1009 is OK: (C)5e+06 ge (W)1e+06 ge 0 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=16&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=jumbo-eqiad&var-kafka_broker=kafka-jumbo1009
[19:40:43] <Urbanecm>	 sadly i wasn't able to make it fit easily, so i reverted :/
[19:40:43] <wikibugs>	 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2335.codfw.wmnet'] `  an...
[19:41:35] <wikibugs>	 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2337.codfw.wmnet'] `  an...
[19:42:01] <icinga-wm>	 RECOVERY - Check systemd state on mw2337 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:42:12] <Urbanecm>	 legoktm: I'll put a note on the task momentarily :). 
[19:42:56] <legoktm>	 :) thanks
[19:44:13] <icinga-wm>	 PROBLEM - PHP opcache health on mw2337 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[19:45:13] <icinga-wm>	 PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[19:47:40] <bblack>	 !log lvs1015: stopping pybal to try to fix a lingering ifup service state issue on the host, which may require downing an interface
[19:47:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:50:38] <bblack>	 !log lvs1015: bringing pybal back online
[19:50:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:53:00] <wikibugs>	 (03PS1) 10CDanis: add bot_posts_blocked_nets [labs/private] - 10https://gerrit.wikimedia.org/r/657406
[19:55:48] <wikibugs>	 (03PS1) 10Tpt: Enables the Wikisource extension on oldwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657407 (https://phabricator.wikimedia.org/T272163)
[19:57:06] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Enables the Wikisource extension on oldwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657407 (https://phabricator.wikimedia.org/T272163) (owner: 10Tpt)
[19:57:17] <wikibugs>	 (03PS2) 10Tpt: Enables the Wikisource extension on oldwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657407 (https://phabricator.wikimedia.org/T272163)
[19:57:51] <wikibugs>	 (03PS2) 10CDanis: add bot_posts_blocked_nets [labs/private] - 10https://gerrit.wikimedia.org/r/657406
[19:57:53] <Urbanecm>	 legoktm: {{done}} as https://phabricator.wikimedia.org/T272526#6763287. 
[19:58:26] <wikibugs>	 (03CR) 10CDanis: [V: 03+2 C: 03+2] add bot_posts_blocked_nets [labs/private] - 10https://gerrit.wikimedia.org/r/657406 (owner: 10CDanis)
[20:00:04] <jouncebot>	 brennen and liw: Time to snap out of that daydream and deploy Mediawiki train - American+European Version. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210120T2000).
[20:02:23] <wikibugs>	 10SRE, 10ops-eqiad, 10Traffic: lvs1015 interface errors - https://phabricator.wikimedia.org/T272258 (10BBlack) All appears healthy now and downtimes are removed, and librenms isn't showing those errors on the interface anymore, either.  Thanks!
[20:06:12] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox
[20:06:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:06:29] <brennen>	 !log 1.36.0-wmf.27 (T271341) train status as of deploy window: currently blocked at group0 on T272508
[20:06:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:06:33] <stashbot>	 T272508: PropertyInfoSnakUrlExpander: Bad value for parameter $snak->getDataValue(): must be a DataValues\StringValue - https://phabricator.wikimedia.org/T272508
[20:06:34] <stashbot>	 T271341: 1.36.0-wmf.27 deployment blockers - https://phabricator.wikimedia.org/T271341
[20:08:48] <wikibugs>	 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Add wikitrent to `wmf` LDAP group - https://phabricator.wikimedia.org/T272489 (10aezell) As @wikitrent's manager, I approve this access.
[20:11:04] <icinga-wm>	 RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[20:11:41] <wikibugs>	 (03PS2) 10CDanis: add bot_posts_blocked_nets: IP ranges to block POSTs from bot U-As [puppet] - 10https://gerrit.wikimedia.org/r/657402 (https://phabricator.wikimedia.org/T272330)
[20:12:35] <logmsgbot>	 !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[20:12:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:13:14] <icinga-wm>	 RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 92, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[20:13:28] <wikibugs>	 (03PS3) 10CDanis: add bot_posts_blocked_nets: IP ranges to block POSTs from bot U-As [puppet] - 10https://gerrit.wikimedia.org/r/657402 (https://phabricator.wikimedia.org/T272330)
[20:14:23] <wikibugs>	 (03PS2) 10Andrew Bogott: nova vendordata/firstboot: move puppet logic into cloud-init [puppet] - 10https://gerrit.wikimedia.org/r/657401 (https://phabricator.wikimedia.org/T271273)
[20:15:32] <wikibugs>	 (03CR) 10CDanis: "Added tests, re-checking once more, then will disable puppet on cps, merge, test on a few, re-enable." [puppet] - 10https://gerrit.wikimedia.org/r/657402 (https://phabricator.wikimedia.org/T272330) (owner: 10CDanis)
[20:15:42] <icinga-wm>	 PROBLEM - PHP opcache health on mw2329 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[20:16:06] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox
[20:16:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:17:24] <cdanis>	 ✔️ cdanis@cumin1001.eqiad.wmnet ~ 🕒🍵 sudo cumin A:cp 'disable-puppet "cdanis deploying I558346d T272330"'                            
[20:17:28] <cdanis>	 !log ✔️ cdanis@cumin1001.eqiad.wmnet ~ 🕒🍵 sudo cumin A:cp 'disable-puppet "cdanis deploying I558346d T272330"'                            
[20:17:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:20:58] <icinga-wm>	 PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 90, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[20:21:51] <logmsgbot>	 !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[20:21:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:22:36] <wikibugs>	 (03CR) 10CDanis: [C: 03+2] add bot_posts_blocked_nets: IP ranges to block POSTs from bot U-As [puppet] - 10https://gerrit.wikimedia.org/r/657402 (https://phabricator.wikimedia.org/T272330) (owner: 10CDanis)
[20:22:40] <brennen>	 jouncebot now
[20:22:40] <jouncebot>	 For the next 1 hour(s) and 37 minute(s): Mediawiki train - American+European Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210120T2000)
[20:23:06] <icinga-wm>	 RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 92, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[20:23:11] <brennen>	 !log 1.36.0-wmf.27 (T271341) train: proceeding to group1
[20:23:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:23:16] <stashbot>	 T271341: 1.36.0-wmf.27 deployment blockers - https://phabricator.wikimedia.org/T271341
[20:24:30] <wikibugs>	 10SRE, 10puppet-compiler: run-puppet-agent --enable flag is broken - https://phabricator.wikimedia.org/T272539 (10CDanis)
[20:24:42] <wikibugs>	 10Puppet, 10SRE: run-puppet-agent --enable flag is broken - https://phabricator.wikimedia.org/T272539 (10CDanis)
[20:25:23] <wikibugs>	 (03PS1) 10Brennen Bearnes: group1 wikis to 1.36.0-wmf.27 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657409
[20:25:25] <wikibugs>	 (03CR) 10Brennen Bearnes: [C: 03+2] group1 wikis to 1.36.0-wmf.27 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657409 (owner: 10Brennen Bearnes)
[20:26:16] <wikibugs>	 (03Merged) 10jenkins-bot: group1 wikis to 1.36.0-wmf.27 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657409 (owner: 10Brennen Bearnes)
[20:28:48] <logmsgbot>	 !log brennen@deploy1001 rebuilt and synchronized wikiversions files: group1 wikis to 1.36.0-wmf.27
[20:28:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:29:14] <icinga-wm>	 PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[20:30:22] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code=205 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET
[20:31:54] <logmsgbot>	 !log brennen@deploy1001 Synchronized php: group1 wikis to 1.36.0-wmf.27 (duration: 03m 05s)
[20:31:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:32:32] <mutante>	 brennen: this time the spike was real.. but ..a spike.. already back down so that's from deploying itself it looks
[20:32:36] <icinga-wm>	 RECOVERY - High average GET latency for mw requests on appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET
[20:32:40] <mutante>	 there we go
[20:33:42] <wikibugs>	 (03CR) 10Cwhite: [C: 03+2] profile: drop ECS messages on legacy cluster [puppet] - 10https://gerrit.wikimedia.org/r/657213 (https://phabricator.wikimedia.org/T234565) (owner: 10Cwhite)
[20:34:44] <brennen>	 whew.
[20:36:09] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: (Need By: 2020-11-29) rack/setup/install db11[51-76] - https://phabricator.wikimedia.org/T267043 (10Cmjohnson)
[20:37:05] <wikibugs>	 (03Abandoned) 10Effie Mouzeli: Revert "hieradata: enable onhost memcached on mw2271" [puppet] - 10https://gerrit.wikimedia.org/r/631230 (owner: 10Effie Mouzeli)
[20:37:28] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: (Need By: 2020-11-29) rack/setup/install db11[51-76] - https://phabricator.wikimedia.org/T267043 (10Cmjohnson) a:05Jclark-ctr→03RobH All of the servers are in the racks, idracs are setup including db1169 and db1175.   Outstanding items that @robh will do  - raid - p...
[20:38:06] <icinga-wm>	 RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[20:41:15] <effie>	 !log restart mc-gp2001, mc-gp2002, mc-gp2003 for T269596
[20:41:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:41:58] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc-gp2001.codfw.wmnet
[20:42:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:43:18] <wikibugs>	 10Puppet, 10SRE: run-puppet-agent --enable flag is broken - https://phabricator.wikimedia.org/T272539 (10CDanis) There's one related problem, which is that `enable-puppet` should check the given message both with and without appending ` - $SUDO_USER`, as perhaps you set a disable-puppet from a context where th...
[20:44:07] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw2331.codfw.wmnet
[20:44:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:44:35] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw2333.codfw.wmnet
[20:44:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:44:48] <wikibugs>	 10Puppet, 10SRE: run-puppet-agent --enable flag is broken - https://phabricator.wikimedia.org/T272539 (10CDanis) >>! In T272539#6763461, @CDanis wrote: > There's one related problem, which is that `enable-puppet` should check the given message both with and without appending ` - $SUDO_USER`, as perhaps you set...
[20:45:15] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw2335.codfw.wmnet
[20:45:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:45:26] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw2337.codfw.wmnet
[20:45:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:45:45] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw2331.codfw.wmnet
[20:45:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:46:18] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw2333.codfw.wmnet
[20:46:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:46:40] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw2335.codfw.wmnet
[20:46:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:46:46] <cdanis>	 !log ✔️ cdanis@cumin1001.eqiad.wmnet ~ 🕞🍵 sudo cumin A:cp 'enable-puppet "cdanis deploying I558346d T272330"' 
[20:46:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:46:50] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw2337.codfw.wmnet
[20:46:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:48:25] <logmsgbot>	 !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc-gp2001.codfw.wmnet
[20:48:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:48:42] <wikibugs>	 (03PS3) 10Effie Mouzeli: profile::memcached::instance: simplify handling of extendend_options [puppet] - 10https://gerrit.wikimedia.org/r/647190 (owner: 10Elukey)
[20:48:58] <wikibugs>	 (03CR) 10Effie Mouzeli: [C: 03+1] profile::memcached::instance: simplify handling of extendend_options [puppet] - 10https://gerrit.wikimedia.org/r/647190 (owner: 10Elukey)
[20:49:43] <wikibugs>	 (03CR) 10Effie Mouzeli: [C: 03+1] "@elukey, since puppet does not trigger a memcached restart, I think we can merge it and slowly do the restarts" [puppet] - 10https://gerrit.wikimedia.org/r/647190 (owner: 10Elukey)
[20:53:18] <icinga-wm>	 PROBLEM - SSH on logstash2005 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[20:54:40] <shdubsh>	 ^^ looking
[20:55:52] <icinga-wm>	 PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 90, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[20:55:52] <wikibugs>	 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia...
[20:56:14] <wikibugs>	 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia...
[20:56:17] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc-gp2002.codfw.wmnet
[20:56:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:57:07] <wikibugs>	 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia...
[20:57:25] <wikibugs>	 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia...
[20:57:30] <icinga-wm>	 RECOVERY - SSH on logstash2005 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[20:57:34] <icinga-wm>	 PROBLEM - logstash JSON linesTCP port on logstash2005 is CRITICAL: connect to address 127.0.0.1 and port 11514: Connection refused https://wikitech.wikimedia.org/wiki/Logstash
[20:58:00] <wikibugs>	 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Add wikitrent to `wmf` LDAP group - https://phabricator.wikimedia.org/T272489 (10jcrespo) a:05aezell→03jcrespo
[20:58:04] <icinga-wm>	 RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 92, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[20:58:07] <wikibugs>	 (03CR) 10Effie Mouzeli: [C: 03+1] "PCC https://puppet-compiler.wmflabs.org/compiler1002/27548/" [puppet] - 10https://gerrit.wikimedia.org/r/647190 (owner: 10Elukey)
[20:59:27] <wikibugs>	 (03CR) 10Effie Mouzeli: [C: 03+1] "Thanks @ahmon for having a look" [puppet] - 10https://gerrit.wikimedia.org/r/574485 (https://phabricator.wikimedia.org/T227080) (owner: 10Filippo Giunchedi)
[20:59:44] <icinga-wm>	 RECOVERY - logstash JSON linesTCP port on logstash2005 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11514 https://wikitech.wikimedia.org/wiki/Logstash
[21:00:05] <jouncebot>	 chrisalbon and accraze: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Services – Graphoid / ORES deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210120T2100).
[21:02:33] <logmsgbot>	 !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc-gp2002.codfw.wmnet
[21:02:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:05:02] <icinga-wm>	 PROBLEM - SSH on logstash2006 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[21:07:18] <wikibugs>	 (03CR) 10Dzahn: [C: 03+1] "+1, the approval is already there regardless of it being required" [puppet] - 10https://gerrit.wikimedia.org/r/657378 (https://phabricator.wikimedia.org/T272489) (owner: 10Jcrespo)
[21:12:37] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "compiled on everything, including alert1001 (this is a defined type, not a class and used in base and all over the place). shows there are" [puppet] - 10https://gerrit.wikimedia.org/r/655516 (https://phabricator.wikimedia.org/T209953) (owner: 10Dzahn)
[21:13:03] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc-gp2003.codfw.wmnet
[21:13:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:13:32] <icinga-wm>	 RECOVERY - SSH on logstash2006 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[21:14:34] <icinga-wm>	 PROBLEM - logstash JSON linesTCP port on logstash2006 is CRITICAL: connect to address 127.0.0.1 and port 11514: Connection refused https://wikitech.wikimedia.org/wiki/Logstash
[21:14:40] <wikibugs>	 (03PS1) 10Legoktm: docker_registry_ha: Enable and fix build-homepage job [puppet] - 10https://gerrit.wikimedia.org/r/657412 (https://phabricator.wikimedia.org/T179696)
[21:14:55] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2338.codfw.wmnet with reason: REIMAGE
[21:14:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:15:15] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2339.codfw.wmnet with reason: REIMAGE
[21:15:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:15:47] <logmsgbot>	 !log dzahn@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on mw2339.codfw.wmnet with reason: REIMAGE
[21:15:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:16:09] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2351.codfw.wmnet with reason: REIMAGE
[21:16:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:16:28] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2353.codfw.wmnet with reason: REIMAGE
[21:16:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:16:45] <icinga-wm>	 PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 90, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[21:17:03] <icinga-wm>	 PROBLEM - Disk space on dumpsdata1003 is CRITICAL: DISK CRITICAL - free space: /data 848763 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=dumpsdata1003&var-datasource=eqiad+prometheus/ops
[21:17:27] <wikibugs>	 (03CR) 10Kosta Harlan: [C: 03+1] [beta] GrowthExperiments: set link recommendation feature flags [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657292 (owner: 10Gergő Tisza)
[21:17:41] <logmsgbot>	 !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2338.codfw.wmnet with reason: REIMAGE
[21:17:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:19:31] <logmsgbot>	 !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc-gp2003.codfw.wmnet
[21:19:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:19:35] <logmsgbot>	 !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2353.codfw.wmnet with reason: REIMAGE
[21:19:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:21:10] <logmsgbot>	 !log dzahn@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on mw2351.codfw.wmnet with reason: REIMAGE
[21:21:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:21:18] <wikibugs>	 (03PS6) 10Effie Mouzeli: varnish: check for debug=1 value in X-Analytics header [puppet] - 10https://gerrit.wikimedia.org/r/629735 (https://phabricator.wikimedia.org/T263683)
[21:22:10] <icinga-wm>	 PROBLEM - Apache HTTP on mw2351 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[21:22:10] <icinga-wm>	 PROBLEM - Memcached on mw2339 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Memcached
[21:22:48] <legoktm>	 ^ those are being reimaged by mutante
[21:23:02] <wikibugs>	 (03PS2) 10Legoktm: docker_registry_ha: Enable and fix build-homepage job [puppet] - 10https://gerrit.wikimedia.org/r/657412 (https://phabricator.wikimedia.org/T179696)
[21:23:04] <icinga-wm>	 RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 92, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[21:23:10] <wikibugs>	 (03CR) 10Legoktm: [C: 03+2] docker_registry_ha: Enable and fix build-homepage job [puppet] - 10https://gerrit.wikimedia.org/r/657412 (https://phabricator.wikimedia.org/T179696) (owner: 10Legoktm)
[21:24:03] <logmsgbot>	 !log milimetric@deploy1001 Started deploy [analytics/refinery@1313244]: Regular analytics weekly train [analytics/refinery@1313244]
[21:24:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:28:34] <icinga-wm>	 RECOVERY - mediawiki-installation DSH group on mw2337 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups
[21:28:50] <icinga-wm>	 PROBLEM - PHP7 rendering on mw2339 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[21:29:22] <icinga-wm>	 PROBLEM - PHP opcache health on mw2335 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[21:29:28] <icinga-wm>	 RECOVERY - Check systemd state on registry2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:29:40] <icinga-wm>	 PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[21:31:30] <icinga-wm>	 PROBLEM - Memcached on mw2351 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Memcached
[21:31:43] <wikibugs>	 (03PS7) 10Effie Mouzeli: varnish: check for debug=1 value in X-Analytics header [puppet] - 10https://gerrit.wikimedia.org/r/629735 (https://phabricator.wikimedia.org/T263683)
[21:32:18] <icinga-wm>	 RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[21:34:02] <icinga-wm>	 PROBLEM - Host mw2339 is DOWN: PING CRITICAL - Packet loss = 100%
[21:34:18] <icinga-wm>	 PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 90, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[21:34:55] <logmsgbot>	 !log milimetric@deploy1001 Finished deploy [analytics/refinery@1313244]: Regular analytics weekly train [analytics/refinery@1313244] (duration: 10m 52s)
[21:34:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:35:02] <icinga-wm>	 PROBLEM - Check systemd state on registry2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:35:21] <logmsgbot>	 !log milimetric@deploy1001 Started deploy [analytics/refinery@1313244] (thin): Regular analytics weekly train THIN [analytics/refinery@1313244]
[21:35:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:35:28] <logmsgbot>	 !log milimetric@deploy1001 Finished deploy [analytics/refinery@1313244] (thin): Regular analytics weekly train THIN [analytics/refinery@1313244] (duration: 00m 07s)
[21:35:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:35:48] <icinga-wm>	 RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 92, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[21:37:14] <icinga-wm>	 RECOVERY - PHP7 rendering on mw2339 is OK: HTTP OK: HTTP/1.1 302 Found - 656 bytes in 1.171 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[21:37:16] <icinga-wm>	 RECOVERY - Host mw2339 is UP: PING OK - Packet loss = 0%, RTA = 33.43 ms
[21:37:18] <icinga-wm>	 RECOVERY - Memcached on mw2351 is OK: TCP OK - 0.034 second response time on 10.192.32.201 port 11210 https://wikitech.wikimedia.org/wiki/Memcached
[21:37:18] <icinga-wm>	 RECOVERY - Memcached on mw2339 is OK: TCP OK - 0.034 second response time on 10.192.32.117 port 11210 https://wikitech.wikimedia.org/wiki/Memcached
[21:37:20] <icinga-wm>	 RECOVERY - Apache HTTP on mw2351 is OK: HTTP OK: HTTP/1.1 302 Found - 641 bytes in 1.047 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[21:37:21] <wikibugs>	 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2338.codfw.wmnet'] `  an...
[21:37:32] <icinga-wm>	 PROBLEM - PHP opcache health on mw2339 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[21:37:54] <icinga-wm>	 PROBLEM - PHP opcache health on mw2351 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[21:37:56] <wikibugs>	 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2339.codfw.wmnet'] `  an...
[21:38:06] <wikibugs>	 (03PS3) 10Andrew Bogott: nova vendordata/firstboot: move puppet logic into cloud-init [puppet] - 10https://gerrit.wikimedia.org/r/657401 (https://phabricator.wikimedia.org/T271273)
[21:38:28] <wikibugs>	 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2351.codfw.wmnet'] `  an...
[21:39:01] <wikibugs>	 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2353.codfw.wmnet'] `  an...
[21:40:24] <icinga-wm>	 RECOVERY - logstash JSON linesTCP port on logstash2006 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11514 https://wikitech.wikimedia.org/wiki/Logstash
[21:40:32] <icinga-wm>	 PROBLEM - Ensure local MW versions match expected deployment on mw2337 is CRITICAL: CRITICAL: 522 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers
[21:41:22] <icinga-wm>	 PROBLEM - Ensure local MW versions match expected deployment on mw2335 is CRITICAL: CRITICAL: 522 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers
[21:42:04] <wikibugs>	 (03PS1) 10Effie Mouzeli: varnish: include X-Client-Port in X-Analytics [puppet] - 10https://gerrit.wikimedia.org/r/657416 (https://phabricator.wikimedia.org/T181368)
[21:43:26] <icinga-wm>	 PROBLEM - mediawiki-installation DSH group on mw2339 is CRITICAL: Host mw2339 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups
[21:43:41] <wikibugs>	 (03PS8) 10Effie Mouzeli: varnish: Set debug=1 in X-Analytics header [puppet] - 10https://gerrit.wikimedia.org/r/629735 (https://phabricator.wikimedia.org/T263683)
[21:44:05] <wikibugs>	 (03CR) 10Effie Mouzeli: varnish: Set debug=1 in X-Analytics header (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/629735 (https://phabricator.wikimedia.org/T263683) (owner: 10Effie Mouzeli)
[21:46:26] <icinga-wm>	 PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 90, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[21:48:28] <icinga-wm>	 RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 92, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[21:49:56] <icinga-wm>	 PROBLEM - PHP opcache health on mw2316 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[21:51:19] <icinga-wm>	 PROBLEM - PHP opcache health on mw2325 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[21:53:53] <icinga-wm>	 PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 90, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[21:56:03] <icinga-wm>	 RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 92, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[22:00:19] <icinga-wm>	 RECOVERY - Check systemd state on registry2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:01:38] <wikibugs>	 (03PS4) 10Andrew Bogott: nova vendordata/firstboot: move puppet config into cloud-init [puppet] - 10https://gerrit.wikimedia.org/r/657401 (https://phabricator.wikimedia.org/T271273)
[22:04:19] <icinga-wm>	 PROBLEM - Check systemd state on registry2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:05:47] <icinga-wm>	 RECOVERY - Check systemd state on registry2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:08:09] <icinga-wm>	 PROBLEM - mediawiki-installation DSH group on mw2351 is CRITICAL: Host mw2351 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups
[22:09:14] <mutante>	 mediawiki-installation DSH group - dont worry, fixing it right now
[22:09:26] <mutante>	 it's the reaimaging and because i was on a break
[22:09:59] <icinga-wm>	 PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[22:11:17] <icinga-wm>	 PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 90, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[22:11:39] <icinga-wm>	 RECOVERY - Check systemd state on registry1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:12:07] <icinga-wm>	 PROBLEM - Check systemd state on registry2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:15:15] <icinga-wm>	 RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[22:15:47] <icinga-wm>	 RECOVERY - Check systemd state on registry1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:15:51] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw2338.codfw.wmnet
[22:15:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:16:10] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw2339.codfw.wmnet
[22:16:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:16:33] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw2351.codfw.wmnet
[22:16:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:16:37] <icinga-wm>	 RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 92, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[22:16:48] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw2353.codfw.wmnet
[22:16:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:17:42] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw2338.codfw.wmnet
[22:17:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:17:49] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw2339.codfw.wmnet
[22:17:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:17:57] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw2351.codfw.wmnet
[22:17:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:18:08] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw2353.codfw.wmnet
[22:18:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:18:51] <icinga-wm>	 PROBLEM - PHP opcache health on mw2327 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[22:19:27] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on mw2327.codfw.wmnet with reason: new install on buster
[22:19:28] <logmsgbot>	 !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on mw2327.codfw.wmnet with reason: new install on buster
[22:19:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:19:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:19:41] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job={atlas_exporter,netbox_device_statistics} site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[22:21:49] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[22:22:12] <wikibugs>	 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia...
[22:22:55] <icinga-wm>	 PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 90, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[22:23:21] <icinga-wm>	 PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[22:23:54] <wikibugs>	 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia...
[22:24:10] <wikibugs>	 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2355.codfw.wmnet'] `  Of...
[22:24:22] <wikibugs>	 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia...
[22:24:47] <wikibugs>	 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia...
[22:25:03] <wikibugs>	 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia...
[22:25:31] <icinga-wm>	 RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[22:27:02] <wikibugs>	 10SRE, 10serviceops: Upgrade docker-registry servers to Debian Buster - https://phabricator.wikimedia.org/T272550 (10Legoktm)
[22:27:21] <icinga-wm>	 RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 92, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[22:28:38] <wikibugs>	 10SRE, 10Epic: Migrate all of production metal and VMs to Buster or later - https://phabricator.wikimedia.org/T247045 (10Dzahn)
[22:28:40] <wikibugs>	 10SRE, 10serviceops: Upgrade docker-registry servers to Debian Buster - https://phabricator.wikimedia.org/T272550 (10Dzahn)
[22:30:27] <wikibugs>	 (03PS7) 10Legoktm: docker_registry_ha: Have nginx serve /srv/homepage for / [puppet] - 10https://gerrit.wikimedia.org/r/655792 (https://phabricator.wikimedia.org/T179696)
[22:32:39] <wikibugs>	 (03PS8) 10Legoktm: docker_registry_ha: Have nginx serve /srv/homepage for / [puppet] - 10https://gerrit.wikimedia.org/r/655792 (https://phabricator.wikimedia.org/T179696)
[22:35:24] <wikibugs>	 (03CR) 10Legoktm: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27553/console" [puppet] - 10https://gerrit.wikimedia.org/r/655792 (https://phabricator.wikimedia.org/T179696) (owner: 10Legoktm)
[22:37:17] <wikibugs>	 (03CR) 10Legoktm: [V: 03+1 C: 03+2] docker_registry_ha: Have nginx serve /srv/homepage for / [puppet] - 10https://gerrit.wikimedia.org/r/655792 (https://phabricator.wikimedia.org/T179696) (owner: 10Legoktm)
[22:41:14] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2355.codfw.wmnet with reason: REIMAGE
[22:41:15] <icinga-wm>	 PROBLEM - mediawiki originals uploads -hourly- for eqiad on alert1001 is CRITICAL: account=mw-media class=originals cluster=swift instance=ms-fe1005 job=statsd_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=eqiad
[22:41:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:41:52] <legoktm>	 https://docker-registry.wikimedia.org/ should work now
[22:42:09] <icinga-wm>	 PROBLEM - mediawiki originals uploads -hourly- for codfw on alert1001 is CRITICAL: account=mw-media class=originals cluster=swift instance=ms-fe2005 job=statsd_exporter site=codfw https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=codfw
[22:43:17] <logmsgbot>	 !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2355.codfw.wmnet with reason: REIMAGE
[22:43:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:43:25] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2357.codfw.wmnet with reason: REIMAGE
[22:43:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:43:43] <icinga-wm>	 RECOVERY - mediawiki-installation DSH group on mw2339 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups
[22:43:50] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2359.codfw.wmnet with reason: REIMAGE
[22:43:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:44:05] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2361.codfw.wmnet with reason: REIMAGE
[22:44:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:45:24] <logmsgbot>	 !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2357.codfw.wmnet with reason: REIMAGE
[22:45:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:47:15] <logmsgbot>	 !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2361.codfw.wmnet with reason: REIMAGE
[22:47:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:48:50] <logmsgbot>	 !log dzahn@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on mw2359.codfw.wmnet with reason: REIMAGE
[22:48:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:49:55] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on mw2359.codfw.wmnet with reason: new install on buster
[22:49:55] <logmsgbot>	 !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on mw2359.codfw.wmnet with reason: new install on buster
[22:49:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:49:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:57:29] <icinga-wm>	 PROBLEM - PHP opcache health on mw2331 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[22:58:41] <icinga-wm>	 PROBLEM - Ensure local MW versions match expected deployment on mw2333 is CRITICAL: CRITICAL: 522 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers
[22:58:43] <icinga-wm>	 PROBLEM - Ensure local MW versions match expected deployment on mw2331 is CRITICAL: CRITICAL: 522 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers
[22:59:17] <icinga-wm>	 PROBLEM - PHP opcache health on mw2333 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[22:59:58] <mutante>	 sorry, trying to minimize the noise but they still happen sometimes
[23:00:20] <mutante>	 every once in a while there is a race condition and the downtime is not set
[23:00:29] <icinga-wm>	 RECOVERY - Check systemd state on registry2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:01:25] <mutante>	 !log mw2331, mw2333 - scap pull
[23:01:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:03:26] <wikibugs>	 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2355.codfw.wmnet'] `  an...
[23:03:34] <legoktm>	 !log updated docker-registry.discovery.wmnet/wikimedia-buster image
[23:03:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:03:49] <icinga-wm>	 RECOVERY - Ensure local MW versions match expected deployment on mw2337 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers
[23:04:21] <icinga-wm>	 RECOVERY - Ensure local MW versions match expected deployment on mw2333 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers
[23:04:23] <icinga-wm>	 RECOVERY - Ensure local MW versions match expected deployment on mw2331 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers
[23:05:27] <icinga-wm>	 PROBLEM - Check systemd state on registry2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:05:39] <icinga-wm>	 RECOVERY - Ensure local MW versions match expected deployment on mw2335 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers
[23:05:57] <icinga-wm>	 PROBLEM - Check systemd state on registry1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:06:01] <icinga-wm>	 PROBLEM - Check systemd state on registry1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:06:11] <mutante>	 that's the build-homepage.service currently being worked on by lego
[23:06:19] <wikibugs>	 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2361.codfw.wmnet'] `  an...
[23:06:56] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw2355.codfw.wmnet
[23:06:58] <wikibugs>	 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2359.codfw.wmnet'] `  an...
[23:06:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:07:05] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw2359.codfw.wmnet
[23:07:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:07:15] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw2361.codfw.wmnet
[23:07:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:07:20] <legoktm>	 it looks like the script works independently, but when all 4 servers run the systemd job at the same time, the registry times out and the job fails
[23:07:37] <wikibugs>	 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2357.codfw.wmnet'] `  an...
[23:08:51] <mutante>	 legoktm: ah, you can randomize the minute like this:  $minute = Integer(seeded_rand(60, $title))
[23:08:52] <wikibugs>	 10SRE, 10MediaWiki-Containers, 10serviceops, 10Patch-For-Review: Homepage for https://docker-registry.wikimedia.org - https://phabricator.wikimedia.org/T179696 (10Legoktm) https://docker-registry.wikimedia.org/ ta-da  Tested by: *  `docker pull docker-registry.discovery.wmnet/wikimedia-buster:latest` on a...
[23:09:03] <icinga-wm>	 RECOVERY - mediawiki-installation DSH group on mw2351 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups
[23:09:37] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install aqs101[0-5] - https://phabricator.wikimedia.org/T267414 (10Jclark-ctr)
[23:10:02] <legoktm>	 mutante: ahh, thanks. Is $title just a fixed string?
[23:10:03] <icinga-wm>	 PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[23:10:05] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw2357.codfw.wmnet
[23:10:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:10:15] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install aqs101[0-5] - https://phabricator.wikimedia.org/T267414 (10Jclark-ctr) @Cmjohnson. host are racked and cabled netbox is updated  host , port aqs1010 7 aqs1011 23 aqs1012 36 aqs1013 31 aqs1014 6 aqs1015 14
[23:12:20] <mutante>	 legoktm: yea, a string. it's because my example is from inside a defined type which then uses systemd::timer::job so $title is re-using the title of the defined type
[23:12:29] <legoktm>	 ack
[23:12:30] <mutante>	 you can just use a random word too
[23:13:09] <wikibugs>	 (03PS1) 10Legoktm: docker_registry_ha: Randomize timing of build-homepage job [puppet] - 10https://gerrit.wikimedia.org/r/657446 (https://phabricator.wikimedia.org/T179696)
[23:13:53] <wikibugs>	 (03CR) 10Dzahn: [C: 03+1] "that's how I do it for planet feed updates as well" [puppet] - 10https://gerrit.wikimedia.org/r/657446 (https://phabricator.wikimedia.org/T179696) (owner: 10Legoktm)
[23:14:05] <icinga-wm>	 RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[23:14:07] <wikibugs>	 (03CR) 10Legoktm: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27554/console" [puppet] - 10https://gerrit.wikimedia.org/r/657446 (https://phabricator.wikimedia.org/T179696) (owner: 10Legoktm)
[23:15:00] <mutante>	 or could use $::fqdn as seed i guess 
[23:15:52] <wikibugs>	 (03CR) 10Legoktm: [V: 03+1] "PCC SUCCESS (DIFF 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27555/console" [puppet] - 10https://gerrit.wikimedia.org/r/657446 (https://phabricator.wikimedia.org/T179696) (owner: 10Legoktm)
[23:16:49] <legoktm>	 mutante: PCC https://puppet-compiler.wmflabs.org/compiler1002/27555/ has them all moving the minute to 20...is that just a limitation of the compiler? or is that not right?
[23:17:40] <mutante>	 legoktm: hmm.. I wonder what happens if you use $::fqdn as part of the seed and compile it again
[23:17:49] <mutante>	 not entirely sure 
[23:19:37] <wikibugs>	 (03PS2) 10Legoktm: docker_registry_ha: Randomize timing of build-homepage job [puppet] - 10https://gerrit.wikimedia.org/r/657446 (https://phabricator.wikimedia.org/T179696)
[23:19:49] * legoktm tries
[23:19:54] <mutante>	 I did not have the exact same case, i had just one host per DC that is active but 10 different jobs
[23:20:05] <mutante>	 which all get to use different seeds
[23:20:47] <wikibugs>	 (03CR) 10Legoktm: [V: 03+1] "PCC SUCCESS (DIFF 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27556/console" [puppet] - 10https://gerrit.wikimedia.org/r/657446 (https://phabricator.wikimedia.org/T179696) (owner: 10Legoktm)
[23:21:07] <legoktm>	 ok, they're all different now
[23:21:17] <legoktm>	 not perfectly spread out but pretty good enough
[23:21:20] <mutante>	 looks like it works
[23:21:24] <legoktm>	 thank you :)
[23:21:24] <mutante>	 yep
[23:21:27] <mutante>	 yw!
[23:21:42] <wikibugs>	 (03CR) 10Legoktm: [V: 03+1 C: 03+2] docker_registry_ha: Randomize timing of build-homepage job [puppet] - 10https://gerrit.wikimedia.org/r/657446 (https://phabricator.wikimedia.org/T179696) (owner: 10Legoktm)
[23:22:42] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on releases2002.codfw.wmnet with reason: rebooting to add a disk
[23:22:42] <logmsgbot>	 !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on releases2002.codfw.wmnet with reason: rebooting to add a disk
[23:22:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:22:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:23:51] <legoktm>	 now hopefully within the next hour they'll all recover
[23:25:00] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw2357.codfw.wmnet
[23:25:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:25:06] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw2355.codfw.wmnet
[23:25:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:25:16] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw2359.codfw.wmnet
[23:25:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:25:23] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw2361.codfw.wmnet
[23:25:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:26:51] <wikibugs>	 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia...
[23:27:27] <wikibugs>	 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia...
[23:28:05] <wikibugs>	 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia...
[23:28:35] <wikibugs>	 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia...
[23:30:24] <mutante>	 !log releases2002 - rebooting VM 
[23:30:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:31:12] <icinga-wm>	 RECOVERY - Check systemd state on registry2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:33:46] <icinga-wm>	 PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[23:35:36] <icinga-wm>	 PROBLEM - Check systemd state on registry2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:36:50] <icinga-wm>	 RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[23:41:38] <icinga-wm>	 RECOVERY - Check systemd state on registry1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:45:53] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2363.codfw.wmnet with reason: REIMAGE
[23:45:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:46:32] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2365.codfw.wmnet with reason: REIMAGE
[23:46:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:47:07] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2367.codfw.wmnet with reason: REIMAGE
[23:47:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:47:39] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2369.codfw.wmnet with reason: REIMAGE
[23:47:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:47:55] <logmsgbot>	 !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2363.codfw.wmnet with reason: REIMAGE
[23:47:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:49:02] <icinga-wm>	 PROBLEM - PHP opcache health on mw2353 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[23:49:06] <icinga-wm>	 RECOVERY - Check systemd state on registry1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:49:56] <logmsgbot>	 !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2367.codfw.wmnet with reason: REIMAGE
[23:49:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:51:18] <wikibugs>	 10SRE, 10serviceops: Upgrade docker-registry servers to Debian Buster - https://phabricator.wikimedia.org/T272550 (10Legoktm) Skimming the puppet role, there's:  `     # this could be removed when buster or next debian includes a 2.7+ version     apt::pin { 'strech_wikimedia_docker_registry_27':         packag...
[23:51:32] <logmsgbot>	 !log dzahn@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on mw2365.codfw.wmnet with reason: REIMAGE
[23:51:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:51:47] <logmsgbot>	 !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2369.codfw.wmnet with reason: REIMAGE
[23:51:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:53:58] <icinga-wm>	 PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 90, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[23:56:22] <icinga-wm>	 RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 92, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[23:58:56] <icinga-wm>	 RECOVERY - Check systemd state on registry2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state