[00:23:47] * Krinkle staging on mwdebug1002 [00:32:44] !log krinkle@deploy1001 Synchronized php-1.33.0-wmf.25/includes/: Idc19cc29764a / T220854 - hot fix (duration: 05m 37s) [00:32:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:32:48] T220854: "Preview mode" parser output gets re-used for page views (affects addWarning and REVISIONID processing) - https://phabricator.wikimedia.org/T220854 [00:40:01] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1004 is CRITICAL: CRITICAL: 90.00% of data above the critical threshold [50.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [00:45:15] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1004 is OK: OK: Less than 70.00% above the threshold [25.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [00:47:19] 10Operations, 10SRE-Access-Requests: access for foks to labweb (in one way or another) (or make changePassword.php work on mwmaint hosts) - https://phabricator.wikimedia.org/T220860 (10bd808) > somehow make it possible to use changePasword.php from mwmaint hosts? This might be the most straight forward thing... [01:19:10] (03PS1) 10Alex Monk: deployment-prep: Update with live files [software/swift-ring] - 10https://gerrit.wikimedia.org/r/503544 [01:46:53] 10Operations, 10LDAP-Access-Requests, 10Patch-For-Review: LDAP access to the wmf group for Evan Prodromou - https://phabricator.wikimedia.org/T220226 (10Krenair) run the `id` command on any *.wmflabs host. alternatively `ldapsearch -LLLx member=uid=evanp,ou=people,dc=wikimedia,dc=org dn` [01:56:45] 10Operations, 10ops-codfw, 10decommission, 10Patch-For-Review, 10cloud-services-team (Kanban): decommission: labtestcontrol2001.wikimedia.org - https://phabricator.wikimedia.org/T218021 (10Papaul) @Dzahn thanks [02:07:18] 10Operations, 10ops-codfw, 10decommission, 10Patch-For-Review, 10cloud-services-team (Kanban): decommission: labtestcontrol2001.wikimedia.org - https://phabricator.wikimedia.org/T218021 (10Papaul) ` papaul@asw-b-codfw# show | compare [edit interfaces interface-range vlan-public1-b-codfw] - member ge-... [02:07:21] 10Operations, 10ops-codfw, 10decommission, 10Patch-For-Review, 10cloud-services-team (Kanban): decommission: labtestcontrol2001.wikimedia.org - https://phabricator.wikimedia.org/T218021 (10Papaul) [02:13:57] (03PS1) 10MaxSem: LoginNotify: remove setting that was moved to the extension itself [mediawiki-config] - 10https://gerrit.wikimedia.org/r/503546 (https://phabricator.wikimedia.org/T220780) [02:14:34] (03CR) 10MaxSem: [C: 04-2] "Waiting for https://gerrit.wikimedia.org/r/#/c/mediawiki/extensions/LoginNotify/+/503465/ to be live everywhere." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/503546 (https://phabricator.wikimedia.org/T220780) (owner: 10MaxSem) [02:22:33] PROBLEM - puppet last run on mw1306 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [02:42:53] PROBLEM - Postgres Replication Lag on maps2001 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 28073984 and 0 seconds [02:49:03] RECOVERY - puppet last run on mw1306 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [03:50:51] PROBLEM - puppet last run on centrallog1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [03:58:37] PROBLEM - puppet last run on mw1229 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [03:59:03] PROBLEM - puppet last run on cp1085 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [04:11:55] PROBLEM - puppet last run on cloudvirt1008 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [04:17:17] RECOVERY - puppet last run on centrallog1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [04:25:03] RECOVERY - puppet last run on mw1229 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [04:25:31] RECOVERY - puppet last run on cp1085 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [04:38:21] RECOVERY - puppet last run on cloudvirt1008 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:29:07] PROBLEM - puppet last run on ms-be2028 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/ferm/conf.d/00_main] [06:30:09] PROBLEM - puppet last run on mw1252 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [06:56:37] RECOVERY - puppet last run on mw1252 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [07:00:45] RECOVERY - puppet last run on ms-be2028 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [07:42:27] PROBLEM - puppet last run on mw1255 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:05:27] !log repooling wdqs1008 - data transfer completed - T220830 [08:05:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:05:32] T220830: data reimport on wdqs1009 and wdqs1010 - https://phabricator.wikimedia.org/T220830 [08:08:53] RECOVERY - puppet last run on mw1255 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [09:13:44] (03PS1) 10Framawiki: Add www4.bibl.ulaval.ca to wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/503561 (https://phabricator.wikimedia.org/T220704) [10:13:43] (03PS1) 10Odder: Upload a new Apple Touch icon for German Wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/503638 (https://phabricator.wikimedia.org/T202902) [10:18:49] 10Operations, 10Office-IT, 10Wikimedia-Mailing-lists: Mailing list migration for Arbitration Committee to Google Group - https://phabricator.wikimedia.org/T215940 (10Elitre) [10:25:36] (03PS1) 10Odder: Add a new Apple Touch icon to configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/503639 (https://phabricator.wikimedia.org/T202902) [10:31:17] PROBLEM - EDAC syslog messages on thumbor1004 is CRITICAL: 27 ge 4 https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=thumbor1004&var-datasource=eqiad+prometheus/ops [10:46:49] !log depooling maps2001 for postgres init [10:46:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:08:36] 10Operations, 10ops-eqiad: Degraded RAID on analytics1039 - https://phabricator.wikimedia.org/T220880 (10ops-monitoring-bot) [11:25:52] (03PS1) 10Hoo man: Revert "WikibaseClient: Conditionally enable mapframe support" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/503643 (https://phabricator.wikimedia.org/T218051) [12:23:16] PROBLEM - Disk space on elastic1025 is CRITICAL: DISK CRITICAL - free space: /srv 28109 MB (5% inode=99%) [12:24:34] RECOVERY - Disk space on elastic1025 is OK: DISK OK [12:38:02] PROBLEM - Host mw1280 is DOWN: PING CRITICAL - Packet loss = 100% [13:33:58] PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1004 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1 [13:35:30] PROBLEM - recommendation_api endpoints health on scb2003 is CRITICAL: /{domain}/v1/article/creation/translation/{source}{/seed} (article.creation.translation - normal source and target with seed) is CRITICAL: Test article.creation.translation - normal source and target with seed returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [13:37:00] PROBLEM - recommendation_api endpoints health on scb1003 is CRITICAL: /{domain}/v1/article/creation/translation/{source}{/seed} (article.creation.translation - normal source and target with seed) is CRITICAL: Test article.creation.translation - normal source and target with seed returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [13:38:38] PROBLEM - recommendation_api endpoints health on scb2001 is CRITICAL: /{domain}/v1/article/creation/translation/{source}{/seed} (article.creation.translation - normal source and target with seed) is CRITICAL: Test article.creation.translation - normal source and target with seed returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [13:39:30] PROBLEM - recommendation_api endpoints health on scb1004 is CRITICAL: /{domain}/v1/article/creation/translation/{source}{/seed} (article.creation.translation - normal source and target with seed) is CRITICAL: Test article.creation.translation - normal source and target with seed returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [13:39:38] RECOVERY - recommendation_api endpoints health on scb1003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [13:40:00] RECOVERY - recommendation_api endpoints health on scb2001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [13:40:44] RECOVERY - recommendation_api endpoints health on scb2003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [13:40:50] RECOVERY - recommendation_api endpoints health on scb1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [13:50:43] 10Operations, 10fundraising-tech-ops, 10procurement: SSL renewal: frdata.wm.o expires 19/05/13 - https://phabricator.wikimedia.org/T220882 (10cwdent) [14:05:14] PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1004 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1 [14:10:02] PROBLEM - recommendation_api endpoints health on scb2002 is CRITICAL: /{domain}/v1/article/creation/translation/{source}{/seed} (article.creation.translation - normal source and target with seed) is CRITICAL: Test article.creation.translation - normal source and target with seed returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [14:11:21] uhhh [14:11:24] RECOVERY - recommendation_api endpoints health on scb2002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [14:11:24] An error has occurred while searching: Search is currently too busy. Please try again later. [14:11:40] where should i report this? [14:11:41] https://usercontent.irccloud-cdn.com/file/bSR8186g/image.png [14:19:12] PROBLEM - recommendation_api endpoints health on scb2002 is CRITICAL: /{domain}/v1/article/creation/translation/{source}{/seed} (article.creation.translation - normal source and target with seed) is CRITICAL: Test article.creation.translation - normal source and target with seed returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [14:20:28] PROBLEM - recommendation_api endpoints health on scb2001 is CRITICAL: /{domain}/v1/article/creation/translation/{source}{/seed} (article.creation.translation - normal source and target with seed) is CRITICAL: Test article.creation.translation - normal source and target with seed returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [14:21:26] PROBLEM - recommendation_api endpoints health on scb1002 is CRITICAL: /{domain}/v1/article/creation/translation/{source}{/seed} (article.creation.translation - normal source and target with seed) is CRITICAL: Test article.creation.translation - normal source and target with seed returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [14:21:48] RECOVERY - recommendation_api endpoints health on scb2002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [14:21:48] RECOVERY - recommendation_api endpoints health on scb2001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [14:24:04] RECOVERY - recommendation_api endpoints health on scb1002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [14:25:24] PROBLEM - recommendation_api endpoints health on scb1003 is CRITICAL: /{domain}/v1/article/creation/translation/{source}{/seed} (article.creation.translation - normal source and target with seed) is CRITICAL: Test article.creation.translation - normal source and target with seed returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [14:26:30] PROBLEM - recommendation_api endpoints health on scb2003 is CRITICAL: /{domain}/v1/article/creation/translation/{source}{/seed} (article.creation.translation - normal source and target with seed) is CRITICAL: Test article.creation.translation - normal source and target with seed returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [14:26:42] RECOVERY - recommendation_api endpoints health on scb1003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [14:29:06] RECOVERY - recommendation_api endpoints health on scb2003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [14:29:40] PROBLEM - recommendation_api endpoints health on scb2001 is CRITICAL: /{domain}/v1/article/creation/translation/{source}{/seed} (article.creation.translation - normal source and target with seed) is CRITICAL: Test article.creation.translation - normal source and target with seed returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [14:30:58] RECOVERY - recommendation_api endpoints health on scb2001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [14:34:47] (03PS5) 10Revi: Add enwiki to azwiki import source [mediawiki-config] - 10https://gerrit.wikimedia.org/r/492942 (https://phabricator.wikimedia.org/T217104) [14:38:48] PROBLEM - recommendation_api endpoints health on scb2001 is CRITICAL: /{domain}/v1/article/creation/translation/{source}{/seed} (article.creation.translation - normal source and target with seed) is CRITICAL: Test article.creation.translation - normal source and target with seed returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [14:40:40] PROBLEM - recommendation_api endpoints health on scb1001 is CRITICAL: /{domain}/v1/article/creation/translation/{source}{/seed} (article.creation.translation - normal source and target with seed) is CRITICAL: Test article.creation.translation - normal source and target with seed returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [14:41:26] RECOVERY - recommendation_api endpoints health on scb2001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [14:41:58] RECOVERY - recommendation_api endpoints health on scb1001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [14:43:46] PROBLEM - recommendation_api endpoints health on scb2006 is CRITICAL: /{domain}/v1/article/creation/translation/{source}{/seed} (article.creation.translation - normal source and target with seed) is CRITICAL: Test article.creation.translation - normal source and target with seed returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [14:44:56] PROBLEM - recommendation_api endpoints health on scb1002 is CRITICAL: /{domain}/v1/article/creation/translation/{source}{/seed} (article.creation.translation - normal source and target with seed) is CRITICAL: Test article.creation.translation - normal source and target with seed returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [14:45:06] RECOVERY - recommendation_api endpoints health on scb2006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [14:45:40] PROBLEM - puppet last run on mw1269 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:46:06] PROBLEM - recommendation_api endpoints health on scb2005 is CRITICAL: /{domain}/v1/article/creation/translation/{source}{/seed} (article.creation.translation - normal source and target with seed) is CRITICAL: Test article.creation.translation - normal source and target with seed returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [14:46:18] RECOVERY - recommendation_api endpoints health on scb1002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [14:47:34] PROBLEM - recommendation_api endpoints health on scb1003 is CRITICAL: /{domain}/v1/article/creation/translation/{source}{/seed} (article.creation.translation - normal source and target with seed) is CRITICAL: Test article.creation.translation - normal source and target with seed returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [14:48:44] RECOVERY - recommendation_api endpoints health on scb2005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [14:48:54] RECOVERY - recommendation_api endpoints health on scb1003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [14:53:12] PROBLEM - recommendation_api endpoints health on scb2004 is CRITICAL: /{domain}/v1/article/creation/translation/{source}{/seed} (article.creation.translation - normal source and target with seed) is CRITICAL: Test article.creation.translation - normal source and target with seed returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [14:55:36] PROBLEM - recommendation_api endpoints health on scb2006 is CRITICAL: /{domain}/v1/article/creation/translation/{source}{/seed} (article.creation.translation - normal source and target with seed) is CRITICAL: Test article.creation.translation - normal source and target with seed returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [14:56:36] PROBLEM - recommendation_api endpoints health on scb2003 is CRITICAL: /{domain}/v1/article/creation/translation/{source}{/seed} (article.creation.translation - normal source and target with seed) is CRITICAL: Test article.creation.translation - normal source and target with seed returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [14:56:54] RECOVERY - recommendation_api endpoints health on scb2006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [14:57:40] PROBLEM - recommendation_api endpoints health on scb1001 is CRITICAL: /{domain}/v1/article/creation/translation/{source}{/seed} (article.creation.translation - normal source and target with seed) is CRITICAL: Test article.creation.translation - normal source and target with seed returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [14:57:56] RECOVERY - recommendation_api endpoints health on scb2003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [14:58:28] RECOVERY - recommendation_api endpoints health on scb2004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [14:59:09] Working for me again now [14:59:42] PROBLEM - recommendation_api endpoints health on scb2002 is CRITICAL: /{domain}/v1/article/creation/translation/{source}{/seed} (article.creation.translation - normal source and target with seed) is CRITICAL: Test article.creation.translation - normal source and target with seed returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [15:00:38] PROBLEM - recommendation_api endpoints health on scb1002 is CRITICAL: /{domain}/v1/article/creation/translation/{source}{/seed} (article.creation.translation - normal source and target with seed) is CRITICAL: Test article.creation.translation - normal source and target with seed returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [15:01:00] RECOVERY - recommendation_api endpoints health on scb2002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [15:01:34] RECOVERY - recommendation_api endpoints health on scb1001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [15:01:58] RECOVERY - recommendation_api endpoints health on scb1002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [15:02:52] Praxidicae maybe file a task and tagging it #operations. [15:04:42] PROBLEM - recommendation_api endpoints health on scb2006 is CRITICAL: /{domain}/v1/article/creation/translation/{source}{/seed} (article.creation.translation - normal source and target with seed) is CRITICAL: Test article.creation.translation - normal source and target with seed returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [15:07:02] PROBLEM - recommendation_api endpoints health on scb2005 is CRITICAL: /{domain}/v1/article/creation/translation/{source}{/seed} (article.creation.translation - normal source and target with seed) is CRITICAL: Test article.creation.translation - normal source and target with seed returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [15:07:20] RECOVERY - recommendation_api endpoints health on scb2006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [15:08:52] PROBLEM - recommendation_api endpoints health on scb2001 is CRITICAL: /{domain}/v1/article/creation/translation/{source}{/seed} (article.creation.translation - normal source and target with seed) is CRITICAL: Test article.creation.translation - normal source and target with seed returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [15:09:32] PROBLEM - Check systemd state on cloudvirt1015 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [15:09:38] RECOVERY - recommendation_api endpoints health on scb2005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [15:10:10] RECOVERY - recommendation_api endpoints health on scb2001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [15:10:54] PROBLEM - recommendation_api endpoints health on scb2003 is CRITICAL: /{domain}/v1/article/creation/translation/{source}{/seed} (article.creation.translation - normal source and target with seed) is CRITICAL: Test article.creation.translation - normal source and target with seed returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [15:11:06] PROBLEM - recommendation_api endpoints health on scb1003 is CRITICAL: /{domain}/v1/article/creation/translation/{source}{/seed} (article.creation.translation - normal source and target with seed) is CRITICAL: Test article.creation.translation - normal source and target with seed returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [15:12:12] RECOVERY - recommendation_api endpoints health on scb2003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [15:13:33] paladox, Good idea, Praxidicae, Do you want to file? [15:14:48] PROBLEM - recommendation_api endpoints health on scb2005 is CRITICAL: /{domain}/v1/article/creation/translation/{source}{/seed} (article.creation.translation - normal source and target with seed) is CRITICAL: Test article.creation.translation - normal source and target with seed returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [15:15:00] PROBLEM - recommendation_api endpoints health on scb1002 is CRITICAL: /{domain}/v1/article/creation/translation/{source}{/seed} (article.creation.translation - normal source and target with seed) is CRITICAL: Test article.creation.translation - normal source and target with seed returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [15:15:22] PROBLEM - recommendation_api endpoints health on scb2001 is CRITICAL: /{domain}/v1/article/creation/translation/{source}{/seed} (article.creation.translation - normal source and target with seed) is CRITICAL: Test article.creation.translation - normal source and target with seed returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [15:16:18] RECOVERY - recommendation_api endpoints health on scb1002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [15:17:24] RECOVERY - puppet last run on mw1269 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [15:17:58] RECOVERY - recommendation_api endpoints health on scb2001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [15:20:00] RECOVERY - recommendation_api endpoints health on scb2005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [15:22:44] PROBLEM - recommendation_api endpoints health on scb1004 is CRITICAL: /{domain}/v1/article/creation/translation/{source}{/seed} (article.creation.translation - normal source and target with seed) is CRITICAL: Test article.creation.translation - normal source and target with seed returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [15:22:48] RECOVERY - recommendation_api endpoints health on scb1003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [15:23:18] PROBLEM - Logstash rate of ingestion percent change compared to yesterday on icinga1001 is CRITICAL: 170.5 ge 130 https://grafana.wikimedia.org/dashboard/db/logstash?orgId=1&panelId=2&fullscreen [15:24:04] RECOVERY - recommendation_api endpoints health on scb1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [15:25:06] RECOVERY - Check systemd state on cloudvirt1015 is OK: OK - running: The system is fully operational [15:27:30] PROBLEM - puppet last run on mw1300 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:27:52] PROBLEM - recommendation_api endpoints health on scb2005 is CRITICAL: /{domain}/v1/article/creation/translation/{source}{/seed} (article.creation.translation - normal source and target with seed) is CRITICAL: Test article.creation.translation - normal source and target with seed returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [15:30:18] PROBLEM - recommendation_api endpoints health on scb1001 is CRITICAL: /{domain}/v1/article/creation/translation/{source}{/seed} (article.creation.translation - normal source and target with seed) is CRITICAL: Test article.creation.translation - normal source and target with seed returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [15:30:18] PROBLEM - Check systemd state on cloudvirt1015 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [15:30:32] RECOVERY - recommendation_api endpoints health on scb2005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [15:32:56] RECOVERY - Check systemd state on cloudvirt1015 is OK: OK - running: The system is fully operational [15:33:24] !log restart recommendation_api on scb2001 [15:33:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:34:47] !log restart recommendation_api on scb1001 [15:34:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:40:00] PROBLEM - recommendation_api endpoints health on scb1002 is CRITICAL: /{domain}/v1/article/creation/translation/{source}{/seed} (article.creation.translation - normal source and target with seed) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [15:41:00] PROBLEM - recommendation_api endpoints health on scb2005 is CRITICAL: /{domain}/v1/article/creation/translation/{source}{/seed} (article.creation.translation - normal source and target with seed) is CRITICAL: Test article.creation.translation - normal source and target with seed returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [15:41:14] RECOVERY - recommendation_api endpoints health on scb1002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [15:42:18] RECOVERY - recommendation_api endpoints health on scb2005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [15:42:30] PROBLEM - recommendation_api endpoints health on scb1003 is CRITICAL: /{domain}/v1/article/creation/translation/{source}{/seed} (article.creation.translation - normal source and target with seed) is CRITICAL: Test article.creation.translation - normal source and target with seed returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [15:43:50] RECOVERY - recommendation_api endpoints health on scb1003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [15:45:18] PROBLEM - recommendation_api endpoints health on scb2006 is CRITICAL: /{domain}/v1/article/creation/translation/{source}{/seed} (article.creation.translation - normal source and target with seed) is CRITICAL: Test article.creation.translation - normal source and target with seed returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [15:46:28] PROBLEM - recommendation_api endpoints health on scb1002 is CRITICAL: /{domain}/v1/article/creation/translation/{source}{/seed} (article.creation.translation - normal source and target with seed) is CRITICAL: Test article.creation.translation - normal source and target with seed returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [15:46:50] PROBLEM - recommendation_api endpoints health on scb2004 is CRITICAL: /{domain}/v1/article/creation/translation/{source}{/seed} (article.creation.translation - normal source and target with seed) is CRITICAL: Test article.creation.translation - normal source and target with seed returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [15:47:32] PROBLEM - recommendation_api endpoints health on scb2003 is CRITICAL: /{domain}/v1/article/creation/translation/{source}{/seed} (article.creation.translation - normal source and target with seed) is CRITICAL: Test article.creation.translation - normal source and target with seed returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [15:47:46] RECOVERY - recommendation_api endpoints health on scb1002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [15:48:50] RECOVERY - recommendation_api endpoints health on scb2003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [15:49:28] RECOVERY - recommendation_api endpoints health on scb2004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [15:51:52] RECOVERY - recommendation_api endpoints health on scb2006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [15:57:55] RhinosF1, paladox, Praxidicae: we have some people looking into the search and recommendation_api issues now. Thanks for the pings that got us to start looking. [15:58:09] Thanks bd808 [15:58:58] !log restart elasticsearch on elastic1027 [15:59:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:59:18] RECOVERY - puppet last run on mw1300 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [16:01:52] PROBLEM - Check systemd state on elastic1027 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [16:02:17] <_joe_> uhm [16:02:21] <_joe_> looks like it failed? [16:02:43] <_joe_> ● elasticsearch.service loaded failed failed Elasticsearch [16:03:45] <_joe_> ebernhardson: I think you restarted the wrong service [16:05:31] probably the elasticsearch_6@ service is the right one? [16:06:24] sigh, yes i restarted the wrong one ... but then the question is why did it shutdown anyways? [16:06:56] or at least, the graphs to compare nodes showed an immediate change ... [16:07:07] <_joe_> indeed [16:07:10] i see the two expected elasticsearch instances running on the server at least [16:07:10] <_joe_> lemme see something [16:08:41] <_joe_> you apparently restarted the instances as well [16:10:17] <_joe_> frankly it seems the whole cluster is suffering, but that node was in particularly bad shape [16:10:33] hmm, do we need to force delete whatever `sudo service elasticsearch` does? [16:10:45] <_joe_> yes, I'm going to fix that [16:10:49] things have picked back up, the pool counter (limits concurrency) has completely stopped rejecting things [16:11:15] <_joe_> yes [16:11:25] <_joe_> we're also not seeing any user-visible error [16:11:25] search threadpool isn't full anywhere, so elastic shouldn't be rejecting queries either [16:11:32] there is additional load for some reason though [16:11:47] <_joe_> but I guess we should look at what's going on before it gets uglier [16:11:53] <_joe_> but it's not an emergency anymore [16:13:30] RECOVERY - Check systemd state on elastic1027 is OK: OK - running: The system is fully operational [16:17:04] PROBLEM - puppet last run on mw1273 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:20:47] Thanks bd08 [16:28:17] (03PS1) 10Krinkle: profiler: Discard incomplete stacks and monitor via statsd [mediawiki-config] - 10https://gerrit.wikimedia.org/r/503652 (https://phabricator.wikimedia.org/T220623) [16:29:02] (03CR) 10jerkins-bot: [V: 04-1] profiler: Discard incomplete stacks and monitor via statsd [mediawiki-config] - 10https://gerrit.wikimedia.org/r/503652 (https://phabricator.wikimedia.org/T220623) (owner: 10Krinkle) [16:29:22] (03PS1) 10EBernhardson: Switch more_like and regex elsaticsearch queries from eqiad to codw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/503653 [16:29:32] (03PS2) 10Krinkle: profiler: Discard incomplete stacks and monitor via statsd [mediawiki-config] - 10https://gerrit.wikimedia.org/r/503652 (https://phabricator.wikimedia.org/T176916) [16:30:19] (03CR) 10jerkins-bot: [V: 04-1] profiler: Discard incomplete stacks and monitor via statsd [mediawiki-config] - 10https://gerrit.wikimedia.org/r/503652 (https://phabricator.wikimedia.org/T176916) (owner: 10Krinkle) [16:30:34] (03CR) 10Giuseppe Lavagetto: [C: 03+1] "Seems sensible given the current situation." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/503653 (owner: 10EBernhardson) [16:30:42] (03CR) 10EBernhardson: [C: 03+2] Switch more_like and regex elsaticsearch queries from eqiad to codw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/503653 (owner: 10EBernhardson) [16:30:55] (03CR) 10Gehel: [C: 03+1] "LGTM in principle, I don't know the actual config format." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/503653 (owner: 10EBernhardson) [16:31:21] (03PS2) 10EBernhardson: Switch more_like and regex elsaticsearch queries from eqiad to codw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/503653 [16:31:31] (03CR) 10EBernhardson: [C: 03+2] Switch more_like and regex elsaticsearch queries from eqiad to codw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/503653 (owner: 10EBernhardson) [16:32:02] (03PS3) 10Krinkle: profiler: Discard incomplete stacks and monitor via statsd [mediawiki-config] - 10https://gerrit.wikimedia.org/r/503652 (https://phabricator.wikimedia.org/T176916) [16:32:41] (03Merged) 10jenkins-bot: Switch more_like and regex elsaticsearch queries from eqiad to codw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/503653 (owner: 10EBernhardson) [16:33:03] (03CR) 10jerkins-bot: [V: 04-1] profiler: Discard incomplete stacks and monitor via statsd [mediawiki-config] - 10https://gerrit.wikimedia.org/r/503652 (https://phabricator.wikimedia.org/T176916) (owner: 10Krinkle) [16:33:14] RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1004 is OK: OK: Less than 20.00% above the threshold [500.0] https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1 [16:39:01] (03PS1) 10EBernhardson: Revert "Switch more_like and regex elsaticsearch queries from eqiad to codw" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/503655 [16:39:03] (03CR) 10EBernhardson: [C: 03+2] Revert "Switch more_like and regex elsaticsearch queries from eqiad to codw" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/503655 (owner: 10EBernhardson) [16:40:09] (03Merged) 10jenkins-bot: Revert "Switch more_like and regex elsaticsearch queries from eqiad to codw" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/503655 (owner: 10EBernhardson) [16:42:34] (03CR) 10jenkins-bot: Switch more_like and regex elsaticsearch queries from eqiad to codw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/503653 (owner: 10EBernhardson) [16:42:36] (03CR) 10jenkins-bot: Revert "Switch more_like and regex elsaticsearch queries from eqiad to codw" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/503655 (owner: 10EBernhardson) [16:43:26] RECOVERY - puppet last run on mw1273 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [17:10:08] PROBLEM - Check systemd state on cloudvirt1015 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [17:23:34] RECOVERY - Logstash rate of ingestion percent change compared to yesterday on icinga1001 is OK: (C)130 ge (W)110 ge 85.86 https://grafana.wikimedia.org/dashboard/db/logstash?orgId=1&panelId=2&fullscreen [17:24:20] RECOVERY - Check systemd state on cloudvirt1015 is OK: OK - running: The system is fully operational [17:29:34] PROBLEM - Check systemd state on cloudvirt1015 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [17:34:48] RECOVERY - Check systemd state on cloudvirt1015 is OK: OK - running: The system is fully operational [18:29:00] PROBLEM - nova-compute proc minimum on cloudvirt1015 is CRITICAL: connect to address 10.64.20.31 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [18:31:18] PROBLEM - ensure kvm processes are running on cloudvirt1015 is CRITICAL: connect to address 10.64.20.31 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [18:31:38] PROBLEM - dhclient process on cloudvirt1015 is CRITICAL: connect to address 10.64.20.31 port 5666: Connection refused [18:31:38] PROBLEM - kvm ssl cert on cloudvirt1015 is CRITICAL: connect to address 10.64.20.31 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [18:31:58] PROBLEM - Check systemd state on cloudvirt1015 is CRITICAL: connect to address 10.64.20.31 port 5666: Connection refused [18:32:08] PROBLEM - SSH on cloudvirt1015 is CRITICAL: connect to address 10.64.20.31 and port 22: Connection refused https://wikitech.wikimedia.org/wiki/SSH/monitoring [18:32:10] PROBLEM - Disk space on cloudvirt1015 is CRITICAL: connect to address 10.64.20.31 port 5666: Connection refused [18:32:18] PROBLEM - DPKG on cloudvirt1015 is CRITICAL: connect to address 10.64.20.31 port 5666: Connection refused [18:32:42] PROBLEM - configured eth on cloudvirt1015 is CRITICAL: connect to address 10.64.20.31 port 5666: Connection refused [18:34:22] PROBLEM - puppet last run on cloudvirt1015 is CRITICAL: connect to address 10.64.20.31 port 5666: Connection refused [18:34:55] taking a look [18:35:08] PROBLEM - nova-compute proc maximum on cloudvirt1015 is CRITICAL: connect to address 10.64.20.31 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [18:35:17] Its due to a reboot see -cloud godog [18:35:38] oh, thanks Zppix ! [18:36:16] Np [18:36:39] oops. sorry godog. I've been !log-ging in -cloud as I poke at that one [18:36:40] planned reboot? [18:36:50] Not exactly [18:36:57] you know those page, right? [18:37:04] PROBLEM - Check the NTP synchronisation status of timesyncd on cloudvirt1015 is CRITICAL: connect to address 10.64.20.31 port 5666: Connection refused [18:37:04] yeah planned, following what appears to be hardware fault [18:37:11] T220853 [18:37:38] np bd808 [18:38:11] can folks downtime next time? just save a bunch of people checking on on their sat evening (or day I guess for your tz) [18:38:44] apergos: I don't have downtime super powers in icinga :/ I should find out how to change that [18:39:01] ah yeah, for sure :) [18:39:07] I would not have guessed that [18:39:39] anyone that's gotta to do reboots should have that [18:39:43] * bd808 is a mysterious 0.78% root [18:40:21] well hope they hw issue gets worked out and the rest of your day is quiet [18:40:24] eh [18:45:04] 10Operations, 10SRE-Access-Requests, 10monitoring: Allow WMCS to downtime alerts in Icinga - https://phabricator.wikimedia.org/T220887 (10ayounsi) [18:45:06] apergos, bd808: https://phabricator.wikimedia.org/T220887 [18:45:33] XioNoX: I think it's just me (only non-root) [18:46:19] !log 3h downtime for cloudvirt1015 [18:46:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:46:24] I guess that can go into access requests if it's only you [18:46:29] and name you in the title [18:46:48] I'm not sure I'd describe that as planned maintenance [18:48:03] I opened the task for discussion, you're welcome to edit it :) [18:48:24] sec [18:48:37] cool [18:49:34] 10Operations, 10SRE-Access-Requests, 10monitoring: Allow WMCS to downtime alerts in Icinga - https://phabricator.wikimedia.org/T220887 (10Krenair) [18:50:06] 10Operations, 10SRE-Access-Requests, 10monitoring: Allow Bryan Davis to downtime alerts in Icinga - https://phabricator.wikimedia.org/T220887 (10ArielGlenn) [18:50:07] edited [18:50:47] thx :) closing IRC [18:51:24] 10Operations, 10SRE-Access-Requests, 10monitoring: Allow Bryan Davis to downtime alerts in Icinga - https://phabricator.wikimedia.org/T220887 (10Zppix) Shouldnt really require discussion on if he should be granted the rights imho [18:51:26] I was only here for the page (10pm on Sat) so gone again [19:01:22] RECOVERY - configured eth on cloudvirt1015 is OK: OK - interfaces up [19:01:30] RECOVERY - dhclient process on cloudvirt1015 is OK: PROCS OK: 0 processes with command name dhclient [19:01:30] RECOVERY - nova-compute proc minimum on cloudvirt1015 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n] /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [19:01:32] RECOVERY - kvm ssl cert on cloudvirt1015 is OK: Certificate will not expire https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [19:02:00] RECOVERY - SSH on cloudvirt1015 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u6 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [19:02:02] RECOVERY - Disk space on cloudvirt1015 is OK: DISK OK [19:02:12] RECOVERY - DPKG on cloudvirt1015 is OK: All packages OK [19:02:26] RECOVERY - nova-compute proc maximum on cloudvirt1015 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n] /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [19:02:32] RECOVERY - ensure kvm processes are running on cloudvirt1015 is OK: PROCS OK: 4 processes with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [19:06:06] RECOVERY - puppet last run on cloudvirt1015 is OK: OK: Puppet is currently enabled, last run 58 minutes ago with 0 failures [19:07:22] RECOVERY - Check the NTP synchronisation status of timesyncd on cloudvirt1015 is OK: OK: synced at Sat 2019-04-13 19:07:20 UTC. [19:08:22] RECOVERY - Check systemd state on cloudvirt1015 is OK: OK - running: The system is fully operational [19:53:16] (03CR) 10Jforrester: "Is this change needed for anything in particular, or is just cleanup?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499768 (owner: 10Addshore) [20:04:22] 10Operations, 10SRE-Access-Requests: Membership in "analytics" group for Bryan Davis - https://phabricator.wikimedia.org/T220892 (10bd808) [20:05:52] bd808: You can also disable notifications via puppet, on hiera adding: 'profile::base::notifications: disabled ' to the yaml, see for instance db1112.yaml [20:06:21] bd808: You'd of course need to wait for puppet to run on the host and on icinga...but it is an alternative to avoid paging if you are going to do maintenance [20:06:30] But yeah, I think you should have powers to downtime hosts! [20:07:27] marostegui: I'd also need a root to merge the puppet patch ;) [20:39:35] 10Operations, 10SRE-Access-Requests: Membership in "analytics" group for Bryan Davis - https://phabricator.wikimedia.org/T220892 (10Krenair) I think you mean the `researchers` group? [20:48:11] 10Operations, 10SRE-Access-Requests: Membership in "researchers" group for Bryan Davis - https://phabricator.wikimedia.org/T220892 (10bd808) [20:48:29] 10Operations, 10SRE-Access-Requests: Membership in "researchers" group for Bryan Davis - https://phabricator.wikimedia.org/T220892 (10bd808) >>! In T220892#5109144, @Krenair wrote: > I think you mean the `researchers` group? I did indeed. Thanks for pointing that out. [20:53:39] 10Operations, 10Wikimedia-SVG-rendering, 10Upstream: Update librsvg to ≥2.42.3 - https://phabricator.wikimedia.org/T193352 (10JoKalliauer) [21:53:20] 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, and 2 others: VMs on cloudvirt1015 crashing - https://phabricator.wikimedia.org/T220853 (10Andrew) @RobH and/or @Cmjohnson, I'm hoping the above is enough to pass on to Dell for a replacement part. Let us know if you need more details. [21:58:20] PROBLEM - Long running screen/tmux on notebook1003 is CRITICAL: CRIT: Long running SCREEN process. (user: fsalutari PID: 15618, 1739243s 1728000s). [21:59:24] PROBLEM - tools project instance distribution on cloudcontrol1003 is CRITICAL: CRITICAL: elastic class instances not spread out enough https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [22:02:43] (03PS1) 10QChris: Add .gitreview [debs/elastalert] - 10https://gerrit.wikimedia.org/r/503674 [22:02:45] (03CR) 10QChris: [V: 03+2 C: 03+2] Add .gitreview [debs/elastalert] - 10https://gerrit.wikimedia.org/r/503674 (owner: 10QChris) [22:04:55] (03PS1) 10Krinkle: webperf: Remove arclamp subscriber from mwlog servers [puppet] - 10https://gerrit.wikimedia.org/r/503675 (https://phabricator.wikimedia.org/T195312) [22:05:13] (03CR) 10Krinkle: "TODO: Manually delete/stop all the indirectly ensured resources, or puppetise it somehow?" [puppet] - 10https://gerrit.wikimedia.org/r/503675 (https://phabricator.wikimedia.org/T195312) (owner: 10Krinkle) [22:19:26] PROBLEM - puppet last run on mc1030 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [22:51:10] RECOVERY - puppet last run on mc1030 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [22:58:43] (03PS1) 10DannyS712: en.wikiversity VisualEditor Changing Active Namespaces [mediawiki-config] - 10https://gerrit.wikimedia.org/r/503680 (https://phabricator.wikimedia.org/T220881) [23:04:29] (03PS2) 10DannyS712: en.wikiversity: configure VisualEditor active namespaces Add Draft, Help, Portal, School, and Wikiversity namespaces [mediawiki-config] - 10https://gerrit.wikimedia.org/r/503680 (https://phabricator.wikimedia.org/T220881) [23:04:54] (03PS3) 10DannyS712: en.wikiversity: configure VisualEditor active namespaces Add Draft, Help, Portal, School, and Wikiversity namespaces [mediawiki-config] - 10https://gerrit.wikimedia.org/r/503680 (https://phabricator.wikimedia.org/T220881) [23:10:19] (03CR) 10Zoranzoki21: [C: 03+1] "This patch itself looks good, but I have some minor suggestions for improvement." (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/503680 (https://phabricator.wikimedia.org/T220881) (owner: 10DannyS712) [23:13:34] (03PS4) 10DannyS712: Add 5 active namespaces for VisualEditor on en.wikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/503680 (https://phabricator.wikimedia.org/T220881) [23:16:59] (03CR) 10Zoranzoki21: [C: 03+1] "Yes, it is. Now looks 100% good." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/503680 (https://phabricator.wikimedia.org/T220881) (owner: 10DannyS712) [23:45:15] 10Operations: Replacement of network::constant's special_hosts - https://phabricator.wikimedia.org/T220894 (10Krenair) [23:48:28] 10Operations: Replacement of network::constant's special_hosts - https://phabricator.wikimedia.org/T220894 (10Krenair) [23:50:47] 10Operations: Replacement of network::constant's special_hosts - https://phabricator.wikimedia.org/T220894 (10Krenair) [23:51:16] 10Operations: Replacement of network::constant's special_hosts - https://phabricator.wikimedia.org/T220894 (10Krenair) Help from anyone with answering the questions around monitoring_hosts and deployment_hosts would be appreciated.