[00:25:47] RECOVERY - Check systemd state on scandium is OK: OK - running: The system is fully operational [00:43:25] PROBLEM - puppet last run on prometheus2004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:59:05] PROBLEM - MariaDB Slave SQL: s3 on dbstore1002 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1030, Errmsg: Error Got error 22 Invalid argument from storage engine TokuDB on query. Default database: mediawikiwiki. [Query snipped] [01:13:17] PROBLEM - MariaDB Slave Lag: s3 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1020.47 seconds [01:15:13] RECOVERY - puppet last run on prometheus2004 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [03:43:50] Is T214963 bad enough to emergency-deploy the fix in VE? [03:43:51] T214963: Database contention due to concurrent requests to ApiOptions - https://phabricator.wikimedia.org/T214963 [04:15:45] RECOVERY - Check systemd state on an-coord1001 is OK: OK - running: The system is fully operational [04:29:31] PROBLEM - Check systemd state on scandium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [04:35:49] PROBLEM - Freshness of OCSP Stapling files on cp5008 is CRITICAL: CRITICAL: File /var/cache/ocsp/digicert-2017-ecdsa-unified.ocsp is more than 259500 secs old! [04:46:45] PROBLEM - Freshness of OCSP Stapling files on cp1084 is CRITICAL: CRITICAL: File /var/cache/ocsp/digicert-2017-ecdsa-unified.ocsp is more than 259500 secs old! [04:56:31] PROBLEM - Freshness of OCSP Stapling files on cp5012 is CRITICAL: CRITICAL: File /var/cache/ocsp/digicert-2017-ecdsa-unified.ocsp is more than 259500 secs old! [04:56:49] RECOVERY - Check systemd state on scandium is OK: OK - running: The system is fully operational [04:58:01] PROBLEM - Freshness of OCSP Stapling files on cp5002 is CRITICAL: CRITICAL: File /var/cache/ocsp/digicert-2017-ecdsa-unified.ocsp is more than 259500 secs old! [04:59:21] PROBLEM - Freshness of OCSP Stapling files on cp5007 is CRITICAL: CRITICAL: File /var/cache/ocsp/digicert-2017-ecdsa-unified.ocsp is more than 259500 secs old! [05:16:19] PROBLEM - Freshness of OCSP Stapling files on cp1090 is CRITICAL: CRITICAL: File /var/cache/ocsp/digicert-2017-ecdsa-unified.ocsp is more than 259500 secs old! [05:32:19] PROBLEM - Freshness of OCSP Stapling files on cp2001 is CRITICAL: CRITICAL: File /var/cache/ocsp/digicert-2017-ecdsa-unified.ocsp is more than 259500 secs old! [05:35:05] PROBLEM - Freshness of OCSP Stapling files on cp3037 is CRITICAL: CRITICAL: File /var/cache/ocsp/digicert-2017-ecdsa-unified.ocsp is more than 259500 secs old! [05:43:19] PROBLEM - Freshness of OCSP Stapling files on cp4029 is CRITICAL: CRITICAL: File /var/cache/ocsp/digicert-2017-ecdsa-unified.ocsp is more than 259500 secs old! [05:44:20] !log jforrester@deploy1001 Synchronized php-1.33.0-wmf.14/extensions/VisualEditor/lib/ve/src/ui/dialogs/ve.ui.FindAndReplaceDialog.js: b/src/ui/dialogs/ve.ui.FindAndReplaceDialog.js T214963 Hot-deploy VE fix to stop hitting user pref writes without debounce (duration: 01m 02s) [05:44:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:44:24] T214963: Database contention due to concurrent requests to ApiOptions - https://phabricator.wikimedia.org/T214963 [05:55:59] PROBLEM - Freshness of OCSP Stapling files on cp1089 is CRITICAL: CRITICAL: File /var/cache/ocsp/digicert-2017-ecdsa-unified.ocsp is more than 259500 secs old! [05:59:29] PROBLEM - Check systemd state on scandium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [06:01:11] PROBLEM - Freshness of OCSP Stapling files on cp3038 is CRITICAL: CRITICAL: File /var/cache/ocsp/digicert-2017-ecdsa-unified.ocsp is more than 259500 secs old! [06:18:20] PROBLEM - Freshness of OCSP Stapling files on cp3030 is CRITICAL: CRITICAL: File /var/cache/ocsp/digicert-2017-ecdsa-unified.ocsp is more than 259500 secs old! [06:18:27] PROBLEM - Freshness of OCSP Stapling files on cp3035 is CRITICAL: CRITICAL: File /var/cache/ocsp/digicert-2017-ecdsa-unified.ocsp is more than 259500 secs old! [06:23:09] PROBLEM - Freshness of OCSP Stapling files on cp2005 is CRITICAL: CRITICAL: File /var/cache/ocsp/digicert-2017-ecdsa-unified.ocsp is more than 259500 secs old! [06:23:29] PROBLEM - Freshness of OCSP Stapling files on cp5009 is CRITICAL: CRITICAL: File /var/cache/ocsp/digicert-2017-ecdsa-unified.ocsp is more than 259500 secs old! [06:26:57] RECOVERY - Check systemd state on scandium is OK: OK - running: The system is fully operational [06:26:59] PROBLEM - Freshness of OCSP Stapling files on cp4023 is CRITICAL: CRITICAL: File /var/cache/ocsp/digicert-2017-ecdsa-unified.ocsp is more than 259500 secs old! [06:28:30] (03CR) 10Gehel: "Looks reasonable, will need a bit more in depth review before merging." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/487360 (https://phabricator.wikimedia.org/T198622) (owner: 10Mathew.onipe) [06:31:21] PROBLEM - puppet last run on mw1300 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/ferm/conf.d/00_main] [06:31:33] PROBLEM - puppet last run on ores1007 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/profile.d/mysql-ps1.sh] [06:57:48] RECOVERY - puppet last run on mw1300 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [06:57:59] RECOVERY - puppet last run on ores1007 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [07:00:13] PROBLEM - Freshness of OCSP Stapling files on cp5005 is CRITICAL: CRITICAL: File /var/cache/ocsp/digicert-2017-ecdsa-unified.ocsp is more than 259500 secs old! [07:03:27] PROBLEM - Freshness of OCSP Stapling files on cp3040 is CRITICAL: CRITICAL: File /var/cache/ocsp/digicert-2017-ecdsa-unified.ocsp is more than 259500 secs old! [07:08:09] PROBLEM - Freshness of OCSP Stapling files on cp2024 is CRITICAL: CRITICAL: File /var/cache/ocsp/digicert-2017-ecdsa-unified.ocsp is more than 259500 secs old! [07:29:03] PROBLEM - Freshness of OCSP Stapling files on cp1077 is CRITICAL: CRITICAL: File /var/cache/ocsp/digicert-2017-ecdsa-unified.ocsp is more than 259500 secs old! [07:29:37] PROBLEM - Check systemd state on scandium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [07:46:09] PROBLEM - Freshness of OCSP Stapling files on cp3044 is CRITICAL: CRITICAL: File /var/cache/ocsp/digicert-2017-ecdsa-unified.ocsp is more than 259500 secs old! [07:54:21] PROBLEM - Freshness of OCSP Stapling files on cp1088 is CRITICAL: CRITICAL: File /var/cache/ocsp/digicert-2017-ecdsa-unified.ocsp is more than 259500 secs old! [07:56:49] RECOVERY - Check systemd state on scandium is OK: OK - running: The system is fully operational [07:58:04] FYI those OCSP alerts are known and harmless, sorry for the spam [07:59:31] PROBLEM - Freshness of OCSP Stapling files on cp2004 is CRITICAL: CRITICAL: File /var/cache/ocsp/digicert-2017-ecdsa-unified.ocsp is more than 259500 secs old! [08:08:13] PROBLEM - Freshness of OCSP Stapling files on cp2022 is CRITICAL: CRITICAL: File /var/cache/ocsp/digicert-2017-ecdsa-unified.ocsp is more than 259500 secs old! [08:13:41] PROBLEM - Freshness of OCSP Stapling files on cp3033 is CRITICAL: CRITICAL: File /var/cache/ocsp/digicert-2017-ecdsa-unified.ocsp is more than 259500 secs old! [08:17:23] PROBLEM - Freshness of OCSP Stapling files on cp2019 is CRITICAL: CRITICAL: File /var/cache/ocsp/digicert-2017-ecdsa-unified.ocsp is more than 259500 secs old! [08:29:13] PROBLEM - Freshness of OCSP Stapling files on cp3045 is CRITICAL: CRITICAL: File /var/cache/ocsp/digicert-2017-ecdsa-unified.ocsp is more than 259500 secs old! [08:59:15] PROBLEM - Check systemd state on scandium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [09:00:33] PROBLEM - Freshness of OCSP Stapling files on cp3042 is CRITICAL: CRITICAL: File /var/cache/ocsp/digicert-2017-ecdsa-unified.ocsp is more than 259500 secs old! [09:03:39] PROBLEM - Freshness of OCSP Stapling files on cp2006 is CRITICAL: CRITICAL: File /var/cache/ocsp/digicert-2017-ecdsa-unified.ocsp is more than 259500 secs old! [09:10:53] PROBLEM - Freshness of OCSP Stapling files on cp2020 is CRITICAL: CRITICAL: File /var/cache/ocsp/digicert-2017-ecdsa-unified.ocsp is more than 259500 secs old! [09:26:37] RECOVERY - Check systemd state on scandium is OK: OK - running: The system is fully operational [09:43:09] PROBLEM - Freshness of OCSP Stapling files on cp1080 is CRITICAL: CRITICAL: File /var/cache/ocsp/digicert-2017-ecdsa-unified.ocsp is more than 259500 secs old! [10:19:09] PROBLEM - Freshness of OCSP Stapling files on cp2017 is CRITICAL: CRITICAL: File /var/cache/ocsp/digicert-2017-ecdsa-unified.ocsp is more than 259500 secs old! [10:21:41] PROBLEM - Freshness of OCSP Stapling files on cp1079 is CRITICAL: CRITICAL: File /var/cache/ocsp/digicert-2017-ecdsa-unified.ocsp is more than 259500 secs old! [10:21:59] PROBLEM - Freshness of OCSP Stapling files on cp3034 is CRITICAL: CRITICAL: File /var/cache/ocsp/digicert-2017-ecdsa-unified.ocsp is more than 259500 secs old! [10:22:59] volans: should we ack them? [10:25:59] emzl|off: probably, but I think it can wait tomorrow [10:26:35] PROBLEM - Freshness of OCSP Stapling files on cp3049 is CRITICAL: CRITICAL: File /var/cache/ocsp/digicert-2017-ecdsa-unified.ocsp is more than 259500 secs old! [10:28:55] PROBLEM - Check systemd state on scandium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [10:56:09] RECOVERY - Check systemd state on scandium is OK: OK - running: The system is fully operational [11:01:03] PROBLEM - Freshness of OCSP Stapling files on cp1087 is CRITICAL: CRITICAL: File /var/cache/ocsp/digicert-2017-ecdsa-unified.ocsp is more than 259500 secs old! [11:04:15] PROBLEM - Freshness of OCSP Stapling files on cp2010 is CRITICAL: CRITICAL: File /var/cache/ocsp/digicert-2017-ecdsa-unified.ocsp is more than 259500 secs old! [11:05:47] PROBLEM - Freshness of OCSP Stapling files on cp2013 is CRITICAL: CRITICAL: File /var/cache/ocsp/digicert-2017-ecdsa-unified.ocsp is more than 259500 secs old! [11:08:15] PROBLEM - Freshness of OCSP Stapling files on cp2016 is CRITICAL: CRITICAL: File /var/cache/ocsp/digicert-2017-ecdsa-unified.ocsp is more than 259500 secs old! [11:11:03] PROBLEM - Freshness of OCSP Stapling files on cp3047 is CRITICAL: CRITICAL: File /var/cache/ocsp/digicert-2017-ecdsa-unified.ocsp is more than 259500 secs old! [11:15:43] PROBLEM - Freshness of OCSP Stapling files on cp4032 is CRITICAL: CRITICAL: File /var/cache/ocsp/digicert-2017-ecdsa-unified.ocsp is more than 259500 secs old! [11:20:47] PROBLEM - Freshness of OCSP Stapling files on cp3046 is CRITICAL: CRITICAL: File /var/cache/ocsp/digicert-2017-ecdsa-unified.ocsp is more than 259500 secs old! [11:23:29] PROBLEM - Freshness of OCSP Stapling files on cp1076 is CRITICAL: CRITICAL: File /var/cache/ocsp/digicert-2017-ecdsa-unified.ocsp is more than 259500 secs old! [11:29:19] PROBLEM - Freshness of OCSP Stapling files on cp1082 is CRITICAL: CRITICAL: File /var/cache/ocsp/digicert-2017-ecdsa-unified.ocsp is more than 259500 secs old! [11:36:21] PROBLEM - Freshness of OCSP Stapling files on cp2026 is CRITICAL: CRITICAL: File /var/cache/ocsp/digicert-2017-ecdsa-unified.ocsp is more than 259500 secs old! [11:39:53] PROBLEM - Freshness of OCSP Stapling files on cp1086 is CRITICAL: CRITICAL: File /var/cache/ocsp/digicert-2017-ecdsa-unified.ocsp is more than 259500 secs old! [11:41:59] PROBLEM - Apache HTTP on mw1290 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:43:05] RECOVERY - Apache HTTP on mw1290 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.039 second response time [11:45:29] PROBLEM - Freshness of OCSP Stapling files on cp1083 is CRITICAL: CRITICAL: File /var/cache/ocsp/digicert-2017-ecdsa-unified.ocsp is more than 259500 secs old! [11:59:51] PROBLEM - Check systemd state on scandium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [12:25:57] RECOVERY - Check systemd state on scandium is OK: OK - running: The system is fully operational [12:30:03] PROBLEM - Freshness of OCSP Stapling files on cp2023 is CRITICAL: CRITICAL: File /var/cache/ocsp/digicert-2017-ecdsa-unified.ocsp is more than 259500 secs old! [12:54:03] PROBLEM - MariaDB Slave Lag: s2 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1024.43 seconds [12:54:31] PROBLEM - Freshness of OCSP Stapling files on cp4031 is CRITICAL: CRITICAL: File /var/cache/ocsp/digicert-2017-rsa-unified.ocsp is more than 259500 secs old! [12:56:09] PROBLEM - Freshness of OCSP Stapling files on cp4030 is CRITICAL: CRITICAL: File /var/cache/ocsp/digicert-2017-rsa-unified.ocsp is more than 259500 secs old! [13:09:11] PROBLEM - Freshness of OCSP Stapling files on cp3043 is CRITICAL: CRITICAL: File /var/cache/ocsp/digicert-2017-ecdsa-unified.ocsp is more than 259500 secs old! [13:13:27] PROBLEM - Freshness of OCSP Stapling files on cp1078 is CRITICAL: CRITICAL: File /var/cache/ocsp/digicert-2017-rsa-unified.ocsp is more than 259500 secs old! [13:17:11] PROBLEM - Freshness of OCSP Stapling files on cp5011 is CRITICAL: CRITICAL: File /var/cache/ocsp/digicert-2017-ecdsa-unified.ocsp is more than 259500 secs old! [13:19:11] PROBLEM - Freshness of OCSP Stapling files on cp3036 is CRITICAL: CRITICAL: File /var/cache/ocsp/digicert-2017-rsa-unified.ocsp is more than 259500 secs old! [13:21:05] PROBLEM - Freshness of OCSP Stapling files on cp4022 is CRITICAL: CRITICAL: File /var/cache/ocsp/digicert-2017-rsa-unified.ocsp is more than 259500 secs old! [13:23:31] PROBLEM - Freshness of OCSP Stapling files on cp1085 is CRITICAL: CRITICAL: File /var/cache/ocsp/digicert-2017-ecdsa-unified.ocsp is more than 259500 secs old! [13:27:33] PROBLEM - Freshness of OCSP Stapling files on cp2012 is CRITICAL: CRITICAL: File /var/cache/ocsp/digicert-2017-rsa-unified.ocsp is more than 259500 secs old! [13:29:41] PROBLEM - Check systemd state on scandium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [13:36:09] PROBLEM - Freshness of OCSP Stapling files on cp3041 is CRITICAL: CRITICAL: File /var/cache/ocsp/digicert-2017-ecdsa-unified.ocsp is more than 259500 secs old! [13:39:53] PROBLEM - Freshness of OCSP Stapling files on cp4025 is CRITICAL: CRITICAL: File /var/cache/ocsp/digicert-2017-ecdsa-unified.ocsp is more than 259500 secs old! [13:44:03] PROBLEM - Freshness of OCSP Stapling files on cp5006 is CRITICAL: CRITICAL: File /var/cache/ocsp/digicert-2017-rsa-unified.ocsp is more than 259500 secs old! [13:56:53] RECOVERY - Check systemd state on scandium is OK: OK - running: The system is fully operational [13:58:21] PROBLEM - Freshness of OCSP Stapling files on cp3032 is CRITICAL: CRITICAL: File /var/cache/ocsp/digicert-2017-ecdsa-unified.ocsp is more than 259500 secs old! [14:00:11] PROBLEM - Freshness of OCSP Stapling files on cp4021 is CRITICAL: CRITICAL: File /var/cache/ocsp/digicert-2017-ecdsa-unified.ocsp is more than 259500 secs old! [14:13:21] PROBLEM - Freshness of OCSP Stapling files on cp5003 is CRITICAL: CRITICAL: File /var/cache/ocsp/digicert-2017-ecdsa-unified.ocsp is more than 259500 secs old! [14:19:05] PROBLEM - Freshness of OCSP Stapling files on cp1081 is CRITICAL: CRITICAL: File /var/cache/ocsp/digicert-2017-ecdsa-unified.ocsp is more than 259500 secs old! [15:44:18] 10Operations, 10Traffic: cp nodes still try to OCSP staple the already expired digicert-2017 certificate - https://phabricator.wikimedia.org/T215103 (10Vgutierrez) [15:44:54] 10Operations, 10Traffic: cp nodes still try to OCSP staple the already expired digicert-2017 certificate - https://phabricator.wikimedia.org/T215103 (10Vgutierrez) p:05Triage→03Normal [15:48:01] PROBLEM - puppet last run on mw1254 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:49:04] (03PS1) 10Vgutierrez: ssl: get rid of the expired digicert-2017 certificate [puppet] - 10https://gerrit.wikimedia.org/r/487584 (https://phabricator.wikimedia.org/T215103) [15:54:14] (03CR) 10Vgutierrez: "After merging this I suspect we should manually clean the ocsp.d conf files to avoid more stapling attempts" [puppet] - 10https://gerrit.wikimedia.org/r/487584 (https://phabricator.wikimedia.org/T215103) (owner: 10Vgutierrez) [15:59:25] PROBLEM - Check systemd state on scandium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [16:19:47] RECOVERY - puppet last run on mw1254 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [16:26:55] RECOVERY - Check systemd state on scandium is OK: OK - running: The system is fully operational [17:25:31] PROBLEM - ElasticSearch shard size check on search.svc.codfw.wmnet is CRITICAL: CRITICAL - commonswiki_content_1538158521(68gb), commonswiki_file_1538134444(52gb) [17:27:27] Oops...that's much [17:27:42] I suspect segment merges going on [18:24:15] PROBLEM - pdfrender on scb1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:28:07] RECOVERY - pdfrender on scb1004 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 6.706 second response time [18:29:21] PROBLEM - Check systemd state on scandium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [18:32:11] PROBLEM - pdfrender on scb1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:32:52] what is [18:34:09] RECOVERY - MariaDB Slave Lag: s2 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 187.16 seconds [18:40:07] RECOVERY - pdfrender on scb1004 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 9.468 second response time [18:48:01] PROBLEM - pdfrender on scb1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:56:43] RECOVERY - Check systemd state on scandium is OK: OK - running: The system is fully operational [19:02:15] RECOVERY - pdfrender on scb1004 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 7.370 second response time [19:02:45] PROBLEM - pdfrender on scb1003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:06:15] PROBLEM - pdfrender on scb1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:11:38] 10Operations, 10Phabricator, 10RelEng-Archive-FY201718-Q1: reinstall iridium (phabricator) as phab1001 with jessie - https://phabricator.wikimedia.org/T152129 (10Alroilim) [19:14:41] RECOVERY - pdfrender on scb1003 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 8.544 second response time [19:18:41] PROBLEM - pdfrender on scb1003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:27:53] 10Operations, 10Patch-For-Review, 10Release-Engineering-Team (Watching / External): Provide cross-dc redundancy (active-active or active-passive) to all important misc services - https://phabricator.wikimedia.org/T156937 (10Paladox) [19:27:56] 10Operations, 10Phabricator, 10RelEng-Archive-FY201718-Q1: reinstall iridium (phabricator) as phab1001 with jessie - https://phabricator.wikimedia.org/T152129 (10Paladox) [19:28:11] 10Operations, 10Phabricator, 10RelEng-Archive-FY201718-Q1: reinstall iridium (phabricator) as phab1001 with jessie - https://phabricator.wikimedia.org/T152129 (10Paladox) [19:32:00] (03PS1) 10CRusnov: Properly detect connected ports. [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/487599 [19:37:46] 10Operations, 10Analytics, 10Product-Analytics, 10User-Elukey: notebook/stat server(s) running out of memory - https://phabricator.wikimedia.org/T212824 (10elukey) After a chat with Chase, it seems that @aborrero has already done a similar thing for the tool-forge hosts. Anything that we can share with pup... [19:42:06] 10Operations, 10MediaWiki-Cache, 10MW-1.33-notes (1.33.0-wmf.13; 2019-01-15), 10Patch-For-Review, and 3 others: Mcrouter periodically reports soft TKOs for mc1022 (was mc1035) leading to MW Memcached exceptions - https://phabricator.wikimedia.org/T203786 (10elukey) Next step: * ssh to mc1022 and execute `... [19:43:30] 10Operations, 10Patch-For-Review, 10Release-Engineering-Team (Watching / External): Provide cross-dc redundancy (active-active or active-passive) to all important misc services - https://phabricator.wikimedia.org/T156937 (10Gopavasanth) [19:59:11] PROBLEM - Check systemd state on scandium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [20:05:43] RECOVERY - Check systemd state on scandium is OK: OK - running: The system is fully operational [20:06:59] !log parsoid was failed on scandium and alerting, the service parsoid-vd was restarted and appears to have come back [20:07:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:13:03] PROBLEM - Check systemd state on puppetmaster1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [20:23:49] well that's slightly more of a problem i think? puppet-master is red [20:28:05] RECOVERY - pdfrender on scb1003 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 7.418 second response time [20:30:07] RECOVERY - pdfrender on scb1004 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 3.534 second response time [20:34:09] PROBLEM - pdfrender on scb1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:36:29] to be clear puppet-master is still running, but it entered failed state when it didn't go away during a restart or something and so there's a disowned puppet-master working away while systemd thinks its failed. [20:38:39] PROBLEM - pdfrender on scb1003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:39:03] chaomodus: if you're already with a laptop and connected [20:39:41] well i guess the question is what to do about this [20:39:45] kill and restart? [20:39:46] could you restart pdfrender on the scb hosts where it's failing? [20:39:50] okay [20:39:53] one at a time ofc [20:40:04] they seem to recover no? [20:41:04] not really, better to restart them ;) [20:41:18] !log restarted pdfrender on scb1004 [20:41:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:41:25] they start "flapping" after a bit in those cases [20:41:53] RECOVERY - pdfrender on scb1004 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 0.003 second response time [20:42:04] regarding the puppetmaster one... do you have more details? I can login in 3 minutes if needed [20:42:06] !log restarted pdfrender on scb1003 [20:42:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:42:27] RECOVERY - pdfrender on scb1003 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 0.008 second response time [20:43:07] which details do you require? It appears that systemd forgot about the running one (which i've seen happen when it's trying to stop something that isn't nice about it), and so there's a working puppet-master that's not owned by systemd, so it considers the unit failed [20:43:31] (because when it tries to start puppet-master, it exits because the port is occupied) [20:44:07] puppet-master should not run, we run it via apache [20:44:14] ah [20:44:34] well the unit is failed with the message that it can't allocate the port [20:45:39] ok, most likely someone has run a puppet master command directly [20:45:47] ah that makes sense [20:45:59] the obvious fix to me would be to kill the running puppet master and restart the unit [20:46:06] I would try a systemctl reset failed $unitnamr [20:46:12] okay [20:47:00] if that works that's ok for now [20:47:07] we can have a look later on [20:47:31] p.s. thanks for keeping an eyr on things ;) [20:48:05] welp that seemed to work, the service isn't red anymore [20:48:14] hey no problem i have 2 hours before boarding :) [20:48:17] RECOVERY - Check systemd state on puppetmaster1001 is OK: OK - running: The system is fully operational [20:49:17] great, thanks a lot, have a safe trip! [20:49:46] * volans offline for a bit but with laptop, page me if needed [20:49:47] thanks! [21:08:17] PROBLEM - Check systemd state on scandium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [21:15:31] mumble [21:21:17] 10Operations, 10SRE-Access-Requests: Requesting access to production for dsharpe - https://phabricator.wikimedia.org/T214130 (10faidon) Let's not wait for a meeting, approved! [21:21:39] PROBLEM - Host mw1299 is DOWN: PING CRITICAL - Packet loss = 100% [21:21:44] 10Operations, 10SRE-Access-Requests: Requesting access to production for dsharpe - https://phabricator.wikimedia.org/T214130 (10faidon) 05Stalled→03Open a:05mark→03None [21:26:37] RECOVERY - Check systemd state on scandium is OK: OK - running: The system is fully operational [21:53:01] 10Operations, 10MobileFrontend, 10Traffic: Remove .m. subdomain, serve mobile and desktop variants through the same URL - https://phabricator.wikimedia.org/T214998 (10Tgr) > The m. subdomain does not appear to reflect best current practice. Going by Wikipedia's list of top websites: # Google: a big grab ba... [22:16:06] (03PS1) 10CRusnov: Reorganize and add tox/CI support for repository. [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/487612 [22:20:03] (03CR) 10GTirloni: [C: 03+1] wmcs: monitoring: refactor code into roles/profiles [puppet] - 10https://gerrit.wikimedia.org/r/487482 (owner: 10Arturo Borrero Gonzalez) [23:29:22] PROBLEM - Check systemd state on scandium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [23:39:21] 10Operations, 10Gerrit, 10Icinga, 10monitoring: Investigate why icinga did not report high cpu/load for gerrit - https://phabricator.wikimedia.org/T215033 (10Paladox) Thank you @Dzahn i guess we should do both? [23:56:38] RECOVERY - Check systemd state on scandium is OK: OK - running: The system is fully operational