[00:00:32] 10Operations, 10Icinga, 10decommission, 10monitoring: decom einsteinium - https://phabricator.wikimedia.org/T209738 (10Dzahn) [00:00:39] 10Operations, 10Icinga, 10decommission, 10monitoring: decom einsteinium - https://phabricator.wikimedia.org/T209738 (10Dzahn) a:03Dzahn [00:01:27] 10Operations, 10monitoring, 10Patch-For-Review: upgrade icinga server to stretch and replace einsteinium - https://phabricator.wikimedia.org/T202782 (10Dzahn) [00:01:34] 10Operations, 10monitoring, 10Patch-For-Review: upgrade icinga server to stretch and replace einsteinium - https://phabricator.wikimedia.org/T202782 (10Dzahn) [00:02:29] 10Operations, 10monitoring, 10Patch-For-Review: upgrade icinga server to stretch and replace einsteinium - https://phabricator.wikimedia.org/T202782 (10Dzahn) 05Open>03Resolved this ticket is resolved, einsteinium has been replaced by icinga1001 on stretch. the rest of the steps will be part of the de... [00:03:28] (03PS3) 10Dzahn: icinga: remove einsteinium as an alerting_host [puppet] - 10https://gerrit.wikimedia.org/r/473278 (https://phabricator.wikimedia.org/T209738) [00:03:50] (03CR) 10jerkins-bot: [V: 04-1] icinga: remove einsteinium as an alerting_host [puppet] - 10https://gerrit.wikimedia.org/r/473278 (https://phabricator.wikimedia.org/T209738) (owner: 10Dzahn) [00:04:10] (03PS4) 10Dzahn: icinga: remove jessie support [puppet] - 10https://gerrit.wikimedia.org/r/473276 (https://phabricator.wikimedia.org/T202782) [00:04:36] (03CR) 10jerkins-bot: [V: 04-1] icinga: remove jessie support [puppet] - 10https://gerrit.wikimedia.org/r/473276 (https://phabricator.wikimedia.org/T202782) (owner: 10Dzahn) [00:08:18] 10Operations, 10Icinga, 10monitoring, 10Patch-For-Review: Concerns about icinga1001 check latency - https://phabricator.wikimedia.org/T208066 (10colewhite) I'll re-title the case and claim it to implement the metrics collection. [00:08:59] 10Operations, 10Icinga, 10monitoring, 10Patch-For-Review: Push check latency and check execution time to Prometheus - https://phabricator.wikimedia.org/T208066 (10colewhite) p:05Normal>03Low a:03colewhite [00:12:15] (03PS4) 10Dzahn: icinga: remove einsteinium as an alerting_host [puppet] - 10https://gerrit.wikimedia.org/r/473278 (https://phabricator.wikimedia.org/T202782) [00:12:17] (03PS5) 10Dzahn: icinga: remove jessie support [puppet] - 10https://gerrit.wikimedia.org/r/473276 (https://phabricator.wikimedia.org/T202782) [00:12:19] (03PS1) 10Dzahn: decom einsteinium remove from netboot and DHCP [puppet] - 10https://gerrit.wikimedia.org/r/474390 (https://phabricator.wikimedia.org/T209738) [00:13:06] (03CR) 10jerkins-bot: [V: 04-1] icinga: remove einsteinium as an alerting_host [puppet] - 10https://gerrit.wikimedia.org/r/473278 (https://phabricator.wikimedia.org/T202782) (owner: 10Dzahn) [00:13:57] (03CR) 10jerkins-bot: [V: 04-1] icinga: remove jessie support [puppet] - 10https://gerrit.wikimedia.org/r/473276 (https://phabricator.wikimedia.org/T202782) (owner: 10Dzahn) [00:16:42] (03PS2) 10Dzahn: decom einsteinium remove from netboot and DHCP [puppet] - 10https://gerrit.wikimedia.org/r/474390 (https://phabricator.wikimedia.org/T209738) [00:17:21] PROBLEM - Check systemd state on ruthenium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [00:17:28] 10Operations, 10Traffic, 10Wikimedia-General-or-Unknown: Disable caching on the main page for anonymous users - https://phabricator.wikimedia.org/T119366 (10BBlack) >>! In T119366#4754978, @Bawolff wrote: > Fwiw: im of the opinion that date magic words should reduce varnish cache to at least 24 hours, maybe... [00:17:36] (03CR) 10jerkins-bot: [V: 04-1] decom einsteinium remove from netboot and DHCP [puppet] - 10https://gerrit.wikimedia.org/r/474390 (https://phabricator.wikimedia.org/T209738) (owner: 10Dzahn) [00:19:27] (03PS5) 10Dzahn: icinga: remove einsteinium as an alerting_host [puppet] - 10https://gerrit.wikimedia.org/r/473278 (https://phabricator.wikimedia.org/T202782) [00:20:37] (03PS6) 10Paladox: WIP: Update gerrit to 2.16 [software/gerrit] (deploy/wmf/stable-2.15) - 10https://gerrit.wikimedia.org/r/463509 [00:20:58] (03PS7) 10Paladox: WIP: Update gerrit to 2.16 [software/gerrit] (deploy/wmf/stable-2.15) - 10https://gerrit.wikimedia.org/r/463509 [00:23:59] (03PS6) 10Dzahn: icinga: remove jessie support [puppet] - 10https://gerrit.wikimedia.org/r/473276 (https://phabricator.wikimedia.org/T202782) [00:28:16] (03PS6) 10Dzahn: icinga: remove einsteinium as an alerting_host [puppet] - 10https://gerrit.wikimedia.org/r/473278 (https://phabricator.wikimedia.org/T202782) [00:36:24] (03PS1) 10Dzahn: remove icinga-old.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/474392 (https://phabricator.wikimedia.org/T209738) [00:37:48] 10Operations, 10Traffic, 10Wikimedia-General-or-Unknown: Disable caching on the main page for anonymous users - https://phabricator.wikimedia.org/T119366 (10kruusamagi) >>! In T119366#4754973, @Bawolff wrote: >>>! In T119366#4754971, @kruusamagi wrote: >> For me, it seems that the issue has grown even bigger... [00:42:07] RECOVERY - Check systemd state on ruthenium is OK: OK - running: The system is fully operational [00:52:14] (03PS11) 10Dzahn: phabricator: add data types to all parameters [puppet] - 10https://gerrit.wikimedia.org/r/471325 [00:54:44] (03PS12) 10Dzahn: phabricator: add data types to all parameters [puppet] - 10https://gerrit.wikimedia.org/r/471325 [00:57:20] (03CR) 10Dzahn: [C: 04-1] "https://puppet-compiler.wmflabs.org/compiler1002/13566/phab1001.eqiad.wmnet/change.phab1001.eqiad.wmnet.err" [puppet] - 10https://gerrit.wikimedia.org/r/471325 (owner: 10Dzahn) [01:04:51] (03PS2) 10Herron: kafka_shipper: use mmrm1stspace to remove leading space in msg field [puppet] - 10https://gerrit.wikimedia.org/r/474317 (https://phabricator.wikimedia.org/T206454) [01:05:11] (03PS8) 10Dzahn: icinga/planet: add generic check_lastmod plugin and check planet updates [puppet] - 10https://gerrit.wikimedia.org/r/472713 (https://phabricator.wikimedia.org/T203208) [01:06:06] (03CR) 10Herron: [C: 032] kafka_shipper: use mmrm1stspace to remove leading space in msg field [puppet] - 10https://gerrit.wikimedia.org/r/474317 (https://phabricator.wikimedia.org/T206454) (owner: 10Herron) [01:06:31] (03Abandoned) 10Dzahn: icinga: on stretch, use fping instead of ping for faster host checks [puppet] - 10https://gerrit.wikimedia.org/r/469333 (https://phabricator.wikimedia.org/T202782) (owner: 10Dzahn) [01:11:25] (03PS2) 10Herron: kafka_shipper: update syslog json template [puppet] - 10https://gerrit.wikimedia.org/r/474319 (https://phabricator.wikimedia.org/T206454) [01:12:28] (03CR) 10Herron: [C: 032] kafka_shipper: update syslog json template [puppet] - 10https://gerrit.wikimedia.org/r/474319 (https://phabricator.wikimedia.org/T206454) (owner: 10Herron) [01:15:35] (03PS2) 10Herron: kafka_shipper: add apache2 to lookup table with kafka output [puppet] - 10https://gerrit.wikimedia.org/r/474320 (https://phabricator.wikimedia.org/T205852) [01:16:56] (03CR) 10Herron: [C: 032] kafka_shipper: add apache2 to lookup table with kafka output [puppet] - 10https://gerrit.wikimedia.org/r/474320 (https://phabricator.wikimedia.org/T205852) (owner: 10Herron) [01:48:29] 10Operations, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Wikidata-Query-Service-Sprint: Define an SLO for Wikidata Query Service public endpoint and communicate it - https://phabricator.wikimedia.org/T199228 (10Smalyshev) > ensuring that the data in the WDQS nodes accurately reflects the data upstre... [01:55:09] PROBLEM - puppet last run on db1107 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [01:55:16] (03PS1) 10Bstorm: sonofgridengine: configure grid hosts from OpenStack [puppet] - 10https://gerrit.wikimedia.org/r/474400 (https://phabricator.wikimedia.org/T200557) [01:58:52] (03CR) 10Bstorm: [C: 032] sonofgridengine: configure grid hosts from OpenStack [puppet] - 10https://gerrit.wikimedia.org/r/474400 (https://phabricator.wikimedia.org/T200557) (owner: 10Bstorm) [02:25:53] RECOVERY - puppet last run on db1107 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [02:54:16] !log Deployed patches for T208112, T208109, T208110 [02:54:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:56:03] (03PS1) 10Catrope: Add default for new CN variable [mediawiki-config] - 10https://gerrit.wikimedia.org/r/474406 (https://phabricator.wikimedia.org/T208112) [02:56:05] (03PS1) 10Catrope: Add and grant banner-protect [mediawiki-config] - 10https://gerrit.wikimedia.org/r/474407 (https://phabricator.wikimedia.org/T208109) [03:01:35] PROBLEM - Check systemd state on ruthenium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [03:07:27] (03CR) 10Catrope: [C: 032] Add default for new CN variable [mediawiki-config] - 10https://gerrit.wikimedia.org/r/474406 (https://phabricator.wikimedia.org/T208112) (owner: 10Catrope) [03:07:32] (03CR) 10Catrope: [C: 032] Add and grant banner-protect [mediawiki-config] - 10https://gerrit.wikimedia.org/r/474407 (https://phabricator.wikimedia.org/T208109) (owner: 10Catrope) [03:08:28] (03Merged) 10jenkins-bot: Add default for new CN variable [mediawiki-config] - 10https://gerrit.wikimedia.org/r/474406 (https://phabricator.wikimedia.org/T208112) (owner: 10Catrope) [03:08:35] (03Merged) 10jenkins-bot: Add and grant banner-protect [mediawiki-config] - 10https://gerrit.wikimedia.org/r/474407 (https://phabricator.wikimedia.org/T208109) (owner: 10Catrope) [03:11:45] RECOVERY - Check systemd state on ruthenium is OK: OK - running: The system is fully operational [03:15:44] (03CR) 10jenkins-bot: Add default for new CN variable [mediawiki-config] - 10https://gerrit.wikimedia.org/r/474406 (https://phabricator.wikimedia.org/T208112) (owner: 10Catrope) [03:15:46] (03CR) 10jenkins-bot: Add and grant banner-protect [mediawiki-config] - 10https://gerrit.wikimedia.org/r/474407 (https://phabricator.wikimedia.org/T208109) (owner: 10Catrope) [03:30:55] PROBLEM - Check systemd state on ruthenium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [03:31:09] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 900.39 seconds [03:42:13] RECOVERY - Check systemd state on ruthenium is OK: OK - running: The system is fully operational [04:15:05] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 192.09 seconds [05:20:37] (03PS1) 10Jayprakash12345: Enable NewUserMessage Extension on tcy.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/474409 [05:23:14] (03PS2) 10Jayprakash12345: Enable NewUserMessage Extension on tcy.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/474409 (https://phabricator.wikimedia.org/T209432) [05:34:45] PROBLEM - Check systemd state on ruthenium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [05:41:33] RECOVERY - Check systemd state on ruthenium is OK: OK - running: The system is fully operational [05:42:08] PROBLEM - MariaDB Slave IO: s3 on db1078 is CRITICAL: CRITICAL slave_io_state could not connect [05:50:09] PROBLEM - MariaDB Slave Lag: s3 on db1078 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 637.23 seconds [06:17:35] PROBLEM - Check systemd state on ruthenium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [06:21:39] (03PS1) 10Giuseppe Lavagetto: db-eqiad: depool db1078 from s3, it crashed [mediawiki-config] - 10https://gerrit.wikimedia.org/r/474417 [06:22:40] <_joe_> apergos: I'll check grafana for a minute then merge this [06:25:55] 10Operations, 10DBA: MariaDB killed by systemd with ABRT6 - https://phabricator.wikimedia.org/T209754 (10Joe) [06:26:00] that means most regular traffic will go to the master but I don't see what choice we have [06:26:23] <_joe_> no, a lot will go to the vslow and recentchanges hosts [06:26:26] <_joe_> see the weights [06:26:31] <_joe_> but yes, no alternative [06:26:42] <_joe_> and this is an ongoing outage, thanks mediawki loadbalancer [06:27:23] <_joe_> ok, merging [06:27:35] (03CR) 10Giuseppe Lavagetto: [C: 032] db-eqiad: depool db1078 from s3, it crashed [mediawiki-config] - 10https://gerrit.wikimedia.org/r/474417 (owner: 10Giuseppe Lavagetto) [06:28:44] 10Operations, 10DBA: MariaDB killed by systemd with ABRT6 - https://phabricator.wikimedia.org/T209754 (10colewhite) The server was depooled: https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/474417/ [06:29:16] (03Merged) 10jenkins-bot: db-eqiad: depool db1078 from s3, it crashed [mediawiki-config] - 10https://gerrit.wikimedia.org/r/474417 (owner: 10Giuseppe Lavagetto) [06:29:45] PROBLEM - puppet last run on cp2024 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [06:34:03] 10Operations, 10DBA: MariaDB killed by systemd with ABRT6 - https://phabricator.wikimedia.org/T209754 (10Marostegui) Thank you for letting us know Thanks also @Joe for calling me up. We will take it from here :-) [06:37:37] 10Operations, 10DBA: db1078 (s3 candidate master) crashed - https://phabricator.wikimedia.org/T209754 (10Marostegui) p:05Triage>03High [06:38:28] !log oblivian@deploy1001 Synchronized wmf-config/db-eqiad.php: Depooling db1078 (duration: 00m 59s) [06:38:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:40:39] (03CR) 10jenkins-bot: db-eqiad: depool db1078 from s3, it crashed [mediawiki-config] - 10https://gerrit.wikimedia.org/r/474417 (owner: 10Giuseppe Lavagetto) [06:41:15] RECOVERY - Check systemd state on ruthenium is OK: OK - running: The system is fully operational [06:42:08] 10Operations, 10DBA: db1078 (s3 candidate master) crashed - https://phabricator.wikimedia.org/T209754 (10Marostegui) MySQL got corrupted - this host needs to be rebuilt. [06:43:53] (03PS1) 10Marostegui: db1078: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/474447 (https://phabricator.wikimedia.org/T209754) [06:44:39] (03CR) 10Marostegui: [C: 032] db1078: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/474447 (https://phabricator.wikimedia.org/T209754) (owner: 10Marostegui) [06:49:00] 10Operations, 10DBA, 10Patch-For-Review: db1078 (s3 candidate master) crashed - https://phabricator.wikimedia.org/T209754 (10Marostegui) I haven't found anything on HW logs that might indicate a HW malfunction [06:51:58] RECOVERY - MariaDB Slave IO: s3 on db1078 is OK: OK slave_io_state Slave_IO_Running: Yes [06:55:33] * apergos eyes the recovery page skeptically [06:55:39] PROBLEM - Check systemd state on ruthenium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [06:56:15] RECOVERY - MariaDB Slave Lag: s3 on db1078 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [06:56:34] <_joe_> marostegui: ^^ uh? shouldn't it break like soon? [06:56:58] * volans|off here [06:57:11] <_joe_> volans|off: late to the party :P [06:57:41] It was me starting replication again - just in case we really need that host before recloning (which we shouldn't) [06:58:14] that was a very fast catchup [06:58:39] It wasn't delayed too much and that host has SSDs :) [06:59:21] impressive, the power of ssds [06:59:50] and we will get no pages if it decides to crash and burn again, yes? [07:00:13] RECOVERY - puppet last run on cp2024 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [07:02:12] !log 'reset modified attributes' on IcingaUI for db1078 (and mgmt) and all its services [07:02:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:11:53] RECOVERY - Check systemd state on ruthenium is OK: OK - running: The system is fully operational [07:12:39] 10Operations, 10Icinga, 10monitoring: Notifications disablement via puppet not working on icinga - https://phabricator.wikimedia.org/T209757 (10Marostegui) [07:16:25] 10Operations, 10Icinga, 10monitoring: Notifications disablement via puppet not working on icinga - https://phabricator.wikimedia.org/T209757 (10Banyek) When I was stasrted to run puppet on the new Parsercache hosts on the other day I disabled notifications in the same way, but the hosts were reporting error. [07:17:52] 10Operations, 10Icinga, 10monitoring: Notifications disablement via puppet not working on icinga - https://phabricator.wikimedia.org/T209757 (10Marostegui) >>! In T209757#4755397, @Banyek wrote: > When I was stasrted to run puppet on the new Parsercache hosts on the other day I disabled notifications in the... [07:17:56] 10Operations, 10Icinga, 10monitoring: Notifications disablement via puppet not working on icinga - https://phabricator.wikimedia.org/T209757 (10Volans) [07:20:55] 10Operations, 10Icinga, 10monitoring: Notifications disablement via puppet not working on icinga - https://phabricator.wikimedia.org/T209757 (10Banyek) There was an error with pt-heartbeat indeed, but that error was reported to IRC which shouldn't happened if disabling notifications would work. (Or maybe I... [07:22:37] PROBLEM - DPKG on notebook1004 is CRITICAL: connect to address 10.64.36.107 port 5666: Connection refused [07:22:39] PROBLEM - dhclient process on notebook1004 is CRITICAL: connect to address 10.64.36.107 port 5666: Connection refused [07:22:41] PROBLEM - Check systemd state on notebook1004 is CRITICAL: connect to address 10.64.36.107 port 5666: Connection refused [07:22:45] PROBLEM - MD RAID on notebook1004 is CRITICAL: connect to address 10.64.36.107 port 5666: Connection refused [07:22:49] PROBLEM - configured eth on notebook1004 is CRITICAL: connect to address 10.64.36.107 port 5666: Connection refused [07:23:25] PROBLEM - Disk space on notebook1004 is CRITICAL: connect to address 10.64.36.107 port 5666: Connection refused [07:24:09] PROBLEM - puppet last run on notebook1004 is CRITICAL: connect to address 10.64.36.107 port 5666: Connection refused [07:25:45] PROBLEM - Check the NTP synchronisation status of timesyncd on notebook1004 is CRITICAL: connect to address 10.64.36.107 port 5666: Connection refused [07:25:46] 10Operations, 10Icinga, 10monitoring: Notifications disablement via puppet not working on icinga - https://phabricator.wikimedia.org/T209757 (10Marostegui) For the record I have downtimed db1078 (without touching notifications anymore to avoid messing with any investigation). [07:40:25] PROBLEM - Check systemd state on ruthenium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [07:41:33] RECOVERY - Check systemd state on ruthenium is OK: OK - running: The system is fully operational [07:56:39] PROBLEM - IPMI Sensor Status on notebook1004 is CRITICAL: connect to address 10.64.36.107 port 5666: Connection refused [08:04:49] PROBLEM - Check systemd state on ruthenium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [08:09:15] RECOVERY - DPKG on notebook1004 is OK: All packages OK [08:09:15] RECOVERY - dhclient process on notebook1004 is OK: PROCS OK: 0 processes with command name dhclient [08:09:17] RECOVERY - Check systemd state on notebook1004 is OK: OK - running: The system is fully operational [08:09:21] RECOVERY - MD RAID on notebook1004 is OK: OK: Active: 6, Working: 6, Failed: 0, Spare: 0 [08:09:27] RECOVERY - configured eth on notebook1004 is OK: OK - interfaces up [08:10:04] (should be recovered in a bit) [08:10:07] RECOVERY - Disk space on notebook1004 is OK: DISK OK [08:14:08] elukey: what was up? and do I need to check on this when I see it? [08:15:07] RECOVERY - puppet last run on notebook1004 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [08:16:59] 10Operations, 10Parsoid: parsoid-rt repeated failures on ruthenium (parsoid::testing) - https://phabricator.wikimedia.org/T209758 (10Volans) [08:17:22] this is for the random systemd unit failures on ruthenium ^^^ [08:24:56] 10Operations, 10Wikimedia-Apache-configuration: Redirect from zh-yue.wiktionary.org is not working properly - https://phabricator.wikimedia.org/T209693 (10Hello903hello) We just need to add proper redirect rules for zh-yue.wiktionary.org to yue.wiktionary.org at the current stage, period. [08:25:55] RECOVERY - Check the NTP synchronisation status of timesyncd on notebook1004 is OK: OK: synced at Sat 2018-11-17 08:25:54 UTC. [08:26:45] RECOVERY - IPMI Sensor Status on notebook1004 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK [08:29:25] PROBLEM - SSH ganeti2005.mgmt on ganeti2005.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:41:36] 10Operations, 10Icinga, 10monitoring: Notifications disablement via puppet not working on icinga - https://phabricator.wikimedia.org/T209757 (10Volans) p:05Triage>03High Things that I've found so far, some may be unrelated but still need a fix anyway. === Permissions It seems that https://gerrit.wikimed... [08:41:39] RECOVERY - Check systemd state on ruthenium is OK: OK - running: The system is fully operational [08:56:11] PROBLEM - Check systemd state on ruthenium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [09:01:43] PROBLEM - dhclient process on notebook1004 is CRITICAL: connect to address 10.64.36.107 port 5666: Connection refused [09:01:45] PROBLEM - DPKG on notebook1004 is CRITICAL: connect to address 10.64.36.107 port 5666: Connection refused [09:01:45] PROBLEM - Check systemd state on notebook1004 is CRITICAL: connect to address 10.64.36.107 port 5666: Connection refused [09:01:49] PROBLEM - MD RAID on notebook1004 is CRITICAL: connect to address 10.64.36.107 port 5666: Connection refused [09:01:53] PROBLEM - configured eth on notebook1004 is CRITICAL: connect to address 10.64.36.107 port 5666: Connection refused [09:02:29] PROBLEM - Disk space on notebook1004 is CRITICAL: connect to address 10.64.36.107 port 5666: Connection refused [09:03:19] PROBLEM - puppet last run on notebook1004 is CRITICAL: connect to address 10.64.36.107 port 5666: Connection refused [09:11:45] RECOVERY - Check systemd state on ruthenium is OK: OK - running: The system is fully operational [09:28:21] PROBLEM - Check the NTP synchronisation status of timesyncd on notebook1004 is CRITICAL: connect to address 10.64.36.107 port 5666: Connection refused [09:29:25] RECOVERY - SSH ganeti2005.mgmt on ganeti2005.mgmt is OK: SSH OK - OpenSSH_5.8 (protocol 2.0) [09:39:23] RECOVERY - dhclient process on notebook1004 is OK: PROCS OK: 0 processes with command name dhclient [09:39:25] RECOVERY - DPKG on notebook1004 is OK: All packages OK [09:39:27] RECOVERY - Check systemd state on notebook1004 is OK: OK - running: The system is fully operational [09:39:31] RECOVERY - MD RAID on notebook1004 is OK: OK: Active: 6, Working: 6, Failed: 0, Spare: 0 [09:39:35] RECOVERY - configured eth on notebook1004 is OK: OK - interfaces up [09:40:11] RECOVERY - Disk space on notebook1004 is OK: DISK OK [09:44:05] RECOVERY - puppet last run on notebook1004 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [09:51:36] (03PS3) 10Alexandros Kosiaris: package_builder: Switch to class declaration syntax [puppet] - 10https://gerrit.wikimedia.org/r/473782 [09:54:20] (03PS1) 10Alexandros Kosiaris: ores::redis: Set maxmemory-policy: volatile-lur [puppet] - 10https://gerrit.wikimedia.org/r/474450 (https://phabricator.wikimedia.org/T209628) [09:58:29] RECOVERY - Check the NTP synchronisation status of timesyncd on notebook1004 is OK: OK: synced at Sat 2018-11-17 09:58:27 UTC. [11:36:20] (03CR) 10Zoranzoki21: [C: 031] "Hi, patch looks good. But please fix commit message per my comment. Thanks!" (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/474372 (owner: 10Takidelfin) [13:01:30] (03PS1) 10Zoranzoki21: Add tboverride permission to extendedmover group on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/474458 (https://phabricator.wikimedia.org/T209753) [14:06:04] (03Abandoned) 10Zoranzoki21: Remove duplicates of comments about task T206935 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/472919 (https://phabricator.wikimedia.org/T206935) (owner: 10Zoranzoki21) [14:26:27] 10Operations, 10Icinga, 10monitoring: Notifications disablement via puppet not working on icinga - https://phabricator.wikimedia.org/T209757 (10Dzahn) >>! In T209757#4755427, @Volans wrote: > === Init files > Those two init files seems to still have the old paths for jessie and are not compatible with stret... [14:27:41] (03PS1) 10Dzahn: fix path to puppet_hosts/services in default_icinga.sh [puppet] - 10https://gerrit.wikimedia.org/r/474463 (https://phabricator.wikimedia.org/T209757) [14:33:46] (03PS1) 10Dzahn: icinga: do not manage retention.dat in puppet [puppet] - 10https://gerrit.wikimedia.org/r/474464 (https://phabricator.wikimedia.org/T209757) [14:34:16] (03PS2) 10Dzahn: icinga: fix path to puppet_hosts/services in default_icinga.sh [puppet] - 10https://gerrit.wikimedia.org/r/474463 (https://phabricator.wikimedia.org/T209757) [14:35:17] (03PS3) 10Dzahn: icinga: fix path to puppet_hosts/services in default_icinga.sh [puppet] - 10https://gerrit.wikimedia.org/r/474463 (https://phabricator.wikimedia.org/T202782) [14:37:30] (03CR) 10Dzahn: [C: 032] "/etc/icinga/puppet_hosts.cfg: cannot open `/etc/icinga/puppet_hosts.cfg' (No such file or directory)" [puppet] - 10https://gerrit.wikimedia.org/r/474463 (https://phabricator.wikimedia.org/T202782) (owner: 10Dzahn) [14:38:55] akosiaris: hi, i see a pending change on puppetmaster [14:40:02] can merge both and take a look at package_builder. checking how that compiles [14:41:38] oh.. also used in CI/labs [14:43:22] merging both, looks harmless syntax change.. ack [14:45:39] (03CR) 10Dzahn: "merged on puppetmaster. noop on boron." [puppet] - 10https://gerrit.wikimedia.org/r/473782 (owner: 10Alexandros Kosiaris) [14:49:07] 10Operations, 10Commons, 10Multimedia, 10media-storage: Damaged uploads interrupted with reaching of 5 MB - https://phabricator.wikimedia.org/T201379 (10SJu) The problem is still continuing... I propose to switch cross-wiki uploads off and forbid it until the problem is solved. [14:50:41] (03PS2) 10Dzahn: icinga: do not manage retention.dat in puppet [puppet] - 10https://gerrit.wikimedia.org/r/474464 (https://phabricator.wikimedia.org/T209757) [14:51:15] 10Operations, 10Icinga, 10monitoring, 10Patch-For-Review: Notifications disablement via puppet not working on icinga - https://phabricator.wikimedia.org/T209757 (10Dzahn) [14:51:17] 10Operations, 10monitoring, 10Patch-For-Review: upgrade icinga server to stretch and replace einsteinium - https://phabricator.wikimedia.org/T202782 (10Dzahn) [14:51:42] (03PS3) 10Dzahn: icinga: do not manage retention.dat in puppet [puppet] - 10https://gerrit.wikimedia.org/r/474464 (https://phabricator.wikimedia.org/T209757) [14:53:24] (03PS4) 10Dzahn: icinga: do not manage retention.dat in puppet [puppet] - 10https://gerrit.wikimedia.org/r/474464 (https://phabricator.wikimedia.org/T209757) [14:54:19] (03CR) 10Dzahn: [C: 032] icinga: do not manage retention.dat in puppet [puppet] - 10https://gerrit.wikimedia.org/r/474464 (https://phabricator.wikimedia.org/T209757) (owner: 10Dzahn) [14:57:13] 10Operations, 10Icinga, 10monitoring, 10Patch-For-Review: Notifications disablement via puppet not working on icinga - https://phabricator.wikimedia.org/T209757 (10Dzahn) >>! In T209757#4755427, @Volans wrote: > Notice: /Stage[main]/Icinga/File[/var/lib/icinga/retention.dat]/group: group changed 'nagios' t... [15:05:12] (03PS1) 10Dzahn: test disabling icinga notifications on ununpentium [puppet] - 10https://gerrit.wikimedia.org/r/474465 [15:06:09] (03PS2) 10Dzahn: test disabling icinga notifications on ununpentium [puppet] - 10https://gerrit.wikimedia.org/r/474465 [15:06:55] (03CR) 10Dzahn: [C: 032] "T209757" [puppet] - 10https://gerrit.wikimedia.org/r/474465 (owner: 10Dzahn) [15:18:08] 10Operations, 10Icinga, 10monitoring, 10Patch-For-Review: Notifications disablement via puppet not working on icinga - https://phabricator.wikimedia.org/T209757 (10Dzahn) To summarize: - permissions on retention.dat: They are now: 56M -rw-r--r-- 1 nagios nagios 56M Nov 17 14:53 retention.dat and... [15:19:36] interesting. I'm getting session errors when I save, even when I do it over again [15:19:44] I wonder how my session became broken like that [15:21:52] And there's no log data from the session mismatch. I would kind of expect something going wrong with my session to trigger a logging event [15:26:52] 10Operations, 10Icinga, 10monitoring, 10Patch-For-Review: Notifications disablement via puppet not working on icinga - https://phabricator.wikimedia.org/T209757 (10Marostegui) And db1078 picked it up too and has all notifications disabled. I guess this is fixed then or is there any other follow up needed? [15:27:10] wtf, The tokens on the edit page are invalid for me, but the tokens on other pages (e.g. js csrfToken are valid) [15:30:00] meh, upon further experimentation, all the tokens seem invalid [15:31:07] which would be more consistent with my session being borked [15:34:01] 10Operations, 10Icinga, 10monitoring, 10Patch-For-Review: Notifications disablement via puppet not working on icinga - https://phabricator.wikimedia.org/T209757 (10Dzahn) Also db1078 specifically is now fixed without further steps. Before the changes above just some services had notifications disabled in... [15:41:51] 10Operations, 10monitoring, 10Patch-For-Review: upgrade icinga server to stretch and replace einsteinium - https://phabricator.wikimedia.org/T202782 (10Dzahn) [15:41:56] 10Operations, 10Icinga, 10monitoring, 10Patch-For-Review: Notifications disablement via puppet not working on icinga - https://phabricator.wikimedia.org/T209757 (10Dzahn) 05Open>03Resolved a:03Dzahn >>! In T209757#4755730, @Marostegui wrote: > And db1078 picked it up too and has all notifications dis... [15:45:07] (03PS1) 10Dzahn: Revert "test disabling icinga notifications on ununpentium" [puppet] - 10https://gerrit.wikimedia.org/r/474468 [15:45:48] (03CR) 10Dzahn: [C: 032] "this was just a test for T209757" [puppet] - 10https://gerrit.wikimedia.org/r/474468 (owner: 10Dzahn) [15:51:21] * bawolff gave up on trying to debug what was wrong with my session and just logged in and out again [15:51:47] 10Operations, 10Icinga, 10monitoring, 10Patch-For-Review: Notifications disablement via puppet not working on icinga - https://phabricator.wikimedia.org/T209757 (10Dzahn) reverted my test patch on ununpentium and all notifications are enabled again, while db1078 still has all disabled. so that worked too..... [16:28:41] 10Operations, 10Citoid, 10Patch-For-Review, 10Service-deployment-requests, and 3 others: Deploy translation-server-v2 - https://phabricator.wikimedia.org/T201611 (10akosiaris) This has now been deployed to the kubernetes staging cluster. ` akosiaris@deploy1001:~$ curl -d 'http://www.nytimes.com/2018/06/1... [16:34:12] PROBLEM - Memory correctable errors -EDAC- on wtp2020 is CRITICAL: 7.001 ge 4 https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1var-server=wtp2020var-datasource=codfw%2520prometheus%252Fops [16:38:38] 10Operations, 10ops-eqiad, 10Dumps-Generation: Move dumpsdata1001 - https://phabricator.wikimedia.org/T207278 (10ayounsi) 05Open>03Resolved a:03ayounsi Correct, thanks! [17:04:55] (03PS1) 10Andrew Bogott: update the toolserver.org IP to point to eqiad1-r [dns] - 10https://gerrit.wikimedia.org/r/474475 (https://phabricator.wikimedia.org/T209769) [17:06:29] (03CR) 10Andrew Bogott: [C: 032] update the toolserver.org IP to point to eqiad1-r [dns] - 10https://gerrit.wikimedia.org/r/474475 (https://phabricator.wikimedia.org/T209769) (owner: 10Andrew Bogott) [17:24:20] PROBLEM - HTTPS-toolserver on www.toolserver.org is CRITICAL: SSL CRITICAL - Certificate stable.toolserver.org expired [17:25:26] RECOVERY - HTTPS-toolserver on www.toolserver.org is OK: SSL OK - Certificate stable.toolserver.org valid until 2018-12-25 21:52:38 +0000 (expires in 38 days) [17:38:10] (03CR) 10Urbanecm: [C: 031] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/472917 (https://phabricator.wikimedia.org/T209250) (owner: 10Zoranzoki21) [17:41:20] PROBLEM - HTTPS-toolserver on www.toolserver.org is CRITICAL: SSL CRITICAL - Certificate stable.toolserver.org expired [17:46:55] (03CR) 10Urbanecm: [C: 031] Disable FlaggedRevs, enable RC patrol and add rights on srwikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/472745 (https://phabricator.wikimedia.org/T209251) (owner: 10Zoranzoki21) [17:48:04] (03CR) 10Urbanecm: [C: 031] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/472918 (https://phabricator.wikimedia.org/T209252) (owner: 10Zoranzoki21) [17:48:12] RECOVERY - HTTPS-toolserver on www.toolserver.org is OK: SSL OK - Certificate stable.toolserver.org valid until 2018-12-25 21:52:38 +0000 (expires in 38 days) [17:48:19] (03CR) 10Urbanecm: [C: 031] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/474458 (https://phabricator.wikimedia.org/T209753) (owner: 10Zoranzoki21) [17:51:34] PROBLEM - HTTPS-toolserver on www.toolserver.org is CRITICAL: SSL CRITICAL - Certificate stable.toolserver.org expired [18:00:01] (03PS1) 10Zoranzoki21: IS.php: Cosmetic changes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/474478 [18:32:20] RECOVERY - HTTPS-toolserver on www.toolserver.org is OK: SSL OK - Certificate stable.toolserver.org valid until 2019-02-15 17:31:48 +0000 (expires in 89 days) [18:38:06] (03PS1) 10Andrew Bogott: Toolserver: fix ErrorDocument rule in apache config [puppet] - 10https://gerrit.wikimedia.org/r/474481 (https://phabricator.wikimedia.org/T209769) [18:39:34] (03CR) 10Andrew Bogott: [C: 032] Toolserver: fix ErrorDocument rule in apache config [puppet] - 10https://gerrit.wikimedia.org/r/474481 (https://phabricator.wikimedia.org/T209769) (owner: 10Andrew Bogott) [19:04:09] 10Operations, 10Wikimedia-Mailing-lists: Need to shut down a list - https://phabricator.wikimedia.org/T209726 (10Aklapper) @Beeblebrox: Which exact list on https://lists.wikimedia.org/mailman/listinfo is this about? [19:28:42] PROBLEM - Check systemd state on ruthenium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [19:38:52] 10Operations, 10Parsoid: parsoid-rt repeated failures on ruthenium (parsoid::testing) - https://phabricator.wikimedia.org/T209758 (10ssastry) p:05Triage>03High [19:42:18] RECOVERY - Check systemd state on ruthenium is OK: OK - running: The system is fully operational [19:56:56] PROBLEM - puppet last run on mwmaint2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:27:40] RECOVERY - puppet last run on mwmaint2001 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures [21:30:58] (03CR) 10Takidelfin: "> Patch Set 2: Code-Review+1" (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/474372 (owner: 10Takidelfin) [21:31:46] (03PS3) 10Takidelfin: InitialiseSettings: Remove redundant namespace talks definitions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/474372 (https://phabricator.wikimedia.org/T206952) [21:45:48] PROBLEM - Check systemd state on ruthenium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [22:11:54] RECOVERY - Check systemd state on ruthenium is OK: OK - running: The system is fully operational [22:16:26] PROBLEM - Check systemd state on ruthenium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [22:41:26] RECOVERY - Check systemd state on ruthenium is OK: OK - running: The system is fully operational [22:42:36] PROBLEM - Disk space on analytics1039 is CRITICAL: DISK CRITICAL - free space: / 1366 MB (2% inode=97%) [23:00:28] PROBLEM - DPKG on notebook1004 is CRITICAL: connect to address 10.64.36.107 port 5666: Connection refused [23:00:28] PROBLEM - dhclient process on notebook1004 is CRITICAL: connect to address 10.64.36.107 port 5666: Connection refused [23:00:40] PROBLEM - Check systemd state on notebook1004 is CRITICAL: connect to address 10.64.36.107 port 5666: Connection refused [23:00:42] PROBLEM - Disk space on notebook1004 is CRITICAL: connect to address 10.64.36.107 port 5666: Connection refused [23:00:50] PROBLEM - MD RAID on notebook1004 is CRITICAL: connect to address 10.64.36.107 port 5666: Connection refused [23:00:52] PROBLEM - configured eth on notebook1004 is CRITICAL: connect to address 10.64.36.107 port 5666: Connection refused [23:03:32] PROBLEM - Check the NTP synchronisation status of timesyncd on notebook1004 is CRITICAL: connect to address 10.64.36.107 port 5666: Connection refused [23:03:36] PROBLEM - puppet last run on notebook1004 is CRITICAL: connect to address 10.64.36.107 port 5666: Connection refused [23:09:32] RECOVERY - DPKG on notebook1004 is OK: All packages OK [23:09:34] RECOVERY - dhclient process on notebook1004 is OK: PROCS OK: 0 processes with command name dhclient [23:09:44] RECOVERY - Check systemd state on notebook1004 is OK: OK - running: The system is fully operational [23:09:48] RECOVERY - Disk space on notebook1004 is OK: DISK OK [23:09:56] RECOVERY - MD RAID on notebook1004 is OK: OK: Active: 6, Working: 6, Failed: 0, Spare: 0 [23:09:58] RECOVERY - configured eth on notebook1004 is OK: OK - interfaces up [23:13:52] RECOVERY - puppet last run on notebook1004 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [23:33:40] RECOVERY - Check the NTP synchronisation status of timesyncd on notebook1004 is OK: OK: synced at Sat 2018-11-17 23:33:38 UTC. [23:40:36] PROBLEM - Disk space on analytics1039 is CRITICAL: DISK CRITICAL - free space: / 1657 MB (3% inode=97%) [23:43:52] 10Operations, 10Wikidata, 10Wikidata-Query-Service, 10cloud-services-team, 10User-Smalyshev: Provide a way to have test servers on real hardware, isolated from production for Wikidata Query Service - https://phabricator.wikimedia.org/T206636 (10Smalyshev) Loading finished, overall took 8 days and 9 hours... [23:55:14] PROBLEM - Check systemd state on ruthenium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.