[00:19:10] <wikibugs>	 (03PS3) 10Andrew Bogott: Cloud vms: enable a default tty [puppet] - 10https://gerrit.wikimedia.org/r/489299 (https://phabricator.wikimedia.org/T215211)
[00:27:39] <wikibugs>	 (03PS4) 10Andrew Bogott: Cloud vms: enable a default tty [puppet] - 10https://gerrit.wikimedia.org/r/489299 (https://phabricator.wikimedia.org/T215211)
[00:29:23] <icinga-wm>	 PROBLEM - Check systemd state on ms-be2015 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[00:30:49] <wikibugs>	 (03PS5) 10Andrew Bogott: Cloud vms: enable a default tty [puppet] - 10https://gerrit.wikimedia.org/r/489299 (https://phabricator.wikimedia.org/T215211)
[00:31:04] <wikibugs>	 (03CR) 10Andrew Bogott: "This works now -- the earlier issue was just a copy/paste mistake." [puppet] - 10https://gerrit.wikimedia.org/r/489299 (https://phabricator.wikimedia.org/T215211) (owner: 10Andrew Bogott)
[00:34:39] <icinga-wm>	 RECOVERY - Check systemd state on ms-be2015 is OK: OK - running: The system is fully operational
[00:35:26] <wikibugs>	 10Operations, 10monitoring: icinga1001 crashed - https://phabricator.wikimedia.org/T214760 (10Volans) The host crashed again today and got rebooted, nothing in `getsel` and from `getraclog` I just got:  ` -------------------------------------------------------------------------------- SeqNumber       = 226 Mes...
[00:36:05] <wikibugs>	 10Operations, 10Icinga, 10monitoring: Icinga passive checks go awal and downtime stops working - https://phabricator.wikimedia.org/T196336 (10Volans) I got also an email alert from our external monitoring today and upon checking `icinga1001` I noticed that the uptime coincide with a reboot around that time....
[01:14:01] <wikibugs>	 (03CR) 10Anomie: wiki replicas: Remove reference to old comment fields (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/489242 (https://phabricator.wikimedia.org/T212972) (owner: 10Anomie)
[02:05:28] <wikibugs>	 10Operations, 10DNS, 10Domains, 10Traffic, 10HTTPS: Merge Wikipedia subdomains into one, to discourage censorship - https://phabricator.wikimedia.org/T215071 (10Tgr) > According to the article Censorship of Wikipedia, one effect of the switch to https was that it is now not possible to censor individual...
[02:27:04] <wikibugs>	 10Operations, 10DNS, 10Domains, 10Traffic, 10HTTPS: Merge Wikipedia subdomains into one, to discourage censorship - https://phabricator.wikimedia.org/T215071 (10Krenair) >>! In T215071#4941080, @Tgr wrote: > At some point maybe we could downgrade their security and just letsencrypt them  How exactly woul...
[03:35:26] <icinga-wm>	 PROBLEM - puppet last run on analytics1064 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/share/GeoIP/GeoIPCity.dat.gz]
[03:38:32] <icinga-wm>	 PROBLEM - puppet last run on mw2266 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 7 minutes ago with 2 failures. Failed resources (up to 3 shown): File[/usr/share/GeoIP/GeoIP2-City.mmdb.gz],File[/usr/share/GeoIP/GeoIP2-City.mmdb.test]
[04:01:56] <icinga-wm>	 RECOVERY - puppet last run on analytics1064 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[04:05:08] <icinga-wm>	 RECOVERY - puppet last run on mw2266 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures
[04:57:54] <wikibugs>	 10Operations, 10Cloud-VPS, 10Toolforge, 10Traffic, 10Patch-For-Review: Wikimedia varnish rules no longer exempt all Cloud VPS/Toolforge IPs from rate limits (HTTP 429 response) - https://phabricator.wikimedia.org/T213475 (10Cyberpower678) Question, when will this patch go live?
[05:59:30] <icinga-wm>	 PROBLEM - Check systemd state on ms-be2015 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[06:27:59] <icinga-wm>	 PROBLEM - Check systemd state on netmon1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[06:28:55] <icinga-wm>	 PROBLEM - netbox HTTPS on netmon1002 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 547 bytes in 0.009 second response time
[06:32:51] <icinga-wm>	 PROBLEM - puppet last run on mw1303 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/profile.d/mysql-ps1.sh]
[06:33:39] <icinga-wm>	 PROBLEM - puppet last run on bast3002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/bin/puppet-enabled]
[06:34:17] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s7 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 602.54 seconds
[06:34:33] <icinga-wm>	 RECOVERY - Check systemd state on ms-be2015 is OK: OK - running: The system is fully operational
[06:37:27] <icinga-wm>	 RECOVERY - netbox HTTPS on netmon1002 is OK: HTTP OK: HTTP/1.1 302 Found - 348 bytes in 0.516 second response time
[06:37:41] <icinga-wm>	 RECOVERY - Check systemd state on netmon1002 is OK: OK - running: The system is fully operational
[06:38:01] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s3 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 827.90 seconds
[06:49:42] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s3 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 611.22 seconds
[06:58:30] <icinga-wm>	 RECOVERY - puppet last run on mw1303 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[06:59:22] <icinga-wm>	 RECOVERY - puppet last run on bast3002 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures
[07:00:44] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s3 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 259.09 seconds
[07:37:12] <icinga-wm>	 RECOVERY - Maps tiles generation on icinga1001 is OK: OK: Less than 90.00% under the threshold [10.0] https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=8&fullscreen&orgId=1
[07:40:34] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s7 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 235.99 seconds
[08:59:36] <icinga-wm>	 PROBLEM - Check systemd state on ms-be2015 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[09:22:32] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s7 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 829.42 seconds
[09:25:54] <marostegui>	 !log Disable notifications for lag checks on dbstore1002 - T210478
[09:25:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:25:58] <stashbot>	 T210478: Migrate dbstore1002 to a multi instance setup on dbstore100[3-5] - https://phabricator.wikimedia.org/T210478
[09:28:27] <elukey>	 marostegui: thanks :)
[09:31:36] <marostegui>	 :-)
[09:34:46] <icinga-wm>	 RECOVERY - Check systemd state on ms-be2015 is OK: OK - running: The system is fully operational
[09:47:52] <icinga-wm>	 PROBLEM - MegaRAID on cloudvirt1024 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded)
[09:47:53] <icinga-wm>	 ACKNOWLEDGEMENT - MegaRAID on cloudvirt1024 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T215718
[09:47:57] <wikibugs>	 10Operations, 10ops-eqiad: Degraded RAID on cloudvirt1024 - https://phabricator.wikimedia.org/T215718 (10ops-monitoring-bot)
[09:56:56] <icinga-wm>	 PROBLEM - MariaDB Slave IO: s5 on dbstore1002 is CRITICAL: CRITICAL slave_io_state could not connect
[09:56:58] <icinga-wm>	 PROBLEM - MariaDB Slave IO: s4 on dbstore1002 is CRITICAL: CRITICAL slave_io_state could not connect
[09:56:58] <icinga-wm>	 PROBLEM - MariaDB Slave SQL: s2 on dbstore1002 is CRITICAL: CRITICAL slave_sql_state could not connect
[09:57:08] <icinga-wm>	 PROBLEM - MariaDB Slave IO: m2 on dbstore1002 is CRITICAL: CRITICAL slave_io_state could not connect
[09:57:08] <icinga-wm>	 PROBLEM - MariaDB Slave IO: s8 on dbstore1002 is CRITICAL: CRITICAL slave_io_state could not connect
[09:57:10] <icinga-wm>	 PROBLEM - MariaDB Slave SQL: s3 on dbstore1002 is CRITICAL: CRITICAL slave_sql_state could not connect
[09:57:16] <icinga-wm>	 PROBLEM - MariaDB Slave SQL: s4 on dbstore1002 is CRITICAL: CRITICAL slave_sql_state could not connect
[09:57:24] <icinga-wm>	 PROBLEM - MariaDB Slave IO: s7 on dbstore1002 is CRITICAL: CRITICAL slave_io_state could not connect
[09:57:26] <icinga-wm>	 PROBLEM - MariaDB Slave SQL: x1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_state could not connect
[09:57:26] <icinga-wm>	 PROBLEM - MariaDB Slave IO: s3 on dbstore1002 is CRITICAL: CRITICAL slave_io_state could not connect
[09:57:30] <icinga-wm>	 PROBLEM - MariaDB Slave SQL: s6 on dbstore1002 is CRITICAL: CRITICAL slave_sql_state could not connect
[09:57:31] <wikibugs>	 10Operations, 10Analytics, 10Analytics-Kanban, 10Product-Analytics, 10Patch-For-Review: dbstore1002 Mysql errors - https://phabricator.wikimedia.org/T213670 (10Marostegui) Another crash just happened: ` InnoDB: We intentionally crash the server, because it appears to be hung. 2019-02-10 09:54:08 7fa2cfff...
[09:57:36] <icinga-wm>	 PROBLEM - MariaDB Slave IO: s6 on dbstore1002 is CRITICAL: CRITICAL slave_io_state could not connect
[09:57:38] <icinga-wm>	 PROBLEM - MariaDB Slave SQL: s5 on dbstore1002 is CRITICAL: CRITICAL slave_sql_state could not connect
[09:57:42] <icinga-wm>	 PROBLEM - MariaDB Slave IO: s1 on dbstore1002 is CRITICAL: CRITICAL slave_io_state could not connect
[09:57:54] <icinga-wm>	 PROBLEM - MariaDB Slave SQL: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_state could not connect
[09:57:56] <icinga-wm>	 PROBLEM - MariaDB Slave IO: s2 on dbstore1002 is CRITICAL: CRITICAL slave_io_state could not connect
[09:57:56] <icinga-wm>	 PROBLEM - MariaDB Slave SQL: s8 on dbstore1002 is CRITICAL: CRITICAL slave_sql_state could not connect
[09:58:00] <icinga-wm>	 PROBLEM - MariaDB Slave SQL: m3 on dbstore1002 is CRITICAL: CRITICAL slave_sql_state could not connect
[09:58:00] <icinga-wm>	 PROBLEM - MariaDB Slave IO: m3 on dbstore1002 is CRITICAL: CRITICAL slave_io_state could not connect
[09:58:02] <icinga-wm>	 PROBLEM - MariaDB Slave IO: x1 on dbstore1002 is CRITICAL: CRITICAL slave_io_state could not connect
[09:58:02] <icinga-wm>	 PROBLEM - MariaDB Slave SQL: m2 on dbstore1002 is CRITICAL: CRITICAL slave_sql_state could not connect
[09:58:06] <icinga-wm>	 PROBLEM - MariaDB Slave SQL: s7 on dbstore1002 is CRITICAL: CRITICAL slave_sql_state could not connect
[10:03:14] <icinga-wm>	 PROBLEM - Check systemd state on an-coord1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[10:06:44] <elukey>	 marostegui: lovely
[10:07:06] <elukey>	 so in theory with your recent changes the slaves should restart after the crash
[10:07:12] <elukey>	 not complaining 
[10:07:14] <elukey>	 right?
[10:07:48] <marostegui>	 yeah
[10:08:02] <marostegui>	 And with the idempotent mode it should't not complain about broken replication on x1
[10:50:18] <icinga-wm>	 RECOVERY - MariaDB Slave SQL: m3 on dbstore1002 is OK: OK slave_sql_state not a slave
[10:50:18] <icinga-wm>	 RECOVERY - MariaDB Slave IO: m3 on dbstore1002 is OK: OK slave_io_state not a slave
[10:50:22] <icinga-wm>	 RECOVERY - MariaDB Slave SQL: m2 on dbstore1002 is OK: OK slave_sql_state not a slave
[10:50:46] <icinga-wm>	 RECOVERY - MariaDB Slave IO: m2 on dbstore1002 is OK: OK slave_io_state not a slave
[11:48:32] <icinga-wm>	 RECOVERY - MariaDB Slave SQL: x1 on dbstore1002 is OK: OK slave_sql_state Slave_SQL_Running: Yes
[11:48:32] <icinga-wm>	 RECOVERY - MariaDB Slave IO: s3 on dbstore1002 is OK: OK slave_io_state Slave_IO_Running: Yes
[11:48:32] <icinga-wm>	 RECOVERY - MariaDB Slave IO: s7 on dbstore1002 is OK: OK slave_io_state Slave_IO_Running: Yes
[11:48:38] <icinga-wm>	 RECOVERY - MariaDB Slave SQL: s6 on dbstore1002 is OK: OK slave_sql_state Slave_SQL_Running: Yes
[11:48:44] <icinga-wm>	 RECOVERY - MariaDB Slave IO: s6 on dbstore1002 is OK: OK slave_io_state Slave_IO_Running: Yes
[11:48:46] <icinga-wm>	 RECOVERY - MariaDB Slave SQL: s5 on dbstore1002 is OK: OK slave_sql_state Slave_SQL_Running: Yes
[11:48:52] <icinga-wm>	 RECOVERY - MariaDB Slave IO: s1 on dbstore1002 is OK: OK slave_io_state Slave_IO_Running: Yes
[11:49:02] <icinga-wm>	 RECOVERY - MariaDB Slave SQL: s8 on dbstore1002 is OK: OK slave_sql_state Slave_SQL_Running: Yes
[11:49:02] <icinga-wm>	 RECOVERY - MariaDB Slave IO: s2 on dbstore1002 is OK: OK slave_io_state Slave_IO_Running: Yes
[11:49:02] <icinga-wm>	 RECOVERY - MariaDB Slave SQL: s1 on dbstore1002 is OK: OK slave_sql_state Slave_SQL_Running: Yes
[11:49:10] <icinga-wm>	 RECOVERY - MariaDB Slave IO: x1 on dbstore1002 is OK: OK slave_io_state Slave_IO_Running: Yes
[11:49:10] <icinga-wm>	 RECOVERY - MariaDB Slave SQL: s7 on dbstore1002 is OK: OK slave_sql_state Slave_SQL_Running: Yes
[11:49:17] <marostegui>	 elukey: Guess what, the my.cnf had the "skip-slave-start" twice on the my.cnf, lovely...
[11:49:22] <icinga-wm>	 RECOVERY - MariaDB Slave IO: s5 on dbstore1002 is OK: OK slave_io_state Slave_IO_Running: Yes
[11:49:24] <icinga-wm>	 RECOVERY - MariaDB Slave SQL: s2 on dbstore1002 is OK: OK slave_sql_state Slave_SQL_Running: Yes
[11:49:24] <icinga-wm>	 RECOVERY - MariaDB Slave IO: s4 on dbstore1002 is OK: OK slave_io_state Slave_IO_Running: Yes
[11:49:25] <marostegui>	 I will send a patch to get rid of the second one XD
[11:49:34] <icinga-wm>	 RECOVERY - MariaDB Slave IO: s8 on dbstore1002 is OK: OK slave_io_state Slave_IO_Running: Yes
[11:49:36] <icinga-wm>	 RECOVERY - MariaDB Slave SQL: s3 on dbstore1002 is OK: OK slave_sql_state Slave_SQL_Running: Yes
[11:49:42] <icinga-wm>	 RECOVERY - MariaDB Slave SQL: s4 on dbstore1002 is OK: OK slave_sql_state Slave_SQL_Running: Yes
[11:52:12] <wikibugs>	 (03PS1) 10Marostegui: dbstore.my.cnf: Remove the second skip-slave-start [puppet] - 10https://gerrit.wikimedia.org/r/489450 (https://phabricator.wikimedia.org/T213670)
[12:23:50] <wikibugs>	 10Operations, 10ExternalGuidance, 10Traffic, 10Patch-For-Review: Deliver mobile-based version for automatic translations - https://phabricator.wikimedia.org/T212197 (10dr0ptp4kt) I should note that, concerning the edge case regression, it has always been the case that, even after the mobile user taps/click...
[12:27:46] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 04-1] Helm chart for eventgate-analytics deployment (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/483035 (https://phabricator.wikimedia.org/T211247) (owner: 10Ottomata)
[12:28:20] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 04-1] Helm chart for eventgate-analytics deployment (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/483035 (https://phabricator.wikimedia.org/T211247) (owner: 10Ottomata)
[12:32:30] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+1] zone_validator: require -z argument zones dir [dns] - 10https://gerrit.wikimedia.org/r/489287 (owner: 10BBlack)
[12:33:54] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+1] deploy-check: integrate other checks, no-gdnsd opt [dns] - 10https://gerrit.wikimedia.org/r/489288 (owner: 10BBlack)
[12:34:23] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+1] update README and run-tests.sh [dns] - 10https://gerrit.wikimedia.org/r/489289 (owner: 10BBlack)
[12:34:38] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+1] authdns-local-update: update deploy-check.py args [puppet] - 10https://gerrit.wikimedia.org/r/489292 (owner: 10BBlack)
[12:35:06] <wikibugs>	 (03Abandoned) 10Alexandros Kosiaris: Run zole validation on generated zonefiles [dns] - 10https://gerrit.wikimedia.org/r/489277 (owner: 10Alexandros Kosiaris)
[12:35:49] <wikibugs>	 (03PS7) 10Alexandros Kosiaris: Add kubernetes pods PTR records for IPv4 [dns] - 10https://gerrit.wikimedia.org/r/489251
[14:33:25] <wikibugs>	 (03PS1) 10Paladox: Gerrit: Update icinga check to use healthcheck endpoint [puppet] - 10https://gerrit.wikimedia.org/r/489457
[14:34:29] <wikibugs>	 (03PS2) 10Paladox: Gerrit: Update icinga check to use healthcheck endpoint [puppet] - 10https://gerrit.wikimedia.org/r/489457
[14:34:51] <wikibugs>	 (03PS3) 10Paladox: Gerrit: Update icinga check to use healthcheck endpoint [puppet] - 10https://gerrit.wikimedia.org/r/489457 (https://phabricator.wikimedia.org/T215457)
[14:35:15] <wikibugs>	 (03PS4) 10Paladox: Gerrit: Update icinga check to use healthcheck endpoint [puppet] - 10https://gerrit.wikimedia.org/r/489457 (https://phabricator.wikimedia.org/T215457)
[14:39:19] <wikibugs>	 (03CR) 10Paladox: Gerrit: Update icinga check to use healthcheck endpoint (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/489457 (https://phabricator.wikimedia.org/T215457) (owner: 10Paladox)
[15:24:44] <icinga-wm>	 PROBLEM - kartotherian endpoints health on maps1004 is CRITICAL: /v4/marker/pin-m-fuel+ffffff.png (Untitled test) timed out before a response was received: /osm-intl/9/207/163@1.5x.png (default scaled tile) timed out before a response was received: /osm-intl/11/828/655.png (get a tile in the middle of the ocean, with overzoom) timed out before a response was received: /img/osm-intl,1,0.0,0.0,100x100@1.5x.png (Small scaled map) ti
[15:24:44] <icinga-wm>	 response was received
[15:27:33] <icinga-wm>	 PROBLEM - kartotherian endpoints health on maps1003 is CRITICAL: /v4/marker/pin-m-fuel+ffffff.png (Untitled test) timed out before a response was received: /v4/marker/pin-m+ffffff.png (Untitled test) timed out before a response was received: /osm-intl/9/207/163@1.5x.png (default scaled tile) timed out before a response was received
[15:28:03] <icinga-wm>	 PROBLEM - kartotherian endpoints health on maps1004 is CRITICAL: /osm-intl/11/828/655.png (get a tile in the middle of the ocean, with overzoom) timed out before a response was received
[15:29:49] <icinga-wm>	 RECOVERY - kartotherian endpoints health on maps1003 is OK: All endpoints are healthy
[15:31:37] <icinga-wm>	 PROBLEM - kartotherian endpoints health on maps1004 is CRITICAL: /v4/marker/pin-m-fuel+ffffff.png (Untitled test) timed out before a response was received: /v4/marker/pin-m-fuel+ffffff@2x.png (scaled pushpin marker with an icon) timed out before a response was received: /osm-intl/11/828/655.png (get a tile in the middle of the ocean, with overzoom) timed out before a response was received: /img/osm-intl,1,0.0,0.0,100x100@1.5x.png
[15:31:37] <icinga-wm>	 p) timed out before a response was received
[15:32:44] <wikibugs>	 10Operations, 10DNS, 10Domains, 10Traffic, 10HTTPS: Merge Wikipedia subdomains into one, to discourage censorship - https://phabricator.wikimedia.org/T215071 (10BBlack) >>! In T215071#4941080, @Tgr wrote: >> According to the article Censorship of Wikipedia, one effect of the switch to https was that it i...
[15:32:53] <icinga-wm>	 PROBLEM - Ensure certcentral-backend is running only in the active node on certcentral1001 is CRITICAL: PROCS CRITICAL: 0 processes with args certcentral-backend
[15:33:29] <icinga-wm>	 PROBLEM - kartotherian endpoints health on maps1003 is CRITICAL: /img/osm-intl,1,0.0,0.0,100x100@1.5x.png (Small scaled map) timed out before a response was received
[15:33:46] <gehel>	 Onimisionipe, mateusbs17 : any chance you are around to have a look at those maps alerts ? I'm in the mountains with lousy 3g on a mobile phone
[15:34:05] <icinga-wm>	 RECOVERY - Ensure certcentral-backend is running only in the active node on certcentral1001 is OK: PROCS OK: 1 process with args certcentral-backend
[15:34:35] <icinga-wm>	 RECOVERY - kartotherian endpoints health on maps1003 is OK: All endpoints are healthy
[15:36:29] <icinga-wm>	 RECOVERY - kartotherian endpoints health on maps1004 is OK: All endpoints are healthy
[15:37:42] <gehel>	 Seems to recover, so I won't worry too much atm
[15:46:14] <wikibugs>	 (03PS9) 10Ammarpad: Add 'Author' namespace in Sanskrit Wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486221 (https://phabricator.wikimedia.org/T214553)
[15:46:37] <wikibugs>	 (03PS2) 10Ammarpad: Set wgArticleCountMethod='any' for zhwikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/487115
[16:47:24] <wikibugs>	 (03PS1) 10Paladox: gerrit: Increase httpd.threads in gerrit config [puppet] - 10https://gerrit.wikimedia.org/r/489475
[16:48:17] <wikibugs>	 (03PS2) 10Paladox: gerrit: Increase httpd.threads in gerrit config [puppet] - 10https://gerrit.wikimedia.org/r/489475
[17:14:11] <icinga-wm>	 RECOVERY - Check systemd state on an-coord1001 is OK: OK - running: The system is fully operational
[19:16:50] <volans|off>	 !log forcing reboot of icinga1001 because it's stuck again (no ping, no ssh, CPU stuck messages on console) - T214760
[19:16:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:16:54] <stashbot>	 T214760: icinga1001 crashed - https://phabricator.wikimedia.org/T214760
[19:19:23] <wikibugs>	 (03PS1) 10Paladox: Add support for "recheck" as button in PolyGerrit's ui [software/gerrit/plugins/wikimedia] - 10https://gerrit.wikimedia.org/r/489482
[19:19:25] <wikibugs>	 (03PS1) 10Paladox: Introduce gr-wikimedia-prettify-ci-comments [software/gerrit/plugins/wikimedia] - 10https://gerrit.wikimedia.org/r/489483
[19:21:11] <volans|off>	 welcome back icinga-wm
[19:23:41] <wikibugs>	 (03PS2) 10Paladox: Add support for "recheck" as button in PolyGerrit's ui [software/gerrit/plugins/wikimedia] - 10https://gerrit.wikimedia.org/r/489482
[19:23:47] <wikibugs>	 (03PS2) 10Paladox: Introduce gr-wikimedia-prettify-ci-comments [software/gerrit/plugins/wikimedia] - 10https://gerrit.wikimedia.org/r/489483
[19:32:48] <wikibugs>	 10Operations, 10ops-eqiad, 10monitoring: icinga1001 crashed - https://phabricator.wikimedia.org/T214760 (10Volans) `icinga1001` was stuck again today, but in a slightly different way that gave us some additional information. No ping, no ssh and no icinga web were working and no racadm errors were logged, but...
[19:39:40] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops: icinga1001 mysterious reboots - https://phabricator.wikimedia.org/T210108 (10Volans) I'm merging this with T214760 as those are now clearly just two different manifestation of the same issue (stuck and reboot) and we have the same entries in `getraclog`:  ` --------------...
[19:40:01] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops: icinga1001 mysterious reboots - https://phabricator.wikimedia.org/T210108 (10Volans)
[19:42:12] <wikibugs>	 (03PS3) 10Paladox: Add support for "recheck" as button in PolyGerrit's ui [software/gerrit/plugins/wikimedia] - 10https://gerrit.wikimedia.org/r/489482
[19:42:18] <wikibugs>	 (03PS3) 10Paladox: Introduce gr-wikimedia-prettify-ci-comments [software/gerrit/plugins/wikimedia] - 10https://gerrit.wikimedia.org/r/489483
[19:43:22] <wikibugs>	 (03PS4) 10Paladox: Add support for "recheck" as button in PolyGerrit's ui [software/gerrit/plugins/wikimedia] - 10https://gerrit.wikimedia.org/r/489482
[19:43:27] <wikibugs>	 10Operations, 10ops-eqiad, 10monitoring: icinga1001 crashed - https://phabricator.wikimedia.org/T214760 (10Volans) As an additional datapoint from  T210108, that I've just merged into this task, is that we had 2 reboots that showed the same symptoms during provisioning, I think even before the Icinga softwar...
[19:43:28] <wikibugs>	 (03PS4) 10Paladox: Introduce gr-wikimedia-prettify-ci-comments [software/gerrit/plugins/wikimedia] - 10https://gerrit.wikimedia.org/r/489483
[19:57:09] <wikibugs>	 10Operations, 10ops-eqiad: mw1299 is down (jobrunner-canary, now up but depooled) - https://phabricator.wikimedia.org/T215569 (10Volans) The host is stuck again (no ping, no ssh, nothing in console but `[ 2451.381422] m`, nothing new on `getsel ` or `getraclog`, forcing a reboot,
[19:57:56] <volans|off>	 !log force rebooting mw1299, stuck again - T215569
[20:00:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:00:12] <stashbot>	 T215569: mw1299 is down (jobrunner-canary, now up but depooled) - https://phabricator.wikimedia.org/T215569
[20:00:21] <icinga-wm>	 RECOVERY - Host mw1299 is UP: PING OK - Packet loss = 0%, RTA = 0.30 ms
[20:03:19] <icinga-wm>	 PROBLEM - Host mw1299 is DOWN: PING CRITICAL - Packet loss = 100%
[20:05:51] <volans|off>	 wow that failed again fast
[20:06:22] <Reedy>	 volans|off: I think it's a fscked cpu
[20:07:00] <Reedy>	 But it should be depooled etc
[20:07:08] <volans|off>	 it's already set as inactive
[20:07:30] <volans|off>	 (no lvs, no dsh)
[20:08:39] <Reedy>	 Probably easiest just to leave it for Chris to run diagnostics and warranty repair
[20:09:58] <volans|off>	 indeed, not planning to do much on it, I already have my other lemon ;)
[20:10:22] <volans|off>	 I'll just try another reboot keeping the console open to see if anything show up there
[20:11:28] <volans|off>	 actually, re-reading the task, we already have evidence of CPU errors
[20:11:34] <volans|off>	 acking on icinga
[20:12:54] <icinga-wm>	 ACKNOWLEDGEMENT - Host mw1299 is DOWN: PING CRITICAL - Packet loss = 100% Volans Faulty CPU - T215569
[20:14:40] <wikibugs>	 10Operations, 10ops-eqiad: mw1299 is down (jobrunner-canary, now up but depooled) - https://phabricator.wikimedia.org/T215569 (10Volans) The host already re-crashed, I'm leaving it as is for now. I've ack'ed the alerts on icinga.
[20:32:40] <wikibugs>	 (03PS5) 10Paladox: Introduce gr-wikimedia-prettify-ci-comments [software/gerrit/plugins/wikimedia] - 10https://gerrit.wikimedia.org/r/489483
[20:32:56] <wikibugs>	 10Operations, 10ops-eqiad, 10monitoring: icinga1001 crashed - https://phabricator.wikimedia.org/T214760 (10Marostegui) >>! In T214760#4941652, @Volans wrote: >  > I propose to failover to icinga2001 until we find out what's wrong with this one and we fix it. > Thoughts?  The existing downtimes would be lost?
[20:36:03] <wikibugs>	 (03PS6) 10Paladox: Introduce gr-wikimedia-prettify-ci-comments [software/gerrit/plugins/wikimedia] - 10https://gerrit.wikimedia.org/r/489483
[20:42:18] <icinga-wm>	 PROBLEM - puppet last run on analytics1054 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[20:45:47] <wikibugs>	 10Operations, 10ops-eqiad, 10monitoring: icinga1001 crashed - https://phabricator.wikimedia.org/T214760 (10Volans) >>! In T214760#4941860, @Marostegui wrote: > The existing downtimes would be lost?  Absolutely not, we already sync the icinga state file periodically and the failover procedure has a specific s...
[20:52:34] <wikibugs>	 10Operations, 10Cloud-VPS, 10cloud-services-team, 10Discovery-Search (Current work), 10Patch-For-Review: Setup elasticsearch on cloudelastic100[1-4] - https://phabricator.wikimedia.org/T214921 (10bd808)
[21:13:14] <icinga-wm>	 RECOVERY - puppet last run on analytics1054 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures
[21:21:49] <wikibugs>	 (03Abandoned) 10Paladox: Add support for "recheck" and "check experimental" as buttons in PolyGerrit's ui [software/gerrit/plugins/wikimedia] - 10https://gerrit.wikimedia.org/r/487089 (https://phabricator.wikimedia.org/T214631) (owner: 10Paladox)
[21:23:12] <wikibugs>	 (03PS5) 10Paladox: Add support for "recheck" as button in PolyGerrit's ui [software/gerrit/plugins/wikimedia] - 10https://gerrit.wikimedia.org/r/489482 (https://phabricator.wikimedia.org/T214631)
[21:24:36] <wikibugs>	 (03Abandoned) 10Paladox: gerrit: Make ci comments pretty under PolyGerrit [puppet] - 10https://gerrit.wikimedia.org/r/489397 (https://phabricator.wikimedia.org/T215658) (owner: 10Paladox)
[21:24:44] <wikibugs>	 (03PS1) 10BryanDavis: wiki replicas: Expose ipblocks_restrictions table [puppet] - 10https://gerrit.wikimedia.org/r/489576 (https://phabricator.wikimedia.org/T209819)
[21:36:13] <wikibugs>	 10Puppet, 10Cloud-VPS, 10serviceops, 10Patch-For-Review, 10Technical-Debt: upgrade simplelamp class (apache -> httpd and mysql -> mariadb) or deprecate it - https://phabricator.wikimedia.org/T215662 (10bd808)
[21:44:21] <wikibugs>	 10Operations, 10Cloud-VPS, 10cloud-services-team (Kanban): Move various support services for Cloud VPS currently in prod into their own instances - https://phabricator.wikimedia.org/T207536 (10bd808)
[21:48:13] <wikibugs>	 (03PS2) 10Nuria: Add ability for superset to connect to new staging DB [puppet] - 10https://gerrit.wikimedia.org/r/489338 (https://phabricator.wikimedia.org/T215680)
[21:48:51] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Add ability for superset to connect to new staging DB [puppet] - 10https://gerrit.wikimedia.org/r/489338 (https://phabricator.wikimedia.org/T215680) (owner: 10Nuria)
[21:48:52] <wikibugs>	 (03PS3) 10Nuria: Add ability for superset to connect to new staging DB [puppet] - 10https://gerrit.wikimedia.org/r/489338 (https://phabricator.wikimedia.org/T215680)
[21:50:02] <wikibugs>	 (03PS4) 10Nuria: Add ability for superset to connect to new staging DB [puppet] - 10https://gerrit.wikimedia.org/r/489338 (https://phabricator.wikimedia.org/T215680)
[21:50:03] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Add ability for superset to connect to new staging DB [puppet] - 10https://gerrit.wikimedia.org/r/489338 (https://phabricator.wikimedia.org/T215680) (owner: 10Nuria)
[21:50:28] <wikibugs>	 (03CR) 10Nuria: Add ability for superset to connect to new staging DB (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/489338 (https://phabricator.wikimedia.org/T215680) (owner: 10Nuria)
[21:51:24] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Add ability for superset to connect to new staging DB [puppet] - 10https://gerrit.wikimedia.org/r/489338 (https://phabricator.wikimedia.org/T215680) (owner: 10Nuria)
[22:11:49] <wikibugs>	 10Operations, 10Wikidata, 10Wikidata-Query-Service, 10cloud-services-team, 10User-Smalyshev: Provide a way to have test servers on real hardware, isolated from production for Wikidata Query Service - https://phabricator.wikimedia.org/T206636 (10bd808) The next step here is to determine the hardware requi...
[23:54:39] <wikibugs>	 10Operations, 10serviceops, 10Wikimedia-production-error: PHP Warning:  PHP Startup: Unable to load dynamic library luasandbox.so - https://phabricator.wikimedia.org/T214730 (10Krinkle) (Moving to Meta, as it only affects mwdebug and only the php70 runtime which we don't expose publicly anymore.)