[00:19:10] (03PS3) 10Andrew Bogott: Cloud vms: enable a default tty [puppet] - 10https://gerrit.wikimedia.org/r/489299 (https://phabricator.wikimedia.org/T215211) [00:27:39] (03PS4) 10Andrew Bogott: Cloud vms: enable a default tty [puppet] - 10https://gerrit.wikimedia.org/r/489299 (https://phabricator.wikimedia.org/T215211) [00:29:23] PROBLEM - Check systemd state on ms-be2015 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [00:30:49] (03PS5) 10Andrew Bogott: Cloud vms: enable a default tty [puppet] - 10https://gerrit.wikimedia.org/r/489299 (https://phabricator.wikimedia.org/T215211) [00:31:04] (03CR) 10Andrew Bogott: "This works now -- the earlier issue was just a copy/paste mistake." [puppet] - 10https://gerrit.wikimedia.org/r/489299 (https://phabricator.wikimedia.org/T215211) (owner: 10Andrew Bogott) [00:34:39] RECOVERY - Check systemd state on ms-be2015 is OK: OK - running: The system is fully operational [00:35:26] 10Operations, 10monitoring: icinga1001 crashed - https://phabricator.wikimedia.org/T214760 (10Volans) The host crashed again today and got rebooted, nothing in `getsel` and from `getraclog` I just got: ` -------------------------------------------------------------------------------- SeqNumber = 226 Mes... [00:36:05] 10Operations, 10Icinga, 10monitoring: Icinga passive checks go awal and downtime stops working - https://phabricator.wikimedia.org/T196336 (10Volans) I got also an email alert from our external monitoring today and upon checking `icinga1001` I noticed that the uptime coincide with a reboot around that time.... [01:14:01] (03CR) 10Anomie: wiki replicas: Remove reference to old comment fields (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/489242 (https://phabricator.wikimedia.org/T212972) (owner: 10Anomie) [02:05:28] 10Operations, 10DNS, 10Domains, 10Traffic, 10HTTPS: Merge Wikipedia subdomains into one, to discourage censorship - https://phabricator.wikimedia.org/T215071 (10Tgr) > According to the article Censorship of Wikipedia, one effect of the switch to https was that it is now not possible to censor individual... [02:27:04] 10Operations, 10DNS, 10Domains, 10Traffic, 10HTTPS: Merge Wikipedia subdomains into one, to discourage censorship - https://phabricator.wikimedia.org/T215071 (10Krenair) >>! In T215071#4941080, @Tgr wrote: > At some point maybe we could downgrade their security and just letsencrypt them How exactly woul... [03:35:26] PROBLEM - puppet last run on analytics1064 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/share/GeoIP/GeoIPCity.dat.gz] [03:38:32] PROBLEM - puppet last run on mw2266 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 7 minutes ago with 2 failures. Failed resources (up to 3 shown): File[/usr/share/GeoIP/GeoIP2-City.mmdb.gz],File[/usr/share/GeoIP/GeoIP2-City.mmdb.test] [04:01:56] RECOVERY - puppet last run on analytics1064 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [04:05:08] RECOVERY - puppet last run on mw2266 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures [04:57:54] 10Operations, 10Cloud-VPS, 10Toolforge, 10Traffic, 10Patch-For-Review: Wikimedia varnish rules no longer exempt all Cloud VPS/Toolforge IPs from rate limits (HTTP 429 response) - https://phabricator.wikimedia.org/T213475 (10Cyberpower678) Question, when will this patch go live? [05:59:30] PROBLEM - Check systemd state on ms-be2015 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [06:27:59] PROBLEM - Check systemd state on netmon1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [06:28:55] PROBLEM - netbox HTTPS on netmon1002 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 547 bytes in 0.009 second response time [06:32:51] PROBLEM - puppet last run on mw1303 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/profile.d/mysql-ps1.sh] [06:33:39] PROBLEM - puppet last run on bast3002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/bin/puppet-enabled] [06:34:17] PROBLEM - MariaDB Slave Lag: s7 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 602.54 seconds [06:34:33] RECOVERY - Check systemd state on ms-be2015 is OK: OK - running: The system is fully operational [06:37:27] RECOVERY - netbox HTTPS on netmon1002 is OK: HTTP OK: HTTP/1.1 302 Found - 348 bytes in 0.516 second response time [06:37:41] RECOVERY - Check systemd state on netmon1002 is OK: OK - running: The system is fully operational [06:38:01] PROBLEM - MariaDB Slave Lag: s3 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 827.90 seconds [06:49:42] PROBLEM - MariaDB Slave Lag: s3 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 611.22 seconds [06:58:30] RECOVERY - puppet last run on mw1303 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:59:22] RECOVERY - puppet last run on bast3002 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [07:00:44] RECOVERY - MariaDB Slave Lag: s3 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 259.09 seconds [07:37:12] RECOVERY - Maps tiles generation on icinga1001 is OK: OK: Less than 90.00% under the threshold [10.0] https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=8&fullscreen&orgId=1 [07:40:34] RECOVERY - MariaDB Slave Lag: s7 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 235.99 seconds [08:59:36] PROBLEM - Check systemd state on ms-be2015 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [09:22:32] PROBLEM - MariaDB Slave Lag: s7 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 829.42 seconds [09:25:54] !log Disable notifications for lag checks on dbstore1002 - T210478 [09:25:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:25:58] T210478: Migrate dbstore1002 to a multi instance setup on dbstore100[3-5] - https://phabricator.wikimedia.org/T210478 [09:28:27] marostegui: thanks :) [09:31:36] :-) [09:34:46] RECOVERY - Check systemd state on ms-be2015 is OK: OK - running: The system is fully operational [09:47:52] PROBLEM - MegaRAID on cloudvirt1024 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) [09:47:53] ACKNOWLEDGEMENT - MegaRAID on cloudvirt1024 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T215718 [09:47:57] 10Operations, 10ops-eqiad: Degraded RAID on cloudvirt1024 - https://phabricator.wikimedia.org/T215718 (10ops-monitoring-bot) [09:56:56] PROBLEM - MariaDB Slave IO: s5 on dbstore1002 is CRITICAL: CRITICAL slave_io_state could not connect [09:56:58] PROBLEM - MariaDB Slave IO: s4 on dbstore1002 is CRITICAL: CRITICAL slave_io_state could not connect [09:56:58] PROBLEM - MariaDB Slave SQL: s2 on dbstore1002 is CRITICAL: CRITICAL slave_sql_state could not connect [09:57:08] PROBLEM - MariaDB Slave IO: m2 on dbstore1002 is CRITICAL: CRITICAL slave_io_state could not connect [09:57:08] PROBLEM - MariaDB Slave IO: s8 on dbstore1002 is CRITICAL: CRITICAL slave_io_state could not connect [09:57:10] PROBLEM - MariaDB Slave SQL: s3 on dbstore1002 is CRITICAL: CRITICAL slave_sql_state could not connect [09:57:16] PROBLEM - MariaDB Slave SQL: s4 on dbstore1002 is CRITICAL: CRITICAL slave_sql_state could not connect [09:57:24] PROBLEM - MariaDB Slave IO: s7 on dbstore1002 is CRITICAL: CRITICAL slave_io_state could not connect [09:57:26] PROBLEM - MariaDB Slave SQL: x1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_state could not connect [09:57:26] PROBLEM - MariaDB Slave IO: s3 on dbstore1002 is CRITICAL: CRITICAL slave_io_state could not connect [09:57:30] PROBLEM - MariaDB Slave SQL: s6 on dbstore1002 is CRITICAL: CRITICAL slave_sql_state could not connect [09:57:31] 10Operations, 10Analytics, 10Analytics-Kanban, 10Product-Analytics, 10Patch-For-Review: dbstore1002 Mysql errors - https://phabricator.wikimedia.org/T213670 (10Marostegui) Another crash just happened: ` InnoDB: We intentionally crash the server, because it appears to be hung. 2019-02-10 09:54:08 7fa2cfff... [09:57:36] PROBLEM - MariaDB Slave IO: s6 on dbstore1002 is CRITICAL: CRITICAL slave_io_state could not connect [09:57:38] PROBLEM - MariaDB Slave SQL: s5 on dbstore1002 is CRITICAL: CRITICAL slave_sql_state could not connect [09:57:42] PROBLEM - MariaDB Slave IO: s1 on dbstore1002 is CRITICAL: CRITICAL slave_io_state could not connect [09:57:54] PROBLEM - MariaDB Slave SQL: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_state could not connect [09:57:56] PROBLEM - MariaDB Slave IO: s2 on dbstore1002 is CRITICAL: CRITICAL slave_io_state could not connect [09:57:56] PROBLEM - MariaDB Slave SQL: s8 on dbstore1002 is CRITICAL: CRITICAL slave_sql_state could not connect [09:58:00] PROBLEM - MariaDB Slave SQL: m3 on dbstore1002 is CRITICAL: CRITICAL slave_sql_state could not connect [09:58:00] PROBLEM - MariaDB Slave IO: m3 on dbstore1002 is CRITICAL: CRITICAL slave_io_state could not connect [09:58:02] PROBLEM - MariaDB Slave IO: x1 on dbstore1002 is CRITICAL: CRITICAL slave_io_state could not connect [09:58:02] PROBLEM - MariaDB Slave SQL: m2 on dbstore1002 is CRITICAL: CRITICAL slave_sql_state could not connect [09:58:06] PROBLEM - MariaDB Slave SQL: s7 on dbstore1002 is CRITICAL: CRITICAL slave_sql_state could not connect [10:03:14] PROBLEM - Check systemd state on an-coord1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [10:06:44] marostegui: lovely [10:07:06] so in theory with your recent changes the slaves should restart after the crash [10:07:12] not complaining [10:07:14] right? [10:07:48] yeah [10:08:02] And with the idempotent mode it should't not complain about broken replication on x1 [10:50:18] RECOVERY - MariaDB Slave SQL: m3 on dbstore1002 is OK: OK slave_sql_state not a slave [10:50:18] RECOVERY - MariaDB Slave IO: m3 on dbstore1002 is OK: OK slave_io_state not a slave [10:50:22] RECOVERY - MariaDB Slave SQL: m2 on dbstore1002 is OK: OK slave_sql_state not a slave [10:50:46] RECOVERY - MariaDB Slave IO: m2 on dbstore1002 is OK: OK slave_io_state not a slave [11:48:32] RECOVERY - MariaDB Slave SQL: x1 on dbstore1002 is OK: OK slave_sql_state Slave_SQL_Running: Yes [11:48:32] RECOVERY - MariaDB Slave IO: s3 on dbstore1002 is OK: OK slave_io_state Slave_IO_Running: Yes [11:48:32] RECOVERY - MariaDB Slave IO: s7 on dbstore1002 is OK: OK slave_io_state Slave_IO_Running: Yes [11:48:38] RECOVERY - MariaDB Slave SQL: s6 on dbstore1002 is OK: OK slave_sql_state Slave_SQL_Running: Yes [11:48:44] RECOVERY - MariaDB Slave IO: s6 on dbstore1002 is OK: OK slave_io_state Slave_IO_Running: Yes [11:48:46] RECOVERY - MariaDB Slave SQL: s5 on dbstore1002 is OK: OK slave_sql_state Slave_SQL_Running: Yes [11:48:52] RECOVERY - MariaDB Slave IO: s1 on dbstore1002 is OK: OK slave_io_state Slave_IO_Running: Yes [11:49:02] RECOVERY - MariaDB Slave SQL: s8 on dbstore1002 is OK: OK slave_sql_state Slave_SQL_Running: Yes [11:49:02] RECOVERY - MariaDB Slave IO: s2 on dbstore1002 is OK: OK slave_io_state Slave_IO_Running: Yes [11:49:02] RECOVERY - MariaDB Slave SQL: s1 on dbstore1002 is OK: OK slave_sql_state Slave_SQL_Running: Yes [11:49:10] RECOVERY - MariaDB Slave IO: x1 on dbstore1002 is OK: OK slave_io_state Slave_IO_Running: Yes [11:49:10] RECOVERY - MariaDB Slave SQL: s7 on dbstore1002 is OK: OK slave_sql_state Slave_SQL_Running: Yes [11:49:17] elukey: Guess what, the my.cnf had the "skip-slave-start" twice on the my.cnf, lovely... [11:49:22] RECOVERY - MariaDB Slave IO: s5 on dbstore1002 is OK: OK slave_io_state Slave_IO_Running: Yes [11:49:24] RECOVERY - MariaDB Slave SQL: s2 on dbstore1002 is OK: OK slave_sql_state Slave_SQL_Running: Yes [11:49:24] RECOVERY - MariaDB Slave IO: s4 on dbstore1002 is OK: OK slave_io_state Slave_IO_Running: Yes [11:49:25] I will send a patch to get rid of the second one XD [11:49:34] RECOVERY - MariaDB Slave IO: s8 on dbstore1002 is OK: OK slave_io_state Slave_IO_Running: Yes [11:49:36] RECOVERY - MariaDB Slave SQL: s3 on dbstore1002 is OK: OK slave_sql_state Slave_SQL_Running: Yes [11:49:42] RECOVERY - MariaDB Slave SQL: s4 on dbstore1002 is OK: OK slave_sql_state Slave_SQL_Running: Yes [11:52:12] (03PS1) 10Marostegui: dbstore.my.cnf: Remove the second skip-slave-start [puppet] - 10https://gerrit.wikimedia.org/r/489450 (https://phabricator.wikimedia.org/T213670) [12:23:50] 10Operations, 10ExternalGuidance, 10Traffic, 10Patch-For-Review: Deliver mobile-based version for automatic translations - https://phabricator.wikimedia.org/T212197 (10dr0ptp4kt) I should note that, concerning the edge case regression, it has always been the case that, even after the mobile user taps/click... [12:27:46] (03CR) 10Alexandros Kosiaris: [C: 04-1] Helm chart for eventgate-analytics deployment (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/483035 (https://phabricator.wikimedia.org/T211247) (owner: 10Ottomata) [12:28:20] (03CR) 10Alexandros Kosiaris: [C: 04-1] Helm chart for eventgate-analytics deployment (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/483035 (https://phabricator.wikimedia.org/T211247) (owner: 10Ottomata) [12:32:30] (03CR) 10Alexandros Kosiaris: [C: 03+1] zone_validator: require -z argument zones dir [dns] - 10https://gerrit.wikimedia.org/r/489287 (owner: 10BBlack) [12:33:54] (03CR) 10Alexandros Kosiaris: [C: 03+1] deploy-check: integrate other checks, no-gdnsd opt [dns] - 10https://gerrit.wikimedia.org/r/489288 (owner: 10BBlack) [12:34:23] (03CR) 10Alexandros Kosiaris: [C: 03+1] update README and run-tests.sh [dns] - 10https://gerrit.wikimedia.org/r/489289 (owner: 10BBlack) [12:34:38] (03CR) 10Alexandros Kosiaris: [C: 03+1] authdns-local-update: update deploy-check.py args [puppet] - 10https://gerrit.wikimedia.org/r/489292 (owner: 10BBlack) [12:35:06] (03Abandoned) 10Alexandros Kosiaris: Run zole validation on generated zonefiles [dns] - 10https://gerrit.wikimedia.org/r/489277 (owner: 10Alexandros Kosiaris) [12:35:49] (03PS7) 10Alexandros Kosiaris: Add kubernetes pods PTR records for IPv4 [dns] - 10https://gerrit.wikimedia.org/r/489251 [14:33:25] (03PS1) 10Paladox: Gerrit: Update icinga check to use healthcheck endpoint [puppet] - 10https://gerrit.wikimedia.org/r/489457 [14:34:29] (03PS2) 10Paladox: Gerrit: Update icinga check to use healthcheck endpoint [puppet] - 10https://gerrit.wikimedia.org/r/489457 [14:34:51] (03PS3) 10Paladox: Gerrit: Update icinga check to use healthcheck endpoint [puppet] - 10https://gerrit.wikimedia.org/r/489457 (https://phabricator.wikimedia.org/T215457) [14:35:15] (03PS4) 10Paladox: Gerrit: Update icinga check to use healthcheck endpoint [puppet] - 10https://gerrit.wikimedia.org/r/489457 (https://phabricator.wikimedia.org/T215457) [14:39:19] (03CR) 10Paladox: Gerrit: Update icinga check to use healthcheck endpoint (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/489457 (https://phabricator.wikimedia.org/T215457) (owner: 10Paladox) [15:24:44] PROBLEM - kartotherian endpoints health on maps1004 is CRITICAL: /v4/marker/pin-m-fuel+ffffff.png (Untitled test) timed out before a response was received: /osm-intl/9/207/163@1.5x.png (default scaled tile) timed out before a response was received: /osm-intl/11/828/655.png (get a tile in the middle of the ocean, with overzoom) timed out before a response was received: /img/osm-intl,1,0.0,0.0,100x100@1.5x.png (Small scaled map) ti [15:24:44] response was received [15:27:33] PROBLEM - kartotherian endpoints health on maps1003 is CRITICAL: /v4/marker/pin-m-fuel+ffffff.png (Untitled test) timed out before a response was received: /v4/marker/pin-m+ffffff.png (Untitled test) timed out before a response was received: /osm-intl/9/207/163@1.5x.png (default scaled tile) timed out before a response was received [15:28:03] PROBLEM - kartotherian endpoints health on maps1004 is CRITICAL: /osm-intl/11/828/655.png (get a tile in the middle of the ocean, with overzoom) timed out before a response was received [15:29:49] RECOVERY - kartotherian endpoints health on maps1003 is OK: All endpoints are healthy [15:31:37] PROBLEM - kartotherian endpoints health on maps1004 is CRITICAL: /v4/marker/pin-m-fuel+ffffff.png (Untitled test) timed out before a response was received: /v4/marker/pin-m-fuel+ffffff@2x.png (scaled pushpin marker with an icon) timed out before a response was received: /osm-intl/11/828/655.png (get a tile in the middle of the ocean, with overzoom) timed out before a response was received: /img/osm-intl,1,0.0,0.0,100x100@1.5x.png [15:31:37] p) timed out before a response was received [15:32:44] 10Operations, 10DNS, 10Domains, 10Traffic, 10HTTPS: Merge Wikipedia subdomains into one, to discourage censorship - https://phabricator.wikimedia.org/T215071 (10BBlack) >>! In T215071#4941080, @Tgr wrote: >> According to the article Censorship of Wikipedia, one effect of the switch to https was that it i... [15:32:53] PROBLEM - Ensure certcentral-backend is running only in the active node on certcentral1001 is CRITICAL: PROCS CRITICAL: 0 processes with args certcentral-backend [15:33:29] PROBLEM - kartotherian endpoints health on maps1003 is CRITICAL: /img/osm-intl,1,0.0,0.0,100x100@1.5x.png (Small scaled map) timed out before a response was received [15:33:46] Onimisionipe, mateusbs17 : any chance you are around to have a look at those maps alerts ? I'm in the mountains with lousy 3g on a mobile phone [15:34:05] RECOVERY - Ensure certcentral-backend is running only in the active node on certcentral1001 is OK: PROCS OK: 1 process with args certcentral-backend [15:34:35] RECOVERY - kartotherian endpoints health on maps1003 is OK: All endpoints are healthy [15:36:29] RECOVERY - kartotherian endpoints health on maps1004 is OK: All endpoints are healthy [15:37:42] Seems to recover, so I won't worry too much atm [15:46:14] (03PS9) 10Ammarpad: Add 'Author' namespace in Sanskrit Wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486221 (https://phabricator.wikimedia.org/T214553) [15:46:37] (03PS2) 10Ammarpad: Set wgArticleCountMethod='any' for zhwikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/487115 [16:47:24] (03PS1) 10Paladox: gerrit: Increase httpd.threads in gerrit config [puppet] - 10https://gerrit.wikimedia.org/r/489475 [16:48:17] (03PS2) 10Paladox: gerrit: Increase httpd.threads in gerrit config [puppet] - 10https://gerrit.wikimedia.org/r/489475 [17:14:11] RECOVERY - Check systemd state on an-coord1001 is OK: OK - running: The system is fully operational [19:16:50] !log forcing reboot of icinga1001 because it's stuck again (no ping, no ssh, CPU stuck messages on console) - T214760 [19:16:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:16:54] T214760: icinga1001 crashed - https://phabricator.wikimedia.org/T214760 [19:19:23] (03PS1) 10Paladox: Add support for "recheck" as button in PolyGerrit's ui [software/gerrit/plugins/wikimedia] - 10https://gerrit.wikimedia.org/r/489482 [19:19:25] (03PS1) 10Paladox: Introduce gr-wikimedia-prettify-ci-comments [software/gerrit/plugins/wikimedia] - 10https://gerrit.wikimedia.org/r/489483 [19:21:11] welcome back icinga-wm [19:23:41] (03PS2) 10Paladox: Add support for "recheck" as button in PolyGerrit's ui [software/gerrit/plugins/wikimedia] - 10https://gerrit.wikimedia.org/r/489482 [19:23:47] (03PS2) 10Paladox: Introduce gr-wikimedia-prettify-ci-comments [software/gerrit/plugins/wikimedia] - 10https://gerrit.wikimedia.org/r/489483 [19:32:48] 10Operations, 10ops-eqiad, 10monitoring: icinga1001 crashed - https://phabricator.wikimedia.org/T214760 (10Volans) `icinga1001` was stuck again today, but in a slightly different way that gave us some additional information. No ping, no ssh and no icinga web were working and no racadm errors were logged, but... [19:39:40] 10Operations, 10ops-eqiad, 10DC-Ops: icinga1001 mysterious reboots - https://phabricator.wikimedia.org/T210108 (10Volans) I'm merging this with T214760 as those are now clearly just two different manifestation of the same issue (stuck and reboot) and we have the same entries in `getraclog`: ` --------------... [19:40:01] 10Operations, 10ops-eqiad, 10DC-Ops: icinga1001 mysterious reboots - https://phabricator.wikimedia.org/T210108 (10Volans) [19:42:12] (03PS3) 10Paladox: Add support for "recheck" as button in PolyGerrit's ui [software/gerrit/plugins/wikimedia] - 10https://gerrit.wikimedia.org/r/489482 [19:42:18] (03PS3) 10Paladox: Introduce gr-wikimedia-prettify-ci-comments [software/gerrit/plugins/wikimedia] - 10https://gerrit.wikimedia.org/r/489483 [19:43:22] (03PS4) 10Paladox: Add support for "recheck" as button in PolyGerrit's ui [software/gerrit/plugins/wikimedia] - 10https://gerrit.wikimedia.org/r/489482 [19:43:27] 10Operations, 10ops-eqiad, 10monitoring: icinga1001 crashed - https://phabricator.wikimedia.org/T214760 (10Volans) As an additional datapoint from T210108, that I've just merged into this task, is that we had 2 reboots that showed the same symptoms during provisioning, I think even before the Icinga softwar... [19:43:28] (03PS4) 10Paladox: Introduce gr-wikimedia-prettify-ci-comments [software/gerrit/plugins/wikimedia] - 10https://gerrit.wikimedia.org/r/489483 [19:57:09] 10Operations, 10ops-eqiad: mw1299 is down (jobrunner-canary, now up but depooled) - https://phabricator.wikimedia.org/T215569 (10Volans) The host is stuck again (no ping, no ssh, nothing in console but `[ 2451.381422] m`, nothing new on `getsel ` or `getraclog`, forcing a reboot, [19:57:56] !log force rebooting mw1299, stuck again - T215569 [20:00:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:00:12] T215569: mw1299 is down (jobrunner-canary, now up but depooled) - https://phabricator.wikimedia.org/T215569 [20:00:21] RECOVERY - Host mw1299 is UP: PING OK - Packet loss = 0%, RTA = 0.30 ms [20:03:19] PROBLEM - Host mw1299 is DOWN: PING CRITICAL - Packet loss = 100% [20:05:51] wow that failed again fast [20:06:22] volans|off: I think it's a fscked cpu [20:07:00] But it should be depooled etc [20:07:08] it's already set as inactive [20:07:30] (no lvs, no dsh) [20:08:39] Probably easiest just to leave it for Chris to run diagnostics and warranty repair [20:09:58] indeed, not planning to do much on it, I already have my other lemon ;) [20:10:22] I'll just try another reboot keeping the console open to see if anything show up there [20:11:28] actually, re-reading the task, we already have evidence of CPU errors [20:11:34] acking on icinga [20:12:54] ACKNOWLEDGEMENT - Host mw1299 is DOWN: PING CRITICAL - Packet loss = 100% Volans Faulty CPU - T215569 [20:14:40] 10Operations, 10ops-eqiad: mw1299 is down (jobrunner-canary, now up but depooled) - https://phabricator.wikimedia.org/T215569 (10Volans) The host already re-crashed, I'm leaving it as is for now. I've ack'ed the alerts on icinga. [20:32:40] (03PS5) 10Paladox: Introduce gr-wikimedia-prettify-ci-comments [software/gerrit/plugins/wikimedia] - 10https://gerrit.wikimedia.org/r/489483 [20:32:56] 10Operations, 10ops-eqiad, 10monitoring: icinga1001 crashed - https://phabricator.wikimedia.org/T214760 (10Marostegui) >>! In T214760#4941652, @Volans wrote: > > I propose to failover to icinga2001 until we find out what's wrong with this one and we fix it. > Thoughts? The existing downtimes would be lost? [20:36:03] (03PS6) 10Paladox: Introduce gr-wikimedia-prettify-ci-comments [software/gerrit/plugins/wikimedia] - 10https://gerrit.wikimedia.org/r/489483 [20:42:18] PROBLEM - puppet last run on analytics1054 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:45:47] 10Operations, 10ops-eqiad, 10monitoring: icinga1001 crashed - https://phabricator.wikimedia.org/T214760 (10Volans) >>! In T214760#4941860, @Marostegui wrote: > The existing downtimes would be lost? Absolutely not, we already sync the icinga state file periodically and the failover procedure has a specific s... [20:52:34] 10Operations, 10Cloud-VPS, 10cloud-services-team, 10Discovery-Search (Current work), 10Patch-For-Review: Setup elasticsearch on cloudelastic100[1-4] - https://phabricator.wikimedia.org/T214921 (10bd808) [21:13:14] RECOVERY - puppet last run on analytics1054 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [21:21:49] (03Abandoned) 10Paladox: Add support for "recheck" and "check experimental" as buttons in PolyGerrit's ui [software/gerrit/plugins/wikimedia] - 10https://gerrit.wikimedia.org/r/487089 (https://phabricator.wikimedia.org/T214631) (owner: 10Paladox) [21:23:12] (03PS5) 10Paladox: Add support for "recheck" as button in PolyGerrit's ui [software/gerrit/plugins/wikimedia] - 10https://gerrit.wikimedia.org/r/489482 (https://phabricator.wikimedia.org/T214631) [21:24:36] (03Abandoned) 10Paladox: gerrit: Make ci comments pretty under PolyGerrit [puppet] - 10https://gerrit.wikimedia.org/r/489397 (https://phabricator.wikimedia.org/T215658) (owner: 10Paladox) [21:24:44] (03PS1) 10BryanDavis: wiki replicas: Expose ipblocks_restrictions table [puppet] - 10https://gerrit.wikimedia.org/r/489576 (https://phabricator.wikimedia.org/T209819) [21:36:13] 10Puppet, 10Cloud-VPS, 10serviceops, 10Patch-For-Review, 10Technical-Debt: upgrade simplelamp class (apache -> httpd and mysql -> mariadb) or deprecate it - https://phabricator.wikimedia.org/T215662 (10bd808) [21:44:21] 10Operations, 10Cloud-VPS, 10cloud-services-team (Kanban): Move various support services for Cloud VPS currently in prod into their own instances - https://phabricator.wikimedia.org/T207536 (10bd808) [21:48:13] (03PS2) 10Nuria: Add ability for superset to connect to new staging DB [puppet] - 10https://gerrit.wikimedia.org/r/489338 (https://phabricator.wikimedia.org/T215680) [21:48:51] (03CR) 10jerkins-bot: [V: 04-1] Add ability for superset to connect to new staging DB [puppet] - 10https://gerrit.wikimedia.org/r/489338 (https://phabricator.wikimedia.org/T215680) (owner: 10Nuria) [21:48:52] (03PS3) 10Nuria: Add ability for superset to connect to new staging DB [puppet] - 10https://gerrit.wikimedia.org/r/489338 (https://phabricator.wikimedia.org/T215680) [21:50:02] (03PS4) 10Nuria: Add ability for superset to connect to new staging DB [puppet] - 10https://gerrit.wikimedia.org/r/489338 (https://phabricator.wikimedia.org/T215680) [21:50:03] (03CR) 10jerkins-bot: [V: 04-1] Add ability for superset to connect to new staging DB [puppet] - 10https://gerrit.wikimedia.org/r/489338 (https://phabricator.wikimedia.org/T215680) (owner: 10Nuria) [21:50:28] (03CR) 10Nuria: Add ability for superset to connect to new staging DB (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/489338 (https://phabricator.wikimedia.org/T215680) (owner: 10Nuria) [21:51:24] (03CR) 10jerkins-bot: [V: 04-1] Add ability for superset to connect to new staging DB [puppet] - 10https://gerrit.wikimedia.org/r/489338 (https://phabricator.wikimedia.org/T215680) (owner: 10Nuria) [22:11:49] 10Operations, 10Wikidata, 10Wikidata-Query-Service, 10cloud-services-team, 10User-Smalyshev: Provide a way to have test servers on real hardware, isolated from production for Wikidata Query Service - https://phabricator.wikimedia.org/T206636 (10bd808) The next step here is to determine the hardware requi... [23:54:39] 10Operations, 10serviceops, 10Wikimedia-production-error: PHP Warning: PHP Startup: Unable to load dynamic library luasandbox.so - https://phabricator.wikimedia.org/T214730 (10Krinkle) (Moving to Meta, as it only affects mwdebug and only the php70 runtime which we don't expose publicly anymore.)