[00:03:35] (03CR) 10jenkins-bot: In LocalSettings.php use a relative path to CommonSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486004 (owner: 10Tim Starling) [00:38:35] PROBLEM - Request latencies on neon is CRITICAL: instance=10.64.0.40:6443 verb=CONNECT https://grafana.wikimedia.org/dashboard/db/kubernetes-api [00:39:51] RECOVERY - Request latencies on neon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [01:20:02] Krinkle: https://gerrit.wikimedia.org/r/c/mediawiki/extensions/WikibaseMediaInfo/+/486792 [01:24:39] Reedy: Thanks. [01:27:27] RECOVERY - MariaDB Slave Lag: s8 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 161.13 seconds [01:35:05] * James_F ponders painting go-faster stripes on the side of the CI cloud. [02:24:59] !log jforrester@deploy1001 Synchronized php-1.33.0-wmf.14/extensions/WikibaseMediaInfo/src/WikibaseMediaInfoHooks.php: Hot-deploy Ic2b08cb27 in WBMI to fix Commons File page display (duration: 00m 49s) [02:25:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:33:37] Everything seems fine. [02:33:55] (Fine for a I-just-emergency-deployed-on-a-Saturday-night fine.) [03:27:47] RECOVERY - MariaDB Slave SQL: x1 on dbstore1002 is OK: OK slave_sql_state Slave_SQL_Running: Yes [03:28:25] !log Fix x1 on dbstore1002 - T213670 [03:28:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:28:29] T213670: dbstore1002 Mysql errors - https://phabricator.wikimedia.org/T213670 [03:32:25] PROBLEM - MariaDB Slave Lag: s8 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1041.97 seconds [03:37:31] PROBLEM - puppet last run on mw2266 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 6 minutes ago with 2 failures. Failed resources (up to 3 shown): File[/usr/share/GeoIP/GeoIP2-City.mmdb.gz],File[/usr/share/GeoIP/GeoIP2-City.mmdb.test] [03:37:59] PROBLEM - puppet last run on cp4028 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 7 minutes ago with 2 failures. Failed resources (up to 3 shown): File[/usr/share/GeoIP/GeoIPCity.dat.gz],File[/usr/share/GeoIP/GeoIPCity.dat.test] [03:47:13] RECOVERY - MariaDB Slave Lag: x1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 288.29 seconds [04:03:47] RECOVERY - puppet last run on mw2266 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [04:04:15] RECOVERY - puppet last run on cp4028 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [04:31:06] (03CR) 10Mathew.onipe: [C: 03+1] icinga: add context manager for downtimed hosts [software/spicerack] - 10https://gerrit.wikimedia.org/r/486530 (owner: 10Volans) [05:55:11] RECOVERY - MariaDB Slave Lag: s8 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 295.85 seconds [07:02:07] PROBLEM - MariaDB Slave Lag: s8 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1022.52 seconds [07:28:59] PROBLEM - MariaDB Slave SQL: x1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1032, Errmsg: Could not execute Update_rows_v1 event on table wikishared.echo_unread_wikis: Cant find record in echo_unread_wikis, Error_code: 1032: handler error HA_ERR_KEY_NOT_FOUND: the events master log db1069-bin.000323, end_log_pos 279872044 [07:43:23] PROBLEM - MariaDB Slave Lag: x1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1022.45 seconds [09:05:49] PROBLEM - very high load average likely xfs on ms-be1020 is CRITICAL: CRITICAL - load average: 181.89, 116.83, 63.93 [09:06:35] PROBLEM - MD RAID on ms-be1020 is CRITICAL: CRITICAL: State: degraded, Active: 3, Working: 3, Failed: 1, Spare: 0 [09:06:36] ACKNOWLEDGEMENT - MD RAID on ms-be1020 is CRITICAL: CRITICAL: State: degraded, Active: 3, Working: 3, Failed: 1, Spare: 0 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T214778 [09:06:40] 10Operations, 10ops-eqiad: Degraded RAID on ms-be1020 - https://phabricator.wikimedia.org/T214778 (10ops-monitoring-bot) [09:09:13] PROBLEM - Check systemd state on ms-be1020 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [09:10:01] PROBLEM - Disk space on ms-be1020 is CRITICAL: DISK CRITICAL - /srv/swift-storage/sdm1 is not accessible: Input/output error [09:13:27] PROBLEM - very high load average likely xfs on ms-be1020 is CRITICAL: CRITICAL - load average: 54.10, 117.43, 90.50 [09:15:59] RECOVERY - very high load average likely xfs on ms-be1020 is OK: OK - load average: 15.52, 76.49, 79.07 [09:16:21] PROBLEM - swift-container-updater on ms-be1020 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-updater [09:16:23] PROBLEM - Disk space on ms-be1020 is CRITICAL: DISK CRITICAL - /srv/swift-storage/sdm1 is not accessible: Input/output error [09:33:23] RECOVERY - MariaDB Slave Lag: s8 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 240.24 seconds [09:37:51] 10Operations, 10ops-eqiad: ms-be1020 shows I/O errors - https://phabricator.wikimedia.org/T214779 (10elukey) p:05Triage→03Normal [09:37:59] godog: --^ [09:39:26] As far as I can see traffic to the host is zero, so I am inclined not do to anything like shudown/etc.. before taking my plane :) [09:40:56] 10Operations, 10ops-eqiad: ms-be1020 shows I/O errors - https://phabricator.wikimedia.org/T214779 (10elukey) [09:40:58] 10Operations, 10ops-eqiad: Degraded RAID on ms-be1020 - https://phabricator.wikimedia.org/T214778 (10elukey) [09:41:03] course there was already a task! [09:41:05] sigh [09:41:57] 10Operations, 10ops-eqiad: Degraded RAID on ms-be1020 - https://phabricator.wikimedia.org/T214778 (10elukey) Reporting in here as well: SSH: ` #~ ssh ms-be1020.eqiad.wmnet -bash: /usr/share/bash-completion/bash_completion: Input/output error elukey@ms-be1020:~$ df -h -bash: /bin/df: Input/output error elukey... [09:41:59] PROBLEM - HHVM rendering on mw2193 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:43:07] RECOVERY - HHVM rendering on mw2193 is OK: HTTP OK: HTTP/1.1 200 OK - 75444 bytes in 0.328 second response time [09:50:53] PROBLEM - HP RAID on ms-be1034 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.64.32.223: Connection reset by peer [10:02:55] RECOVERY - MariaDB Slave SQL: x1 on dbstore1002 is OK: OK slave_sql_state Slave_SQL_Running: Yes [10:05:54] 10Operations, 10Analytics, 10Analytics-Kanban, 10Product-Analytics, 10Patch-For-Review: dbstore1002 Mysql errors - https://phabricator.wikimedia.org/T213670 (10jcrespo) It broke again on echo_unread_wikis, I fixed it. [10:07:14] thanks jynus --^ [10:07:45] I am honestly tired of that [10:08:30] I completely get it, I really hope that February will be a good month to migrate to the new dbstore nodes [10:12:00] (afk) [10:14:41] RECOVERY - MariaDB Slave Lag: x1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 0.05 seconds [11:01:47] PROBLEM - MariaDB Slave Lag: s8 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1002.47 seconds [11:18:07] PROBLEM - SSH on ms-be1034 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:20:27] RECOVERY - SSH on ms-be1034 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u2 (protocol 2.0) [11:32:45] PROBLEM - Check systemd state on ms-be1034 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [11:35:19] RECOVERY - Check systemd state on ms-be1034 is OK: OK - running: The system is fully operational [11:57:51] PROBLEM - puppet last run on labtestcontrol2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:08:23] RECOVERY - puppet last run on labtestcontrol2001 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [12:17:18] (03CR) 10MarcoAurelio: [C: 03+1] "LGTM now. Added @Daimona here for doublechecking as the abuse filter configs are being changed these days." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486103 (https://phabricator.wikimedia.org/T210364) (owner: 10Ammarpad) [12:23:04] (03CR) 10Daimona Eaytoy: [C: 03+1] "LGTM, and copying my comment here on the task to make sure no misunderstanding will arise." (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486103 (https://phabricator.wikimedia.org/T210364) (owner: 10Ammarpad) [12:27:50] (03PS10) 10Ammarpad: Enable blocking feature of AbuseFilter in zh.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486103 (https://phabricator.wikimedia.org/T210364) [12:28:09] (03CR) 10MarcoAurelio: [C: 03+1] Enable blocking feature of AbuseFilter in zh.wikipedia (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486103 (https://phabricator.wikimedia.org/T210364) (owner: 10Ammarpad) [12:30:52] (03PS11) 10Ammarpad: Enable blocking feature of AbuseFilter in zh.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486103 (https://phabricator.wikimedia.org/T210364) [12:33:08] (03CR) 10Ammarpad: Enable blocking feature of AbuseFilter in zh.wikipedia (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486103 (https://phabricator.wikimedia.org/T210364) (owner: 10Ammarpad) [13:33:33] PROBLEM - Check systemd state on ms-be1019 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [13:40:47] PROBLEM - SSH on ms-be1034 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:41:11] RECOVERY - Check systemd state on ms-be1019 is OK: OK - running: The system is fully operational [13:44:27] RECOVERY - SSH on ms-be1034 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u2 (protocol 2.0) [13:57:17] PROBLEM - HP RAID on ms-be1034 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.64.32.223: Connection reset by peer [13:59:39] PROBLEM - SSH on ms-be1034 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:00:45] RECOVERY - SSH on ms-be1034 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u2 (protocol 2.0) [14:26:11] PROBLEM - SSH on ms-be1034 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:27:17] RECOVERY - SSH on ms-be1034 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u2 (protocol 2.0) [14:41:09] PROBLEM - puppet last run on ms-be1020 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago [14:43:46] 10Operations: ms-be1034 icinga alers - https://phabricator.wikimedia.org/T214796 (10jijiki) [15:00:18] 10Operations, 10Wikimedia-Mailing-lists: Adding administrator to mailing list for Wikimedia New Zealand - https://phabricator.wikimedia.org/T214271 (10Dzahn) 05Open→03Resolved a:03Dzahn @Podzemnik Done! I added you to the admins, you can now see on the list info page: WikimediaNZ-l list run by brian.... [15:04:24] 10Operations, 10User-herron: Improve visibility of incoming operations tasks - https://phabricator.wikimedia.org/T197624 (10Dzahn) > transitioning the ~1400 existing tasks currently in "backlog" on the workboard to "acknowledged" So i just got 250 surprise notifications over night and it means it's hard to... [15:19:13] PROBLEM - HP RAID on ms-be1034 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.64.32.223: Connection reset by peer [15:21:37] PROBLEM - HHVM rendering on mw1282 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:22:41] RECOVERY - HHVM rendering on mw1282 is OK: HTTP OK: HTTP/1.1 200 OK - 75491 bytes in 0.111 second response time [15:30:37] PROBLEM - SSH on ms-be1034 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:37:05] RECOVERY - SSH on ms-be1034 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u2 (protocol 2.0) [15:37:21] PROBLEM - very high load average likely xfs on ms-be1034 is CRITICAL: CRITICAL - load average: 117.98, 105.34, 87.00 [15:38:31] Anybody available? [15:38:49] To restart ms-be1034 [15:42:43] the host is up, just very loaded [15:43:29] Yeah.. [15:44:42] apergos: what will you do? [15:44:49] don't know yet [15:48:21] still looking around to see what's what [15:51:31] btw jijik.i created: https://phabricator.wikimedia.org/T214796 [15:51:49] I saw it [15:51:59] A restart should fix it...well, I hope [15:52:10] Looking around wikitech [15:52:10] a restart of swift services? [15:52:25] Of the Swift server [15:53:05] There are incident of cpu maxing out..which lead to rolling restart of all servers [15:54:14] I don't know much about swift but I know its multi instance storage and redundancy should be in place. So a power cycle of one server should not hurt [15:56:36] when I look at cpu per host in grafana, it doesn't look exceptional for ms-be1034 [15:56:48] https://grafana.wikimedia.org/d/000000607/cluster-overview?orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-cluster=swift&var-instance=All&from=1548532602817&to=1548604089497 [15:58:49] RECOVERY - very high load average likely xfs on ms-be1034 is OK: OK - load average: 55.86, 70.34, 78.86 [15:59:15] and a recovery, I wonder if that's godog already, the message was passed on to him [15:59:20] Load is normal [16:00:33] hi, no didn't touch it! though glad to see it seems back to normal? [16:00:42] well for the moment [16:00:57] when I go look at the channel logs it seems to have been flapping for the last couple hours [16:01:20] Yeah...so we might get another alert soon [16:02:00] ok thanks, I will take a closer look now [16:08:36] what is all this 'remote drive not mounted' cruft in the previous background log? [16:09:24] ah nm looks like it was trying to replicate to/from 1020 which is out of service [16:15:50] meh hotel wifi doing what hotel wifis do [16:15:57] ok checking now [16:17:51] I'll start from ms-be1020 btw as it seems the culprit [16:18:47] ok [16:18:59] sorry bout hotel wifi [16:21:37] PROBLEM - Host ms-be1020 is DOWN: PING CRITICAL - Packet loss = 100% [16:22:24] !log powercycle ms-be1020 - T214778 [16:22:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:22:28] T214778: Degraded RAID on ms-be1020 - https://phabricator.wikimedia.org/T214778 [16:22:50] heheh me too onimisionipe ! [16:22:57] ah the degraded raid [16:23:03] RECOVERY - MD RAID on ms-be1020 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [16:23:09] RECOVERY - Host ms-be1020 is UP: PING OK - Packet loss = 0%, RTA = 1.01 ms [16:23:11] RECOVERY - Check systemd state on ms-be1020 is OK: OK - running: The system is fully operational [16:23:43] RECOVERY - Disk space on ms-be1020 is OK: DISK OK [16:24:05] RECOVERY - swift-container-updater on ms-be1020 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-updater [16:24:20] I saw the alerts and the ticket on that and figured it was unrelated :-/ [16:24:56] shouldn't the host be automatically depooled when it's in that state? I thought... [16:26:15] RECOVERY - puppet last run on ms-be1020 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:29:03] apergos: yeah in practice swift works around a single host down, looks like 1034 was struggling with the additional load [16:29:51] gotcha [16:30:04] * apergos raises a jetlagged eyebrow in swift's direction [16:30:12] thank you for fixing it up! [16:30:15] apergos: lol [16:30:22] godog: Thanks! [16:30:41] there's a ticket if you want to comment/close, too, godog [16:30:53] https://phabricator.wikimedia.org/T214796 [16:33:19] thanks! I'll let things recover for a bit and update both that and the ms-be1020 task [16:40:50] awesome! [17:02:48] 10Operations: ms-be1034 icinga alers - https://phabricator.wikimedia.org/T214796 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi root cause looks like it was additional load from ms-be1020 being extremely slow (but reachable) and traffic ramping up in eqiad, resolving this in favor of {T214778} [17:03:54] \o/ [17:03:56] 10Operations, 10ops-eqiad: Degraded RAID on ms-be1020 - https://phabricator.wikimedia.org/T214778 (10fgiunchedi) a:03fgiunchedi Host is back after a powercycle, looks like the raid controller freaked out. Leaving this open to upgrade the controller firmware. [18:06:27] lol godog woke up and ninja'd everything [19:49:55] 10Operations, 10Citoid, 10Prod-Kubernetes, 10serviceops, and 2 others: Citoid automated monitoring times out due to Zotero v2 - https://phabricator.wikimedia.org/T211411 (10mobrovac) 05Open→03Resolved There have been no timeouts recorded by the automatic check scripts since the deploy, so looking good.... [20:36:26] (03CR) 10Zoranzoki21: [C: 03+1] Enable blocking feature of AbuseFilter in zh.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486103 (https://phabricator.wikimedia.org/T210364) (owner: 10Ammarpad) [20:47:19] PROBLEM - MariaDB Slave Lag: s4 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 620.07 seconds [21:27:10] (03PS1) 10BryanDavis: toolforge: fix script naming for run-parts [puppet] - 10https://gerrit.wikimedia.org/r/486822 (https://phabricator.wikimedia.org/T87001) [21:36:23] RECOVERY - MariaDB Slave Lag: s4 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 256.45 seconds [22:03:14] 10Operations, 10Wikimedia-Mailing-lists: request of a new mailing list WIKI-BNCF - https://phabricator.wikimedia.org/T214059 (10MarcoAurelio) I don't think (excuse me if I'm wrong) that @Giaccai can create mailing lists on Wikimedia. As such unless she's required to provide some info which blocks this task to... [22:04:29] mutante: hi - in which status is T52864 ? [22:04:30] T52864: Have a conversation about migrating from GNU Mailman 2.1 to GNU Mailman 3.0 - https://phabricator.wikimedia.org/T52864 [22:59:50] (03CR) 10Krinkle: "Fixed in If08fceae29842af828f53f8c" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/420115 (https://phabricator.wikimedia.org/T188760) (owner: 10Gergő Tisza) [23:00:07] (03PS1) 10Krinkle: Remove wgTemplateStylesAllowedUrls override (matches default) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486829 [23:00:41] (03CR) 10Krinkle: [C: 04-1] "Wait until the Depends-On commit is merged and deployed everywhere." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486829 (owner: 10Krinkle) [23:02:29] (03PS1) 10Krinkle: Document why ActiveAbstract is loaded in this way [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486830 [23:05:09] (03CR) 10Krinkle: "Fixed in Ida8a347f8c836b360c2a83cd6c4f53f08e7da9a3" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/403114 (https://phabricator.wikimedia.org/T184177) (owner: 10ArielGlenn) [23:28:26] (03CR) 10Brion VIBBER: [C: 03+2] Document why ActiveAbstract is loaded in this way [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486830 (owner: 10Krinkle) [23:29:31] (03Merged) 10jenkins-bot: Document why ActiveAbstract is loaded in this way [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486830 (owner: 10Krinkle) [23:38:35] (03CR) 10jenkins-bot: Document why ActiveAbstract is loaded in this way [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486830 (owner: 10Krinkle) [23:59:37] PROBLEM - puppet last run on labmon1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues