[00:02:03] PROBLEM - puppet last run on mw1282 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. [00:29:25] RECOVERY - puppet last run on mw1282 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [02:32:49] PROBLEM - Postgres Replication Lag on maps2002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 31849824 and 1 seconds [02:34:17] RECOVERY - Postgres Replication Lag on maps2002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 215064 and 10 seconds [02:52:39] RECOVERY - Check systemd state on netflow1001 is OK: OK - running: The system is fully operational [03:05:23] PROBLEM - puppet last run on lvs4006 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. [03:32:41] RECOVERY - puppet last run on lvs4006 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [04:15:29] PROBLEM - Device not healthy -SMART- on db1072 is CRITICAL: cluster=mysql device=megaraid,9 instance=db1072:9100 job=node site=eqiad https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=db1072&var-datasource=eqiad+prometheus/ops [06:31:31] PROBLEM - puppet last run on pollux is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. [06:32:49] PROBLEM - puppet last run on mc1035 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. [06:58:53] RECOVERY - puppet last run on pollux is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [07:00:07] RECOVERY - puppet last run on mc1035 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [07:07:49] PROBLEM - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is CRITICAL: cluster=cache_text site=esams https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [07:10:01] PROBLEM - Text HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [07:10:01] PROBLEM - Esams HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [07:10:45] RECOVERY - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [07:17:21] RECOVERY - Text HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [07:17:23] RECOVERY - Esams HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [08:35:33] 10Operations, 10Wikidata, 10Wikidata-Query-Service: Define an SLO for Wikidata Query Service public endpoint and communicate it - https://phabricator.wikimedia.org/T199228 (10Multichill) @Smalyshev : Maybe do it like jsub on the Toollabs: Give an option to add the expected runtime? Based on this the load bal... [09:07:23] PROBLEM - Device not healthy -SMART- on helium is CRITICAL: cluster=misc device=megaraid,14 instance=helium:9100 job=node site=eqiad https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=helium&var-datasource=eqiad+prometheus/ops [10:18:26] ACKNOWLEDGEMENT - MegaRAID on helium is CRITICAL: CRITICAL: 1 failed LD(s) (Partially Degraded) nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T226878 https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [10:18:29] 10Operations, 10ops-eqiad: Degraded RAID on helium - https://phabricator.wikimedia.org/T226878 (10ops-monitoring-bot) [10:39:19] ACKNOWLEDGEMENT - MegaRAID on helium is CRITICAL: CRITICAL: 1 failed LD(s) (Partially Degraded) nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T226880 https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [10:39:22] 10Operations, 10ops-eqiad: Degraded RAID on helium - https://phabricator.wikimedia.org/T226880 (10ops-monitoring-bot) [12:13:30] ACKNOWLEDGEMENT - MegaRAID on helium is CRITICAL: CRITICAL: 1 failed LD(s) (Partially Degraded) nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T226887 https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [12:13:33] 10Operations, 10ops-eqiad: Degraded RAID on helium - https://phabricator.wikimedia.org/T226887 (10ops-monitoring-bot) [12:47:38] (03PS1) 10Volans: Release 1.1.0 [software/conftool] - 10https://gerrit.wikimedia.org/r/519752 [12:47:40] (03PS1) 10Volans: debian: Release 1.1.0 [software/conftool] - 10https://gerrit.wikimedia.org/r/519753 [12:48:22] (03CR) 10Volans: "TBD when we're ready to cat a new release..." [software/conftool] - 10https://gerrit.wikimedia.org/r/519752 (owner: 10Volans) [12:55:04] (03CR) 10Volans: "Tested on boron." [software/conftool] - 10https://gerrit.wikimedia.org/r/519753 (owner: 10Volans) [13:37:09] ACKNOWLEDGEMENT - MegaRAID on helium is CRITICAL: CRITICAL: 1 failed LD(s) (Partially Degraded) nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T226889 https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [13:37:12] 10Operations, 10ops-eqiad: Degraded RAID on helium - https://phabricator.wikimedia.org/T226889 (10ops-monitoring-bot) [13:46:27] (03CR) 10Gilles: [C: 03+2] Serve JPG when WEBP conversion fails [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/519379 (https://phabricator.wikimedia.org/T226707) (owner: 10Gilles) [13:53:19] (03CR) 10Gilles: [V: 03+2 C: 03+2] Serve JPG when WEBP conversion fails [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/519379 (https://phabricator.wikimedia.org/T226707) (owner: 10Gilles) [14:05:24] (03PS1) 10Gilles: Version bump [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/519755 (https://phabricator.wikimedia.org/T226707) [14:08:34] ACKNOWLEDGEMENT - MegaRAID on helium is CRITICAL: CRITICAL: 1 failed LD(s) (Partially Degraded) nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T226890 https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [14:08:42] 10Operations, 10ops-eqiad: Degraded RAID on helium - https://phabricator.wikimedia.org/T226890 (10ops-monitoring-bot) [14:39:59] ACKNOWLEDGEMENT - MegaRAID on helium is CRITICAL: CRITICAL: 1 failed LD(s) (Partially Degraded) nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T226891 https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [14:40:03] 10Operations, 10ops-eqiad: Degraded RAID on helium - https://phabricator.wikimedia.org/T226891 (10ops-monitoring-bot) [14:40:41] (03CR) 10Gilles: [V: 03+2 C: 03+2] Version bump [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/519755 (https://phabricator.wikimedia.org/T226707) (owner: 10Gilles) [14:50:00] Hi, can anyone refresh UnconnectedPages special page on srwiki as it is outdated much days [14:50:50] It is special page which is always instant updated, no like BrokenRedirects which gets refreshed each 3 days [15:11:48] 10Operations, 10ops-eqiad: Degraded RAID on helium - https://phabricator.wikimedia.org/T226890 (10Zoranzoki21) [15:11:50] 10Operations, 10ops-eqiad: Degraded RAID on helium - https://phabricator.wikimedia.org/T224794 (10Zoranzoki21) [15:11:58] 10Operations, 10ops-eqiad: Degraded RAID on helium - https://phabricator.wikimedia.org/T226891 (10Zoranzoki21) [15:12:00] 10Operations, 10ops-eqiad: Degraded RAID on helium - https://phabricator.wikimedia.org/T224794 (10Zoranzoki21) [15:12:18] 10Operations, 10ops-eqiad: Degraded RAID on helium - https://phabricator.wikimedia.org/T226878 (10Zoranzoki21) [15:12:20] 10Operations, 10ops-eqiad: Degraded RAID on helium - https://phabricator.wikimedia.org/T224794 (10Zoranzoki21) [15:12:23] (03CR) 10Urbanecm: [C: 03+1] "Is the task number right?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519080 (https://phabricator.wikimedia.org/T218237) (owner: 10Catrope) [15:12:30] 10Operations, 10ops-eqiad: Degraded RAID on helium - https://phabricator.wikimedia.org/T226880 (10Zoranzoki21) [15:12:33] 10Operations, 10ops-eqiad: Degraded RAID on helium - https://phabricator.wikimedia.org/T224794 (10Zoranzoki21) [15:12:44] 10Operations, 10ops-eqiad: Degraded RAID on helium - https://phabricator.wikimedia.org/T226889 (10Zoranzoki21) [15:12:46] 10Operations, 10ops-eqiad: Degraded RAID on helium - https://phabricator.wikimedia.org/T224794 (10Zoranzoki21) [15:13:09] 10Operations, 10ops-eqiad: Degraded RAID on helium - https://phabricator.wikimedia.org/T226891 (10Zoranzoki21) [15:13:10] 10Operations, 10ops-eqiad: Degraded RAID on helium - https://phabricator.wikimedia.org/T224794 (10Zoranzoki21) [15:13:41] 10Operations, 10ops-eqiad: Degraded RAID on helium - https://phabricator.wikimedia.org/T226887 (10Zoranzoki21) [15:13:43] 10Operations, 10ops-eqiad: Degraded RAID on helium - https://phabricator.wikimedia.org/T224794 (10Zoranzoki21) [16:55:48] ACKNOWLEDGEMENT - MegaRAID on helium is CRITICAL: CRITICAL: 1 failed LD(s) (Partially Degraded) nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T226894 https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [16:55:52] 10Operations, 10ops-eqiad: Degraded RAID on helium - https://phabricator.wikimedia.org/T226894 (10ops-monitoring-bot) [17:16:45] ACKNOWLEDGEMENT - MegaRAID on helium is CRITICAL: CRITICAL: 1 failed LD(s) (Partially Degraded) nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T226895 https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [17:16:49] 10Operations, 10ops-eqiad: Degraded RAID on helium - https://phabricator.wikimedia.org/T226895 (10ops-monitoring-bot) [19:20:39] (03PS1) 10Urbanecm: Add abusefilter-view-private to checkusers on arwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519767 (https://phabricator.wikimedia.org/T226899) [19:24:56] 10Operations, 10ops-eqiad: Degraded RAID on helium - https://phabricator.wikimedia.org/T226894 (10Reedy) [19:24:58] 10Operations, 10ops-eqiad: Degraded RAID on helium - https://phabricator.wikimedia.org/T226895 (10Reedy) [19:26:12] (03CR) 10MarcoAurelio: [C: 04-1] "Can we please use the dedicated abusefilter.php file? AF config is such a mess nowadays, being spread into several config files. Thanks." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519767 (https://phabricator.wikimedia.org/T226899) (owner: 10Urbanecm) [19:33:13] Urbanecm: probably some config at https://codesearch.wmflabs.org/operations/?q=abusefilter-view-private&i=nope&files=&repos= is also duplicate with the abusefilter-modify then :) [19:33:22] I've not checked carefully though [19:33:59] hauskatze, that's more than possible [19:34:04] going to update my patch btw [19:34:16] and I'm going to have dinner :) [19:36:22] (03PS2) 10Urbanecm: Add abusefilter-view-private to checkusers on arwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519767 (https://phabricator.wikimedia.org/T226899) [19:37:00] (03PS3) 10Urbanecm: Add abusefilter-view-private to checkusers on arwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519767 (https://phabricator.wikimedia.org/T226899) [19:39:18] hauskatze, ^^ does it LGTY ^^? :) [19:40:05] I honestly have no idea why it ended up like that [19:40:17] * Reedy goes with "it seemed a good idea at the time" [19:41:26] Reedy, that's the most probable cause, as with (almost?) anything :). Wondering if it's a good idea to move the per-project variables into IS.php... [19:42:06] In my breaking of FR config... I did that with simple config variables [19:42:51] I suggest most of that file can be slotted into IS.php [19:42:58] A few into CS [19:43:19] looks so [19:43:40] * Urbanecm puts this into "the list of things he will do when he has time" list [19:43:43] :) [19:43:57] I'd probably try and do it in stages [19:44:10] good idea [19:44:11] Move things like wgAbuseFilterNotifications out, set a default, the wiki overrides [19:44:22] ala "simple" config [19:44:39] Makes it a bit easier for someone else to review it too [19:45:07] noted :) [19:45:17] I'm not saying do every variable individually or anything like that [19:45:40] But depending on usages, 2 or 3 patches probably works [19:49:17] btw, Reedy, what do you think about https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/518298 ? [19:58:06] * hauskatze back [19:58:22] Honestly, IS is already very big. I'd leave AF things in abusefilter.php [19:58:31] Same happens with FlaggedRevs [20:00:10] (03CR) 10MarcoAurelio: [C: 03+1] "/me approves :)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519767 (https://phabricator.wikimedia.org/T226899) (owner: 10Urbanecm) [20:00:12] hauskatze, big and convenient. Honestly, don't have any problems with editing it from the size side [20:00:24] me neither [20:00:35] IMHO, as long as stuff is in one place [20:00:38] it's okay [20:01:17] but I think Daimona moved/was moving stuff back to abusefilter.php so we better consensuate [20:01:24] IIRC there was a task about it [20:01:36] I'll look for it after finishing dinner [20:01:52] thought you already finished it :) [20:01:59] or I can do it [20:02:12] I'm taking the dessert :) [20:02:31] do you mean T145931? [20:02:32] T145931: AbuseFilter permissions spread all over InitialiseSettings and abusefilter.php: they should be in one place - https://phabricator.wikimedia.org/T145931 [20:02:56] https://phabricator.wikimedia.org/T145931 [20:02:59] yup [20:04:10] well in theory, don't have problems with moving it either way [20:04:44] just wondering if we shouldn't keep permissions in IS [20:05:05] for instance, after merging Daimona's patch, cswiki's arbcom and engineer groups would be defined in two places [20:33:13] (03CR) 10Huji: [C: 03+1] Add abusefilter-view-private to checkusers on arwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519767 (https://phabricator.wikimedia.org/T226899) (owner: 10Urbanecm) [21:38:12] ACKNOWLEDGEMENT - MegaRAID on helium is CRITICAL: CRITICAL: 1 failed LD(s) (Partially Degraded) nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T226905 https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [21:38:15] 10Operations, 10ops-eqiad: Degraded RAID on helium - https://phabricator.wikimedia.org/T226905 (10ops-monitoring-bot) [21:41:53] 10Operations, 10ops-eqiad: Degraded RAID on helium - https://phabricator.wikimedia.org/T226905 (10Krenair) [21:41:55] 10Operations, 10ops-eqiad: Degraded RAID on helium - https://phabricator.wikimedia.org/T226894 (10Krenair) [21:58:50] PROBLEM - HHVM jobrunner on mw1293 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [22:00:16] RECOVERY - HHVM jobrunner on mw1293 is OK: HTTP OK: HTTP/1.1 200 OK - 270 bytes in 0.011 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [22:09:35] ACKNOWLEDGEMENT - MegaRAID on helium is CRITICAL: CRITICAL: 1 failed LD(s) (Partially Degraded) nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T226906 https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [22:09:38] 10Operations, 10ops-eqiad: Degraded RAID on helium - https://phabricator.wikimedia.org/T226906 (10ops-monitoring-bot) [22:33:11] 10Operations, 10ops-eqiad: Degraded RAID on helium - https://phabricator.wikimedia.org/T226906 (10Krenair) [22:33:14] 10Operations, 10ops-eqiad: Degraded RAID on helium - https://phabricator.wikimedia.org/T226894 (10Krenair) [22:41:17] 10Operations, 10ops-eqiad: Degraded RAID on helium - https://phabricator.wikimedia.org/T226906 (10Zoranzoki21) [22:41:19] 10Operations, 10ops-eqiad: Degraded RAID on helium - https://phabricator.wikimedia.org/T224794 (10Zoranzoki21) [22:53:15] 10Operations: ops-monitoring-bot creating dupes - https://phabricator.wikimedia.org/T226908 (10Reedy) [23:12:20] ACKNOWLEDGEMENT - MegaRAID on helium is CRITICAL: CRITICAL: 1 failed LD(s) (Partially Degraded) nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T226909 https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [23:12:24] 10Operations, 10ops-eqiad: Degraded RAID on helium - https://phabricator.wikimedia.org/T226909 (10ops-monitoring-bot) [23:14:06] PROBLEM - puppet last run on prometheus2004 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. [23:26:49] 10Operations: ops-monitoring-bot creating dupes - https://phabricator.wikimedia.org/T226908 (10Zoranzoki21) p:05Triage→03High Still happening... [23:27:09] 10Operations, 10ops-eqiad: Degraded RAID on helium - https://phabricator.wikimedia.org/T226909 (10Zoranzoki21) [23:27:11] 10Operations, 10ops-eqiad: Degraded RAID on helium - https://phabricator.wikimedia.org/T224794 (10Zoranzoki21) [23:41:24] RECOVERY - puppet last run on prometheus2004 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures