[00:51:27] (03PS1) 10Urbanecm: Allow ptwiki's bureaucrats to grant/revoke rollbacker user group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/481662 (https://phabricator.wikimedia.org/T212735) [01:01:09] (03PS1) 10Urbanecm: Use localized wgMetaNamespace and wgMetaNamespaceTalk in satwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/481663 (https://phabricator.wikimedia.org/T211294) [03:33:33] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 916.55 seconds [04:21:05] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 213.38 seconds [09:46:59] PROBLEM - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is CRITICAL: cluster=cache_text site=eqiad https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4fullscreenrefresh=1morgId=1 [09:48:45] PROBLEM - Text HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3fullscreenorgId=1var-site=Allvar-cache_type=textvar-status_type=5 [09:49:23] seems a single spike problem (recurrent issue) --^ [09:50:11] PROBLEM - Check systemd state on ms-be2034 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [09:50:37] RECOVERY - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4fullscreenrefresh=1morgId=1 [09:54:47] RECOVERY - Text HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3fullscreenorgId=1var-site=Allvar-cache_type=textvar-status_type=5 [10:00:07] PROBLEM - Device not healthy -SMART- on db2047 is CRITICAL: cluster=mysql device=cciss,0 instance=db2047:9100 job=node site=codfw https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=db2047var-datasource=codfw%2520prometheus%252Fops [10:31:38] (03PS3) 10Hashar: contint: remove unused classes [puppet] - 10https://gerrit.wikimedia.org/r/481201 (https://phabricator.wikimedia.org/T209361) [10:33:09] (03CR) 10Hashar: "Those classes are not the Jenkins slaves on labs, they are not used on contint1001 / contint2001 :-)" [puppet] - 10https://gerrit.wikimedia.org/r/481201 (https://phabricator.wikimedia.org/T209361) (owner: 10Hashar) [10:44:59] RECOVERY - Check systemd state on ms-be2034 is OK: OK - running: The system is fully operational [11:39:59] PROBLEM - Disk space on orespoolcounter2001 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.0.56: Connection reset by peer [11:40:01] PROBLEM - Disk space on alcyone is CRITICAL: CHECK_NRPE: Error - Could not connect to 208.80.153.16: Connection reset by peer [11:40:25] PROBLEM - configured eth on ping2001 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.0.22: Connection reset by peer [11:40:39] PROBLEM - Check size of conntrack table on alcyone is CRITICAL: CHECK_NRPE: Error - Could not connect to 208.80.153.16: Connection reset by peer [11:40:47] PROBLEM - dhclient process on ping2001 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.0.22: Connection reset by peer [11:41:11] PROBLEM - Check systemd state on ping2001 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.0.22: Connection reset by peer [11:41:33] PROBLEM - DPKG on orespoolcounter2001 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.0.56: Connection reset by peer [11:41:49] PROBLEM - Check whether ferm is active by checking the default input chain on alcyone is CRITICAL: CHECK_NRPE: Error - Could not connect to 208.80.153.16: Connection reset by peer [11:42:27] PROBLEM - Check systemd state on orespoolcounter2001 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.0.56: Connection reset by peer [11:42:53] PROBLEM - Check size of conntrack table on ping2001 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.0.22: Connection reset by peer [11:43:53] PROBLEM - Check size of conntrack table on orespoolcounter2001 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.0.56: Connection reset by peer [11:43:55] PROBLEM - ganeti-noded running on ganeti2006 is CRITICAL: PROCS CRITICAL: 3 processes with UID = 0 (root), command name ganeti-noded [11:44:19] PROBLEM - Check size of conntrack table on alcyone is CRITICAL: CHECK_NRPE: Error - Could not connect to 208.80.153.16: Connection reset by peer [11:44:55] PROBLEM - Check systemd state on orespoolcounter2001 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.0.56: Connection reset by peer [11:44:57] PROBLEM - Disk space on alcyone is CRITICAL: CHECK_NRPE: Error - Could not connect to 208.80.153.16: Connection reset by peer [11:45:17] RECOVERY - Check size of conntrack table on ping2001 is OK: OK: nf_conntrack is 0 % full [11:45:17] RECOVERY - configured eth on ping2001 is OK: OK - interfaces up [11:45:23] RECOVERY - Check whether ferm is active by checking the default input chain on alcyone is OK: OK ferm input default policy is set [11:45:25] RECOVERY - Check size of conntrack table on alcyone is OK: OK: nf_conntrack is 0 % full [11:45:35] RECOVERY - dhclient process on ping2001 is OK: PROCS OK: 0 processes with command name dhclient [11:45:59] PROBLEM - Check systemd state on ping2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [11:46:01] RECOVERY - Disk space on orespoolcounter2001 is OK: DISK OK [11:46:01] RECOVERY - Check systemd state on orespoolcounter2001 is OK: OK - running: The system is fully operational [11:46:03] RECOVERY - Disk space on alcyone is OK: DISK OK [11:46:13] RECOVERY - Check size of conntrack table on orespoolcounter2001 is OK: OK: nf_conntrack is 0 % full [11:46:19] RECOVERY - DPKG on orespoolcounter2001 is OK: All packages OK [11:46:21] RECOVERY - ganeti-noded running on ganeti2006 is OK: PROCS OK: 1 process with UID = 0 (root), command name ganeti-noded [11:47:11] RECOVERY - Check systemd state on ping2001 is OK: OK - running: The system is fully operational [12:49:53] 10Operations, 10DBA: Predictive failures on disk S.M.A.R.T. status - https://phabricator.wikimedia.org/T208323 (10Marostegui) [12:50:20] ACKNOWLEDGEMENT - Device not healthy -SMART- on db2047 is CRITICAL: cluster=mysql device=cciss,0 instance=db2047:9100 job=node site=codfw Marostegui https://phabricator.wikimedia.org/T208323 https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=db2047var-datasource=codfw%2520prometheus%252Fops [14:21:38] (03CR) 10Jforrester: Set wgNoticeProjects for wikimedia (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/471663 (https://phabricator.wikimedia.org/T208694) (owner: 10MacFan4000) [15:45:27] 10Operations, 10Release Pipeline, 10Release-Engineering-Team, 10Core Platform Team Backlog (Watching / External), and 2 others: Migrate production services to kubernetes using the pipeline - https://phabricator.wikimedia.org/T198901 (10hashar) [16:25:31] (03PS4) 10Framawiki: Publish throttle-analyze at noc [mediawiki-config] - 10https://gerrit.wikimedia.org/r/481267 (https://phabricator.wikimedia.org/T187894) [17:29:41] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Ensure Zotero is working) timed out before a response was received [17:33:17] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy [17:37:05] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Ensure Zotero is working) timed out before a response was received: /api (Scrapes sample page) timed out before a response was received [17:39:25] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy [17:44:48] 10Puppet, 10Continuous-Integration-Infrastructure: Need a better way of testing puppet patches for contint/integration stuff - https://phabricator.wikimedia.org/T126370 (10hashar) 05Open→03Declined The jobs now run in Docker containers and the hosts have a very straightforward puppet manifest. Since puppet... [19:56:16] (03CR) 10Gergő Tisza: [C: 04-1] Require an 8-byte new password for all users (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479571 (https://phabricator.wikimedia.org/T211622) (owner: 10Jforrester) [20:01:56] (03CR) 10Gergő Tisza: [C: 04-1] "Uh, can we not do this? MinimumPasswordLengthToLogin is an antifeature that should really not be used except after known compromises when " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479570 (https://phabricator.wikimedia.org/T208246) (owner: 10Jforrester) [20:05:28] (03CR) 10Gergő Tisza: "This is already the default, from core. Icc70122fab1b5 cleans it up, along with a bunch of other things." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479572 (https://phabricator.wikimedia.org/T208441) (owner: 10Jforrester) [20:09:15] (03CR) 10Gergő Tisza: "No harm in this, but it's a no-op (the list only has 10K passwords, and it's unlikely that will ever change as we have switched to Bloom f" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479573 (owner: 10Jforrester) [20:21:07] PROBLEM - MariaDB Slave Lag: s3 on db1095 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 774.73 seconds [20:24:06] 10Operations, 10Wikimedia-Mailing-lists: Create mailing list for Wikimedia's Google Code-in mentors - https://phabricator.wikimedia.org/T212747 (10Aklapper) p:05Triage→03Lowest [21:28:32] (03CR) 10Jforrester: [C: 04-2] "> Patch Set 1: Code-Review-1" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479570 (https://phabricator.wikimedia.org/T208246) (owner: 10Jforrester) [21:35:03] (03PS1) 10Reedy: Add testcommons.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/481795 (https://phabricator.wikimedia.org/T197616) [21:38:53] (03PS1) 10Reedy: Add testcommons.wikimedia.org to prod_sites.pp [puppet] - 10https://gerrit.wikimedia.org/r/481796 (https://phabricator.wikimedia.org/T197616) [21:39:47] (03PS2) 10Reedy: Add testcommons.wikimedia.org to prod_sites.pp [puppet] - 10https://gerrit.wikimedia.org/r/481796 (https://phabricator.wikimedia.org/T197616) [21:39:49] (03CR) 10Jforrester: "Ideally we'd like this done on 2019-01-02 so that we can get production fully tested with SDC items ahead of deployment to real Commons ne" [dns] - 10https://gerrit.wikimedia.org/r/481795 (https://phabricator.wikimedia.org/T197616) (owner: 10Reedy) [21:51:12] (03PS2) 10Reedy: Add test-commons.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/481795 (https://phabricator.wikimedia.org/T197616) [21:52:29] (03PS3) 10Reedy: Add test-commons.wikimedia.org to prod_sites.pp [puppet] - 10https://gerrit.wikimedia.org/r/481796 (https://phabricator.wikimedia.org/T197616) [22:11:57] (03CR) 10Gergő Tisza: [C: 04-1] "Enforcing it is fine (the easy way is to refuse finishing the login process unless the user changes their password; a more complex but nic" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479570 (https://phabricator.wikimedia.org/T208246) (owner: 10Jforrester) [22:16:50] (03CR) 10Gergő Tisza: "Hm, I guess password reset via email should still work so it's not that bad. Still a crude approach, IMO." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479570 (https://phabricator.wikimedia.org/T208246) (owner: 10Jforrester) [22:47:35] (03CR) 10Jforrester: [C: 04-2] "> Patch Set 1: -Code-Review" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479570 (https://phabricator.wikimedia.org/T208246) (owner: 10Jforrester)