[01:36:01] 06Operations, 10OTRS: Proposal: Centralize OTRS login methodology - https://phabricator.wikimedia.org/T133476#2232834 (10Matthewrbowker) [02:24:10] 06Operations, 10OTRS: Proposal: Centralize OTRS login methodology - https://phabricator.wikimedia.org/T133476#2232834 (10Krenair) While I was an agent I wrote scripts for the admins to turn half of that process into a simple form. What happened with that? I also don't understand why you're proposing setting u... [02:26:22] PROBLEM - RAID on mw1142 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:27:03] PROBLEM - salt-minion processes on mw1142 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:27:14] PROBLEM - DPKG on mw1142 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:27:34] PROBLEM - Check size of conntrack table on mw1142 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:27:41] 06Operations, 10OTRS: Proposal: Centralize OTRS login methodology - https://phabricator.wikimedia.org/T133476#2232878 (10Matthewrbowker) >>! In T133476#2232876, @Krenair wrote: > While I was an agent I wrote scripts for the admins to turn half of that process into a simple form. What happened with that? Erm..... [02:27:44] PROBLEM - SSH on mw1142 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:27:52] PROBLEM - configured eth on mw1142 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:27:52] PROBLEM - nutcracker process on mw1142 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:27:53] PROBLEM - HHVM processes on mw1142 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:28:33] PROBLEM - dhclient process on mw1142 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:28:52] PROBLEM - nutcracker port on mw1142 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:31:53] PROBLEM - Disk space on mw1142 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:33:03] RECOVERY - nutcracker port on mw1142 is OK: TCP OK - 0.000 second response time on port 11212 [02:33:22] RECOVERY - salt-minion processes on mw1142 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [02:33:23] RECOVERY - DPKG on mw1142 is OK: All packages OK [02:33:43] RECOVERY - Disk space on mw1142 is OK: DISK OK [02:33:44] RECOVERY - Check size of conntrack table on mw1142 is OK: OK: nf_conntrack is 0 % full [02:33:54] RECOVERY - SSH on mw1142 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.6 (protocol 2.0) [02:34:02] RECOVERY - nutcracker process on mw1142 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [02:34:02] RECOVERY - configured eth on mw1142 is OK: OK - interfaces up [02:34:03] RECOVERY - HHVM processes on mw1142 is OK: PROCS OK: 6 processes with command name hhvm [02:34:27] (03PS2) 10Yuvipanda: [WIP] MySQL backend for storing roles / hiera data for labs [puppet] - 10https://gerrit.wikimedia.org/r/285014 (https://phabricator.wikimedia.org/T133412) [02:34:29] (03PS1) 10Yuvipanda: uwsgi: Allow specifying multiple values for keys [puppet] - 10https://gerrit.wikimedia.org/r/285053 [02:34:33] RECOVERY - RAID on mw1142 is OK: OK: no RAID installed [02:34:43] RECOVERY - dhclient process on mw1142 is OK: PROCS OK: 0 processes with command name dhclient [02:35:40] (03CR) 10jenkins-bot: [V: 04-1] [WIP] MySQL backend for storing roles / hiera data for labs [puppet] - 10https://gerrit.wikimedia.org/r/285014 (https://phabricator.wikimedia.org/T133412) (owner: 10Yuvipanda) [02:39:14] (03PS3) 10Yuvipanda: [WIP] MySQL backend for storing roles / hiera data for labs [puppet] - 10https://gerrit.wikimedia.org/r/285014 (https://phabricator.wikimedia.org/T133412) [02:40:14] (03CR) 10jenkins-bot: [V: 04-1] [WIP] MySQL backend for storing roles / hiera data for labs [puppet] - 10https://gerrit.wikimedia.org/r/285014 (https://phabricator.wikimedia.org/T133412) (owner: 10Yuvipanda) [02:41:34] 06Operations, 10OTRS: Proposal: Centralize OTRS login methodology - https://phabricator.wikimedia.org/T133476#2232879 (10Krenair) >>! In T133476#2232878, @Matthewrbowker wrote: >>>! In T133476#2232876, @Krenair wrote: >> While I was an agent I wrote scripts for the admins to turn half of that process into a si... [02:44:02] (03PS4) 10Yuvipanda: [WIP] MySQL backend for storing roles / hiera data for labs [puppet] - 10https://gerrit.wikimedia.org/r/285014 (https://phabricator.wikimedia.org/T133412) [02:45:01] (03CR) 10jenkins-bot: [V: 04-1] [WIP] MySQL backend for storing roles / hiera data for labs [puppet] - 10https://gerrit.wikimedia.org/r/285014 (https://phabricator.wikimedia.org/T133412) (owner: 10Yuvipanda) [02:45:58] 06Operations, 10OTRS: Proposal: Centralize OTRS login methodology - https://phabricator.wikimedia.org/T133476#2232880 (10Krenair) So I guess the tricky part would be finding a sane way for OTRS admins to control the LDAP groups. We'd presumably want it integrated with either OTRS (sounds from "This module has... [02:46:54] PROBLEM - RAID on mw1142 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:47:03] PROBLEM - dhclient process on mw1142 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:48:53] RECOVERY - RAID on mw1142 is OK: OK: no RAID installed [02:49:02] RECOVERY - dhclient process on mw1142 is OK: PROCS OK: 0 processes with command name dhclient [02:52:45] 06Operations, 10OTRS: Proposal: Centralize OTRS login methodology - https://phabricator.wikimedia.org/T133476#2232883 (10Krenair) And I suppose an argument in favour of a new separate LDAP system would be the existing OTRS/OTRS wiki users conflicting with the existing LDAP users - although maybe we could put O... [03:00:10] 06Operations, 10OTRS: Proposal: Centralize OTRS login methodology - https://phabricator.wikimedia.org/T133476#2232886 (10Matthewrbowker) >>! In T133476#2232879, @Krenair wrote: >>>! In T133476#2232878, @Matthewrbowker wrote: >>>>! In T133476#2232876, @Krenair wrote: >>> While I was an agent I wrote scripts for... [03:07:26] 06Operations, 10OTRS: Proposal: Centralize OTRS login methodology - https://phabricator.wikimedia.org/T133476#2232888 (10Krenair) >>! In T133476#2232886, @Matthewrbowker wrote: >>>! In T133476#2232879, @Krenair wrote: >>>>! In T133476#2232878, @Matthewrbowker wrote: >>>>>! In T133476#2232876, @Krenair wrote: >... [03:09:03] (03PS5) 10Yuvipanda: [WIP] MySQL backend for storing roles / hiera data for labs [puppet] - 10https://gerrit.wikimedia.org/r/285014 (https://phabricator.wikimedia.org/T133412) [03:09:06] (03PS2) 10Yuvipanda: uwsgi: Allow specifying multiple values for keys [puppet] - 10https://gerrit.wikimedia.org/r/285053 [03:10:30] (03CR) 10jenkins-bot: [V: 04-1] [WIP] MySQL backend for storing roles / hiera data for labs [puppet] - 10https://gerrit.wikimedia.org/r/285014 (https://phabricator.wikimedia.org/T133412) (owner: 10Yuvipanda) [03:12:06] (03PS6) 10Yuvipanda: [WIP] MySQL backend for storing roles / hiera data for labs [puppet] - 10https://gerrit.wikimedia.org/r/285014 (https://phabricator.wikimedia.org/T133412) [03:12:07] (03PS3) 10Yuvipanda: uwsgi: Allow specifying multiple values for keys [puppet] - 10https://gerrit.wikimedia.org/r/285053 [03:12:51] 06Operations, 06Labs, 10Tool-Labs, 10Traffic, and 2 others: Detect tools.wmflabs.org tools which are HTTP-only - https://phabricator.wikimedia.org/T128409#2232903 (10scfc) @Magnus: https://petscan.wmflabs.org/ seems to work fine. Did you mean something else? [03:13:20] (03CR) 10jenkins-bot: [V: 04-1] [WIP] MySQL backend for storing roles / hiera data for labs [puppet] - 10https://gerrit.wikimedia.org/r/285014 (https://phabricator.wikimedia.org/T133412) (owner: 10Yuvipanda) [03:16:12] RECOVERY - Apache HTTP on mw1142 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 626 bytes in 0.066 second response time [03:17:15] RECOVERY - HHVM rendering on mw1142 is OK: HTTP OK: HTTP/1.1 200 OK - 64810 bytes in 0.120 second response time [03:18:03] RECOVERY - puppet last run on mw1142 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures [03:22:43] (03PS4) 10Yuvipanda: uwsgi: Allow specifying multiple values for keys [puppet] - 10https://gerrit.wikimedia.org/r/285053 [03:23:40] 06Operations, 10OTRS: Proposal: Centralize OTRS login methodology - https://phabricator.wikimedia.org/T133476#2232904 (10Matthewrbowker) >>! In T133476#2232888, @Krenair wrote: >>>! In T133476#2232886, @Matthewrbowker wrote: >>>>! In T133476#2232879, @Krenair wrote: >>>>>! In T133476#2232878, @Matthewrbowker w... [03:26:13] (03CR) 10Yuvipanda: [C: 032] uwsgi: Allow specifying multiple values for keys [puppet] - 10https://gerrit.wikimedia.org/r/285053 (owner: 10Yuvipanda) [03:31:01] (03CR) 10Ori.livneh: "Nice :)" [puppet] - 10https://gerrit.wikimedia.org/r/285053 (owner: 10Yuvipanda) [03:34:07] ori: :D [03:35:14] PROBLEM - puppet last run on mw1096 is CRITICAL: CRITICAL: Puppet has 1 failures [03:35:23] PROBLEM - puppet last run on analytics1037 is CRITICAL: CRITICAL: Puppet has 1 failures [03:35:52] PROBLEM - puppet last run on analytics1039 is CRITICAL: CRITICAL: Puppet has 1 failures [03:36:32] PROBLEM - puppet last run on elastic1028 is CRITICAL: CRITICAL: Puppet has 1 failures [03:36:52] PROBLEM - puppet last run on mw1192 is CRITICAL: CRITICAL: Puppet has 1 failures [03:37:36] (03CR) 10Tim Landscheidt: "The template uses the variable @ldapconfig that is set to the class parameter $ldap::role::config::labs::ldapconfig's value in various pla" [puppet] - 10https://gerrit.wikimedia.org/r/279682 (owner: 10Dzahn) [03:43:03] 06Operations, 10DBA, 06Labs, 07Tracking: (Tracking) Database replication services - https://phabricator.wikimedia.org/T50930#2232909 (10scfc) [03:45:42] PROBLEM - Disk space on ms-be2007 is CRITICAL: DISK CRITICAL - /srv/swift-storage/sdg1 is not accessible: Input/output error [03:45:44] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [03:46:03] PROBLEM - RAID on ms-be2007 is CRITICAL: CRITICAL: 1 failed LD(s) (Offline) [03:46:54] PROBLEM - puppet last run on ms-be2007 is CRITICAL: CRITICAL: Puppet has 1 failures [03:49:11] (03PS7) 10Yuvipanda: [WIP] MySQL backend for storing roles / hiera data for labs [puppet] - 10https://gerrit.wikimedia.org/r/285014 (https://phabricator.wikimedia.org/T133412) [03:49:13] (03PS1) 10Yuvipanda: uwsgi: Always die on term! [puppet] - 10https://gerrit.wikimedia.org/r/285054 [03:50:32] (03CR) 10jenkins-bot: [V: 04-1] [WIP] MySQL backend for storing roles / hiera data for labs [puppet] - 10https://gerrit.wikimedia.org/r/285014 (https://phabricator.wikimedia.org/T133412) (owner: 10Yuvipanda) [03:50:51] (03CR) 10jenkins-bot: [V: 04-1] uwsgi: Always die on term! [puppet] - 10https://gerrit.wikimedia.org/r/285054 (owner: 10Yuvipanda) [03:51:06] (03CR) 10Tim Landscheidt: [C: 031] toollabs: flake8 [puppet] - 10https://gerrit.wikimedia.org/r/283664 (owner: 10Ladsgroup) [03:52:24] Let's pretend I went on a tirade now against aligning arrows [03:54:11] (03PS8) 10Yuvipanda: [WIP] MySQL backend for storing roles / hiera data for labs [puppet] - 10https://gerrit.wikimedia.org/r/285014 (https://phabricator.wikimedia.org/T133412) [03:54:13] (03PS2) 10Yuvipanda: uwsgi: Always die on term! [puppet] - 10https://gerrit.wikimedia.org/r/285054 [03:54:56] (03PS2) 10Yuvipanda: toollabs: flake8 [puppet] - 10https://gerrit.wikimedia.org/r/283664 (owner: 10Ladsgroup) [03:55:08] (03CR) 10Yuvipanda: [C: 032 V: 032] "Thank you for the patch!" [puppet] - 10https://gerrit.wikimedia.org/r/283664 (owner: 10Ladsgroup) [03:55:26] (03CR) 10jenkins-bot: [V: 04-1] [WIP] MySQL backend for storing roles / hiera data for labs [puppet] - 10https://gerrit.wikimedia.org/r/285014 (https://phabricator.wikimedia.org/T133412) (owner: 10Yuvipanda) [03:57:02] (03PS9) 10Yuvipanda: [WIP] MySQL backend for storing roles / hiera data for labs [puppet] - 10https://gerrit.wikimedia.org/r/285014 (https://phabricator.wikimedia.org/T133412) [04:01:13] RECOVERY - puppet last run on elastic1028 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [04:01:34] RECOVERY - puppet last run on mw1192 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures [04:02:03] RECOVERY - puppet last run on mw1096 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [04:02:13] RECOVERY - Disk space on ms-be2007 is OK: DISK OK [04:02:22] RECOVERY - puppet last run on analytics1037 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [04:02:42] RECOVERY - puppet last run on analytics1039 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [04:06:15] (03PS10) 10Yuvipanda: [WIP] MySQL backend for storing roles / hiera data for labs [puppet] - 10https://gerrit.wikimedia.org/r/285014 (https://phabricator.wikimedia.org/T133412) [04:07:21] (03CR) 10jenkins-bot: [V: 04-1] [WIP] MySQL backend for storing roles / hiera data for labs [puppet] - 10https://gerrit.wikimedia.org/r/285014 (https://phabricator.wikimedia.org/T133412) (owner: 10Yuvipanda) [04:08:33] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [04:09:18] (03PS11) 10Yuvipanda: [WIP] MySQL backend for storing roles / hiera data for labs [puppet] - 10https://gerrit.wikimedia.org/r/285014 (https://phabricator.wikimedia.org/T133412) [04:09:45] (03PS3) 10Yuvipanda: uwsgi: Always die on term! [puppet] - 10https://gerrit.wikimedia.org/r/285054 [04:09:52] (03CR) 10Yuvipanda: [C: 032 V: 032] uwsgi: Always die on term! [puppet] - 10https://gerrit.wikimedia.org/r/285054 (owner: 10Yuvipanda) [04:13:11] (03CR) 10Tim Landscheidt: "@Dzahn: Did you mean to abandon this change?" [puppet] - 10https://gerrit.wikimedia.org/r/271735 (owner: 10Dzahn) [04:22:04] PROBLEM - puppet last run on mw2200 is CRITICAL: CRITICAL: puppet fail [04:50:42] RECOVERY - puppet last run on mw2200 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures [05:15:06] 06Operations, 10OTRS: Proposal: Centralize OTRS login methodology - https://phabricator.wikimedia.org/T133476#2232933 (10Krd) Strong oppose. No need, more problems created than resolved, if any resolved at at. Additionally, no prior discussion has taken place at the appropriate venue, which would have been the... [05:32:50] (03PS12) 10Yuvipanda: [WIP] MySQL backend for storing roles / hiera data for labs [puppet] - 10https://gerrit.wikimedia.org/r/285014 (https://phabricator.wikimedia.org/T133412) [05:33:57] (03CR) 10jenkins-bot: [V: 04-1] [WIP] MySQL backend for storing roles / hiera data for labs [puppet] - 10https://gerrit.wikimedia.org/r/285014 (https://phabricator.wikimedia.org/T133412) (owner: 10Yuvipanda) [05:42:58] (03PS13) 10Yuvipanda: [WIP] MySQL backend for storing roles / hiera data for labs [puppet] - 10https://gerrit.wikimedia.org/r/285014 (https://phabricator.wikimedia.org/T133412) [05:44:00] (03CR) 10jenkins-bot: [V: 04-1] [WIP] MySQL backend for storing roles / hiera data for labs [puppet] - 10https://gerrit.wikimedia.org/r/285014 (https://phabricator.wikimedia.org/T133412) (owner: 10Yuvipanda) [05:45:38] (03PS14) 10Yuvipanda: [WIP] MySQL backend for storing roles / hiera data for labs [puppet] - 10https://gerrit.wikimedia.org/r/285014 (https://phabricator.wikimedia.org/T133412) [06:31:31] PROBLEM - puppet last run on cp2001 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:31] PROBLEM - puppet last run on einsteinium is CRITICAL: CRITICAL: Puppet has 2 failures [06:34:41] PROBLEM - puppet last run on mw2207 is CRITICAL: CRITICAL: Puppet has 1 failures [06:35:11] PROBLEM - puppet last run on mw2045 is CRITICAL: CRITICAL: Puppet has 1 failures [06:35:11] PROBLEM - puppet last run on cp3036 is CRITICAL: CRITICAL: Puppet has 1 failures [06:35:31] PROBLEM - puppet last run on mw2050 is CRITICAL: CRITICAL: Puppet has 1 failures [06:36:40] (03PS15) 10Yuvipanda: [WIP] MySQL backend for storing roles / hiera data for labs [puppet] - 10https://gerrit.wikimedia.org/r/285014 (https://phabricator.wikimedia.org/T133412) [06:36:42] (03PS1) 10Yuvipanda: quarry: Make Quarry HTTPS only [puppet] - 10https://gerrit.wikimedia.org/r/285057 (https://phabricator.wikimedia.org/T107627) [06:37:34] (03CR) 10Yuvipanda: [C: 032 V: 032] quarry: Make Quarry HTTPS only [puppet] - 10https://gerrit.wikimedia.org/r/285057 (https://phabricator.wikimedia.org/T107627) (owner: 10Yuvipanda) [06:50:58] (03PS16) 10Yuvipanda: [WIP] MySQL backend for storing roles / hiera data for labs [puppet] - 10https://gerrit.wikimedia.org/r/285014 (https://phabricator.wikimedia.org/T133412) [06:51:00] (03PS1) 10Yuvipanda: quarry: Enforce https only at nginx level [puppet] - 10https://gerrit.wikimedia.org/r/285058 (https://phabricator.wikimedia.org/T107627) [06:51:02] (03PS1) 10Yuvipanda: extdist: Enforce HTTPS [puppet] - 10https://gerrit.wikimedia.org/r/285059 (https://phabricator.wikimedia.org/T133484) [06:52:12] (03CR) 10Yuvipanda: [C: 032 V: 032] quarry: Enforce https only at nginx level [puppet] - 10https://gerrit.wikimedia.org/r/285058 (https://phabricator.wikimedia.org/T107627) (owner: 10Yuvipanda) [06:52:25] (03CR) 10Yuvipanda: [C: 032 V: 032] extdist: Enforce HTTPS [puppet] - 10https://gerrit.wikimedia.org/r/285059 (https://phabricator.wikimedia.org/T133484) (owner: 10Yuvipanda) [06:56:11] PROBLEM - puppet last run on analytics1044 is CRITICAL: CRITICAL: Puppet has 1 failures [06:56:33] 06Operations, 06Labs, 10Labs-Infrastructure, 10Quarry, and 3 others: Quarry should be HTTPS-only - https://phabricator.wikimedia.org/T107627#2233035 (10yuvipanda) It is now! [06:56:39] 06Operations, 06Labs, 10Labs-Infrastructure, 10Quarry, and 3 others: Quarry should be HTTPS-only - https://phabricator.wikimedia.org/T107627#2233036 (10yuvipanda) 05Open>03Resolved a:03yuvipanda [06:57:11] RECOVERY - puppet last run on einsteinium is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:12] RECOVERY - puppet last run on mw2207 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures [06:57:41] RECOVERY - puppet last run on mw2045 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:50] RECOVERY - puppet last run on cp3036 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [06:58:10] RECOVERY - puppet last run on mw2050 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:12] RECOVERY - puppet last run on cp2001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:00:21] PROBLEM - puppet last run on mw2208 is CRITICAL: CRITICAL: Puppet has 1 failures [07:18:35] PROBLEM - puppet last run on mw1140 is CRITICAL: CRITICAL: Puppet has 55 failures [07:21:17] RECOVERY - puppet last run on analytics1044 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures [07:25:26] RECOVERY - puppet last run on mw2208 is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures [07:45:33] PROBLEM - Apache HTTP on mw1140 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:47:14] PROBLEM - HHVM rendering on mw1140 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:47:43] PROBLEM - nutcracker process on mw1140 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:47:44] PROBLEM - salt-minion processes on mw1140 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:47:44] PROBLEM - dhclient process on mw1140 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:47:44] PROBLEM - DPKG on mw1140 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:47:54] PROBLEM - RAID on mw1140 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:48:04] PROBLEM - Disk space on mw1140 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:48:23] PROBLEM - HHVM processes on mw1140 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:48:30] (03CR) 10Dereckson: "We now also need to add flow_computed.dblist introduced in c99dce08" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/281977 (owner: 10Dereckson) [07:48:35] PROBLEM - SSH on mw1140 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:48:54] PROBLEM - configured eth on mw1140 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:49:04] PROBLEM - Check size of conntrack table on mw1140 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:49:23] PROBLEM - nutcracker port on mw1140 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:50:03] RECOVERY - Disk space on mw1140 is OK: DISK OK [07:53:43] RECOVERY - nutcracker process on mw1140 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [07:53:44] RECOVERY - dhclient process on mw1140 is OK: PROCS OK: 0 processes with command name dhclient [07:53:44] RECOVERY - salt-minion processes on mw1140 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [07:55:54] PROBLEM - puppet last run on mw1138 is CRITICAL: CRITICAL: Puppet has 73 failures [07:59:21] (03PS1) 10Dereckson: noc: jobqueue-eqiad.php.txt → jobqueue.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/285061 [07:59:54] PROBLEM - nutcracker process on mw1140 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:59:55] PROBLEM - dhclient process on mw1140 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:59:55] PROBLEM - salt-minion processes on mw1140 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:00:23] PROBLEM - Disk space on mw1140 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:01:04] (03PS1) 10Dereckson: noc: PoolCounterSettings-eqiad.php → PoolCounterSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/285062 (https://phabricator.wikimedia.org/T133324) [08:02:15] (03PS2) 10Dereckson: Flow dblist on noc.wikimedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/281977 [08:02:38] (03CR) 10Dereckson: "PS2: +flow_computed.dblist" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/281977 (owner: 10Dereckson) [08:04:34] (03CR) 10Dereckson: [C: 031] Add *.asc-test.nl to wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/284712 (https://phabricator.wikimedia.org/T133286) (owner: 10Urbanecm) [08:12:39] RECOVERY - RAID on mw1140 is OK: OK: no RAID installed [08:12:47] RECOVERY - HHVM processes on mw1140 is OK: PROCS OK: 6 processes with command name hhvm [08:18:48] PROBLEM - RAID on mw1140 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:18:58] PROBLEM - HHVM processes on mw1140 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:37:48] RECOVERY - SSH on mw1140 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.6 (protocol 2.0) [08:37:48] RECOVERY - Check size of conntrack table on mw1140 is OK: OK: nf_conntrack is 0 % full [08:37:49] RECOVERY - nutcracker port on mw1140 is OK: TCP OK - 0.000 second response time on port 11212 [08:37:58] RECOVERY - DPKG on mw1140 is OK: All packages OK [08:38:08] RECOVERY - nutcracker process on mw1140 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [08:38:19] RECOVERY - Disk space on mw1140 is OK: DISK OK [08:38:38] RECOVERY - configured eth on mw1140 is OK: OK - interfaces up [08:38:47] RECOVERY - salt-minion processes on mw1140 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [08:39:08] RECOVERY - RAID on mw1140 is OK: OK: no RAID installed [08:39:17] RECOVERY - HHVM processes on mw1140 is OK: PROCS OK: 6 processes with command name hhvm [08:39:28] RECOVERY - Apache HTTP on mw1140 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 626 bytes in 0.060 second response time [08:39:28] RECOVERY - dhclient process on mw1140 is OK: PROCS OK: 0 processes with command name dhclient [08:40:38] RECOVERY - puppet last run on mw1140 is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures [08:40:38] RECOVERY - HHVM rendering on mw1140 is OK: HTTP OK: HTTP/1.1 200 OK - 66871 bytes in 0.095 second response time [09:22:57] PROBLEM - HHVM rendering on mw1138 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:24:08] PROBLEM - Apache HTTP on mw1138 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:24:37] PROBLEM - RAID on mw1138 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:24:37] PROBLEM - Check size of conntrack table on mw1138 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:24:38] PROBLEM - nutcracker process on mw1138 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:24:48] PROBLEM - nutcracker port on mw1138 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:24:58] PROBLEM - DPKG on mw1138 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:25:27] PROBLEM - Disk space on mw1138 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:25:28] PROBLEM - dhclient process on mw1138 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:30:29] PROBLEM - configured eth on mw1138 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:30:39] PROBLEM - HHVM processes on mw1138 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:32:21] PROBLEM - puppet last run on mw2120 is CRITICAL: CRITICAL: Puppet has 1 failures [09:33:12] PROBLEM - SSH on mw1138 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:34:33] PROBLEM - salt-minion processes on mw1138 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:49:32] RECOVERY - nutcracker process on mw1138 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [09:49:33] RECOVERY - dhclient process on mw1138 is OK: PROCS OK: 0 processes with command name dhclient [09:49:51] RECOVERY - DPKG on mw1138 is OK: All packages OK [09:49:52] RECOVERY - nutcracker port on mw1138 is OK: TCP OK - 0.000 second response time on port 11212 [09:50:12] RECOVERY - Disk space on mw1138 is OK: DISK OK [09:50:32] RECOVERY - salt-minion processes on mw1138 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [09:50:43] RECOVERY - HHVM processes on mw1138 is OK: PROCS OK: 6 processes with command name hhvm [09:51:03] RECOVERY - SSH on mw1138 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.6 (protocol 2.0) [09:53:21] RECOVERY - configured eth on mw1138 is OK: OK - interfaces up [09:53:31] RECOVERY - RAID on mw1138 is OK: OK: no RAID installed [09:53:31] RECOVERY - Check size of conntrack table on mw1138 is OK: OK: nf_conntrack is 9 % full [09:53:31] RECOVERY - Apache HTTP on mw1138 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 626 bytes in 0.048 second response time [09:54:03] RECOVERY - HHVM rendering on mw1138 is OK: HTTP OK: HTTP/1.1 200 OK - 66343 bytes in 0.150 second response time [09:54:12] RECOVERY - puppet last run on mw1138 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [09:57:53] RECOVERY - puppet last run on mw2120 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [10:00:36] (03PS1) 10Yuvipanda: tools: Remove unused import [puppet] - 10https://gerrit.wikimedia.org/r/285065 [10:00:38] (03PS1) 10Yuvipanda: toollabs: Check env $PORT before $2 [puppet] - 10https://gerrit.wikimedia.org/r/285066 (https://phabricator.wikimedia.org/T98442) [10:01:13] (03CR) 10Yuvipanda: [C: 032 V: 032] tools: Remove unused import [puppet] - 10https://gerrit.wikimedia.org/r/285065 (owner: 10Yuvipanda) [10:10:17] (03CR) 10Yuvipanda: [C: 032] toollabs: Check env $PORT before $2 [puppet] - 10https://gerrit.wikimedia.org/r/285066 (https://phabricator.wikimedia.org/T98442) (owner: 10Yuvipanda) [10:53:29] 06Operations, 06Labs, 10Tool-Labs, 10Traffic, and 2 others: Detect tools.wmflabs.org tools which are HTTP-only - https://phabricator.wikimedia.org/T128409#2233119 (10Magnus) Wouldn't it be better for http to always redirect to https? [12:22:23] 06Operations, 06Labs, 10Tool-Labs, 10Traffic, and 2 others: Detect tools.wmflabs.org tools which are HTTP-only - https://phabricator.wikimedia.org/T128409#2073537 (10tom29739) I think in Tools the tool never 'sees' http because everything goes through the proxy. So all tools should be compatible if http is... [13:23:57] PROBLEM - puppet last run on ms-fe3002 is CRITICAL: CRITICAL: puppet fail [13:36:57] 06Operations, 06Labs, 10Tool-Labs, 10Traffic, and 2 others: Detect tools.wmflabs.org tools which are HTTP-only - https://phabricator.wikimedia.org/T128409#2233299 (10yuvipanda) >>! In T128409#2217073, @valhallasw wrote: >> * naively grep all the projects' PHP and JavaScript code looking for hardcoded http:... [13:38:09] (03PS1) 10Yuvipanda: Revert "dynamicproxy: custom log schema (http/https) for tools" [puppet] - 10https://gerrit.wikimedia.org/r/285070 (https://phabricator.wikimedia.org/T128409) [13:38:17] (03PS2) 10Yuvipanda: Revert "dynamicproxy: custom log schema (http/https) for tools" [puppet] - 10https://gerrit.wikimedia.org/r/285070 (https://phabricator.wikimedia.org/T128409) [13:38:44] (03CR) 10Yuvipanda: [C: 032 V: 032] Revert "dynamicproxy: custom log schema (http/https) for tools" [puppet] - 10https://gerrit.wikimedia.org/r/285070 (https://phabricator.wikimedia.org/T128409) (owner: 10Yuvipanda) [13:43:53] 06Operations, 06Labs, 10Tool-Labs, 10Traffic, and 2 others: Detect tools.wmflabs.org tools which are HTTP-only - https://phabricator.wikimedia.org/T128409#2073537 (10BBlack) >>! In T128409#2233237, @tom29739 wrote: > In Tools the tool never 'sees' http because everything goes through the proxy: >>>! In T1... [13:50:57] RECOVERY - puppet last run on ms-fe3002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [14:29:22] PROBLEM - Apache HTTP on mw1135 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:30:23] PROBLEM - HHVM rendering on mw1135 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:30:52] PROBLEM - RAID on mw1135 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:31:02] PROBLEM - SSH on mw1135 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:31:03] PROBLEM - Disk space on mw1135 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:31:12] PROBLEM - nutcracker process on mw1135 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:31:22] PROBLEM - salt-minion processes on mw1135 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:31:23] PROBLEM - configured eth on mw1135 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:31:43] PROBLEM - dhclient process on mw1135 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:32:12] PROBLEM - Check size of conntrack table on mw1135 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:32:13] PROBLEM - puppet last run on mw1135 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:35:04] RECOVERY - Disk space on mw1135 is OK: DISK OK [14:35:04] RECOVERY - SSH on mw1135 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.6 (protocol 2.0) [14:35:04] RECOVERY - nutcracker process on mw1135 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [14:35:23] RECOVERY - salt-minion processes on mw1135 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [14:36:13] PROBLEM - nutcracker port on mw1135 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:36:13] PROBLEM - HHVM processes on mw1135 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:36:33] PROBLEM - DPKG on mw1135 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:41:12] PROBLEM - SSH on mw1135 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:41:13] PROBLEM - Disk space on mw1135 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:41:22] PROBLEM - nutcracker process on mw1135 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:41:24] PROBLEM - salt-minion processes on mw1135 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:55:23] RECOVERY - Disk space on mw1135 is OK: DISK OK [14:55:23] RECOVERY - nutcracker process on mw1135 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [14:55:23] RECOVERY - SSH on mw1135 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.6 (protocol 2.0) [14:55:33] RECOVERY - salt-minion processes on mw1135 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [14:55:34] RECOVERY - configured eth on mw1135 is OK: OK - interfaces up [14:55:44] RECOVERY - Apache HTTP on mw1135 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 626 bytes in 0.039 second response time [14:55:53] RECOVERY - dhclient process on mw1135 is OK: PROCS OK: 0 processes with command name dhclient [14:56:13] RECOVERY - nutcracker port on mw1135 is OK: TCP OK - 0.000 second response time on port 11212 [14:56:23] RECOVERY - Check size of conntrack table on mw1135 is OK: OK: nf_conntrack is 10 % full [14:56:23] RECOVERY - HHVM processes on mw1135 is OK: PROCS OK: 6 processes with command name hhvm [14:56:24] RECOVERY - puppet last run on mw1135 is OK: OK: Puppet is currently enabled, last run 58 minutes ago with 0 failures [14:56:42] RECOVERY - DPKG on mw1135 is OK: All packages OK [14:56:42] RECOVERY - HHVM rendering on mw1135 is OK: HTTP OK: HTTP/1.1 200 OK - 66355 bytes in 0.099 second response time [14:57:02] RECOVERY - RAID on mw1135 is OK: OK: no RAID installed [15:02:03] PROBLEM - configured eth on mw1135 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:02:41] PROBLEM - SSH on mw1135 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:03:09] RECOVERY - configured eth on mw1135 is OK: OK - interfaces up [15:03:40] PROBLEM - Apache HTTP on mw1135 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:04:41] PROBLEM - HHVM rendering on mw1135 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:05:19] PROBLEM - RAID on mw1135 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:05:39] PROBLEM - Check size of conntrack table on mw1135 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:05:45] (03PS12) 10BBlack: letsencrypt module guts + acme-setup script [puppet] - 10https://gerrit.wikimedia.org/r/283988 (https://phabricator.wikimedia.org/T132812) [15:06:09] PROBLEM - puppet last run on mw1135 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:06:23] (03CR) 10BBlack: letsencrypt module guts + acme-setup script (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/283988 (https://phabricator.wikimedia.org/T132812) (owner: 10BBlack) [15:08:22] 06Operations, 10OTRS: Proposal: Centralize OTRS login methodology - https://phabricator.wikimedia.org/T133476#2233464 (10Krenair) >>! In T133476#2232933, @Krd wrote: > No need, more problems created than resolved, if any resolved at at. I don't think you could say no problems resolved until details were confi... [15:09:12] 06Operations, 06Discovery, 10Wikidata, 10Wikidata-Query-Service, 07Varnish: Wikidata Query Service REST endpoint returns truncated results - https://phabricator.wikimedia.org/T133490#2233471 (10Mushroom) [15:11:18] 06Operations, 06Discovery, 10Traffic, 10Wikidata, 10Wikidata-Query-Service: Wikidata Query Service REST endpoint returns truncated results - https://phabricator.wikimedia.org/T133490#2233472 (10BBlack) [15:13:50] PROBLEM - nutcracker process on mw1135 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:13:50] PROBLEM - nutcracker port on mw1135 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:14:00] PROBLEM - DPKG on mw1135 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:14:16] 07Puppet, 10Beta-Cluster-Infrastructure, 06Labs, 13Patch-For-Review: /etc/puppet/puppet.conf keeps getting double content - first for labs-wide puppetmaster, then for the correct puppetmaster - https://phabricator.wikimedia.org/T132689#2233473 (10Krenair) deployment-cache-text04:/etc/puppet/puppet.conf.d/1... [15:14:21] PROBLEM - HHVM processes on mw1135 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:14:29] PROBLEM - salt-minion processes on mw1135 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:14:30] PROBLEM - configured eth on mw1135 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:14:36] (03PS13) 10BBlack: letsencrypt module guts + acme-setup script [puppet] - 10https://gerrit.wikimedia.org/r/283988 (https://phabricator.wikimedia.org/T132812) [15:15:10] PROBLEM - dhclient process on mw1135 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:15:50] PROBLEM - Disk space on mw1135 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:16:02] 07Puppet, 10Beta-Cluster-Infrastructure, 06Labs, 13Patch-For-Review: /etc/puppet/puppet.conf keeps getting double content - first for labs-wide puppetmaster, then for the correct puppetmaster - https://phabricator.wikimedia.org/T132689#2233475 (10Krenair) [15:16:05] 07Puppet, 10Beta-Cluster-Infrastructure, 07Tracking: Deployment-prep hosts with puppet errors (tracking) - https://phabricator.wikimedia.org/T132259#2233474 (10Krenair) [15:21:10] RECOVERY - dhclient process on mw1135 is OK: PROCS OK: 0 processes with command name dhclient [15:24:20] RECOVERY - salt-minion processes on mw1135 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [15:24:20] RECOVERY - HHVM processes on mw1135 is OK: PROCS OK: 6 processes with command name hhvm [15:27:21] RECOVERY - RAID on mw1135 is OK: OK: no RAID installed [15:27:39] RECOVERY - SSH on mw1135 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.6 (protocol 2.0) [15:27:39] RECOVERY - Check size of conntrack table on mw1135 is OK: OK: nf_conntrack is 2 % full [15:27:48] (03PS14) 10BBlack: letsencrypt module guts + acme-setup script [puppet] - 10https://gerrit.wikimedia.org/r/283988 (https://phabricator.wikimedia.org/T132812) [15:27:50] (03PS9) 10BBlack: create letsencrypt module, install acme-tiny [puppet] - 10https://gerrit.wikimedia.org/r/283761 (https://phabricator.wikimedia.org/T132812) (owner: 10Dzahn) [15:27:50] RECOVERY - nutcracker process on mw1135 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [15:27:51] RECOVERY - Disk space on mw1135 is OK: DISK OK [15:27:51] RECOVERY - nutcracker port on mw1135 is OK: TCP OK - 0.000 second response time on port 11212 [15:28:00] RECOVERY - DPKG on mw1135 is OK: All packages OK [15:28:09] RECOVERY - puppet last run on mw1135 is OK: OK: Puppet is currently enabled, last run 1 hour ago with 0 failures [15:28:31] RECOVERY - configured eth on mw1135 is OK: OK - interfaces up [15:28:40] RECOVERY - HHVM rendering on mw1135 is OK: HTTP OK: HTTP/1.1 200 OK - 64354 bytes in 0.081 second response time [15:29:21] RECOVERY - Apache HTTP on mw1135 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 626 bytes in 0.051 second response time [15:35:05] (03CR) 10BBlack: [C: 031] "acme-setup PS14 variant has had some testing against the acme staging server with jessie nginx config as docced. The bundling and puppeti" [puppet] - 10https://gerrit.wikimedia.org/r/283988 (https://phabricator.wikimedia.org/T132812) (owner: 10BBlack) [15:36:29] PROBLEM - puppet last run on mw1135 is CRITICAL: CRITICAL: Puppet has 68 failures [15:37:29] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 648 600 - REDIS on 10.192.48.44:6479 has 1 databases (db0) with 5157531 keys - replication_delay is 648 [15:39:29] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS on 10.192.48.44:6479 has 1 databases (db0) with 5139529 keys - replication_delay is 0 [15:40:01] (03PS15) 10BBlack: letsencrypt module guts + acme-setup script [puppet] - 10https://gerrit.wikimedia.org/r/283988 (https://phabricator.wikimedia.org/T132812) [15:40:27] (03CR) 10BBlack: [C: 031] "PS15: pep8 fixups" [puppet] - 10https://gerrit.wikimedia.org/r/283988 (https://phabricator.wikimedia.org/T132812) (owner: 10BBlack) [15:45:00] PROBLEM - puppet last run on cp3035 is CRITICAL: CRITICAL: puppet fail [15:53:57] (03PS1) 10BBlack: update-ocsp: time validity check bugfix [puppet] - 10https://gerrit.wikimedia.org/r/285072 [16:05:11] 06Operations, 10Traffic, 10Wikimedia-Stream: stream.wikimedia.org - redirect http(s) to docs - https://phabricator.wikimedia.org/T70528#2233508 (10Krenair) doesn't look like apache is involved in this - see modules/rcstream/templates/rcstream.nginx.erb in the puppet repo [16:05:37] 06Operations, 10Traffic, 10Wikimedia-Stream: stream.wikimedia.org - redirect http(s) to docs - https://phabricator.wikimedia.org/T70528#2233510 (10Krenair) (of course now I see it was me who added that project in the first place, oops...) [16:06:50] (03PS1) 10Reedy: Disable Special:GlobalAllocation as it OOMs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/285073 (https://phabricator.wikimedia.org/T55443) [16:11:31] RECOVERY - puppet last run on cp3035 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:18:26] 06Operations, 10Traffic, 10Wikimedia-Stream: stream.wikimedia.org - redirect http(s) to docs - https://phabricator.wikimedia.org/T70528#723098 (10BBlack) See also https://gerrit.wikimedia.org/r/#/c/284760/ pending patch, from trying to fix the /=>404 issue for HTTPS reasons in T132521 (the basic HTTPS issues... [16:28:31] PROBLEM - Apache HTTP on mw1135 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:28:50] PROBLEM - SSH on mw1135 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:28:50] PROBLEM - Check size of conntrack table on mw1135 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:30:02] PROBLEM - HHVM rendering on mw1135 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:30:49] PROBLEM - RAID on mw1135 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:31:36] PROBLEM - Disk space on mw1135 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:31:36] PROBLEM - nutcracker process on mw1135 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:31:36] PROBLEM - nutcracker port on mw1135 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:32:15] PROBLEM - DPKG on mw1135 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:33:15] PROBLEM - configured eth on mw1135 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:33:34] PROBLEM - dhclient process on mw1135 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:33:55] RECOVERY - Disk space on mw1135 is OK: DISK OK [16:34:24] RECOVERY - nutcracker process on mw1135 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [16:34:24] RECOVERY - nutcracker port on mw1135 is OK: TCP OK - 0.000 second response time on port 11212 [16:35:34] RECOVERY - dhclient process on mw1135 is OK: PROCS OK: 0 processes with command name dhclient [16:35:34] RECOVERY - DPKG on mw1135 is OK: All packages OK [16:37:52] (03CR) 10Awight: [C: 031] "Thank you and apologies for the ongoing fiasco!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/285073 (https://phabricator.wikimedia.org/T55443) (owner: 10Reedy) [16:39:55] PROBLEM - Disk space on mw1135 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:40:24] PROBLEM - nutcracker port on mw1135 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:40:25] PROBLEM - nutcracker process on mw1135 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:40:25] PROBLEM - HHVM processes on mw1135 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:40:44] PROBLEM - salt-minion processes on mw1135 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:40:50] (03PS2) 10Tim Landscheidt: Remove unused import in labs [puppet] - 10https://gerrit.wikimedia.org/r/279896 (owner: 10Ladsgroup) [16:41:34] PROBLEM - dhclient process on mw1135 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:41:35] PROBLEM - DPKG on mw1135 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:42:45] (03CR) 10Glaisher: "I'll note that this page does work *sometimes* and is useful when it works. Unless this causes huge issues, I'm not sure if we should be d" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/285073 (https://phabricator.wikimedia.org/T55443) (owner: 10Reedy) [16:56:24] RECOVERY - nutcracker port on mw1135 is OK: TCP OK - 0.000 second response time on port 11212 [16:56:24] RECOVERY - HHVM processes on mw1135 is OK: PROCS OK: 6 processes with command name hhvm [16:56:24] RECOVERY - nutcracker process on mw1135 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [16:56:54] RECOVERY - salt-minion processes on mw1135 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [17:02:43] PROBLEM - salt-minion processes on mw1135 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:04:03] PROBLEM - HHVM processes on mw1135 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:05:32] PROBLEM - nutcracker port on mw1135 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:05:44] PROBLEM - nutcracker process on mw1135 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:13:53] RECOVERY - HHVM processes on mw1135 is OK: PROCS OK: 6 processes with command name hhvm [17:14:44] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 603 600 - REDIS on 10.192.48.44:6479 has 1 databases (db0) with 5143313 keys - replication_delay is 603 [17:20:02] PROBLEM - HHVM processes on mw1135 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:22:52] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS on 10.192.48.44:6479 has 1 databases (db0) with 5138977 keys - replication_delay is 0 [17:30:13] RECOVERY - HHVM processes on mw1135 is OK: PROCS OK: 6 processes with command name hhvm [17:36:32] PROBLEM - HHVM processes on mw1135 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:43:13] RECOVERY - dhclient process on mw1135 is OK: PROCS OK: 0 processes with command name dhclient [17:47:43] (03PS1) 10Alex Monk: Get rid of redirects for non-resolving/parked domains [puppet] - 10https://gerrit.wikimedia.org/r/285084 (https://phabricator.wikimedia.org/T105981) [17:49:23] PROBLEM - dhclient process on mw1135 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:56:25] (03PS1) 10Alex Monk: Set up yue.wikipedia.org DNS record [dns] - 10https://gerrit.wikimedia.org/r/285085 (https://phabricator.wikimedia.org/T105999) [17:56:48] (03PS1) 10Alex Monk: Redirect yue.wikipedia.org to zh-yue.wikipedia.org for now [puppet] - 10https://gerrit.wikimedia.org/r/285086 (https://phabricator.wikimedia.org/T105999) [17:57:01] 06Operations, 10DNS, 10Traffic, 10Wikimedia-Apache-configuration, 13Patch-For-Review: Redirect yue.wikipedia.org to zh-yue.wikipedia.org - https://phabricator.wikimedia.org/T105999#2233666 (10Krenair) a:03Krenair [18:12:02] PROBLEM - NTP on mw1135 is CRITICAL: NTP CRITICAL: No response from NTP server [18:20:02] RECOVERY - NTP on mw1135 is OK: NTP OK: Offset -0.05896604061 secs [18:25:23] RECOVERY - Disk space on mw1135 is OK: DISK OK [18:27:42] PROBLEM - puppet last run on mw1116 is CRITICAL: CRITICAL: Puppet has 61 failures [18:31:33] PROBLEM - Disk space on mw1135 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [18:43:02] RECOVERY - dhclient process on mw1135 is OK: PROCS OK: 0 processes with command name dhclient [18:46:52] PROBLEM - puppet last run on db2009 is CRITICAL: CRITICAL: puppet fail [18:48:22] RECOVERY - salt-minion processes on mw1135 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [18:50:02] (03CR) 10Reedy: "Having a way to cause OOMs isn't good for the cluster." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/285073 (https://phabricator.wikimedia.org/T55443) (owner: 10Reedy) [18:53:22] PROBLEM - dhclient process on mw1135 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:54:33] PROBLEM - salt-minion processes on mw1135 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:13:42] RECOVERY - puppet last run on db2009 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [20:04:44] !log Deployed change Ib7e248ccf to statsv (commit id 5323cece2b3; task T132770) [20:04:46] T132770: "Throughput of EventLogging NavigationTiming events" UNKNOWN - https://phabricator.wikimedia.org/T132770 [20:04:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:05:52] PROBLEM - NTP on mw1135 is CRITICAL: NTP CRITICAL: No response from NTP server [20:05:53] bd808: I love Stashbot. The Phabricator "Mentioned in SAL" thing is great. [20:06:34] :) Thanks. I think that g.reg asked for that bit [20:08:02] nope, it was go.dog in T108720 [20:08:03] T108720: pick up ticket mentions from !log lines - https://phabricator.wikimedia.org/T108720 [20:39:48] <_joe_> indeed it's very useful [20:39:51] <_joe_> thanks [20:42:14] 06Operations, 10Wikimedia-General-or-Unknown: Page on aswikisource not accessible via page title, only via "curid" - https://phabricator.wikimedia.org/T133505#2233789 (10Ciencia_Al_Poder) https://as.wikisource.org/w/api.php?action=query&prop=info&pageids=30 displays: ```lang=json { "batchcomplete": "",... [20:45:12] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 10.192.48.44 on port 6479 [20:45:48] !log ran namespaceDupes.php on aswikisource [20:45:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:46:16] 06Operations, 10Wikimedia-General-or-Unknown: Page on aswikisource not accessible via page title, only via "curid" - https://phabricator.wikimedia.org/T133505#2233772 (10Reedy) ``` reedy@tin:~$ mwscript namespaceDupes.php aswikisource id=869 ns=0 dbk=Author:গণেশ_গগৈ *** dest title exists and --add-prefix not s... [20:47:12] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS on 10.192.48.44:6479 has 1 databases (db0) with 5142938 keys - replication_delay is 0 [21:07:32] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 636 600 - REDIS on 10.192.48.44:6479 has 1 databases (db0) with 5144289 keys - replication_delay is 636 [21:21:33] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS on 10.192.48.44:6479 has 1 databases (db0) with 5143434 keys - replication_delay is 0 [21:49:05] RECOVERY - NTP on mw1135 is OK: NTP OK: Offset -0.09681725502 secs [21:51:13] (03CR) 10Legoktm: "Thanks, do you want to make the eqiad and codfw files visible too?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/285062 (https://phabricator.wikimedia.org/T133324) (owner: 10Dereckson) [22:04:44] RECOVERY - HHVM processes on mw1135 is OK: PROCS OK: 6 processes with command name hhvm [22:10:45] PROBLEM - HHVM processes on mw1135 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:15:15] PROBLEM - citoid endpoints health on scb2002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:17:16] RECOVERY - citoid endpoints health on scb2002 is OK: All endpoints are healthy [22:22:05] RECOVERY - SSH on mw1135 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.6 (protocol 2.0) [22:22:34] RECOVERY - configured eth on mw1135 is OK: OK - interfaces up [22:22:34] RECOVERY - HHVM processes on mw1135 is OK: PROCS OK: 6 processes with command name hhvm [22:22:45] RECOVERY - dhclient process on mw1135 is OK: PROCS OK: 0 processes with command name dhclient [22:23:05] RECOVERY - nutcracker port on mw1135 is OK: TCP OK - 0.000 second response time on port 11212 [22:23:05] RECOVERY - DPKG on mw1135 is OK: All packages OK [22:23:14] RECOVERY - Apache HTTP on mw1135 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 628 bytes in 5.413 second response time [22:23:24] RECOVERY - nutcracker process on mw1135 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [22:23:24] RECOVERY - salt-minion processes on mw1135 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [22:23:34] RECOVERY - Check size of conntrack table on mw1135 is OK: OK: nf_conntrack is 1 % full [22:23:44] RECOVERY - Disk space on mw1135 is OK: DISK OK [22:23:55] RECOVERY - RAID on mw1135 is OK: OK: no RAID installed [22:24:15] RECOVERY - HHVM rendering on mw1135 is OK: HTTP OK: HTTP/1.1 200 OK - 64364 bytes in 0.309 second response time [22:25:36] RECOVERY - puppet last run on mw1135 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [22:28:24] 06Operations, 06WMF-Legal, 07Privacy: Consider moving policy.wikimedia.org away from WordPress.com - https://phabricator.wikimedia.org/T132104#2233892 (10Platonides) @Slaporte, you will probably know better the target authors. Who will be changing this site and how often? Currently, there seems to be 9 page... [22:45:22] ping mutante [23:25:24] PROBLEM - HHVM rendering on mw1116 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:25:35] PROBLEM - Apache HTTP on mw1116 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:25:54] PROBLEM - SSH on mw1116 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:26:04] PROBLEM - dhclient process on mw1116 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:26:15] PROBLEM - configured eth on mw1116 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:26:24] PROBLEM - salt-minion processes on mw1116 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:26:25] PROBLEM - HHVM processes on mw1116 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:26:35] PROBLEM - nutcracker port on mw1116 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:26:54] PROBLEM - nutcracker process on mw1116 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:26:56] PROBLEM - Check size of conntrack table on mw1116 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:27:34] PROBLEM - RAID on mw1116 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:27:35] PROBLEM - Disk space on mw1116 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:27:44] PROBLEM - DPKG on mw1116 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:33:32] RECOVERY - HHVM processes on mw1116 is OK: PROCS OK: 6 processes with command name hhvm [23:33:51] RECOVERY - Disk space on mw1116 is OK: DISK OK [23:34:00] RECOVERY - RAID on mw1116 is OK: OK: no RAID installed [23:39:10] PROBLEM - HHVM processes on mw1116 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:39:41] PROBLEM - RAID on mw1116 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:39:50] PROBLEM - Disk space on mw1116 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:50:30] RECOVERY - nutcracker port on mw1116 is OK: TCP OK - 0.000 second response time on port 11212 [23:50:41] RECOVERY - dhclient process on mw1116 is OK: PROCS OK: 0 processes with command name dhclient [23:50:41] RECOVERY - nutcracker process on mw1116 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [23:50:41] RECOVERY - Check size of conntrack table on mw1116 is OK: OK: nf_conntrack is 0 % full [23:50:52] RECOVERY - HHVM processes on mw1116 is OK: PROCS OK: 6 processes with command name hhvm [23:51:01] RECOVERY - salt-minion processes on mw1116 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [23:51:01] RECOVERY - configured eth on mw1116 is OK: OK - interfaces up [23:51:30] RECOVERY - RAID on mw1116 is OK: OK: no RAID installed [23:51:31] RECOVERY - Disk space on mw1116 is OK: DISK OK [23:56:31] PROBLEM - nutcracker port on mw1116 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:56:41] PROBLEM - Check size of conntrack table on mw1116 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:56:50] PROBLEM - dhclient process on mw1116 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:56:50] PROBLEM - nutcracker process on mw1116 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:57:01] PROBLEM - HHVM processes on mw1116 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:57:10] PROBLEM - configured eth on mw1116 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:57:31] PROBLEM - RAID on mw1116 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:59:10] PROBLEM - salt-minion processes on mw1116 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.